Increasingly, realistic object, scene, and event modeling is based on im- age data rather than manual synthesis. The paper describes a system for visits to a virtual, 3D archeological site. One can navigate through this environment, with a virtual guide as companion. One can ask ques- tions using natural, °uent speech. The guide will respond and will bring the visitor to the desired place. Simple answers are given as changes in the orientations of his head, by him raising his eyebrows or by head nodding. In the near future the head will speak. The idea to model directly from images is applied in three subcompo- nents of this system. First, there are two systems for 3D modeling. One is a shape-from-video system, that turns multiple, uncalibrated images into realistic 3D models. This system was used to model the landscape and buildings of the site. The second projects a special pattern and was used to model smaller pieces, like statues and ornaments that often had intricate shapes. Secondly, the model of the scene is only as convincing as the texture by which it is covered. As it is impossible to keep images of the texture of a complete landscape, images of the natural surface were used to synthesize more of similar texture, starting from a very compact yet e®ective texture model. Thirdly, natural lip motions were learned from observed, 3D face dynamics. These will be used to animate the virtual guide in future versions of the system.