Toward photorealistic reconstruction

We in the Visual Computing Focus Group are research enthusiasts pushing the state of the art at the intersection of computer vision, graphics, and machine learning. Our research mission is to obtain high-quality digital models of the real world, which include detailed geometry, surface texture, and material in both static and dynamic environments.

Focus Group Visual Computing

Prof. Matthias Nießner (TUM), Alumnus Rudolf Mößbauer Tenure Track Professor | Prof. Leonidas Guibas (Stanford University), Prof. Luisa Verdoliva (University Federico II of Naples), Hans Fischer Senior Fellows | Prof. Angel X. Chang (Simon Fraser University), Hans Fischer Fellow | Dr. Justus Thies (TUM), Postdoctoral Researcher | Shivangi Aneja, Armen Avetisyan, Dave Zhenyu Cheng, Manuel Dahnert, Ji Jou, Andreas Rössler, (TUM), Doctoral Candidates | Host: Visual Computing, TUM

In our research, we heavily exploit the capabilities of RGB-D and range sensing devices that are now widely available. However, we ultimately aim to achieve both 3D and 4D recordings from monocular sensors – essentially, we want to record holograms with a simple webcam or mobile phone. We further employ our reconstructed models for specific use cases, such as video editing, immersive AR/VR, semantic scene understanding, and many others. Aside from traditional convex and non-convex optimization techniques, we see great potential in modern artificial intelligence, mainly deep learning, in order to achieve these goals. The relevance of the research ranges across several areas that are impacted by 3D digitization and semantic scene understanding. These include applications ranging from entertainment and communication to medicine and autonomous robotics. However, the primary goal is to replace videos and images with the interactive but photorealistic 3D content of the future – i.e., holograms, which we believe will impact the full range of aforementioned industries.

Current work in the Visual Computing lab focuses on two major topics: 3D scene reconstruction (algorithms to capture, reconstruct, and understand 3D environments), and photorealistic image synthesis from these reconstructed and acquired 3D representations. In addition, we put heavy emphasis on ethical implications by investigating new forensics methods to automatically detect forged image and video manipulations.

For synthesizing realistic images, we start with traditional 3D representations of a scene and a virtual camera, and we use rendering techniques such as rasterization or ray tracing to generate a 2D image. Such input 3D content is largely created through manual effort by expert artists, for example, for movie productions. However, a very recent research direction is now synthesis using neural 3D representations. These offer a potential alternative to traditional, explicit 3D reconstruction and tracking. In particular, generative neural networks such as GANs or auto-regressive techniques can now generate quite convincing images. However, the problem becomes significantly more challenging when the goal is to synthesize a consistent recording from a captured 3D scene, or to make consistent edits in a video stream. Here, the problem statement is particularly challenging due to the need to learn an underlying representation, which can facilitate viewpoint consistency and seamless animation of dynamic elements. A core idea of our line of research is to incorporate 3D knowledge directly in the neural network architectures in a fully differentiable fashion. This not only simplifies the learning process – 3D transformations are hard to learn as a series of 2D convolutions, for example – but it also provides stable anchor points for conditioning generative models. The same idea can be used for 3D scene reconstructions of static scenes. In DeepVoxels, a series of 2D images is lifted to a volumetric grid of 3D features from which novel viewpoints can be synthesized. In contrast to conventional 2D convolutional networks, this approach learns a new scene representation that generates images that are both photorealistic and temporally coherent.

Figure 1

With the ability to synthesize and edit images and videos, we must also consider societal implications, particularly since videos are often considered as credible and especially with the recent focus on media and fabricated news. As researchers, we have a sincere commitment to contributing to society as a whole, both in educating the public on  the possibilities of artificial synthesis and in developing automated approaches to detect image and video manipulations. In our FaceForensics work, we examine the synthesis of manipulation approaches, on the basis of the example of Face2Face manipulations: Can humans easily spot fakes? What about supervised learning methods that are trained to detect faces? In FaceForensics++, our work now covers Face2Face, FaceSwap, and the recently very popular DeepFake technique, introducing a new data set and benchmark with manipulations of over 500,000 source images from Youtube videos, and over 1.5 million synthesized output manipulations from these state-of-the-art editing techniques. It turns that 143 participants of a user study were only able to correctly classify 61% in a 50:50 test split of real and fake images; this is only 11% better than random chance. These experiments are pressing for an automated solution. To this end, FaceForensics++ introduces a new learning-based detection method that exploits domain-specific knowledge of the face domain by combining face tracking input with a convolutional neural classifier. The resulting approach yields a classification accuracy of 86.69% on the same benchmark – far beyond human performance. Detecting manipulated imagery becomes significantly more challenging when dealing with unknown manipulation techniques.

Following up on these important steps in the research community, our long-term goal is to obtain a perfect digital replica of the real world – ideally, from single photos or short RGB video sequences. To this end, we need to generalize capture and synthesis to fully dynamic and arbitrary real-world settings. At the same time, we will investigate the shortcomings of existing synthesis methods by developing automated detection methods. This will not only help to improve synthesis, but also open up a discussion about the social impact of artificially created imagery.

Figures 2 and 3

Figure 4

J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt and M. Nießner, “Face2Face: Real-time Face Capture and Reenactment of RGB Videos”, CACM, Vol. 62 No. 1, pp. 96-104, 2019.

A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies and M. Nießner, “FaceForensics++: Learning to Detect Manipulated Facial Images”, 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 1-11.

J. Thies, M. Zollhöfer and M. Nießner, “Deferred Neural Rendering: Image Synthesis using Neural Textures”, ACM Transactions on Graphics, vol. 38, no. 4, 2019.

J. Hou, A. Dai and M. Nießner, “3D-SIS: 3D Semantic Instance Segmentation of RGB-D Scans”, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4416-4425.

A. Avetisyan, M. Dahnert, A. Dai, M. Savva, A. X. Chang and M. Nießner, “Scan2CAD: Learning CAD Model Alignment in RGB-D Scans”, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2609-2618.

G. Gafni, J. Thies, M. Zollhoefer and M. Nießner, “ Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction”, to be presented at the IEEE Conference on Computer Vision and Pattern Recognition 2021 (CVPR2021), virtual conference, June 19-25, 2021.

A. Dai, Y. Siddiqui, J. Thies, J. Valentin and M. Nießner, “SPSG Self-Supervised Photometric Scene Generation from RGB-D Scans”, to be presented at the IEEE Conference on Computer Vision and Pattern Recognition 2021 (CVPR2021), virtual conference, June 19-25, 2021.

J. Hou, B. Graham, M. Nießner and S. Xie, “Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts”, to be presented at the IEEE Conference on Computer Vision and Pattern Recognition 2021 (CVPR2021), virtual conference, June 19-25, 2021.

A. Božič, P. Palafox, M. Zollhöfer, J. Thies, A. Dai and M. Nießner, “Neural Deformation Graphs for Globally-consistent Non-rigid Reconstruction”, to be presented at the IEEE Conference on Computer Vision and Pattern Recognition 2021 (CVPR2021), virtual conference, June 19-25, 2021.