TLDR: We present an implicit surface reconstruction algorithm achieving millimetric precision in 3 minutes of training time for fast and accurate multi-view human acquisition. The learned volume representation can be rendered at >250 fps out of the box. We propose a dataset called MVMannequin of 14 dressed mannequins containing a high quality hand-scanned geometric reference and corresponding multi-view images for validation.
Detailed human surface capture from multiple images is an essential component for many 3D production, analysis and transmission tasks. Yet producing millimetric precision 3D models in practical time, and actually verifying their 3D accuracy in a real-world capture context, remain key challenges due to the lack of specific methods and data for these goals. We propose two complementary contributions to this end. The first one is a highly scalable neural surface radiance field approach able to achieve millimetric precision by construction, while demonstrating high compute and memory efficiency. The second one is a novel dataset of clothed mannequin geometry captured with a high resolution hand-held 3D scanner paired with calibrated multi-view images, that allow to verify the millimetric accuracy claim.
Although our approach can produce such a highly dense and precise geometry, we show how aggressive sparsification and optimizations of the neural surface pipeline lead to estimations requiring only minutes of computation time and a few GB of VRAM memory on GPU, while allowing for real-time millisecond neural rendering. On the basis of our framework and dataset, we provide a thorough experimental analysis of how such accuracies and efficiencies are achieved in the context of multi-camera human acquisition.
Our reconstruction algorithm is designed from the ground up to achieve fast training and real-time volumetric rendering out-of-the-box, without relying on a post processing step. This is especially desirable in the context of human acquisition in which hundreds of frames must be processed consistently and as quickly as possible. Our contribution is twofold:
More specifically, three rendering options have been considered in the neural radiance field community:
We train the implicit surface in a coarse-to-fine manner, starting from a coarse visual hull. We iteratively subdivide the sparse grid that we supervise with images of increasing resolution. Each voxel has a side length of 2mm at the final level-of-detail, projecting to an area comparable to that of a pixel.
Our dataset contains 8 female and 6 male mannequins. A high quality hand-scanned mesh and corresponding multi-view images are provided in each case. This allows to benchmark reconstruction algorithms on real data with a smaller domain gap in regards to human appearance compared to other popular object datasets. The mannequins are fitted with a diverse selection of clothes displaying complex folds and textured patterns.
We evaluate our approach on a representative set of multi-view reconstruction methods: Voxurf, NeuS2, Colmap, 3D Gaussian Splatting (3DGS), 2D Gaussian Splatting (2DGS) and Gaussian Opacity Fields (GOF). Figure 5 presents a side-by-side comparison of renderings from the novel view synthesis methods and shows the benefits of encoding colors at a millimetric scale. Figure 6 shows the corresponding meshes recovered by each baseline.
The multiview data used for the human reconstruction examples presented here is from the 4DHumanOutfit dataset.
If you find our work useful, consider citing:
@inproceedings{toussaint:hal-04724016,
TITLE = {{Millimetric Human Surface Capture in Minutes}},
AUTHOR = {Toussaint, Briac and Boissieux, Laurence and Thomas, Diego and Boyer, Edmond and Jean-S{\'e}bastien, Franco},
URL = {https://inria.hal.science/hal-04724016},
BOOKTITLE = {{SIGGRAPH Asia 2024 - 17th ACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques in Asia}},
ADDRESS = {Tokyo, Japan},
YEAR = {2024},
MONTH = Dec,
DOI = {10.1145/3680528.3687690},
KEYWORDS = {Neural Radiance Fields ; Human surface capture ; Differential rendering},
}