Millimetric Human Surface Capture in Minutes

1Centre Inria de l’Université Grenoble Alpes 2Kyushu University
Fig.1 - Left: Detailed human reconstruction examples. Right: Input image, ground truth scan and our reconstruction.

TLDR: We present an implicit surface reconstruction algorithm achieving millimetric precision in 3 minutes of training time for fast and accurate multi-view human acquisition. The learned volume representation can be rendered at >250 fps out of the box. We propose a dataset called MVMannequin of 14 dressed mannequins containing a high quality hand-scanned geometric reference and corresponding multi-view images for validation.

Reconstruction examples with real-time rendering and playback on the 4DHumanOutfit dataset.

Abstract

Detailed human surface capture from multiple images is an essential component for many 3D production, analysis and transmission tasks. Yet producing millimetric precision 3D models in practical time, and actually verifying their 3D accuracy in a real-world capture context, remain key challenges due to the lack of specific methods and data for these goals. We propose two complementary contributions to this end. The first one is a highly scalable neural surface radiance field approach able to achieve millimetric precision by construction, while demonstrating high compute and memory efficiency. The second one is a novel dataset of clothed mannequin geometry captured with a high resolution hand-held 3D scanner paired with calibrated multi-view images, that allow to verify the millimetric accuracy claim.

Although our approach can produce such a highly dense and precise geometry, we show how aggressive sparsification and optimizations of the neural surface pipeline lead to estimations requiring only minutes of computation time and a few GB of VRAM memory on GPU, while allowing for real-time millisecond neural rendering. On the basis of our framework and dataset, we provide a thorough experimental analysis of how such accuracies and efficiencies are achieved in the context of multi-camera human acquisition.

Overview

Fig.2 - Inference pipeline overview. Each orange box is a GPU kernel.

Our reconstruction algorithm is designed from the ground up to achieve fast training and real-time volumetric rendering out-of-the-box, without relying on a post processing step. This is especially desirable in the context of human acquisition in which hundreds of frames must be processed consistently and as quickly as possible. Our contribution is twofold:

  • A novel sparse representation for lightweight memory usage and fast empty-space skipping. Voxels can be allocated or pruned in parallel on the GPU at training time, to maintain sparsity as the implicit surface evolves. A single SDF value is stored into each voxel while the features are factorized into 3 orthogonal planes at a coarser level.
  • A reordered volume rendering pipeline for real-time rendering and fast training.

More specifically, three rendering options have been considered in the neural radiance field community:

  1. The classic grid-based NeRF pipeline: render(MLP(interpolate(features)))
  2. The deferred rendering pipeline: MLP(render(interpolate(features)))
  3. The precomputed rendering pipeline: render(interpolate(MLP(features)))
We advocate to use the third option in which the view-dependent color is predicted for all the voxels at once, followed by the volume rendering of the linearly interpolated predicted colors. This allows several key optimizations:
  1. On-the-fly volume rendering: We can fuse the volume rendering and color interpolation in a single GPU kernel, that dynamically skips empty space and only writes the final accumulated color. The number of samples per ray thus becomes unlimited, enabling to render a complete image all at once regardless of the number of rays.
  2. Fused color prediction: Most previous works predict colors from interpolated features stored in coarse voxels. Our insight is that there is no need to interpolate features as long as the voxels are small enough, which is possible thanks to an efficient sparse structure. Consequently, the color prediction MLP only needs to encode the angular dependency while the spatial dependency is fully explained by the voxel grid, hinting that further reducing the size of the MLP is possible. We opt for an MLP with 32 neurons and 2 layers that proves to be sufficient for human acquisition. For additional memory and compute savings, we fuse the entire color prediction in a single GPU kernel so that only the per-voxel predicted colors need to be stored.

Fig.3 - Coarse-to-fine training. Top: SDF visualization, Bottom: Corresponding mesh extracted with marching cubes.

We train the implicit surface in a coarse-to-fine manner, starting from a coarse visual hull. We iteratively subdivide the sparse grid that we supervise with images of increasing resolution. Each voxel has a side length of 2mm at the final level-of-detail, projecting to an area comparable to that of a pixel.

Dataset

Fig.4 - Top: Scans of the 14 mannequins. Bottom: Colored renderings.

Our dataset contains 8 female and 6 male mannequins. A high quality hand-scanned mesh and corresponding multi-view images are provided in each case. This allows to benchmark reconstruction algorithms on real data with a smaller domain gap in regards to human appearance compared to other popular object datasets. The mannequins are fitted with a diverse selection of clothes displaying complex folds and textured patterns.

Results

Fig.5 - Left to right: input image, Ours, Voxurf, NeuS2, 3DG, 2DGS, GOF.
Fig.6 - Left to right: input image, Ground truth scan, Ours, Voxurf, NeuS2, Colmap, 2DGS, GOF.

We evaluate our approach on a representative set of multi-view reconstruction methods: Voxurf, NeuS2, Colmap, 3D Gaussian Splatting (3DGS), 2D Gaussian Splatting (2DGS) and Gaussian Opacity Fields (GOF). Figure 5 presents a side-by-side comparison of renderings from the novel view synthesis methods and shows the benefits of encoding colors at a millimetric scale. Figure 6 shows the corresponding meshes recovered by each baseline.

Related Links

The multiview data used for the human reconstruction examples presented here is from the 4DHumanOutfit dataset.

BibTeX

If you find our work useful, consider citing:

@inproceedings{toussaint:hal-04724016,
  TITLE = {{Millimetric Human Surface Capture in Minutes}},
  AUTHOR = {Toussaint, Briac and Boissieux, Laurence and Thomas, Diego and Boyer, Edmond and Jean-S{\'e}bastien, Franco},
  URL = {https://inria.hal.science/hal-04724016},
  BOOKTITLE = {{SIGGRAPH Asia 2024 - 17th ACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques in Asia}},
  ADDRESS = {Tokyo, Japan},
  YEAR = {2024},
  MONTH = Dec,
  DOI = {10.1145/3680528.3687690},
  KEYWORDS = {Neural Radiance Fields ; Human surface capture ; Differential rendering},
  }