Authors:
(1) Han Jiang, HKUST and Equal contribution (hjiangav@connect.ust.hk);
(2) Haosen Sun, HKUST and Equal contribution (hsunas@connect.ust.hk);
(3) Ruoxuan Li, HKUST and Equal contribution (rliba@connect.ust.hk);
(4) Chi-Keung Tang, HKUST (cktang@cs.ust.hk);
(5) Yu-Wing Tai, Dartmouth College, (yu-wing.tai@dartmouth.edu).
2. Related Work
2.1. NeRF Editing and 2.2. Inpainting Techniques
2.3. Text-Guided Visual Content Generation
3.1. Training View Pre-processing
4. Experiments and 4.1. Experimental Setups
5. Conclusion and 6. References
Given a pre-trained NeRF, a set of masks on its training images denoting the target object to be replaced (or removed), and a text prompt, we propose generative promptable inpainting, which can be decomposed into three objectives: 1) 3D and 4D visual content generation, where the resulting finetuned NeRF should contain a new object that is multiview and temporally consistent; 2) text-prompt guided generation, where the semantics of the generated object should match the input text prompt; 3) the generated inpainted content should be consistent with the existing NeRF background.
Our proposed framework consists of three main stages, as shown in Figure 1. First, we employ stable diffusion [22] to inpaint one view as the first seed image, and generate a coarse set of seed images conditioned on the first seed image. The other views are then inferred from the seed images and refined by stable diffusion. This stage pre-processes the training images aiming to make convergence easier and faster later. Next, we fine-tune the NeRF by performing a stable diffusion version of iterative dataset update [4] to enforce 3D multiview consistency. A converged 3D NeRF is obtained in this stage. If we target at inpainting 4D NeRF, we propagate the 3D inpainted result along the time dimension in the final stage. In the following, we will describe the three stages in detail, respectively in each subsection.
This paper is available on arxiv under CC 4.0 license.