Human motion generation is a long-standing problem, and scene-aware motion synthesis has been widely researched recently due to its numerous applications. Prevailing methods rely heavily on paired motion-scene data whose quantity is limited. Meanwhile, it is difficult to generalize to diverse scenes when trained only on a few specific ones. Thus, we propose a unified framework, termed Diffusion Implicit Policy (DIP), for scene-aware motion synthesis, where paired motion-scene data are no longer necessary. In this framework, we disentangle human-scene interaction from motion synthesis during training and then introduce an interaction-based implicit policy into motion diffusion during inference. Synthesized motion can be derived through iterative diffusion denoising and implicit policy optimization, thus motion naturalness and interaction plausibility can be maintained simultaneously. The proposed implicit policy optimizes the intermediate noised motion in a GAN Inversion manner to maintain motion continuity and control keyframe poses though the ControlNet branch and motion inpainting. For long-term motion synthesis, we introduce motion blending for stable transitions between multiple sub-tasks, where motions are fused in rotation power space and translation linear space. The proposed method is evaluated on synthesized scenes with ShapeNet furniture, and real scenes from PROX and Replica. Results show that our framework presents better motion naturalness and interaction plausibility than cutting-edge methods. This also indicates the feasibility of utilizing the DIP for motion synthesis in more general tasks and versatile scenes.
The left subfigure indicates the overall pipeline of scene-aware motion synthesis. Any feasible command will be first decomposed into sub-task with action-object pairs. Then, we will synthesize the future motion according to historical motion and current sub-task. Last, the synthesized motions will be fused into the historical motion to obtain the final long-term motion. The right subfigure presents the framework of Diffusion Implicit Policy (DIP). In each iteration of the DIP, the diffusion model will denoise the motion and enable the synthesized motion to appear more natural, and implicit policy optimization from reward will endow the motion with plausible interaction. The random sampling step can help the framework synthesize motion with diverse styles.
Stand up from the sofa and walk over to the stool, then sit on it.
Stand up from the sofa and walk over to another sofa, then sit on it.
Walk over to the sofa, and then sit on it.
Walk over to the bed and sit down on it.
Stand up from the sofa, walk to the door, then sit back down on the sofa.
Stand up from the bed and walk over to the chair, then sit down on it.
Walk over to the bed and sit down on it.
Stand up from the sofa, walk to the door, then sit back down on the sofa.
Stand up from the bed and walk over to the chair, then sit down on it.
Stand up from the sofa and go out.
Walk over to the bed and sit down on it.
Stand up from the sofa and walk over to the chair, then sit down on it.
@article{gong2024dip,
title={Diffusion Implicit Policy for Unpaired Scene-aware Motion Synthesis},
author={Gong, Jingyu and Zhang, Chong and Liu, Fengqi and Fan, Ke and Zhou, Qianyu and Tan, Xin and Zhang, Zhizhong and Xie, Yuan and Ma, Lizhuang},
journal={arXiv e-prints},
year={2024}
}