How to Move Your Dragon: Text-to-Motion Synthesis for Large-Vocabulary Objects

Anonymous Authors

Abstract

Motion synthesis for diverse object categories holds great potential for 3D content creation but remains underexplored due to two key challenges: (1) the lack of comprehensive motion datasets that include a wide range of high-quality motions and annotations, and (2) the absence of methods capable of handling heterogeneous skeletal templates from diverse objects.

To address these challenges, we contribute the following: First, we augment the Truebones Zoo dataset—a high-quality animal motion dataset covering over 70 species—by annotating it with detailed text descriptions, making it suitable for text-based motion synthesis. Second, we introduce rig augmentation techniques that generate diverse motion data while preserving consistent dynamics, enabling models to adapt to various skeletal configurations. Finally, we redesign existing motion diffusion models to dynamically adapt to arbitrary skeletal templates, enabling motion synthesis for a diverse range of objects with varying structures.

Experiments show our method generates realistic, coherent motions from textual descriptions for diverse and even unseen objects, setting a strong foundation for motion generation across diverse object categories with heterogeneous skeletal templates. Qualitative results are available on this Link.

Synthesis Results on Representative 35 Objects

Detailed Views for Each Animal Category

Q: How to Move Your Dragon? A: Text-to-Motion Synthesis!

A collection of newly synthesized dragon motions generated from textual descriptions, absent from the original dataset.

Adaptability to Novel Motion Descriptions

We compare the synthesized motions of different models for a dragon using the novel motion description from the ant: “An ant is knocked back and ends up lying on its back, motionless.” Among them, only our method successfully generates a dragon motion that closely aligns with the reference motion from the ant.

Reference

Ours

w/ GPT-Caption

w/o Rig Aug.

SO-MDMs

Motion Synthesis on Known/Novel Objects

Original Object's Motion

Known/Novel Objects' Motion

Observed During Training:

Totally Novel:

Level of Detail in Description and Variability in Synthesized Motions

This section highlights the impact of varying levels of detail in text descriptions on synthesized motions. More detailed descriptions result in precise, fine-grained motion nuances, whereas abstract descriptions allow for a broader range of possible motion interpretations. By adjusting the granularity of textual input, our model adapts dynamically, generating motions that exhibit both diversity and coherence.

Captions Used for Motion Synthesis:

  • Level-1: A T-Rex is walking forward.
  • Level-2: A T-Rex walks forward with its body low to the ground.
  • Level-3: A T-Rex, with its head, neck, back, and tail aligned in a straight line, walks forward. It bends its knees to keep its upper body low to the ground as it moves.
  • Level-4: A T-Rex, maintaining a straight alignment of its head, neck, back, and tail, begins to walk forward. It bends its knees deliberately, allowing its massive upper body to remain low and balanced as it strides.

Long Motion Generation: Journey of T-Rex

Home Camera

Right Camera

This demonstration showcases the long motion generation of a T-Rex from multiple perspectives. Though trained on relatively short sequences (up to 90 frames), our model extends to longer motions (more than 800 frames) by conditioning on sequential descriptions and ensuring smooth transitions between consecutive motion chunks via an overlapping window with weighted blending during sampling.

Captions Used for Motion Synthesis:

  • 🦖 A T-Rex wakes up and gets up from a lying position.
  • 🦖 A T-Rex turns its head to the right and observes something.
  • 🦖 A T-Rex bends down and bites something inside a cave.
  • 🦖 A T-Rex walks slowly forward with its body aligned in a straight line.
  • 🦖 A T-Rex is sprinting with its head facing forward and body lowered for balance.
  • 🦖 A T-Rex dashes forward and bites into something, then chews and swallows it.
  • 🦖 A T-Rex steps back slowly while facing forward.
  • 🦖 A T-Rex bends its knees, lowers its body, and reacts to being hit on the right side of its face.
  • 🦖 A T-Rex with an injured leg walks awkwardly, growling in pain.
  • 🦖 A T-Rex is shot and falls to the ground.

BibTeX

@article{anonymous2025t2m4vlo,
  author    = {Anonymous Authors},
  title     = {How to Move Your Dragon: Text-to-Motion Synthesis for Large-Vocabulary Objects},
  journal   = {under review},
  year      = {2025},
}