SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge

Computer Vision and Pattern Recognition (CVPR) 2026

1University of California, Los Angeles   2University of Southern California   3University of Utah
* Equal Contributions

SPARK is a novel framework that integrates VLM-guided part-level and global image guidance with diffusion transformers to produce high-quality articulated object reconstructions.

✨ Abstract ✨

Articulated 3D objects are critical for embodied AI, robotics, and interactive scene understanding, yet creating simulation-ready assets remains labor-intensive and requires expert modeling of part hierarchies and motion structures. We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image. Given an input image, we first leverage VLMs to extract coarse URDF parameters and generate part-level reference images. We then integrate the part-image guidance and the inferred structure graph into a generative diffusion transformer to synthesize consistent part and complete shapes of articulated objects. To further refine the URDF parameters, we incorporate differentiable forward kinematics and differentiable rendering to optimize joint types, axes, and origins under VLM-generated open-state supervision. Extensive experiments show that SPARK produces high-quality, simulation-ready articulated assets across diverse categories, enabling downstream applications such as robotic manipulation and interaction modeling.

🎯 Pipeline 🎯

We use a VLM to generate per-part reference images, predicted open-state images, and URDF templates with preliminary joint and link estimations. A Diffusion Transformer (DiT) equipped with local, global, and hierarchical attention mechanisms simultaneously synthesizes part-level and complete articulated meshes from a single image with VLM priors. We further employ a generative texture model to generate realistic textures and refine the URDF parameters using differentiable forward kinematics and differentiable rendering under the guidance of the predicted open-state images.

🧸 Qualitative Comparison on Shape Reconstruction 🧸

We compare our results with OmniPart, PartCrafter, and URDFormer. Our method fulfills accurate, high-fidelity articulated object shape reconstruction.

Microwave

Input Image

Ground Truth

OmniPart

PartCrafter

URDFormer

Ours

Refrigerator

Input Image

Ground Truth

OmniPart

PartCrafter

URDFormer

Ours

Table

Input Image

Ground Truth

OmniPart

PartCrafter

URDFormer

Ours

Storage Furniture

Input Image

Ground Truth

OmniPart

PartCrafter

URDFormer

Ours

Safe

Input Image

Ground Truth

OmniPart

PartCrafter

URDFormer

Ours

Washing Machine

Input Image

Ground Truth

OmniPart

PartCrafter

URDFormer

Ours

🧩 Qualitative Comparison on URDF Estimation 🧩

We compare our results with Articulate-Anything, Articulate-AnyMesh. The closed-state results are reconstructed or retrieved meshes, while the open-state configurations are obtained through kinematic transformations using the estimated URDF parameters. Our method achieves more accurate and physically consistent URDF estimation, leading to realistic articulation behavior.

Microwave

Input Image

Ground Truth

Articulate-Anything

Articulate AnyMesh

Ours

Refrigerator

Input Image

Ground Truth

Articulate-Anything

Articulate AnyMesh

Ours

Table

Input Image

Ground Truth

Articulate-Anything

Articulate AnyMesh

Ours

Storage Furniture

Input Image

Ground Truth

Articulate-Anything

Articulate AnyMesh

Ours

🤖 Applications 🤖

Our method generates high-quality articulated objects with URDF parameters, enabling robot learning tasks such as door opening and drawer pulling.

🌷 Acknowledgments 🌷

We acknowledge support from NSF 2153851, 2301040, TRI, Sony, NVIDIA, Nirvana AI, Snap, Style3D, and Disney. We thank Rosalinda Chen for her support in setting up the robotic manipulation tasks used to evaluate our generated assets in downstream applications, Xiaoying Wang for assistance in running the qualitative comparisons during the rebuttal, and Yuchen Lin for guidance on using PartCrafter and for assistance in adapting their implementation as a foundation of our codebase.

🦜 BibTeX 🦜

If you find our work helpful, please consider citing:

@misc{he2025spark,
  title={SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge}, 
  author={Yumeng He and Ying Jiang and Jiayin Lu and Yin Yang and Chenfanfu Jiang},
  year={2025},
  eprint={2512.01629},
  url={https://arxiv.org/abs/2512.01629}
}