CubePart: An Open-Vocabulary Part-Controllable 3D Generator

Gallery

Each example below pairs a live 3D viewer with the resulting in-engine behavior video. Drag to rotate, scroll to zoom, slide the explode control to separate parts, and hover any part to see its label.

Jellyfish race car

“A jellyfish-themed race car.”

Loading model…

Explode

Drone

Quadrotor with articulated blades and landing legs.

Holistic input mesh generated by another 3D model

Loading model…

Explode

Robot

Robot with independent head, torso, arms and legs.

Holistic input mesh created by an artist

Loading model…

Explode

Helicopter

Body and rotor blades as independent parts for spin animation.

Loading model…

Explode

Wizard with magic staff

Character with separable props (staff, orb, feather) for coordinated motion.

Holistic input mesh generated by another 3D model

Loading model…

Explode

Pirate chest

Lid and base as independent parts to drive an opening animation.

Loading model…

Explode

Potted flower

Stem, petals and leaves as independent parts for swaying motion.

Loading model…

Explode

Abstract

Interactive 3D assets used in games and simulation are typically decomposed into specific semantic parts to support animation, physics, and scripted behaviors, yet most generative 3D models produce either monolithic meshes or arbitrary part decompositions that cannot be aligned with application-specific requirements.

We present CubePart, a generative framework for open-vocabulary, part-controllable 3D mesh generation that exposes part structure as an explicit inference-time control signal. Given a global text prompt and a user-defined parts schema expressed as an open-ended list of part names, our method generates a set of meshes—one per schema element—that assemble into a coherent object while respecting the specified semantic structure.

To enable this capability, we introduce a scalable data pipeline to construct a large open-vocabulary, part-labeled 3D dataset, along with a two-stage generative architecture that separates global shape synthesis from part-level decoding. We demonstrate that the resulting assets can be directly integrated into game engines and driven by animation and behavior scripts without manual post-processing.

462K

assets in our open-vocabulary part-labeled dataset

2.02M

parts, more than 11× larger than PartVerse-XL

Open-Voc.

user-defined part schemas at inference time

Game-Ready

can be integrated into game engines and driven by behavior scripts

Method

CubePart is a two-stage framework that takes a global text prompt and an open-ended parts schema (a list of free-form part names), and produces a set of meshes—one per schema element— that jointly assemble into a coherent object.

CubePart pipeline. Stage 1 generates a single full-shape latent from the text prompt and schema. Stage 2 decomposes it into a set of part latents using a multi-mesh DiT with cross-part attention residual blocks. — **Overview.** (a) *Stage 1 — Single-Part Mesh Generation* synthesizes a holistic shape latent with a Multi-Modal DiT (MM-DiT) conditioned on the prompt and schema encoded by Qwen-VL. (b) *Stage 2 — Multi-Part Mesh Generation* takes the Stage 1 latent and decomposes it into distinct part latents. We initialize Stage 2 with the MM-DiT weights from Stage 1 and inject Cross-Part Attention Residual Blocks to enable structural interaction across parts.

Stage 1 — Schema-aware single-mesh generation

We adapt a vecset-based diffusion transformer for text-to-3D generation. The pretrained model is fine-tuned on schema-augmented prompts of the form "<global caption>. This object contains the following parts: <list of part labels>." so that all requested parts appear in the generated shape.

Stage 2 — Multi-part decoding

Stage 2 reuses Stage 1 weights and adds zero-initialized Cross-Part Attention Residual Blocks at four layers. This preserves the strong single-mesh prior while letting parts exchange global structural context. Each part is conditioned on a part-aware prompt indicating the target name and the full schema.

Cross-part attention residual block — zero-initialized transformer block that operates across the latents of all parts and the full shape latent. — A dedicated **zero-initialized** transformer block performs cross-part global attention, leaving the single-mesh priors intact while enabling efficient inter-part communication.

Dataset & Data Engine

Training open-vocabulary part-controllable 3D generators requires datasets that are both large and richly part-labeled. We built a scalable data engine that combines artist-provided segmentations with Vision-Language Models (VLMs) and a 3D-aware Set-of-Mark prompting strategy to produce concise, semantically meaningful part names at scale.

The same Objaverse tank asset, compared across three pipelines. Top-left: original artist decomposition with 7 parts. Middle: PartVerse produces 17 over-segmented parts with VLM caption artifacts. Right: Ours produces 4 concise, meaningful clusters (hull, turret and cannon, side arms, tracks). — **Part segmentation and naming comparison.** Our automatic pipeline produces concise, meaningful names (e.g. *hull*, *tracks*) whereas captions from prior work suffer from VLM artifacts and lack spatial specificity.

Dataset	Assets	Parts	Open-Vocab.	Part Text
ShapeNetPart	16K	93K	×	Taxonomy
PartNet	26K	573K	×	Taxonomy
PartVerse	12K	91K	✓	Captions
PartVerse-XL	40K	320K	✓	Captions
Ours	462K	2.02M	✓	Names

Dataset comparison. Our dataset is more than 11× larger than PartVerse-XL while using concise, schema-friendly part names rather than long descriptive captions.

Results

Two-Stage Generation Gallery

Conditioned on a text prompt and a parts schema, CubePart synthesizes detailed global shapes and decomposes them into independent, structurally complete part meshes that adhere to the defined schema. Drag to rotate, scroll to zoom, slide the explode control to separate parts, and hover any part — either in the viewer or its colored chip — to see its label.

Dwarven steam driller

“A dwarven steam-powered drilling machine with a massive rotating drill bit at the front.”

Loading model…

Explode

Rhino tank

“A heavily armored futuristic tank designed to resemble a charging rhinoceros.”

Loading model…

Explode

Walking fantasy hut

“A fantasy cottage hut perched on giant mechanical chicken legs.”

Loading model…

Explode

Wild-west laser pistol

“A futuristic energy weapon with an old western revolver aesthetic.”

Loading model…

Explode

Clockwork horse

“A mechanical horse construct made of brass gears, copper plating, and exposed clockwork mechanisms.”

Loading model…

Explode

Deep-sea submersible

“A yellow deep-sea research submersible.”

Loading model…

Explode

Browse the full gallery (20 more)

Varying the Part Schema

The same object can be decomposed at different granularities just by changing the schema — from 2 parts up to 8 parts. CubePart resolves ambiguous boundaries (e.g. between fenders and wheels) by introducing the relevant part names in the schema.

Two input meshes (motorcycle and dune buggy) each decomposed by CubePart into 2, 4, and 8 parts. With 2 parts the model merges fenders into wheels; with 4 and 8 parts, the explicit part names resolve the ambiguity and produce fine-grained components. — **Qualitative results with varying part schemas.** Our method controls both the semantic identity and granularity of generated parts.

Multi-Part Generation: Comparisons

Unlike prior methods that either fix the part vocabulary or infer parts implicitly from 2D segmentation, CubePart guarantees alignment between the generated meshes and a user-defined open-vocabulary schema. Compared to controllable (HoloPart) and non-controllable (OmniPart, PartCrafter, PartPacker) baselines, our method produces cleaner part boundaries and stronger geometric fidelity.

Qualitative comparison on PartObjaverse-Tiny. Five rows (house, character, horned figure, flowerpot, kettle) compare Ground-Truth, Ours, PatchAlign3D+HoloPart, SAM3+OmniPart, PartCrafter, and PartPacker. — **Qualitative comparison of multi-part mesh generation.** Under the mesh-conditioned setting, CubePart outperforms HoloPart in both schema adherence and geometric fidelity. Image-conditioned baselines (OmniPart, PartCrafter, PartPacker) fail to offer user-defined part control and produce noisier segmentation boundaries.

Quantitative Comparison

Method on PartObjaverse-Tiny	Part-Level		Holistic-Level
Method on PartObjaverse-Tiny	CD ↓	F-score ↑	CD ↓	F-score ↑
PartCrafter	0.493	0.290	0.272	0.552
PartPacker	0.374	0.475	0.164	0.792
PatchAlign3D + HoloPart	0.309	0.549	0.050	0.970
SAM3 + OmniPart	0.309	0.630	0.053	0.970
Ours	0.251	0.743	0.048	0.974

Evaluation on part-based multi-mesh generation. Our method demonstrates consistent improvements in structural completeness and part-level accuracy on both holistic and part-level Chamfer Distance and F-score.

BibTeX

@inproceedings{zhu2026cubepart,
  author = {Zhu, Yiheng and Deng, Kangle and Fauconnier, Jean-Philippe
            and Navarro, Inaki and Li, Daiqing and Pun, Ava
            and Zhang, Yinan and Zhuang, Peiye and Sun, Xiaoxia
            and Agrawala, Maneesh and Bhat, Kiran and Zhou, Tinghui},
  title = {CubePart: An Open-Vocabulary Part-Controllable 3D Generator},
  booktitle = {SIGGRAPH},
  year = {2026},
}

Acknowledgments

We thank the leadership, Nishchaie Khanna, Karun Channa, Anupam Singh, and David Baszucki, for their support and guidance throughout this work. We also thank Michael Palleschi, Maurice Chu, Keenan Crane, and Kayvon Fatahalian for helpful discussions. We are grateful to Zhenyu Zhao, Daniel Chin, Michael Spedden, Alvin Chan, and Saurav Dhakad for setting up the evaluation pipeline as part of the broader project. Finally, we are thankful to the ML-Platform team, Anying Li, Yiqing Wang, Steve Han, Sourashis Roy, Chengyi Nie, Wei Zeng, Sal Pathare, Mandar Deshpande, and Andy Shen, for their contributions and collaboration that helped make this project possible.