SIGGRAPH ’2026 · Los Angeles

CubePart: An Open-Vocabulary Part-Controllable 3D Generator

  • 1Roblox
  • 2Carnegie Mellon University
  • 3Stanford University

*The first four authors contributed equally to this research.
Project leads.

CubePart teaser: given a text prompt and a parts schema, CubePart generates structurally complete multi-part 3D meshes that can be animated and driven by behavior scripts.
We propose CubePart, an open-vocabulary part-controllable 3D generator. (a) Given a text prompt and a schema defining part decomposition, CubePart synthesizes a multi-part 3D object where each component is a distinct, structurally complete mesh. (b) The controllable part-based framework directly facilitates scripted or physically simulated behaviors (bottom row). CubePart can also accept an existing mesh as input, decomposing it into semantic multi-part meshes according to the input part schema (last column).

Abstract

Interactive 3D assets used in games and simulation are typically decomposed into specific semantic parts to support animation, physics, and scripted behaviors, yet most generative 3D models produce either monolithic meshes or arbitrary part decompositions that cannot be aligned with application-specific requirements.

We present CubePart, a generative framework for open-vocabulary, part-controllable 3D mesh generation that exposes part structure as an explicit inference-time control signal. Given a global text prompt and a user-defined parts schema expressed as an open-ended list of part names, our method generates a set of meshes—one per schema element—that assemble into a coherent object while respecting the specified semantic structure.

To enable this capability, we introduce a scalable data pipeline to construct a large open-vocabulary, part-labeled 3D dataset, along with a two-stage generative architecture that separates global shape synthesis from part-level decoding. We demonstrate that the resulting assets can be directly integrated into game engines and driven by animation and behavior scripts without manual post-processing.

462K
assets in our open-vocabulary part-labeled dataset
2.02M
parts, more than 11× larger than PartVerse-XL
Open-Voc.
user-defined part schemas at inference time
Game-Ready
can be integrated into game engines and driven by behavior scripts

Method

CubePart is a two-stage framework that takes a global text prompt and an open-ended parts schema (a list of free-form part names), and produces a set of meshes—one per schema element— that jointly assemble into a coherent object.

CubePart pipeline. Stage 1 generates a single full-shape latent from the text prompt and schema. Stage 2 decomposes it into a set of part latents using a multi-mesh DiT with cross-part attention residual blocks.
Overview. (a) Stage 1 — Single-Part Mesh Generation synthesizes a holistic shape latent with a Multi-Modal DiT (MM-DiT) conditioned on the prompt and schema encoded by Qwen-VL. (b) Stage 2 — Multi-Part Mesh Generation takes the Stage 1 latent and decomposes it into distinct part latents. We initialize Stage 2 with the MM-DiT weights from Stage 1 and inject Cross-Part Attention Residual Blocks to enable structural interaction across parts.

Stage 1 — Schema-aware single-mesh generation

We adapt a vecset-based diffusion transformer for text-to-3D generation. The pretrained model is fine-tuned on schema-augmented prompts of the form "<global caption>. This object contains the following parts: <list of part labels>." so that all requested parts appear in the generated shape.

Stage 2 — Multi-part decoding

Stage 2 reuses Stage 1 weights and adds zero-initialized Cross-Part Attention Residual Blocks at four layers. This preserves the strong single-mesh prior while letting parts exchange global structural context. Each part is conditioned on a part-aware prompt indicating the target name and the full schema.

Cross-part attention residual block — zero-initialized transformer block that operates across the latents of all parts and the full shape latent.
A dedicated zero-initialized transformer block performs cross-part global attention, leaving the single-mesh priors intact while enabling efficient inter-part communication.

Dataset & Data Engine

Training open-vocabulary part-controllable 3D generators requires datasets that are both large and richly part-labeled. We built a scalable data engine that combines artist-provided segmentations with Vision-Language Models (VLMs) and a 3D-aware Set-of-Mark prompting strategy to produce concise, semantically meaningful part names at scale.

The same Objaverse tank asset, compared across three pipelines. Top-left: original artist decomposition with 7 parts. Middle: PartVerse produces 17 over-segmented parts with VLM caption artifacts. Right: Ours produces 4 concise, meaningful clusters (hull, turret and cannon, side arms, tracks).
Part segmentation and naming comparison. Our automatic pipeline produces concise, meaningful names (e.g. hull, tracks) whereas captions from prior work suffer from VLM artifacts and lack spatial specificity.
Dataset Assets Parts Open-Vocab. Part Text
ShapeNetPart16K93K×Taxonomy
PartNet26K573K×Taxonomy
PartVerse12K91KCaptions
PartVerse-XL40K320KCaptions
Ours 462K 2.02M Names
Dataset comparison. Our dataset is more than 11× larger than PartVerse-XL while using concise, schema-friendly part names rather than long descriptive captions.

Results

Two-Stage Generation Gallery

Conditioned on a text prompt and a parts schema, CubePart synthesizes detailed global shapes and decomposes them into independent, structurally complete part meshes that adhere to the defined schema. Drag to rotate, scroll to zoom, slide the explode control to separate parts, and hover any part — either in the viewer or its colored chip — to see its label.

Dwarven steam driller

“A dwarven steam-powered drilling machine with a massive rotating drill bit at the front.”

Loading model…

Rhino tank

“A heavily armored futuristic tank designed to resemble a charging rhinoceros.”

Loading model…

Walking fantasy hut

“A fantasy cottage hut perched on giant mechanical chicken legs.”

Loading model…

Wild-west laser pistol

“A futuristic energy weapon with an old western revolver aesthetic.”

Loading model…

Clockwork horse

“A mechanical horse construct made of brass gears, copper plating, and exposed clockwork mechanisms.”

Loading model…

Deep-sea submersible

“A yellow deep-sea research submersible.”

Loading model…

Varying the Part Schema

The same object can be decomposed at different granularities just by changing the schema — from 2 parts up to 8 parts. CubePart resolves ambiguous boundaries (e.g. between fenders and wheels) by introducing the relevant part names in the schema.

Two input meshes (motorcycle and dune buggy) each decomposed by CubePart into 2, 4, and 8 parts. With 2 parts the model merges fenders into wheels; with 4 and 8 parts, the explicit part names resolve the ambiguity and produce fine-grained components.
Qualitative results with varying part schemas. Our method controls both the semantic identity and granularity of generated parts.

Multi-Part Generation: Comparisons

Unlike prior methods that either fix the part vocabulary or infer parts implicitly from 2D segmentation, CubePart guarantees alignment between the generated meshes and a user-defined open-vocabulary schema. Compared to controllable (HoloPart) and non-controllable (OmniPart, PartCrafter, PartPacker) baselines, our method produces cleaner part boundaries and stronger geometric fidelity.

Qualitative comparison on PartObjaverse-Tiny. Five rows (house, character, horned figure, flowerpot, kettle) compare Ground-Truth, Ours, PatchAlign3D+HoloPart, SAM3+OmniPart, PartCrafter, and PartPacker.
Qualitative comparison of multi-part mesh generation. Under the mesh-conditioned setting, CubePart outperforms HoloPart in both schema adherence and geometric fidelity. Image-conditioned baselines (OmniPart, PartCrafter, PartPacker) fail to offer user-defined part control and produce noisier segmentation boundaries.

Quantitative Comparison

Method on PartObjaverse-Tiny Part-Level Holistic-Level
CD ↓F-score ↑ CD ↓F-score ↑
PartCrafter0.4930.2900.2720.552
PartPacker0.3740.4750.1640.792
PatchAlign3D + HoloPart0.3090.5490.0500.970
SAM3 + OmniPart0.3090.6300.0530.970
Ours 0.2510.743 0.0480.974
Evaluation on part-based multi-mesh generation. Our method demonstrates consistent improvements in structural completeness and part-level accuracy on both holistic and part-level Chamfer Distance and F-score.

BibTeX

@inproceedings{zhu2026cubepart,
  author = {Zhu, Yiheng and Deng, Kangle and Fauconnier, Jean-Philippe
            and Navarro, Inaki and Li, Daiqing and Pun, Ava
            and Zhang, Yinan and Zhuang, Peiye and Sun, Xiaoxia
            and Agrawala, Maneesh and Bhat, Kiran and Zhou, Tinghui},
  title = {CubePart: An Open-Vocabulary Part-Controllable 3D Generator},
  booktitle = {SIGGRAPH},
  year = {2026},
}

Acknowledgments

We thank the leadership, Nishchaie Khanna, Karun Channa, Anupam Singh, and David Baszucki, for their support and guidance throughout this work. We also thank Michael Palleschi, Maurice Chu, Keenan Crane, and Kayvon Fatahalian for helpful discussions. We are grateful to Zhenyu Zhao, Daniel Chin, Michael Spedden, Alvin Chan, and Saurav Dhakad for setting up the evaluation pipeline as part of the broader project. Finally, we are thankful to the ML-Platform team, Anying Li, Yiqing Wang, Steve Han, Sourashis Roy, Chengyi Nie, Wei Zeng, Sal Pathare, Mandar Deshpande, and Andy Shen, for their contributions and collaboration that helped make this project possible.