One variant from each of 16 part families β each part rotating around its primary axis.
twisted_drill
bevel_gear
twisted_bracket
coil_spring
impeller
bellows
heat_sink
eyebolt
turnbuckle
hinge
ratchet_sector
double_simplex_sprocket
t_pipe_fitting
lathe_turned_part
spline_hub
fan_shroud
Industrial CAD code generation requires more than recognizing the outer shape of a part: it requires understanding the part's 3D structure, inferring engineering parameters, and choosing CAD operations that reflect how the part would actually be designed and manufactured. Models that pass the eye too often fail the caliper β two programs may render to similar envelopes while differing substantially in editability, operation choice, and engineering detail.
BenchCAD is the first public CAD benchmark that combines four properties simultaneously:
helix, twistExtrude, polarArray, advanced sweeps and lofts rare or absent in prior corpora.17,900
BenchCAD β verified CadQuery parts
(code Β· STEP Β· 4 views Β· params Β· ops)
2,400
BenchCAD-QA β paired image/code
numeric QA items
748
BenchCAD-Edit β verified
before/after edit pairs
106
industrial part families
47
ISO / DIN / EN / ASME / IEC codes (49% of families)
46
distinct CadQuery operations covered
Every family is hand-crafted by domain experts directly from industrial standards. Each release ships with Croissant 1.0 metadata, code under MIT, data under CC-BY-4.0.
We organise CAD reasoning into a four-level capability pyramid, from low-level perception to high-level synthesis. The same questions are evaluated under both Vision QA (multi-view renders) and Code QA (CadQuery source); the matched-pair design isolates whether a failure stems from visual recognition or from reasoning over the queried attribute.
L1 Holistic Visual Recognition Β· L2 CAD Operation Understanding Β· L3 Industrial Parametric Abstraction Β· L4 Compositional Spatial / Code Reasoning.
Vision2Code img2cq
CadQuery generation from multi-view renders
Model receives four canonical orthographic views and emits a CadQuery program. We re-execute and score against GT via IoU, Chamfer, Hausdorff, Feature-F1, essential-op recall, and exec rate.
Vision QA qa_img
Geometric reasoning from rendered views
Numeric QA from multi-view renders, broken out along the four-level capability hierarchy (L1 Visual Recognition β L4 Spatial Reasoning). Β±5% tolerance for ratios, exact match for integers.
Code QA qa_code
Symbolic understanding of CadQuery programs
Same QA bank as Vision QA, but conditioned on CadQuery source instead of renders. The matched-pair gap isolates the cost of visual recognition vs. parametric reasoning.
Code Edit edit_code
Instruction-guided program editing
Given a CadQuery program and a natural-language edit instruction, output a minimally modified program matching the target. Five edit types T1βT5 from literal swap to multi-block geometry rebuild. Scored by headroom-normalised improvement.
Per-model performance across the four BenchCAD task categories (frontier proprietary subset).
Detailed per-task leaderboards with all evaluated models β including open-source and CAD-specialist baselines β are below.
Image-to-CadQuery generation from four canonical orthographic views. IoUβ @ 64Β³ voxels; CDβ Chamfer distance; essβ essential-op recall; exec % = fraction of generated programs that run; totalβ = composite score. Best per block in bold.
| Model | IoU β | CD β | ess β | exec % | total β |
|---|---|---|---|---|---|
| Specialist CAD lineage (~2B) | |||||
| Cadrille-RL | 0.0683 | 0.3507 | 0.3118 | 91.2 | 0.3382 |
| CADEvolve v3 | 0.7497 | 0.0080 | 0.3715 | 92.7 | 0.6014 |
| Frontier MLLMs | |||||
| GPT-4o | 0.1884 | 0.0623 | 0.4680 | 87.0 | 0.2274 |
| GPT-5.3 | 0.1952 | 0.0594 | 0.4940 | 69.5 | 0.2160 |
| GPT-5.3 thinking | 0.2072 | 0.0566 | 0.4550 | 67.5 | 0.2120 |
| Claude Sonnet 4.6 | 0.2380 | 0.0320 | 0.5060 | 85.4 | 0.3510 |
| Claude Sonnet 4.6 thinking-high | 0.2420 | 0.0220 | β | 90.0 | 0.3850 |
| Claude Opus 4.7 | 0.2740 | 0.0210 | 0.4770 | 94.0 | 0.3830 |
| Claude Opus 4.7 thinking | 0.2670 | 0.0240 | 0.4710 | 90.4 | 0.3780 |
| Gemini 3.1 Pro | 0.2560 | 0.0400 | 0.3780 | 74.0 | 0.3290 |
| Gemini 3.1 Pro thinking | 0.2790 | 0.0250 | 0.5670 | 79.8 | 0.3970 |
| OpenAI o3 | 0.5004 | 0.0143 | 0.4440 | 5.6 | 0.1081 |
| Moonshot v1-128k | 0.1530 | 0.0572 | 0.1922 | 10.5 | 0.0612 |
| Moonshot v1-8k | 0.2421 | 0.0577 | 0.1792 | 9.0 | 0.0604 |
| Ours β Qwen3-VL-2B trained on BenchCAD | |||||
| Qwen3-VL-2B SFT (OOD) | 0.5630 | 0.0174 | 0.7380 | 94.2 | 0.5877 |
| Qwen3-VL-2B SFT (IID) | 0.6400 | 0.0108 | 0.8710 | 88.1 | 0.6282 |
| Qwen3-VL-2B RL (OOD) | 0.7140 | 0.0047 | 0.8100 | 99.1 | 0.7231 |
| Qwen3-VL-2B RL (IID) | 0.7520 | 0.0041 | 0.8920 | 98.9 | 0.7682 |
| Qwen3-VL-2B (baseline) | 0.0032 | 0.3652 | 0.0042 | 14.6 | 0.0084 |
Same 2,400 numeric questions, two conditioning modalities. The matched-pair gap surfaces the Holistic Spatial & Detailing Deficit: models extract geometric and parametric information far more reliably from explicit code than from rendered images.
qa_img| Model | Holistic Visual Recognition |
CAD Operation Understanding |
Industrial Parametric Abstraction |
Spatial Reasoning |
Total β |
|---|---|---|---|---|---|
| Gemini 3.1 Pro | 0.750 | 0.462 | 0.536 | 0.688 | 0.587 |
| Gemini 3.1 Pro thinking | 0.722 | 0.426 | 0.551 | 0.669 | 0.576 |
| Opus 4.7 thinking | 0.715 | 0.485 | 0.421 | 0.614 | 0.530 |
| Opus 4.7 | 0.699 | 0.464 | 0.426 | 0.668 | 0.526 |
| GPT-5.3 thinking | 0.650 | 0.429 | 0.482 | 0.534 | 0.514 |
| GPT-5.3 | 0.636 | 0.423 | 0.488 | 0.548 | 0.513 |
| GPT-4o | 0.599 | 0.408 | 0.431 | 0.396 | 0.464 |
| Moonshot v1-8k | 0.600 | 0.246 | 0.465 | 0.181 | 0.447 |
| Moonshot v1-128k | 0.556 | 0.387 | 0.427 | 0.334 | 0.442 |
| OpenAI o3 | 0.328 | 0.188 | 0.398 | 0.560 | 0.327 |
| blank-image baseline: 0.376 Β· 0.325 Β· 0.418 Β· 0.296 Β· 0.375 | |||||
qa_code| Model | CadQuery Code Recognition |
CAD Operation Understanding |
Industrial Parametric Abstraction |
Spatial Reasoning |
Total β |
|---|---|---|---|---|---|
| Gemini 3.1 Pro thinking | 0.907 | 0.783 | 0.876 | 0.537 | 0.838 |
| Gemini 3.1 Pro | 0.914 | 0.782 | 0.867 | 0.537 | 0.836 |
| Opus 4.7 thinking | 0.891 | 0.781 | 0.851 | 0.632 | 0.829 |
| GPT-5.3 | 0.879 | 0.805 | 0.815 | 0.731 | 0.823 |
| GPT-5.3 thinking | 0.885 | 0.802 | 0.811 | 0.730 | 0.821 |
| Opus 4.7 | 0.868 | 0.800 | 0.793 | 0.595 | 0.801 |
| GPT-4o | 0.865 | 0.593 | 0.732 | 0.688 | 0.726 |
| OpenAI o3 | 0.804 | 0.701 | 0.689 | 0.492 | 0.708 |
| Moonshot v1-128k | 0.842 | 0.551 | 0.692 | 0.792 | 0.700 |
| gpt-oss-120b | 0.790 | 0.732 | 0.656 | 0.379 | 0.689 |
| Nemotron-3 120B | 0.771 | 0.660 | 0.661 | 0.293 | 0.671 |
| Gemma-4-31B-it | 0.791 | 0.674 | 0.606 | 0.528 | 0.664 |
| Moonshot v1-8k | 0.772 | 0.603 | 0.555 | 0.536 | 0.620 |
| blank-code baseline: 0.040 Β· 0.257 Β· 0.290 Β· 0.420 Β· 0.223 | |||||
Best Code QA reaches 0.838 while best Vision QA caps at 0.587 β a ~25 pt modality gap on identical questions.
Headline metric is Accuracy β headroom-normalised improvement: how much of the original-to-target IoU gap the edited program closes.
1.0 means the model fully traversed the gap; the no-change baseline scores 0.
The 748 pairs span five edit types T1βT5: literal replacement, chained transform, relative compute, feature edit, and geometry rebuild.
Per-type accuracy across T1βT5: simple literal swaps are nearly solved; multi-block geometry rebuilds remain difficult.
| Model | Thinking | Accuracy β |
|---|---|---|
| GPT-5.3 | β | 0.865 |
| Claude Opus 4.7 | β | 0.853 |
| Gemini 3.1 Pro | β | 0.837 |
| Claude Opus 4.7 | β | 0.811 |
| Gemini 3.1 Pro | β | 0.795 |
| GPT-5.3 | β | 0.740 |
| OpenAI o3 | β | 0.708 |
| GPT-4o | β | 0.615 |
| Nemotron-3 120B | β | 0.608 |
| gpt-oss-120b | β | 0.561 |
| no-change baseline: 0.000 | ||
To add a model to the four leaderboards above, open a PR on the code repo with your results.jsonl.
See the code repo for submission instructions.
@misc{benchcad2026,
title = {BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD},
author = {Zhang, Haozhe and Liu, Kaichen and Chen, Miaomiao and Li, Lei and Yang, Shaojie and Peng, Cheng and Chen, Hanjie},
year = {2026},
eprint = {2605.10865},
archivePrefix= {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2605.10865}
}