: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD

Haozhe Zhang, Kaichen Liu, Miaomiao Chen, Lei Li, Shaojie Yang, Cheng Peng, Hanjie Chen
wing_nut rotating 3D render

One variant of the wing_nut family β€” ground-truth CadQuery, STEP, mesh, 4-view, and numeric QA.

17,900

Vision2Code img2cq
verified parts Β· 4-view β†’ CadQuery

2,400

Vision QA Β· Code QA qa_img qa_code
paired image / code numeric items

748

Code Edit edit_code
before/after edit pairs Β· T1–T5

BenchCAD family distribution β€” 106 part families across 7 industrial sectors

106 industrial part families spanning fasteners, transmission, structural, fluid, panels, hardware, and enclosures. 49% of families (52/106) are anchored to real specification tables across 47 ISO / DIN / EN / ASME / IEC codes.

is a unified, capability-decomposed benchmark for industrial CAD reasoning β€” 17,900 execution-verified CadQuery programs across 106 industrial families, every family expert-curated and grounded in industrial standards. It evaluates models through four matched tasks: Vision2Code, Vision QA, Code QA, and Code Edit.

Three releases: BenchCAD (17,900 verified parts) Β· BenchCAD-QA (2,400 paired image/code numeric QA items) Β· BenchCAD-Edit (748 verified edit pairs). Operation surface covers 46 distinct CadQuery ops including helix, twistExtrude, polarArray, loft, and sweep.

Introduction

Industrial CAD code generation requires more than recognizing the outer shape of a part: it requires understanding the part's 3D structure, inferring engineering parameters, and choosing CAD operations that reflect how the part would actually be designed and manufactured. Models that pass the eye too often fail the caliper β€” two programs may render to similar envelopes while differing substantially in editability, operation choice, and engineering detail.

BenchCAD is the first public CAD benchmark that combines four properties simultaneously:

  • Execution-verified at scale β€” 17,900 sandbox-executed CadQuery parts across 106 industrial families.
  • Standard-anchored β€” 49% of families (52/106) bound to real ISO / DIN / EN / ASME / IEC specification tables.
  • Operation-rich β€” 46 distinct CadQuery ops including helix, twistExtrude, polarArray, advanced sweeps and lofts rare or absent in prior corpora.
  • Capability-decomposed β€” four matched tasks (Vision2Code Β· Vision QA Β· Code QA Β· Code Edit) with image/code paired contrast that isolate visual recognition, parametric abstraction, and code synthesis.

Three Released Datasets

17,900

BenchCAD β€” verified CadQuery parts
(code Β· STEP Β· 4 views Β· params Β· ops)

2,400

BenchCAD-QA β€” paired image/code
numeric QA items

748

BenchCAD-Edit β€” verified
before/after edit pairs

106

industrial part families

47

ISO / DIN / EN / ASME / IEC codes (49% of families)

46

distinct CadQuery operations covered

Every family is hand-crafted by domain experts directly from industrial standards. Each release ships with Croissant 1.0 metadata, code under MIT, data under CC-BY-4.0.

Capability Hierarchy

We organise CAD reasoning into a four-level capability pyramid, from low-level perception to high-level synthesis. The same questions are evaluated under both Vision QA (multi-view renders) and Code QA (CadQuery source); the matched-pair design isolates whether a failure stems from visual recognition or from reasoning over the queried attribute.

BenchCAD-QA capability hierarchy: L1 Holistic Visual Recognition, L2 CAD Operation Understanding, L3 Industrial Parametric Abstraction, L4 Spatial / Code Reasoning, with paired Vision QA and Code QA examples per level.

L1 Holistic Visual Recognition Β· L2 CAD Operation Understanding Β· L3 Industrial Parametric Abstraction Β· L4 Compositional Spatial / Code Reasoning.

Four Matched Tasks

Vision2Code img2cq

CadQuery generation from multi-view renders

Model receives four canonical orthographic views and emits a CadQuery program. We re-execute and score against GT via IoU, Chamfer, Hausdorff, Feature-F1, essential-op recall, and exec rate.

Vision QA qa_img

Geometric reasoning from rendered views

Numeric QA from multi-view renders, broken out along the four-level capability hierarchy (L1 Visual Recognition β†’ L4 Spatial Reasoning). Β±5% tolerance for ratios, exact match for integers.

Code QA qa_code

Symbolic understanding of CadQuery programs

Same QA bank as Vision QA, but conditioned on CadQuery source instead of renders. The matched-pair gap isolates the cost of visual recognition vs. parametric reasoning.

Code Edit edit_code

Instruction-guided program editing

Given a CadQuery program and a natural-language edit instruction, output a minimally modified program matching the target. Five edit types T1–T5 from literal swap to multi-block geometry rebuild. Scored by headroom-normalised improvement.

Overall Leaderboard

Per-model performance across the four BenchCAD task categories (frontier proprietary subset).

BenchCAD overall leaderboard β€” per-model performance across CodeGen IoU, CodeEdit, QA-img, QA-code for GPT, Claude, and Gemini families.

Detailed per-task leaderboards with all evaluated models β€” including open-source and CAD-specialist baselines β€” are below.

Vision2Code Leaderboard

Image-to-CadQuery generation from four canonical orthographic views. IoU↑ @ 64Β³ voxels; CD↓ Chamfer distance; ess↑ essential-op recall; exec % = fraction of generated programs that run; total↑ = composite score. Best per block in bold.

Model IoU ↑ CD ↓ ess ↑ exec % total ↑
Specialist CAD lineage (~2B)
Cadrille-RL 0.06830.35070.311891.20.3382
CADEvolve v3 0.74970.00800.371592.70.6014
Frontier MLLMs
GPT-4o 0.18840.06230.468087.00.2274
GPT-5.3 0.19520.05940.494069.50.2160
GPT-5.3 thinking 0.20720.05660.455067.50.2120
Claude Sonnet 4.6 0.23800.03200.506085.40.3510
Claude Sonnet 4.6 thinking-high 0.24200.0220β€”90.00.3850
Claude Opus 4.7 0.27400.02100.477094.00.3830
Claude Opus 4.7 thinking 0.26700.02400.471090.40.3780
Gemini 3.1 Pro 0.25600.04000.378074.00.3290
Gemini 3.1 Pro thinking 0.27900.02500.567079.80.3970
OpenAI o3 0.50040.01430.44405.60.1081
Moonshot v1-128k 0.15300.05720.192210.50.0612
Moonshot v1-8k 0.24210.05770.17929.00.0604
Ours β€” Qwen3-VL-2B trained on BenchCAD
Qwen3-VL-2B SFT (OOD) 0.56300.01740.738094.20.5877
Qwen3-VL-2B SFT (IID) 0.64000.01080.871088.10.6282
Qwen3-VL-2B RL (OOD) 0.71400.00470.810099.10.7231
Qwen3-VL-2B RL (IID) 0.75200.00410.892098.90.7682
Qwen3-VL-2B (baseline) 0.00320.36520.004214.60.0084

Vision QA & Code QA Leaderboards

Same 2,400 numeric questions, two conditioning modalities. The matched-pair gap surfaces the Holistic Spatial & Detailing Deficit: models extract geometric and parametric information far more reliably from explicit code than from rendered images.

Vision QA qa_img

Model Holistic Visual
Recognition
CAD Operation
Understanding
Industrial Parametric
Abstraction
Spatial
Reasoning
Total ↑
Gemini 3.1 Pro0.7500.4620.5360.6880.587
Gemini 3.1 Pro thinking0.7220.4260.5510.6690.576
Opus 4.7 thinking0.7150.4850.4210.6140.530
Opus 4.70.6990.4640.4260.6680.526
GPT-5.3 thinking0.6500.4290.4820.5340.514
GPT-5.30.6360.4230.4880.5480.513
GPT-4o0.5990.4080.4310.3960.464
Moonshot v1-8k0.6000.2460.4650.1810.447
Moonshot v1-128k0.5560.3870.4270.3340.442
OpenAI o30.3280.1880.3980.5600.327
blank-image baseline: 0.376 Β· 0.325 Β· 0.418 Β· 0.296 Β· 0.375

Code QA qa_code

Model CadQuery Code
Recognition
CAD Operation
Understanding
Industrial Parametric
Abstraction
Spatial
Reasoning
Total ↑
Gemini 3.1 Pro thinking0.9070.7830.8760.5370.838
Gemini 3.1 Pro0.9140.7820.8670.5370.836
Opus 4.7 thinking0.8910.7810.8510.6320.829
GPT-5.30.8790.8050.8150.7310.823
GPT-5.3 thinking0.8850.8020.8110.7300.821
Opus 4.70.8680.8000.7930.5950.801
GPT-4o0.8650.5930.7320.6880.726
OpenAI o30.8040.7010.6890.4920.708
Moonshot v1-128k0.8420.5510.6920.7920.700
gpt-oss-120b0.7900.7320.6560.3790.689
Nemotron-3 120B0.7710.6600.6610.2930.671
Gemma-4-31B-it0.7910.6740.6060.5280.664
Moonshot v1-8k0.7720.6030.5550.5360.620
blank-code baseline: 0.040 Β· 0.257 Β· 0.290 Β· 0.420 Β· 0.223

Best Code QA reaches 0.838 while best Vision QA caps at 0.587 β€” a ~25 pt modality gap on identical questions.

Code Edit Leaderboard

Headline metric is Accuracy β€” headroom-normalised improvement: how much of the original-to-target IoU gap the edited program closes. 1.0 means the model fully traversed the gap; the no-change baseline scores 0. The 748 pairs span five edit types T1–T5: literal replacement, chained transform, relative compute, feature edit, and geometry rebuild.

BenchCAD-Edit accuracy across task types T1 literal-replace through T5 geometry-rebuild for frontier MLLMs.

Per-type accuracy across T1–T5: simple literal swaps are nearly solved; multi-block geometry rebuilds remain difficult.

Model Thinking Accuracy ↑
GPT-5.3βœ“0.865
Claude Opus 4.7βœ“0.853
Gemini 3.1 Proβœ“0.837
Claude Opus 4.7β€”0.811
Gemini 3.1 Proβ€”0.795
GPT-5.3β€”0.740
OpenAI o3βœ“0.708
GPT-4oβ€”0.615
Nemotron-3 120Bβ€”0.608
gpt-oss-120bβ€”0.561
no-change baseline: 0.000

Submit Your Results

To add a model to the four leaderboards above, open a PR on the code repo with your results.jsonl.
See the code repo for submission instructions.

BibTeX

@misc{benchcad2026,
  title        = {BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD},
  author       = {Zhang, Haozhe and Liu, Kaichen and Chen, Miaomiao and Li, Lei and Yang, Shaojie and Peng, Cheng and Chen, Hanjie},
  year         = {2026},
  eprint       = {2605.10865},
  archivePrefix= {arXiv},
  primaryClass = {cs.CV},
  url          = {https://arxiv.org/abs/2605.10865}
}