BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD

Haozhe Zhang, Kaichen Liu, Miaomiao Chen, Lei Li, Shaojie Yang, Cheng Peng, Hanjie Chen

Paper arXiv Code 🤗 Dataset Leaderboard

One variant of the wing_nut family — ground-truth CadQuery, STEP, mesh, 4-view, and numeric QA.

17,900

Vision2Code img2cq
verified parts · 4-view → CadQuery

2,400

Vision QA · Code QA qa_img qa_code
paired image / code numeric items

748

Code Edit edit_code
before/after edit pairs · T1–T5

BenchCAD family distribution — 106 part families across 7 industrial sectors

106 industrial part families spanning fasteners, transmission, structural, fluid, panels, hardware, and enclosures. 49% of families (52/106) are anchored to real specification tables across 47 ISO / DIN / EN / ASME / IEC codes.

BenchCAD is a unified, capability-decomposed benchmark for industrial CAD reasoning — 17,900 execution-verified CadQuery programs across 106 industrial families, every family expert-curated and grounded in industrial standards. It evaluates models through four matched tasks: Vision2Code, Vision QA, Code QA, and Code Edit.

Three releases: BenchCAD (17,900 verified parts) · BenchCAD-QA (2,400 paired image/code numeric QA items) · BenchCAD-Edit (748 verified edit pairs). Operation surface covers 46 distinct CadQuery ops including helix, twistExtrude, polarArray, loft, and sweep.

Example Cases

One variant from each of 16 part families — each part rotating around its primary axis.

twisted_drill

bevel_gear

twisted_bracket

coil_spring

impeller

bellows

heat_sink

eyebolt

turnbuckle

hinge

ratchet_sector

double_simplex_sprocket

t_pipe_fitting

lathe_turned_part

spline_hub

fan_shroud

Introduction

Industrial CAD code generation requires more than recognizing the outer shape of a part: it requires understanding the part's 3D structure, inferring engineering parameters, and choosing CAD operations that reflect how the part would actually be designed and manufactured. Models that pass the eye too often fail the caliper — two programs may render to similar envelopes while differing substantially in editability, operation choice, and engineering detail.

BenchCAD is the first public CAD benchmark that combines four properties simultaneously:

Execution-verified at scale — 17,900 sandbox-executed CadQuery parts across 106 industrial families.
Standard-anchored — 49% of families (52/106) bound to real ISO / DIN / EN / ASME / IEC specification tables.
Operation-rich — 46 distinct CadQuery ops including helix, twistExtrude, polarArray, advanced sweeps and lofts rare or absent in prior corpora.
Capability-decomposed — four matched tasks (Vision2Code · Vision QA · Code QA · Code Edit) with image/code paired contrast that isolate visual recognition, parametric abstraction, and code synthesis.

Three Released Datasets

17,900

BenchCAD — verified CadQuery parts
(code · STEP · 4 views · params · ops)

2,400

BenchCAD-QA — paired image/code
numeric QA items

748

BenchCAD-Edit — verified
before/after edit pairs

106

industrial part families

ISO / DIN / EN / ASME / IEC codes (49% of families)

distinct CadQuery operations covered

Every family is hand-crafted by domain experts directly from industrial standards. Each release ships with Croissant 1.0 metadata, code under MIT, data under CC-BY-4.0.

Capability Hierarchy

We organise CAD reasoning into a four-level capability pyramid, from low-level perception to high-level synthesis. The same questions are evaluated under both Vision QA (multi-view renders) and Code QA (CadQuery source); the matched-pair design isolates whether a failure stems from visual recognition or from reasoning over the queried attribute.

L1 Holistic Visual Recognition · L2 CAD Operation Understanding · L3 Industrial Parametric Abstraction · L4 Compositional Spatial / Code Reasoning.

Four Matched Tasks

Vision2Code img2cq

CadQuery generation from multi-view renders

Model receives four canonical orthographic views and emits a CadQuery program. We re-execute and score against GT via IoU, Chamfer, Hausdorff, Feature-F1, essential-op recall, and exec rate.

Vision QA qa_img

Geometric reasoning from rendered views

Numeric QA from multi-view renders, broken out along the four-level capability hierarchy (L1 Visual Recognition → L4 Spatial Reasoning). ±5% tolerance for ratios, exact match for integers.

Code QA qa_code

Symbolic understanding of CadQuery programs

Same QA bank as Vision QA, but conditioned on CadQuery source instead of renders. The matched-pair gap isolates the cost of visual recognition vs. parametric reasoning.

Code Edit edit_code

Instruction-guided program editing

Given a CadQuery program and a natural-language edit instruction, output a minimally modified program matching the target. Five edit types T1–T5 from literal swap to multi-block geometry rebuild. Scored by headroom-normalised improvement.

Overall Leaderboard

Per-model performance across the four BenchCAD task categories (frontier proprietary subset).

Detailed per-task leaderboards with all evaluated models — including open-source and CAD-specialist baselines — are below.

Vision2Code Leaderboard

Image-to-CadQuery generation from four canonical orthographic views. IoU↑ @ 64³ voxels; CD↓ Chamfer distance; ess↑ essential-op recall; exec % = fraction of generated programs that run; total↑ = composite score. Best per block in bold.

Model	IoU ↑	CD ↓	ess ↑	exec %	total ↑
Specialist CAD lineage (~2B)
Cadrille-RL	0.0683	0.3507	0.3118	91.2	0.3382
CADEvolve v3	0.7497	0.0080	0.3715	92.7	0.6014
Frontier MLLMs
GPT-4o	0.1884	0.0623	0.4680	87.0	0.2274
GPT-5.3	0.1952	0.0594	0.4940	69.5	0.2160
GPT-5.3 thinking	0.2072	0.0566	0.4550	67.5	0.2120
Claude Sonnet 4.6	0.2380	0.0320	0.5060	85.4	0.3510
Claude Sonnet 4.6 thinking-high	0.2420	0.0220	—	90.0	0.3850
Claude Opus 4.7	0.2740	0.0210	0.4770	94.0	0.3830
Claude Opus 4.7 thinking	0.2670	0.0240	0.4710	90.4	0.3780
Gemini 3.1 Pro	0.2560	0.0400	0.3780	74.0	0.3290
Gemini 3.1 Pro thinking	0.2790	0.0250	0.5670	79.8	0.3970
OpenAI o3	0.5004	0.0143	0.4440	5.6	0.1081
Moonshot v1-128k	0.1530	0.0572	0.1922	10.5	0.0612
Moonshot v1-8k	0.2421	0.0577	0.1792	9.0	0.0604
Ours — Qwen3-VL-2B trained on BenchCAD
Qwen3-VL-2B SFT (OOD)	0.5630	0.0174	0.7380	94.2	0.5877
Qwen3-VL-2B SFT (IID)	0.6400	0.0108	0.8710	88.1	0.6282
Qwen3-VL-2B RL (OOD)	0.7140	0.0047	0.8100	99.1	0.7231
Qwen3-VL-2B RL (IID)	0.7520	0.0041	0.8920	98.9	0.7682
Qwen3-VL-2B (baseline)	0.0032	0.3652	0.0042	14.6	0.0084

Vision QA & Code QA Leaderboards

Same 2,400 numeric questions, two conditioning modalities. The matched-pair gap surfaces the Holistic Spatial & Detailing Deficit: models extract geometric and parametric information far more reliably from explicit code than from rendered images.

Vision QA `qa_img`

Model	Holistic Visual Recognition	CAD Operation Understanding	Industrial Parametric Abstraction	Spatial Reasoning	Total ↑
Gemini 3.1 Pro	0.750	0.462	0.536	0.688	0.587
Gemini 3.1 Pro thinking	0.722	0.426	0.551	0.669	0.576
Opus 4.7 thinking	0.715	0.485	0.421	0.614	0.530
Opus 4.7	0.699	0.464	0.426	0.668	0.526
GPT-5.3 thinking	0.650	0.429	0.482	0.534	0.514
GPT-5.3	0.636	0.423	0.488	0.548	0.513
GPT-4o	0.599	0.408	0.431	0.396	0.464
Moonshot v1-8k	0.600	0.246	0.465	0.181	0.447
Moonshot v1-128k	0.556	0.387	0.427	0.334	0.442
OpenAI o3	0.328	0.188	0.398	0.560	0.327
blank-image baseline: 0.376 · 0.325 · 0.418 · 0.296 · 0.375

Code QA `qa_code`

Model	CadQuery Code Recognition	CAD Operation Understanding	Industrial Parametric Abstraction	Spatial Reasoning	Total ↑
Gemini 3.1 Pro thinking	0.907	0.783	0.876	0.537	0.838
Gemini 3.1 Pro	0.914	0.782	0.867	0.537	0.836
Opus 4.7 thinking	0.891	0.781	0.851	0.632	0.829
GPT-5.3	0.879	0.805	0.815	0.731	0.823
GPT-5.3 thinking	0.885	0.802	0.811	0.730	0.821
Opus 4.7	0.868	0.800	0.793	0.595	0.801
GPT-4o	0.865	0.593	0.732	0.688	0.726
OpenAI o3	0.804	0.701	0.689	0.492	0.708
Moonshot v1-128k	0.842	0.551	0.692	0.792	0.700
gpt-oss-120b	0.790	0.732	0.656	0.379	0.689
Nemotron-3 120B	0.771	0.660	0.661	0.293	0.671
Gemma-4-31B-it	0.791	0.674	0.606	0.528	0.664
Moonshot v1-8k	0.772	0.603	0.555	0.536	0.620
blank-code baseline: 0.040 · 0.257 · 0.290 · 0.420 · 0.223

Best Code QA reaches 0.838 while best Vision QA caps at 0.587 — a ~25 pt modality gap on identical questions.

Code Edit Leaderboard

Headline metric is Accuracy — headroom-normalised improvement: how much of the original-to-target IoU gap the edited program closes. 1.0 means the model fully traversed the gap; the no-change baseline scores 0. The 748 pairs span five edit types T1–T5: literal replacement, chained transform, relative compute, feature edit, and geometry rebuild.

BenchCAD-Edit accuracy across task types T1 literal-replace through T5 geometry-rebuild for frontier MLLMs.

Per-type accuracy across T1–T5: simple literal swaps are nearly solved; multi-block geometry rebuilds remain difficult.

Model	Thinking	Accuracy ↑
GPT-5.3	✓	0.865
Claude Opus 4.7	✓	0.853
Gemini 3.1 Pro	✓	0.837
Claude Opus 4.7	—	0.811
Gemini 3.1 Pro	—	0.795
GPT-5.3	—	0.740
OpenAI o3	✓	0.708
GPT-4o	—	0.615
Nemotron-3 120B	—	0.608
gpt-oss-120b	—	0.561
no-change baseline: 0.000

Submit Your Results

To add a model to the four leaderboards above, open a PR on the code repo with your results.jsonl.
See the code repo for submission instructions.

BibTeX

@misc{benchcad2026,
  title        = {BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD},
  author       = {Zhang, Haozhe and Liu, Kaichen and Chen, Miaomiao and Li, Lei and Yang, Shaojie and Peng, Cheng and Chen, Hanjie},
  year         = {2026},
  eprint       = {2605.10865},
  archivePrefix= {arXiv},
  primaryClass = {cs.CV},
  url          = {https://arxiv.org/abs/2605.10865}
}