NonoBench Results

Benchmark results for LLM performance on Nonogram puzzle solving. Comparing accuracy, speed, and cost across different grid sizes.

Puzzles:30

Models:45

Total runs:1349

Explore puzzles

Last updated: 2/19/2026, 5:23:00 PM

Filter by model type:

Reasoning models

Non-reasoning models

Model Accuracy

Size:

Detailed Model Statistics

Model	Overall	5×5					10×10					15×15
Model	Overall	Accuracy	Runs	Correct	Avg cost	Avg time	Accuracy	Runs	Correct	Avg cost	Avg time	Accuracy	Runs	Correct	Avg cost	Avg time
gemini-3.1-pro-preview-high	63.3%	90%	10	9	$0.04	31.88s	60%	10	6	$0.26	163.72s	40%	10	4	$0.25	174.99s
claude-4.5-opus-high	56.7%	100%	10	10	$0.13	88.03s	40%	10	4	$1.19	844.26s	30%	10	3	$0.98	778.95s
gemini-3-pro-preview-high	53.3%	90%	10	9	$0.04	44.69s	60%	10	6	$0.22	180.79s	10%	10	1	$0.26	254.67s
gpt-5.2-high	53.3%	90%	10	9	$0.04	45.42s	50%	10	5	$0.40	531.64s	20%	10	2	$0.64	933.66s
claude-4.5-opus-low	50.0%	100%	10	10	$0.09	51.44s	40%	10	4	$0.37	233.62s	10%	10	1	$0.19	115.41s
gpt-5.2-xhigh	50.0%	90%	10	9	$0.08	90.89s	60%	10	6	$0.61	749.07s	0%	10	0	$0.56	994.77s
deepseek-v3.2-speciale-high	46.7%	90%	10	9	$0.0055	597.11s	30%	10	3	$0.02	2580.79s	20%	10	2	$0.02	1511.17s
gemini-3-flash-preview-high	46.7%	90%	10	9	$0.02	48.55s	40%	10	4	$0.07	164.13s	10%	10	1	$0.08	139.85s
gpt-5.2-low	46.7%	100%	10	10	$0.03	34.78s	20%	10	2	$0.21	278.82s	20%	10	2	$0.16	184.03s
gemini-3.1-pro-preview-low	41.4%	80%	10	8	$0.02	39.45s	22%	9	2	$0.04	65.49s	20%	10	2	$0.03	45.48s
deepseek-v3.2-high	36.7%	90%	10	9	$0.0040	529.73s	10%	10	1	$0.03	882.49s	10%	10	1	$0.03	704.64s
deepseek-v3.2-speciale	36.7%	80%	10	8	$0.0044	542.25s	20%	10	2	$0.02	1795.09s	10%	10	1	$0.01	2008.11s
gpt-oss-120b-high	36.7%	80%	10	8	$0.0018	86.90s	30%	10	3	$0.01	237.03s	0%	10	0	$0.0091	113.93s
kimi-k2.5-high	36.7%	100%	10	10	$0.03	231.56s	10%	10	1	$0.14	1012.85s	0%	10	0	$0.09	708.31s
grok-4	33.3%	80%	10	8	$0.13	146.35s	20%	10	2	$0.70	780.83s	0%	10	0	$0.68	809.63s
minimax-m2.5-high	33.3%	90%	10	9	$0.02	220.06s	10%	10	1	$0.08	1473.31s	0%	10	0	$0.05	845.37s
gpt-oss-120b-low	30.0%	90%	10	9	$0.0002	35.14s	0%	10	0	$0.0018	46.15s	0%	10	0	$0.0009	24.41s
qwen3-next-80b-a3b-thinking	30.0%	90%	10	9	$0.01	79.71s	0%	10	0	$0.04	164.82s	0%	10	0	$0.04	199.00s
seed-1.6-high	30.0%	90%	10	9	$0.01	117.03s	0%	10	0	$0.04	359.01s	0%	10	0	$0.03	293.53s
glm-5-reasoning-high	26.7%	60%	10	6	$0.01	91.85s	10%	10	1	$0.07	650.51s	10%	10	1	$0.07	736.69s
kimi-k2-thinking	26.7%	80%	10	8	$0.04	260.15s	0%	10	0	$0.12	938.03s	0%	10	0	$0.14	428.72s
minimax-m2.1-high	26.7%	80%	10	8	$0.0085	129.04s	0%	10	0	$0.03	279.81s	0%	10	0	$0.02	400.14s
claude-4.5-sonnet-reasoning	23.3%	70%	10	7	$0.09	91.83s	0%	10	0	$0.35	365.92s	0%	10	0	$0.15	160.40s
grok-4.1-fast-reasoning	23.3%	60%	10	6	$0.0033	47.29s	10%	10	1	$0.02	295.44s	0%	10	0	$0.03	390.37s
minimax-m2.5	23.3%	70%	10	7	$0.01	181.17s	0%	10	0	$0.09	1481.82s	0%	10	0	$0.06	732.98s
glm-4.7-reasoning	20.0%	60%	10	6	$0.02	250.43s	0%	10	0	$0.06	653.38s	0%	10	0	$0.04	423.89s
glm-5-reasoning	20.0%	60%	10	6	$0.02	209.77s	0%	10	0	$0.10	770.82s	0%	10	0	$0.08	736.45s
minimax-m2.1	20.0%	60%	10	6	$0.0093	150.83s	0%	10	0	$0.03	313.90s	0%	10	0	$0.02	367.21s
mimo-v2-flash-high	16.7%	50%	10	5	$0.0000	155.63s	0%	10	0	$0.0000	365.93s	0%	10	0	$0.0000	572.42s
glm-4.7-reasoning-high	13.3%	40%	10	4	$0.01	220.38s	0%	10	0	$0.06	652.39s	0%	10	0	$0.04	642.59s
grok-4.1-fast-reasoning-high	13.3%	40%	10	4	$0.0031	43.43s	0%	10	0	$0.02	349.43s	0%	10	0	$0.02	262.72s
seed-1.6-flash-high	13.3%	40%	10	4	$0.0042	89.14s	0%	10	0	$0.0076	172.38s	0%	10	0	$0.0083	183.87s
olmo-3.1-32b-think	10.0%	30%	10	3	$0.0071	153.69s	0%	10	0	$0.01	290.81s	0%	10	0	$0.01	301.58s
deepseek-v3.2	3.3%	10%	10	1	$0.0004	28.16s	0%	10	0	$0.0028	157.06s	0%	10	0	$0.0008	40.96s
kimi-k2	3.3%	10%	10	1	$0.0051	105.92s	0%	10	0	$0.01	101.82s	0%	10	0	$0.0073	110.70s
claude-4.5-sonnet-non-reasoning	0.0%	0%	10	0	$0.01	16.12s	0%	10	0	$0.0095	10.42s	0%	10	0	$0.0093	10.60s
gemini-3-flash-preview-minimal	0.0%	0%	10	0	$0.0002	1.10s	0%	10	0	$0.0005	1.36s	0%	10	0	$0.0009	1.93s
gemini-3-pro-preview-low	0.0%	0%	10	0	$0.0026	3.84s	0%	10	0	$0.0040	4.49s	0%	10	0	$0.0055	5.32s
glm-4.7-non-reasoning	0.0%	0%	10	0	$0.0002	5.73s	0%	10	0	$0.0003	9.51s	0%	10	0	$0.0004	16.67s
glm-5-non-reasoning	0.0%	0%	10	0	$0.0003	7.96s	0%	10	0	$0.0004	11.57s	0%	10	0	$0.0005	250.52s
grok-4.1-fast-non-reasoning	0.0%	0%	10	0	$0.0001	2.16s	0%	10	0	$0.0066	99.06s	0%	10	0	$0.0009	11.48s
kimi-k2.5-non-reasoning	0.0%	0%	10	0	$0.0002	1.63s	0%	10	0	$0.0005	2.42s	0%	10	0	$0.0006	4.06s
mimo-v2-flash	0.0%	0%	10	0	$0.0000	10.98s	0%	10	0	$0.0000	12.71s	0%	10	0	$0.0000	91.02s
ministral-14b-2512	0.0%	0%	10	0	$0.0001	544ms	0%	10	0	$0.0001	2.30s	0%	10	0	$0.0008	39.92s
mistral-large-2512	0.0%	0%	10	0	$0.0002	1.13s	0%	10	0	$0.0004	2.36s	0%	10	0	$0.0010	7.24s

Statistics by grid size

5x5

Grid

Avg accuracy

56.2%

Solved

253/450

Avg time

118.24s

Avg cost

$0.02

10x10

Grid

Avg accuracy

12.0%

Solved

54/449

Avg time

457.31s

Avg cost

$0.12

15x15

Grid

Avg accuracy

4.7%

Solved

21/450

Avg time

394.90s

Avg cost

$0.11