AliciaBench - Evaluando LLMs en laberintos

AliciaBench is a benchmark created by Juan Echeverria to measure LLMs' ability to solve a very specific problem: escaping from a maze.

If you're interested in knowing why and how I created it, visit the About page.

1st Place

gpt-5-mini-2025-08-07

997.29

2nd Place

grok-4-0709

988.48

3rd Place

grok-4-1-fast-reasoning

967.00

Providers

Models

Model type

Ranking	LLM	Total score	Cost	n=5	n=7	n=9	n=11	n=13	n=15	n=17	n=19	n=21	n=23
#1	GPT-5-mini	997.29	$3.55	100.00	100.00	100.00	100.00	100.00	100.00	99.88	100.00	100.00	97.42
#2	Grok 4	988.48	$17.67	100.00	100.00	100.00	96.67	97.67	99.31	99.38	98.93	100.00	96.53
#3	Grok 4.1 Fast	967.00	$0.70	100.00	100.00	100.00	96.67	99.00	100.00	99.50	93.64	99.88	78.31
#4	GPT-5	965.46	$19.48	100.00	100.00	100.00	97.77	100.00	94.60	99.25	98.52	98.24	77.07
#5	Sonnet 4.5	918.40	$12.93	100.00	100.00	100.00	95.23	97.67	92.61	88.38	94.76	85.24	64.52
#6	Gemini 3.0 Pro	858.28	$18.35	100.00	100.00	94.70	95.49	97.67	93.70	77.51	58.64	94.09	46.49
#7	Deepseek R1	650.71	$4.96	100.00	100.00	100.00	96.44	98.67	81.21	49.00	5.65	19.75	0.00
#8	GPT Oss 120b	637.32	$0.54	100.00	100.00	96.36	98.00	94.33	67.50	45.27	19.33	16.52	0.00
#9	Flash 2.5 (09/25)	625.57	$5.46	90.00	70.00	90.02	84.36	79.67	76.57	38.26	28.16	58.93	9.60
#10	qwq-32b	442.01	$0.32	100.00	100.00	88.33	57.44	57.67	18.57	20.00	0.00	-	-