AliciaBench is a benchmark created by Juan Echeverria to measure LLMs' ability to solve a very specific problem: escaping from a maze.
If you're interested in knowing why and how I created it, visit the About page.
1st Place
gpt-5-mini-2025-08-07
997.29
2nd Place
grok-4-0709
988.48
3rd Place
grok-4-1-fast-reasoning
967.00
| Ranking | LLM | Total score | Cost | n=5 | n=7 | n=9 | n=11 | n=13 | n=15 | n=17 | n=19 | n=21 | n=23 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| #1 | 997.29 | $3.55 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 99.88 | 100.00 | 100.00 | 97.42 | |
| #2 | 988.48 | $17.67 | 100.00 | 100.00 | 100.00 | 96.67 | 97.67 | 99.31 | 99.38 | 98.93 | 100.00 | 96.53 | |
| #3 | 967.00 | $0.70 | 100.00 | 100.00 | 100.00 | 96.67 | 99.00 | 100.00 | 99.50 | 93.64 | 99.88 | 78.31 | |
| #4 | 965.46 | $19.48 | 100.00 | 100.00 | 100.00 | 97.77 | 100.00 | 94.60 | 99.25 | 98.52 | 98.24 | 77.07 | |
| #5 | 918.40 | $12.93 | 100.00 | 100.00 | 100.00 | 95.23 | 97.67 | 92.61 | 88.38 | 94.76 | 85.24 | 64.52 | |
| #6 | 858.28 | $18.35 | 100.00 | 100.00 | 94.70 | 95.49 | 97.67 | 93.70 | 77.51 | 58.64 | 94.09 | 46.49 | |
| #7 | 650.71 | $4.96 | 100.00 | 100.00 | 100.00 | 96.44 | 98.67 | 81.21 | 49.00 | 5.65 | 19.75 | 0.00 | |
| #8 | 637.32 | $0.54 | 100.00 | 100.00 | 96.36 | 98.00 | 94.33 | 67.50 | 45.27 | 19.33 | 16.52 | 0.00 | |
| #9 | 625.57 | $5.46 | 90.00 | 70.00 | 90.02 | 84.36 | 79.67 | 76.57 | 38.26 | 28.16 | 58.93 | 9.60 | |
| #10 | 442.01 | $0.32 | 100.00 | 100.00 | 88.33 | 57.44 | 57.67 | 18.57 | 20.00 | 0.00 | - | - |