What is AliciaBench?
AliciaBench is a benchmark designed to evaluate LLMs' ability to complete a very specific task: finding the exit of a maze.
The name, as you've probably guessed, is inspired by "Alice in Wonderland". There's also a wordplay here: 'Alice' in Spanish is 'Alicia', and 'AI' in Spanish is 'IA', so 'AlicIA' contains 'AI' within its name.
Why this interest in mazes?
Very good question... The truth is that I created this benchmark somewhat by chance. One day, curiosity got the better of me and I decided to check if GPT-4 was capable of finding the exit of a maze. I found that it wasn't capable of doing it even for super simple mazes, which left me a bit shocked at how useless the model was for an apparently simple task.
The experiment stayed there, until OpenAI launched its first reasoning model: o1. Then I decided to check if o1 was this time capable of completing the maze-solving task more successfully, and it turned out it was! Then I became curious to see how well other models performed, and that's how I created this benchmark.
Why do I think this benchmark is interesting?
To be honest, I'm not entirely clear about what ability we're measuring with this benchmark. Finding the exit of a maze while knowing its exit coordinates is not a task that a priori requires great intelligence; it's a task that a monkey could solve by applying trial and error if it had some memory and infinite time.
But LLMs don't have infinite time to 'think', they are limited by the number of tokens they can generate per request, which is why there are many mazes they cannot solve.
What I believe this benchmark measures is the processing capacity of the models, the efficiency with which they are able to execute that trial and error strategy to find the exit.
But there are 2 reasons why I think this benchmark is very interesting:
- It cannot be saturated, because the dimensions of the mazes can be increased infinitely.
- It doesn't require any kind of knowledge, so the results are not biased by the amount of data the models were trained on.
How does the benchmark work?
The first thing I did was create a maze generator to create a dataset of mazes to test different models. The mazes are actually square matrices of NxN dimensions, this is for example maze 501:
[0,0,0,X,0],
[0,1,1,1,0],
[0,1,0,1,0],
[0,1,1,1,0],
[0,1,0,0,0]The "1"s represent paths, the "0"s represent walls, and the "X" represents the position where the LLM is located, which is initially the entrance to the maze. In the case of this maze, the solution to find the exit would be: 3 cells down, 2 cells to the left, 1 cell down.
The dimension N of the maze is what allows me to control its complexity. The larger N is, the more rows and columns the maze has, and therefore more paths, which means more decisions need to be made, making it more difficult to find the exit. For each dimension, I have created 10 mazes. The mazes can only have odd dimensions due to how the algorithm that creates them works.
To evaluate the different models, I do the following:
- I evaluate the LLM with the 10 mazes of the smallest dimension: 5.
- If the LLM is not able to find the exit of any maze --> I stop the evaluation.
- If the LLM has been able to find the exit of at least one of the mazes --> I evaluate it with mazes of the next dimension.
That's why you'll see that not all models have been evaluated for all dimensions.
How do I calculate the Score?
The Score measures the efficiency with which the LLM has been able to find the exit of the maze. Each maze has a minimum number of steps that must be taken to reach its exit; it is not possible to reach it in fewer steps than that minimum. In the case of maze 501, the minimum number of steps is 6.
That would be the perfect solution. If an LLM is able to reach the exit of maze 501 in 6 steps, then its score = 100. The formula to calculate the score is --> Score = [1 - [(steps taken - minimum steps) / minimum steps]] * 100.
And each LLM has a maximum of 3 requests to reach the exit. If in 3 requests it has not been able to reach it, then its score is directly = 0. I'll explain what happens in each request soon...
In the table and chart on the Home page, the average score per dimension is shown. If you click on a specific model, you can see the score obtained for each maze.
In addition to all this, to reduce randomness, I give each model 3 attempts with each maze. That is, if a model obtains a score of 0.75 with maze 501, then I let it try again 2 more times to see if it can improve its mark. Of the 3 attempts, I always keep the best one.
What do I ask the LLM in each request? What prompt do I use?
Most importantly: I don't pass the LLM any image of the maze; I pass the maze in the form of a matrix. I did this initially because not all models were multimodal, and I wanted to use this benchmark to compare as many models as possible. Now almost all of them are multimodal, so I could pass them an image of the maze for each moment, but I prefer to maintain the initial format.
The prompt is as follows:
I need to navigate through a maze, give me directions to help me find the exit.
<Maze map>
[[0,0,0,X,0],[0,1,1,1,0],[0,1,0,1,0],[0,1,1,1,0],[0,1,0,0,0]]
</Maze map>
The maze is a matrix, where cells represented by "1" are paths and cells represented by "0" are walls. I am the "X" symbol.
Each element of the matrix is a row of the maze, ordered from top to bottom.
The maze has only one entrance and one exit. The entrance is located at position [3,0] using a coordinate system [x,y], where the cell located in the upper left corner is [0,0].
Help me move towards the exit, considering that:
- I can only move in one direction: horizontal or vertical, but not diagonally
- I can move multiple cells in a single movement
Return a JSON containing information about the movements I should make to reach the exit. The JSON must have the following format: {"movements": [{"direction": <direction>, "cells": <cells>}, {"direction": <direction>, "cells": <cells>}]}, where
- direction: up, down, right, or left
- cells: number of cells I should move in each movement
Return only the JSON.
In a single request, the LLM gives me the entire list of movements that must be performed to get from the entrance to the exit. If any of the movements is not viable, because it would be going through a wall or outside the maze, the LLM is responded to, indicating the reason why its movement is not viable. In that same message, an updated map of the maze is passed, after applying all the movements that were valid.
As I mentioned before, if the LLM has generated 3 responses and still hasn't reached the exit of the maze, then the process is stopped and a score = 0 is assigned. I do this to optimize costs; otherwise, the budget would not allow me to evaluate the most expensive models.
Contact
If you have any feedback/suggestions/comments, you can write to me by email or contact me through social networks: