15-Puzzle with Reinforcement Learning (2024)

Amin Shali

7 min read

Jun 24, 2022

When I was a kid I had this puzzle called the 15-puzzle. Basically a 4x4 square with 15 movable tiles in it. The idea is to scramble the puzzle using the empty cell and then fix it again.

15-Puzzle with Reinforcement Learning (2)

15-Puzzle with Reinforcement Learning (3)

You can play this game online at https://15puzzle.netlify.app/.

Recently I bought the very same puzzle and started playing with it again. I wanted to try to solve this using Reinforcement Learning(RL) which was an interesting learning experience for me.

Here I want to show you how I did this using reinforcement learning in Java. I didn’t use any special library to program this in Java. I read the infamous book on the subject called “Reinforcement Learning: An Introduction” by Sutton and Barto and started coding!

The very abstract and general idea behind RL is that we solve the problem by experiencing(playing) it and get rewards for actions we take. The reward is like a signal which tells us if we are doing good or bad. The reward might be given to us after each action we take or could be a lump-sum reward at the end of the game or activity(That is if there is an end).

We don’t want to necessarily maximize the reward with each action. That is a greedy strategy and might not yield a globally optimal solution. We might get stuck in a local maxima which is not a solution really. Whether we are playing a game or solving a physics problem or predicting the stock market with RL, we are always facing a current situation(or state) and a set of actions that we can take from that state. For example, here is a state in the 15-Puzzle:

-----------------
| 10| 13| | 6|
| 14| 12| 4| 2|
| 7| 15| 9| 8|
| 5| 1| 11| 3|
-----------------

The actions we can take from this states are:

Move 4
Move 13, or
Move 6

But which of these actions is the best one and leads us to the final state(faster)? By the way, getting to the final state faster could be encoded as a reward as well. Therefore, we don’t have to keep track of time or number of steps as a different concept. It all can be encoded as reward.

In RL, through trial and error, we look for a policy(π) which is a mapping from each state to a best action that gives us the maximum reward and leads us to the final goal. Finding this policy is what RL is all about. RL gives us a framework to find the best policy. There are many different methods for doing so such as:

Dynamic Programming
Monte Carlo
Temporal Difference

And several more. Each of them have their pros and cons and are suited for a set of problems.

For the purpose of the 15-Puzzle, we are going to use the Dynamic programming method which is the easiest of all methods. I’ll try to briefly explain the method, but for a very good explanation see the book(page 73).

States

The idea behind dynamic programming is that there is a known set of states and actions we can take from those states and we want to give various states a numeric value and thereafter find the optimal policy by finding the action that maximizes the state value.

This method assumes that we have the set of states. That is a bold assumption. Usually we either don’t have all the states beforehand because we actually need to play the game or do the activity to obtain the states, or there are so many states which do not practically fit inside memory. There are tricks, techniques and methods to overcome this though.

How many states are there for the 15-Puzzle? Each square can practically be anywhere in 16 possible cells. Hence the number of states is 16!= 20,922,789,888,000. That is a huge number! There is no way we can fit that many states in memory. So we need some tricks to do this with dynamic programming.

If you play this puzzle, you would naturally think that it is probably a good idea to start fixing the first row(1, 2, 3, 4) and then the second row(5, 6, 7, 8) and then the third and fourth rows. While you are trying to put 1, 2, 3 and 4 into their places, you probably don’t care about the other numbers. They all would look the same to you as if they don’t exist and their number doesn’t matter. And that is the main idea here we use to reduce the total number of states.

For example if the current state is:

-----------------
| 10| 13| | 6|
| 14| 12| 4| 2|
| 7| 15| 9| 8|
| 5| 1| 11| 3|
-----------------

As we are trying to solve for the first row, we can set all the other numbers to 0, for example:

-----------------
| 0| 0| | 0|
| 0| 0| 4| 2|
| 0| 0| 0| 0|
| 0| 1| 0| 3|
-----------------

Now if we play around and move the empty cell around, we only get so much states which is exactly equal to:

16! / 11! = 524,160

(zeros are all the same and does not contribute to the permutations)

524,160 states are very reasonable and easily fit into memory. Once we solve the first row, we do the same thing with the other numbers:

-----------------
| 1| 2| 3| 4|
| 0| 0| 6| 8|
| 0| 0| | 0|
| 7| 5| 0| 0|
-----------------

We can turn 1, 2, 3 and 4 into zeros as well, but there is no need because we are not going to move them(they remain stationary when we solve subsequent rows). How many states are there when we solve the second row?

12! / 7! = 95,040

We then solve for the third and fourth rows together, because they have so many little states: 8! = 40, 320.

Each state must ideally be represented with a unique hash value. To do that we can simply concatenate the numbers in cells row by row. For example the hash for the following state:

S = -----------------
 | 1| 2| 3| 4|
 | 0| 0| 6| 8|
 | 0| 0| | 0|
 | 7| 5| 0| 0|
 -----------------

Hash(S) = 1,2,3,4,0,0,6,8,0,0,16,0,7,5,0,0

Integer value 16 stands for the empty cell. This hash representation is good enough, but it is not super efficient. However, we won’t worry about that since we don’t have that many states. We can probably do better by converting each cell value into a 4 bit code and compact everything into a single 64 bit number, but I’ll leave that to the interested readers.

So, we have all the states and we have the hashing figured out. Next we will do the actual dynamic programming to solve the puzzle. But before we do so, what are drawbacks of the above state reduction trick we did? One issue I can think of is that if there is a grand solution which can solve the whole puzzle in much fewer steps, we can never find that.

The following is the state class which represents and hashes a state in the 15-Puzzle. Important things to notice in the code:

Hash generation
Masking a state
Terminal states for each phase of solve
Generating possible actions
Generating a next state from current state

We also need a piece of code which generates states from various terminal states we have. That will help us generate all the possible states we need for dynamic programming.

Value Iteration

This is a dynamic programming algorithm in RL which tries to estimate the value of each state. It eventually converges on the basis that there is no more significant changes to any of the values for any of the states. Once the optimal values are calculated, the algorithm proceeds to find the optimal policy based on the calculated values. Finding the optimal policy is easy because for each state we only have to find the action which maximizes the term [r + γ * V(s’)].

for all terminal states t:
 V(t) = 0
doneloop:
 Δ = 0
 for each non-terminal state s do
 v = V(s)
 V(s) = max(for each next state s' and reward r:
 [r + γ * V(s')])
 Δ = max(Δ, |v - V(s)|)
 done
until Δ < θ# Then we attempt to find the optimal policy:
for each non-terminal state s do
 π(s) = for each action a, reward r and next state s': 
 choose action which maximizes [r + γ * V(s')])
done

The Java code for this algorithm can be found here.

In order to run the puzzle, check out the repository and run the following commands:

# In the repo's root
mvn clean compile packagejava -cp target/Sutton-Barto-RLBook-1.0-jar-with-dependencies.jar \
com.github.amshali.rl.fifteen.FifteenPuzzle -th 0.000001 \
-g 0.95 -tr 2000 -p

The above command will run the algorithm and find the optimal policy and then it will start to solve a random puzzle.

The average number of steps to complete the puzzle is around 70. This is not bad if you have ever tried the puzzle yourself. I realized that this RL algorithm actually finds some interesting ways to solve the puzzle. For example, in order to solve the first row, it kinda stacks the tiles 1,2,3,4 in a way that it will become super easy to shift them into their true places with no wasted moves. Also it solves the last two rows very efficiently.

Watch a sample run of this algorithm solving a random puzzle here: