If you have a few minutes to spare, consider filling in my “Vote for Candy” survey to rank your favorite candies as part of my latest project, especially if you are a resident of a smaller US state. Click/tap here to go to the survey. This tab will stay open.
Gift cards will be given to random participants.
May 04, 2018
Previously, I wrote about teaching a computer to play board games and learning strategies from it. The example game, Machi Koro, is relatively simple, and the computer learned to play it very fast, and some easy-to-understand strategies were extracted from it.
Recently, I wrote a similar, more complex, AI that learned to play the game Splendor. The rules of the game itself are not complicated, but the strategy and planning involved are.
The starting layout for 4 players looks like this:
(the layout is similar for 2 or 3 players, but some of the square cards at the top/gems in the piles on the right are removed)
The basic rules are as follows:
For a detailed run-through, see Board Game Geek's tutorial.
Alternately, you can check out a more thorough explanation at BoardGameHalv.com.
When humans play the game, the general strategy is to purchase less expensive cards first so that you can use the discounts they grant as much as you can. Acquiring the nobles at the top by purchasing a certain combination of cards is also a common goal players have when making decisions.
The difficulty of making the decisions involve planning ahead:
When simulating the game, I gave each of the four players a different, randomly-initialized neural network to use for decisionmaking. For a simple explanation of neural networks, see my article, A Simple Explanation of how Computers Recognize Images. The inputs I use are shared across all players, and the inputs include reserved cards, number of gems, number of cards owned, points, and turn order. I also include a representation of the game, all of its cards, all of its nobles (objectives), and the turn number. The implementation of the neural network creation can be found in this this source code. Alternatively, I have a visualization of the network below:
Each of the boxes represents numbers that are input to the next layer. The number of inputs are indicated in parentheses. Note that some of the inputs, such as cards on the board and a player's reserved cards, have a blue box below them indicating how many of those layers are used. For example, there are 12 cards on the board, so there are 12 different inputs of game cards that are represented by 15 input values (5 colors for cost, 5 colors for the card color, 3 for tier, 1 for point value, 1 for blank indicator).
Additionally, I connect each player's input to the main dense layer, since a lot of information could be lost when condensing it in the "funnel layers" that I use.
The motivation behind using this type of network is that I can reasonably constrain the values the network learns as to better represent the variables used to predict chances of winning or various score outputs I will describe below.
When training the AI, it needs some sort of output to optimize for, as well as a way to act on its predictions of that output. When training, there were 4 values I had it predict:
These values are added together in various proportions to yield a target "score". When initially training the network, I gave a much higher weight to the short-term predictions than the long ones. I slowly changed the weighting of these decisions to emphasize long-term predictions more as training progressed. Additionally, the decisions become less random over time, as well, so that decisions that are likely to have better results will be chosen even more frequently than before. Here is a simple flowchart to describe the process:
Essentially, I simulate a few hundred games, then train the AI. After a few rounds of this, I also reset the data and update the decision weights and temperature. The temperature is a parameter inspired by temperature in physics that controls the randomness of the AI's decisions. At high temperatures, all possible decisions are almost equally likely, and at low temperatures, the best decision is almost certainly the only one that will be made.
When determining the probability of each possible action, I use the Boltzmann distribution, with the "energy" being the negative score of each predicted outcome.
When running the first trials, my goal was to determine effectiveness using two metrics:
When the AI plays against other AI with the same amount of training, the number of rounds doesn't affect the expected game length too much. This is likely because the strategy transforms from getting yourself a lot of points to preventing your opponent from doing the same.
For the second test, I chose to pit AI with 2, 6, and 12 rounds of training against each other. Each inter-round pair has 200 samples (50 for each player).
The 2-round AI was clearly inferior to the others, but it is difficult to say what effect 12 rounds had versus 6. The 30.5% win rate of 12 rounds vs 6 is not consistent with the 24.5% rate that a single 6-round AI has versus 3 12-round AIs. It is possible that the 12-round AIs all outsmarted each other, eliminating their advantages.
Using the R code
1-pbinom(63, 200, 0.25) for the p-value calculation yields a probability of 0.0154, which indicates that it is highly unlikely that the high win rate is purely by chance if the odds of winning were equal among all players. That p-value grows quickly, however, when one tests above 26% for chances of winning.
I will need to make adjustments to the learning process to improve the AI in the future, and more extensive tests will be performed then. Ideally I can get a narrower confidence interval to determine exact performance rates.
I decided to look at the differences in what the winning players of games with various amounts of training look like. It turns out the main difference is that the 6 and 12-round AIs prefer to have a second or third objective much more frequently, with an expected value of 1.4 objectives per game, rather than 1.07 objectives. This also translates to one more cheaper card (tier 1) and one fifth fewer expensive cards (tier 3) per game.
I will be testing out longer sets of training for future runs, as well as slightly larger networks. I will also try fine-tuning networks with very low temperatures at the end of different phases of training so that the randomness of decisionmaking can be limited. I will also add a command-line interface for human players to interact with the game to get an idea of what it's like to play agains the AI.
The games take about 6-10 seconds to run, so I may need to acquire more RAM in order to run games in parallel.
GitHub Repository: https://github.com/mcandocia/splendor_ai
For running the code, I used IPython notebooks for their relatively ease of use for testing and running code.