'Reinforcement Learning of Kniffel/Yahtzee
I set myself the challenge to develop a deep reinforcement learning algorithm to solve the game Kniffel/Yahtzee. I coded the game with Python and inserted it into a custom open ai gym environment.
Everything is running but unfortunately I do not really make progress. I tested many different hyperparameter settings:
- different learning rates
- different hidden layers
- different units per hidden layer
- and so on and so on
The problem is that I do not really make progress. The agent itself manages only a few rounds to play without failing. So I try to step back and ask you if my gym is maybe wrong and I misunderstood something.
Right now my gym is configured as follows:
Reward Gain:
1. This is the Reward system if the agend wants to finish his turn and select a category:
Category | Formula | Example 1 | Example 2 |
---|---|---|---|
ONES** | = rewards_calculator(#ONES) | [1,2,1,3,1] = 3 reward | [1,2,5,4,6] = -0.2 reward |
TWOS** | = rewards_calculator(#TWOS) | [1,2,1,3,1] = -0.2 reward | [1,2,5,2,6] = -0.1 reward |
THREES** | = rewards_calculator(#THREES) | [3,3,3,3,3] = 5 reward | [1,2,5,4,6] = -0.5 reward |
FOURS** | = rewards_calculator(#FOURS) | [1,4,4,3,1] = -0.2 reward | [1,2,5,4,6] = -0.2 reward |
FIVES** | = rewards_calculator(#FIVES) | [1,2,5,5,1] = -0.2 reward | [1,5,5,5,5] = 4 reward |
SIXES** | = rewards_calculator(#SIXES) | [1,2,6,6,6] = 3 reward | [1,2,5,4,6] = -0.2 reward |
THREE TIMES THE SAME | = SUM / 5 | [1,2,6,6,6] = 4.2 reward | [1,2,2,2,6] = 2.6 reward |
FOUR TIMES THE SAME | = SUM / 5 | [1,6,6,6,6] = 5 reward | [1,1,1,1,6] = 2 reward |
FULL HOUSE | = 3 (fixed constant) | [1,4,4,3,1] = 3 reward | [1,2,5,4,6] = 3 reward |
SMALL STREET | = 3 (fixed constant) | [1,4,4,3,1] = 3 reward | [1,2,5,4,6] = 3 reward |
LARGE STREET | = 3 (fixed constant) | [1,4,4,3,1] = 3 reward | [1,2,5,4,6] = 3 reward |
KNIFFEL | = 3 (fixed constant) | [1,4,4,3,1] = 3 reward | [1,2,5,4,6] = 3 reward |
CHANCE | = SUM / 5 | [1,6,6,6,6] = 5 reward | [1,1,1,1,6] = 2 reward |
# = number of dices with the corresponding eye number
** To get the bonus the agent needs to get at least 3 of each dice. So everything below is penalised and everything above is rewarded with extra points.
def rewards_calculator(self, dice_count) -> float:
"""Calculate reward based on amount of dices used for finishing the round.
:param dice_count: amount of dices
:return: reward
"""
if dice_count == 0:
return -0.5
elif dice_count == 1:
return -0.2
elif dice_count == 2:
return -0.1
elif dice_count == 3:
return 3
elif dice_count == 4:
return 4
elif dice_count == 5:
return 5
elif dice_count == 6:
return 6
2. Reward for a normal reroll of the dice
reward = + 1.5
3. Bonus
If the agent manages to receive the bonus after selecting a category he receives another 2 extra points.
Reward Penalty:
1. Game over
If the agent is doing something he is not allowed he is penalised with -10 points.
For example rolls [1,2,3,4,5] and wants to select the category KNIFFEL. The game ends and he is penalised with -10 points.
2. Normal round
Each round the agent plays he get subtracted -0.1 points.
Example
Round 1: Agent throws [1,2,3,5,6] > Agent decides to throw again dice 2 and 3
Reward: 1.5 (reroll reward) + -0.1 (round reward) = 1.4
Round 2: Agent throws [1,5,5,5,6] > Agent decides to choose THREE TIMES THE SAME
Reward: 1.4 (old reward) + ((1+5+5+5+6) / 5) (THREE TIMES THE SAME reward) - 0.1 (round reward) = 5.7
Round 3: Agent throws [4,5,6,1,5] > Agent decides to log in KNIFFEL
Reward: 5.7 (old reward) - 10 (game over penality) - 0.1 (round reward) = -4.4 --> GAME OVER
Gym Env
Observation Input
The gym uses array looking like this as observation input:
[[ 2 1 2 5 4 1 1 1 1 1 0 0 0 0 0 1]
[ 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 2]
[ 3 3 3 3 3 0 0 0 0 0 0 0 0 0 0 3]
[ 4 4 4 4 4 0 0 0 0 0 0 0 0 0 0 4]
[ 5 5 5 5 5 0 0 0 0 0 0 0 0 0 0 5]
[ 6 6 6 6 6 6 6 6 6 6 0 0 0 0 0 7]
[ 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6]
[ 6 6 6 6 6 0 0 0 0 0 0 0 0 0 0 8]
[ 6 6 6 5 5 0 0 0 0 0 0 0 0 0 0 9]
[ 1 2 3 4 5 0 0 0 0 0 0 0 0 0 0 10]
[ 1 2 3 4 5 0 0 0 0 0 0 0 0 0 0 11]
[ 6 6 6 6 6 0 0 0 0 0 0 0 0 0 0 12]
[ 6 6 6 6 6 0 0 0 0 0 0 0 0 0 0 13]]
Each row is a game. Each number is a dice. If the dice is not rolled the number is 0. If the round is finished the last number identifies the selected category. 13 is for example CHANCE.
Action Space
An action is just a discrete number between 1 and 44 which identifies an action. Can be 1 = select KNIFFEL category or 33=reroll dice 1, 4 and 5.
My approach
I used an DQN with learning rate in the range of 0.001-0.0001.
My model looked like this:
Layer (type) Output Shape Param #
=================================================================
flatten (Flatten) (None, 208) 0
dense (Dense) (None, 128) 26752
dense_1 (Dense) (None, 64) 8256
dense_2 (Dense) (None, 44) 2860
=================================================================
Question
What can I improve. Is it smart to always send the whole observation to the agent, even in the first round when all the other rows are 0. Should I maybe work with the window size and just send the current round but increase the window size to 13 (max possible rounds)?
I hope everything is clear and somebody has an idea.
The whole project is also on https://github.com/elianderlohr/Kniffel/tree/feature/ai under the feature/ai branch.
But please be aware > my coding style is not that advanced and I'm more of a noob in python and deep neural networks.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|