'Reinforcement Learning of Kniffel/Yahtzee

I set myself the challenge to develop a deep reinforcement learning algorithm to solve the game Kniffel/Yahtzee. I coded the game with Python and inserted it into a custom open ai gym environment.

Everything is running but unfortunately I do not really make progress. I tested many different hyperparameter settings:

different learning rates
different hidden layers
different units per hidden layer
and so on and so on

The problem is that I do not really make progress. The agent itself manages only a few rounds to play without failing. So I try to step back and ask you if my gym is maybe wrong and I misunderstood something.

Right now my gym is configured as follows:

Reward Gain:

1. This is the Reward system if the agend wants to finish his turn and select a category:

Category	Formula	Example 1	Example 2
ONES**	= rewards_calculator(#ONES)	[1,2,1,3,1] = 3 reward	[1,2,5,4,6] = -0.2 reward
TWOS**	= rewards_calculator(#TWOS)	[1,2,1,3,1] = -0.2 reward	[1,2,5,2,6] = -0.1 reward
THREES**	= rewards_calculator(#THREES)	[3,3,3,3,3] = 5 reward	[1,2,5,4,6] = -0.5 reward
FOURS**	= rewards_calculator(#FOURS)	[1,4,4,3,1] = -0.2 reward	[1,2,5,4,6] = -0.2 reward
FIVES**	= rewards_calculator(#FIVES)	[1,2,5,5,1] = -0.2 reward	[1,5,5,5,5] = 4 reward
SIXES**	= rewards_calculator(#SIXES)	[1,2,6,6,6] = 3 reward	[1,2,5,4,6] = -0.2 reward
THREE TIMES THE SAME	= SUM / 5	[1,2,6,6,6] = 4.2 reward	[1,2,2,2,6] = 2.6 reward
FOUR TIMES THE SAME	= SUM / 5	[1,6,6,6,6] = 5 reward	[1,1,1,1,6] = 2 reward
FULL HOUSE	= 3 (fixed constant)	[1,4,4,3,1] = 3 reward	[1,2,5,4,6] = 3 reward
SMALL STREET	= 3 (fixed constant)	[1,4,4,3,1] = 3 reward	[1,2,5,4,6] = 3 reward
LARGE STREET	= 3 (fixed constant)	[1,4,4,3,1] = 3 reward	[1,2,5,4,6] = 3 reward
KNIFFEL	= 3 (fixed constant)	[1,4,4,3,1] = 3 reward	[1,2,5,4,6] = 3 reward
CHANCE	= SUM / 5	[1,6,6,6,6] = 5 reward	[1,1,1,1,6] = 2 reward

# = number of dices with the corresponding eye number

** To get the bonus the agent needs to get at least 3 of each dice. So everything below is penalised and everything above is rewarded with extra points.

    def rewards_calculator(self, dice_count) -> float:
            """Calculate reward based on amount of dices used for finishing the round.
    
            :param dice_count: amount of dices
            :return: reward
            """
            if dice_count == 0:
                return -0.5
            elif dice_count == 1:
                return -0.2
            elif dice_count == 2:
                return -0.1
            elif dice_count == 3:
                return 3
            elif dice_count == 4:
                return 4
            elif dice_count == 5:
                return 5
            elif dice_count == 6:
                return 6

2. Reward for a normal reroll of the dice

reward = + 1.5

3. Bonus

If the agent manages to receive the bonus after selecting a category he receives another 2 extra points.

Reward Penalty:

1. Game over

If the agent is doing something he is not allowed he is penalised with -10 points.

For example rolls [1,2,3,4,5] and wants to select the category KNIFFEL. The game ends and he is penalised with -10 points.

2. Normal round

Each round the agent plays he get subtracted -0.1 points.

Example

Round 1: Agent throws [1,2,3,5,6] > Agent decides to throw again dice 2 and 3

Reward: 1.5 (reroll reward) + -0.1 (round reward) = 1.4

Round 2: Agent throws [1,5,5,5,6] > Agent decides to choose THREE TIMES THE SAME

Reward: 1.4 (old reward) + ((1+5+5+5+6) / 5) (THREE TIMES THE SAME reward) - 0.1 (round reward) = 5.7

Round 3: Agent throws [4,5,6,1,5] > Agent decides to log in KNIFFEL

Reward: 5.7 (old reward) - 10 (game over penality) - 0.1 (round reward) = -4.4 --> GAME OVER

Gym Env

Observation Input

The gym uses array looking like this as observation input:

[[ 2  1  2  5  4  1  1  1  1  1  0  0  0  0  0  1]
 [ 2  2  2  2  2  0  0  0  0  0  0  0  0  0  0  2]
 [ 3  3  3  3  3  0  0  0  0  0  0  0  0  0  0  3]
 [ 4  4  4  4  4  0  0  0  0  0  0  0  0  0  0  4]
 [ 5  5  5  5  5  0  0  0  0  0  0  0  0  0  0  5]
 [ 6  6  6  6  6  6  6  6  6  6  0  0  0  0  0  7]
 [ 6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6]
 [ 6  6  6  6  6  0  0  0  0  0  0  0  0  0  0  8]
 [ 6  6  6  5  5  0  0  0  0  0  0  0  0  0  0  9]
 [ 1  2  3  4  5  0  0  0  0  0  0  0  0  0  0 10]
 [ 1  2  3  4  5  0  0  0  0  0  0  0  0  0  0 11]
 [ 6  6  6  6  6  0  0  0  0  0  0  0  0  0  0 12]
 [ 6  6  6  6  6  0  0  0  0  0  0  0  0  0  0 13]]

Each row is a game. Each number is a dice. If the dice is not rolled the number is 0. If the round is finished the last number identifies the selected category. 13 is for example CHANCE.

Action Space

An action is just a discrete number between 1 and 44 which identifies an action. Can be 1 = select KNIFFEL category or 33=reroll dice 1, 4 and 5.

My approach

I used an DQN with learning rate in the range of 0.001-0.0001.

My model looked like this:

 Layer (type)                Output Shape              Param #   
=================================================================
 flatten (Flatten)           (None, 208)               0

 dense (Dense)               (None, 128)               26752     

 dense_1 (Dense)             (None, 64)                8256      

 dense_2 (Dense)             (None, 44)                2860      

=================================================================

Question

What can I improve. Is it smart to always send the whole observation to the agent, even in the first round when all the other rows are 0. Should I maybe work with the window size and just send the current round but increase the window size to 13 (max possible rounds)?

I hope everything is clear and somebody has an idea.

The whole project is also on https://github.com/elianderlohr/Kniffel/tree/feature/ai under the feature/ai branch.
But please be aware > my coding style is not that advanced and I'm more of a noob in python and deep neural networks.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source