Learning Reinforcement Learning - Part 1
June 8, 2025
Reinforcement Learning (RL) is a powerful paradigm for training agents to make decisions in uncertain environments. To deepen my understanding, I decided to implement an RL agent that learns to play The Witcher`s Dice Poker using the Proximal Policy Optimization (PPO) algorithm. In this post, I'll walk through the garbage I have produced to try to achieve that end.
The project is organized as follows:
- Dice Rolling: Simulates rolling dice and retrieving outcomes.
- Hand Evaluation: Assesses the player's hand and determines winners.
- RL Environment: A custom Gymnasium environment for the agent.
- Training Script: Uses Stable Baselines3's PPO to train the agent.
The Game: Dice Poker
Each player rolls five dice, can re-roll selected dice up to two times, and the best poker-style hand wins. One note here, my RL environment models a single agent playing to maximize its hand value. My goal was to first check if the agent can learn rules of the game.
Gymnasium Environment
I created a custom environment compatible with Gymnasium, where the observation includes the five dice and the current hand value. The agent's action is a binary mask indicating which dice to re-roll. Notice that thw following environment is named DicePockerEnv2, that is because I was teting several approaches, and the following one, where the state variable holds the dice values and the hand value for that particular combination.
class DicePokerEnv2(gym.Env):
    metadata = {"render_modes": ["human"]}
    def __init__(self, max_rolls=3):
        super().__init__()
        # 5 dice (1-6) + 1 hand value (0-8, assuming 9 hand ranks)
        self.observation_space = spaces.Box(
            low=np.array([1, 1, 1, 1, 1, 0]),
            high=np.array([6, 6, 6, 6, 6, 8]),
            shape=(6,),
            dtype=np.int32
        )
        self.action_space = spaces.MultiBinary(5)
        self.hand_evaluator = HandEvaluator()
        self.max_rolls = max_rolls
        self.current_roll = 0
        self.state = None
        self.first_hand_value = 0
    def _get_obs(self):
        hand_value = self.hand_evaluator.evaluate_hand(self.state.tolist()).value
        return np.array(list(self.state) + [hand_value], dtype=np.int32)
    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        self.state = np.random.randint(1, 7, size=(5,), dtype=np.int32)
        self.current_roll = 1
        self.first_hand_value = self.hand_evaluator.evaluate_hand(self.state.tolist()).value
        return self._get_obs(), {}
    def step(self, action):
        for i in range(5):
            if action[i]:
                self.state[i] = np.random.randint(1, 7)
        self.current_roll += 1
        hand_value = self.hand_evaluator.evaluate_hand(self.state.tolist()).value
        obs = np.array(list(self.state) + [hand_value], dtype=np.int32)
        if self.current_roll >= self.max_rolls:
            reward = hand_value  - self.first_hand_value # Only final reward
            terminated = True
        else:
            reward = 0
            terminated = False
        truncated = False
        info = {"hand": self.state.copy(), "hand_value": hand_value}
        return obs, reward, terminated, truncated, info
    def render(self, mode="human"):
        hand_value = self.hand_evaluator.evaluate_hand(self.state.tolist()).value
        print(f"Hand: {self.state}, value: {hand_value}")
    def close(self):
        pass
                
                Training the Agent with PPO
PPO (Proximal Policy Optimization) is a robust policy gradient algorithm that balances learning speed and stability. I used Stable Baselines3 for training:
from stable_baselines3 import PPO
from stable_baselines3.common.monitor import Monitor
from dice_poker_env import DicePokerEnv
env = DicePokerEnv()
env = Monitor(env, "./logs")
policy_kwargs = dict(net_arch=[128, 128])
model = PPO(
    "MlpPolicy",
    env,
    verbose=1,
    tensorboard_log="./ppo_tensorboard/",
    policy_kwargs=policy_kwargs,
    device="cpu"
)
model.learn(total_timesteps=500_000)
model.save("ppo_dice_poker")
                Evaluating the Agent
If you're paying attention, you noticed that during training I use TensorBoard to log (in ./ppo_tensorboard/), but I'll leave graph interpretation for another blog post, since I'm also learning some of the metrics. So, to showcase and evaluate if the agent is producing meaningful results, I run some games and check its decisions. For instance, the following game:
Starting game # 1
Starting hand: [6 2 2 2 1]
Action: [1. 0. 0. 0. 0.]
Hand: [2 2 2 2 1]
Action: [0. 0. 0. 0. 0.]
Hand: [2 2 2 2 1]
Final reward: 7
                It decided to re-roll only the 1st die, but not the last one. I'd have chosen to re-roll the 1st and last dice. Anyway, there is still a lot to be done. Next steps are:
- Make it play against a rule-based opponent or set up a multi-agent environment
- Add the betting mechanics
For now, if you want to check the code, I've taged the first version here. Feedbacks are welcome!