Learning Reinforcement Learning - Part 1
June 8, 2025
Reinforcement Learning (RL) is a powerful paradigm for training agents to make decisions in uncertain environments. To deepen my understanding, I decided to implement an RL agent that learns to play The Witcher`s Dice Poker using the Proximal Policy Optimization (PPO) algorithm. In this post, I'll walk through the garbage I have produced to try to achieve that end.
The project is organized as follows:
- Dice Rolling: Simulates rolling dice and retrieving outcomes.
- Hand Evaluation: Assesses the player's hand and determines winners.
- RL Environment: A custom Gymnasium environment for the agent.
- Training Script: Uses Stable Baselines3's PPO to train the agent.
The Game: Dice Poker
Each player rolls five dice, can re-roll selected dice up to two times, and the best poker-style hand wins. One note here, my RL environment models a single agent playing to maximize its hand value. My goal was to first check if the agent can learn rules of the game.
Gymnasium Environment
I created a custom environment compatible with Gymnasium, where the observation includes the five dice and the current hand value. The agent's action is a binary mask indicating which dice to re-roll. Notice that thw following environment is named DicePockerEnv2, that is because I was teting several approaches, and the following one, where the state variable holds the dice values and the hand value for that particular combination.
class DicePokerEnv2(gym.Env):
metadata = {"render_modes": ["human"]}
def __init__(self, max_rolls=3):
super().__init__()
# 5 dice (1-6) + 1 hand value (0-8, assuming 9 hand ranks)
self.observation_space = spaces.Box(
low=np.array([1, 1, 1, 1, 1, 0]),
high=np.array([6, 6, 6, 6, 6, 8]),
shape=(6,),
dtype=np.int32
)
self.action_space = spaces.MultiBinary(5)
self.hand_evaluator = HandEvaluator()
self.max_rolls = max_rolls
self.current_roll = 0
self.state = None
self.first_hand_value = 0
def _get_obs(self):
hand_value = self.hand_evaluator.evaluate_hand(self.state.tolist()).value
return np.array(list(self.state) + [hand_value], dtype=np.int32)
def reset(self, seed=None, options=None):
super().reset(seed=seed)
self.state = np.random.randint(1, 7, size=(5,), dtype=np.int32)
self.current_roll = 1
self.first_hand_value = self.hand_evaluator.evaluate_hand(self.state.tolist()).value
return self._get_obs(), {}
def step(self, action):
for i in range(5):
if action[i]:
self.state[i] = np.random.randint(1, 7)
self.current_roll += 1
hand_value = self.hand_evaluator.evaluate_hand(self.state.tolist()).value
obs = np.array(list(self.state) + [hand_value], dtype=np.int32)
if self.current_roll >= self.max_rolls:
reward = hand_value - self.first_hand_value # Only final reward
terminated = True
else:
reward = 0
terminated = False
truncated = False
info = {"hand": self.state.copy(), "hand_value": hand_value}
return obs, reward, terminated, truncated, info
def render(self, mode="human"):
hand_value = self.hand_evaluator.evaluate_hand(self.state.tolist()).value
print(f"Hand: {self.state}, value: {hand_value}")
def close(self):
pass
Training the Agent with PPO
PPO (Proximal Policy Optimization) is a robust policy gradient algorithm that balances learning speed and stability. I used Stable Baselines3 for training:
from stable_baselines3 import PPO
from stable_baselines3.common.monitor import Monitor
from dice_poker_env import DicePokerEnv
env = DicePokerEnv()
env = Monitor(env, "./logs")
policy_kwargs = dict(net_arch=[128, 128])
model = PPO(
"MlpPolicy",
env,
verbose=1,
tensorboard_log="./ppo_tensorboard/",
policy_kwargs=policy_kwargs,
device="cpu"
)
model.learn(total_timesteps=500_000)
model.save("ppo_dice_poker")
Evaluating the Agent
If you're paying attention, you noticed that during training I use TensorBoard to log (in ./ppo_tensorboard/), but I'll leave graph interpretation for another blog post, since I'm also learning some of the metrics. So, to showcase and evaluate if the agent is producing meaningful results, I run some games and check its decisions. For instance, the following game:
Starting game # 1
Starting hand: [6 2 2 2 1]
Action: [1. 0. 0. 0. 0.]
Hand: [2 2 2 2 1]
Action: [0. 0. 0. 0. 0.]
Hand: [2 2 2 2 1]
Final reward: 7
It decided to re-roll only the 1st die, but not the last one. I'd have chosen to re-roll the 1st and last dice. Anyway, there is still a lot to be done. Next steps are:
- Make it play against a rule-based opponent or set up a multi-agent environment
- Add the betting mechanics
For now, if you want to check the code, I've taged the first version here. Feedbacks are welcome!