The agent’s goal is to pull the bandits/arms one at a time such that the total reward received after the operation is maximized. Pulling any of the arms either rewards or punishes the agent, i.e., success or failure. Each arm of a slot machine has a different chance of winning. The n-arm bandit problem is a reinforcement learning problem in which the agent is given a slot machine with n bandits/arms.