Reinforcement Learning – The Multi Arm Bandit Problem using TensorFlow


The n-arm bandit problem is a reinforcement learning problem where the agent is given n bandits/arms slot machine. Each of the arms of a slot machine has a different success probability. Pulling any one of the arms rewards the agent i.e., success or a failure.
The agent’s objective is to pull the bandits/arms one at a time such that it maximizes the total reward collected as the process ends. Moreover, the problem statement defines that the agent does not know the probability of success of the arms. It gradually learns through the process of trial and error and also by estimation of value.


In this blog, we learn to use the policy gradient method which uses TensorFlow and creates a simple neural network that consists of weights corresponding to each of the possible arms’ probability of fetching the reward of the slot machine. In this method, the agent chooses an arm of a machine based on an e-greedy approach. It means that mostly the agent would choose the action that corresponds to the largest expected value, but sometimes it also chooses randomly.

In this way, the agent tries out each of the different arms to continue to learn more about them. Once the agent has taken an action i.e., chooses an arm of the slot machine, it then receives a reward of either 1 or -1.

Practical Code Implementation

Below is a short implementation of the n-arm/multi-arm bandit problem implemented in Python programming language:

We take n=6 for our code implementation (6 arms of slot machine) and their numbers as [2,0,0.2,-2,-1,0.8].

We will gradually find out that the agent learns and successfully chooses the bandit which fetches the largest reward.

Import necessary libraries

import numpy as np
import tensorflow.compat.v1 as tf

The function tf.disable_v2_behavior (as the name suggests) switches all global behaviors that are different between TensorFlow 1.x and 2.x to behave as intended for 1.x.

Finding rewards for the arms

We create a slot_arms array that defines our bandits. len_slot_arms stores 6 i.e length of array. The function finds reward() generates a random number from a normal distribution with a mean 0.
The lower the arm/bandit number it is, the more likely the agent returns a positive reward (1).

slot_arms = [2,0,0.2,-2,-1,0.8]
len_slot_arms = len(slot_arms)
def findReward(arm):
    result = np.random.randn(1)
    if result > arm:
        #returns a positive reward
        return 1
        #returns a negative reward
        return -1

Our neural agent

weights = tf.Variable(tf.ones([len_slot_arms]))
chosen_action = tf.argmax(weights,0)

The function tf.rese_default_graph of the TensorFlow library clears the default graph stack and resets the global default graph. Lines 2 and 3 define the weights of the particular bandits as 1 and then do the actual choosing of the arm.

reward_holder = tf.placeholder(shape=[1],dtype=tf.float32)
action_holder = tf.placeholder(shape=[1],dtype=tf.int32)
responsible_weight = tf.slice(weights,action_holder,[1])
loss = -(tf.log(responsible_weight)*reward_holder)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
update = optimizer.minimize(loss)

The above lines of code do the training. It first feeds the reward and the chosen action(arm) to the network. The neural network then computes the loss by the formula given below. This loss is then used to update the network for better performance.

Loss = -log(weight for action)*A(Advantage from baseline(here it is 0)).

Training our agent and finding the most probable arm/bandit

total_episodes = 1000
total_reward = np.zeros(len_slot_arms) #output reward array
e = 0.1 #chance of taking a random action.
init = tf.initialize_all_variables()
with tf.Session() as sess:
  i = 0
  while i < total_episodes:
    if np.random.rand(1) < e:
      action = np.random.randint(len_slot_arms)
      action =
    reward = findReward(slot_arms[action])
    _,resp,ww =[update,responsible_weight,weights], feed_dict={reward_holder:[reward],action_holder:[action]})
    total_reward[action] += reward
    if i % 50 == 0:
      print ("Running reward for the n=6 arms of slot machine: " + str(total_reward))
print ("The agent thinks bandit " + str(np.argmax(ww)+1) + " has highest probability of giving poistive reward")
if np.argmax(ww) == np.argmax(-np.array(slot_arms)):
  print("which is right.")
  print("which is wrong.")

We train the agent by taking random actions and therefore receiving rewards. The above lines of code launch a TensorFlow graph, then a random action is chosen to which reward is picked out of one of the arms. This reward helps in updating the network and is also outputted on the screen.

Sample Output

One response to “Reinforcement Learning – The Multi Arm Bandit Problem using TensorFlow”

  1. ISHA BANSAL says:

    Yet another very Informative post! 😊
    Looking forward to many more Reinforcement Learning problems! 💯

Leave a Reply

Your email address will not be published. Required fields are marked *