Dash App: Experiment with the Best Multi-Armed Bandit Algorithms.

My interest in revenues optimization has been growing ever since I joined the world of Online Advertising. What an interesting challenge to optimize a given situation considering the complexity of its environment. We might want to improve a specific metric, say, a conversion rate, but maybe the client whose performance proves best is likely to update his prices in the near future. We might not want to make too quick a decision when impermanence is the only thing we can predict for certain. From a business perspective, it is very important to bear that in mind and balance the options so profitability is maintained in the long run.

Apart from that, I am always on the lookout for new learning content that can help me grow as a Data Scientist. A few weeks ago, I took a course on Udemy called Bayesian Machine Learning in Python: A/B Testing. The course covers traditional A/B Testing as well as some Reinforcement Learning techniques such as Epsilon Greedy, Optimistic Initial Values, UCB1 and Thompson Sampling.

Internet is already filled with many articles that take a deep dive on this topic. I have tried to make it original and built my own Dash app that lets you invest some amount in USD and see which algorithm is the most profitable. Before that, I have attempted to explain in my own words what I understand from those concepts. If at least one person finds this useful, it will be a success.

I built a Docker image for my app. Please feel free to access it via Docker Hub following the instructions on my Github.

What is the problem?

A one-armed bandit refers to one slot machine wherein you insert a coin, pull a lever and get an immediate reward. The Casino designs the machines in such a way so that it ends up winning in the long run. A Multi-Armed bandit problem entails several such machines. An agent must choose the machines in a way that maximizes the cumulative gain after a given set of trials.

A/B Tests are one common approach to go about it. An experiment runs for a predefined period of time. We identify the best option using some statistical tools and then play the winner for the rest of the decided period.

Do you see the problem here?

This means that we would have to waste part of the trials on the least profitable machines before we can reach a conclusion on the best one to play. It is not ideal. We would want to exploit the data we collect along the way and not waste valuable trials. Another approach would consists in exploiting the Maximum Likelihood Estimation (MLE). At each trial, the agent plays the machine having the highest MLE at the time irrespective of the confidence in the estimation and the data collected. A Greedy approach. It is not ideal either as it fails to collect a sufficient amount of data to make an unbiased decision.

We refer to this famous trade-off as:

The Exploration Exploitation Dilemma.

Luckily, some algorithms attempt to solve this problem.

Let’s go over a few of them.

Algo 1: Epsilon Greedy

One alternative to the Greedy approach is to allow some exploration. We call the exploration rate Epsilon (Ɛ). Imagine Ɛ=10%. This means that in 10% of the time we select an option at random (exploration) whereas in the rest of the time, we select the best option at the time — highest MLE (exploitation).

Although the approach gives some margin for exploration, it is not ideal either. The algorithm is “flawed” by design. It will never stop to explore even if it identifies an option that seems to outperform the others. Its expected cumulative return will be lower than what some other algorithms can achieve.

Algo 2: Optimistic Initial Values

This approach resembles the Greedy approach we first mentioned. However, instead of starting at a zero estimate and then updating up to the true reward probabilities, it starts at some high value and then decreases down the true reward probabilities.

It allows for some exploration at an early stage until it identifies the best option. Then it plays that option with little, if any, exploration at later stages. This will depend on how close are the true reward probabilities to each other.

This algorithm performs quite well in that it keeps exploring until the actual data indicates otherwise. On the contrary, the Greedy algorithm exploits right from the start and we might be missing out on valuable data as a result.

Algo 3: UCB1

The UCB1 approach aims to convert a set of average rewards into a set of decision values (DV). It takes into account how many times we played the options in addition to their MLE. Then, for each trial, we play the option having the best DV.

UCB1 Algorithm: Decision Value (DV) Formula

Given the design of the formula, it seems that if we play one machine fewer times compared to the others, we will be more likely to play it at a later stage. This is because N is at the numerator and ni is at the denominator. As the number of total plays increases, if the number of machine plays does not increase, the machine DV will increase, hence the increase in the likelihood of being chosen next.

The DV takes into account the level of uncertainty around a certain estimate.

By the look of it, it could be interesting to try less picked options (a1 blue or a2 red) as opposed to a3 green — highest MLE. Indeed, our confidence around those options is not strong enough yet to make the decision to only play a3. Those have a potential to perform that we need to investigate further.

Algo 4: Thompson Sampling

Contrary to all previous approaches, Thompson Sampling does not update previous estimates. Instead, it updates the distribution around the reward probability. It then samples from it to decide what action to carry out next.

In our binary outcome scenario, we use the Beta Distribution where alpha (α) stands for the number of successes and beta (β) stands for the number of failures. You might have recognized the Bernoulli distribution that is a special case of the Binomial distribution where only a single trial is performed. Here, the outcome is twofold, either a win (α) or a loss (β).

In Bayesian Inference, the Beta distribution is the conjugate prior probability distribution for the Bernoulli distribution. It builds upon the number of successes and failures to model the distribution around the win rate.

In our example, our initial state (red) with α = 1 (1 success) and β = 1 (1 failure), the distribution is uniform. We have no idea what the probability of success is. After a few trials (blue), we collected 4 successes and 6 failures, the distribution centers around a mean reward of 0.4 (40%) — 4 / (4 + 6). Now, we have some idea about the probability of success, however, the distribution still has fat tails, which tells us that we are not so sure about that 40% yet. Finally, after many trials, we collected 90 successes and 60 failures, the distribution centers around a mean reward of 60% with thinner tails, which indicates that we are quite confident about the reward probability of that machine. We are more likely to sample a reward probability close to 60% although it still allows for some exploration around that mean.

That model provides some level of confidence to the reward probability. And this confidence increases as we collect more samples.

Ok, interesting all this technical jargon but when do we get to the fun part?

How do they compare?

Let’s imagine you go to the Casino tonight and you want to bet, say, 100$ and have some 20 minutes to spare before you go have dinner with your friends.

We consider that the number of trials equals the amount invested. We will play for 100 trials in our scenario. Also, we consider that the amount you bet at each trial increases as you progress in your trials: $1 at first trial, 2$ at second trial etc. We bet more as we get more confident in our choices. Why do we do that? Theoretically, our algorithm should learn quickly and in so doing make more money to bet along the way. Performing the experiment that way allows us to penalize poor learners more significantly.

The results that I present here are averaged over 100 repetitions. We define the performance as the cumulative win rate divided by the maximum win rate amongst all machines. The machines Slot A, Slot B and Slot C have true win rates or reward probabilities of 25%, 50% and 75% respectively. As an example, if an algorithm reaches a cumulative win rate of 72% after finishing the 100 trials, its performance is 72% / 75% = 96%.

What should be your strategy so you can pay for your friends dinner? :)

Clearly, three algorithms come on top: Thompson Sampling, Optimistic Initial Values and UCB1. They learn quicker and have, in that particular case, an average cumulative performance of 92%, 88% and 85% respectively.

The violin plots also support that insight. As a matter of fact, we can see that although Epsilon Greedy may perform well in some occasions with a median performance of 79% and 50% of the repetitions in the performance range of 65%-88%, it is no match for Thompson Sampling where 75% of the repetitions have a performance greater than 88%.

Optimistic Initial Values and UCB1 have similar performances with medians of 90% and 84% respectively although the former tends to perform slightly better in that specific experiment. This may depend on factors such as the slot machine win rates, the number of trials, the number of repetitions and how close are the win rates to each other.

Now how can we maximize our gains?

For each trial, we bet an amount, if we win, we get that amount doubled and if we lose, that amount invested goes to the Casino’s pocket and we keep playing with the rest of the money until we finish our trials.

The total gains shown below do not include your initial 100$.

Once again, this proves that Thompson Sampling yields the best results followed by a small margin by Optimistic Initial Values.

Trying the app a few times will give you some intuition as to how each algorithm perform over 100 repetitions. I strongly recommend it.


Although the performances may vary with context, one seems to be more consistent with repetitions. As you remember, Thompson Sampling maps the uncertainty around the reward probability. Every time, it gets closer to the true probabilities of the options at our disposal.

Epsilon Greedy can perform well but is limited by design. It keeps exploring during the whole period. This does not prove efficient in a stationary environment where reward probabilities do not change in time.

Though, it could come handy in situations where they do change in time. A web site gets lots of traffic every day. The behavior of users may be very varying depending on the content. They may favor one Ad today and favor a completely different one the next day, let alone the next hour.

It could be interesting to analyze the performances of those algorithms in a non-stationary environment and see if other approaches may best fit.

Maybe a subject to consider for another article…

Thanks for reading !

Aspiring Data Scientist always on the lookout for the next content to learn. I believe that technology can help shape a better education system. Let’s grow!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store