2 minute read


Thompson Sampling is a highly effective algorithm used for solving the multi-armed bandit problem, an optimization challenge often encountered in online advertising, clinical trials, and recommendation systems. The crux of the problem is to balance exploration (trying out new options) and exploitation (choosing the best-known option) in order to maximize rewards over time. In this blog post, we will take a deep dive into Thompson Sampling, understanding its mechanics and benefits, and discussing real-world applications.

What is the Multi-Armed Bandit Problem?

Imagine a gambler at a casino, faced with multiple slot machines, each with its own unknown probability of winning. The gambler’s goal is to maximize their earnings by playing the machine with the highest win probability. However, they must first determine which machine that is, without wasting too much time and money on inferior options. This is the essence of the multi-armed bandit problem: making optimal decisions under uncertainty.

Thompson Sampling: A Bayesian Approach

Thompson Sampling is a Bayesian approach to solving the multi-armed bandit problem. It involves maintaining a probability distribution for each arm (option) based on observed rewards and using those distributions to make decisions. The algorithm follows these steps:

  1. For each arm, maintain a prior distribution representing our belief about its expected reward.
  2. At each time step, draw a random sample from each arm’s distribution.
  3. Choose the arm with the highest sampled value and play it, then observe the reward.
  4. Update the chosen arm’s distribution with the observed reward, incorporating the new information.

This approach is both exploratory and exploitative: by drawing random samples from each arm’s distribution, the algorithm is likely to choose arms with higher expected rewards more frequently, but it will still explore other arms occasionally.

Benefits of Thompson Sampling

  1. Simple and efficient: Thompson Sampling is computationally efficient and relatively easy to implement. It only requires maintaining a distribution for each arm and updating them as new information becomes available.
  2. Convergence to optimal arm: Over time, Thompson Sampling converges to the optimal arm, as the distributions become more accurate and the best arm is chosen more frequently.
  3. Adaptability: Thompson Sampling can easily adapt to changes in the reward distribution or to dynamic environments, making it well-suited for online applications.
  4. Incorporation of prior knowledge: By using a Bayesian approach, Thompson Sampling allows for the inclusion of prior knowledge or domain expertise through the choice of initial prior distributions.

Real-World Applications

Thompson Sampling has been successfully applied to various real-world scenarios, including:

  1. Online advertising: In digital advertising, Thompson Sampling can be used to optimize ad placements, maximizing click-through rates or conversion rates by intelligently allocating resources to the most effective ads.
  2. Clinical trials: In adaptive clinical trials, Thompson Sampling can be employed to assign patients to the most effective treatments while minimizing the number of patients exposed to inferior treatments.
  3. Recommendation systems: Thompson Sampling can be used to recommend items or content to users, balancing the need to explore new options while exploiting the best-known recommendations.


Thompson Sampling is a powerful, elegant algorithm that addresses the multi-armed bandit problem by balancing exploration and exploitation in a principled, probabilistic manner. Its simplicity, adaptability, and ability to converge to the optimal solution make it a valuable tool in various fields, from online advertising to clinical trials. With its growing popularity and ongoing research, Thompson Sampling is poised to remain a cornerstone in decision-making under uncertainty.