What is Reinforcement Learning?

painting of a dog trying to decide which button to push
"painting of a dog trying to decide which button to push" by DALL-E

Reinforcement learning (RL) is a simple and intuitive way to train machine learning models using rewards. Reinforcement learning is one of three major approaches to machine learning, alongside supervised learning and unsupervised learning.

With reinforcement learning, a decision is made by a ML model, then positive rewards are assigned for positive outcomes (a “carrot”) and negative rewards are assigned for undesirable outcomes (a “stick”). The model is then re-trained to incorporate the learnings from past decisions.

By learning from past successes and failures, the model gets smarter over time and increasingly makes better decisions.

Reinforcement Learning API

Improve AI casts reinforcement learning as an intuitive API. There is a DecisionModel with two functions: which() and addReward(). which() makes a decision using the decision model and addReward() assigns a reward for that decision.

For example, optimizing the headline of an article might look like:

headline = headlineModel.which("Fruit Truck Crashes, Creates Jam", 
                               "Accident on I-80")

If the article is clicked, a reward is assigned.

if (clicked) {
} else {
    // no reward

In this case a simple binary reward is used. 1.0 is assigned if the article is clicked and no reward is assigned otherwise. With this simple reward scheme, future trainings of the headline model will optimize the headline to maximize the expected probability of a click.

In addition to simple decisions, reinforcement learning can also be used for ranking, scoring, and jointly optimizing multiple variables.

Metrics as Rewards

Since rewards are numeric, rewards can be business metrics, such as revenue, conversions, engagement, or user retention. When RL rewards are business metrics, the model will optimize decisions to automatically improve those metrics over time. With each decision, the RL model will attempt to maximize its expected reward.

That’s like A/B testing on steroids.

In fact, reinforcement learning is exponentially better than A/B testing, wasting far fewer trials before learning to make good decisions.

For example, we could implement dynamic offers that learn to maximize profits. A user could be presented with an offer, with different offers having different discounts, free trial periods, etc.

offer = offersModel.whichFrom(offers)
if (purchased) {

By rewarding the the offersModel with the value of the offer, which could be an expected customer lifetime value, or the expected profit of the offer, the model will learn to present the most profitable offer to the user.

Exploration vs Exploitation

Unlike supervised learning, where the correct response is known ahead of time, with reinforcement learning, the model is not told which which actions to take, but instead must discover which actions yield the most reward through trial and error.

This is the exploration vs exploitation trade off. Should the model spend more time exploring variants to gather more data or should it double down on variants that are already performing well?

Much academic ink has been spilled on this topic, and the consensus is that bayesian reinforcement learning methods such as Thompson Sampling, which was first described in 1933, consistently demonstrate excellent performance. In essence, these methods generate multiple hypothesis about the relationship between variants, context, and rewards and choose the variant with the highest expected reward given the current hypothesis. Our real world deployments of reinforcement learning have further validated the efficacy of bayesian methods which can often provide good performance with only thousands of decisions per variant.

Other approaches to exploration such as random exploration and the Epsilon Greedy algorithm have enjoyed popularity in recent years due to their simplicity, but we do not recommend them when bayesian methods are available.

Improve AI uses bayesian exploration automatically with no special configuration required on the part of the developer.

Contextual Decision Making

Furthermore, unlike A/B testing, a type of reinforcement learning called Contextual Bandits uses context to personalize each decision.

With this context, for a Spanish speaker the greetings model will learn to choose Hola.

greeting = greetingsModel.given({"language": "cowboy"})
                         .which("Hello", "Howdy", "Hola")

Given the language is cowboy, the variant with the highest expected reward is Howdy and the model will learn to make that choice.

The combination of given(), which(), create a sort of AI if/then statement.

Deep Reinforcement Learning

Deep RL is an approach to reinforcement learning where the underlying decision model is a deep neural network. Deep neural networks are typically used when the input data is unstructured, such as in video, images, or audio. Deep reinforcement learning has achieved impressive results in game play and robotics.

The challenge with deep reinforcement learning is that the model is simultaneously required to learn representations of unstructured data, such as raw pixels, while also learning long term multi-step cause and effect where one reward is assigned to many past decisions.

Solving this dual-problem of representation and decisions often requires hundreds of millions to billions of decisions, which is not feasible for all but the largest of online digital products.

Because of this requirement for an enormous number of decisions, many of which will lead to poor outcomes, deep reinforcement learning typically requires a simulated or wholly virtual environment, such as a physics simulator for robotics, or a video game where the model can repeatedly play against itself.

For these reason, while deep reinforcement learning has achieved impressive results for game playing, it’s usage and applicability to real-world business applications remains minimal.

Whats Next for Reinforcement Learning?

Reinforcement learning has enjoyed a long history of study and refinement within academia, with Thompson Sampling being originally described as early as 1933. Only recently has reinforcement learning gained wider awareness within the professional software development community.

For professional software developers, what has been missing is tooling. The coming years will see a greater proliferation of tools that make it simple for software developers to implement and cost effective for businesses to deploy RL based solutions. At Improve AI we’re making it easy to deploy RL based solutions for iOS, Android, and Python for making decisions, ranking, scoring, and multi-variate optimization.

Recent Posts