### Introduction to Reinforcement Learning

Reinforcement Learning is a hot topic in the field of machine learning. Think about self driving cars or bots to play complex games. However, the application of this technique to operational problems is scarce. In this blog post, we will guide you through the basic concepts of Reinforcement Learning and how it can be used to solve a simple order-pick routing problem in a warehouse using Python.

Reinforcement Learning is a hot topic in the field of machine learning and can be applied to a wide variety of problems. As formulated by Bhatt [?] : “Reinforcement Learning (RL) is a type of machine learning technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences.” RL is a form of unsupervised learning, meaning that it does not need labeled input and output data.

A typical RL problem is modeled as a MDP and consists of:

**Environment**: World in which the agent operates.**State**: Current situation of the environment, the set of states is denoted by*S*.**Action**: A decision made by the agent that affects the environment and changes the current state, the set of actions is denoted by*A*. The set of actions could also be uniquely defined for each state*s ∈ S*and is denoted as*A(s)*.**Reward**: Feedback from the environment, the reward after transition from state*s*to state*s′*with action*a*is denoted by*ra(s, s′)*.**State-action-transition probability**: Probability of transition from state*s*to state*s′*under action*a*denoted by*p(s′ |s, a)*. The set of probabilities is denoted by*P*.

These concepts are illustrated in figure 1. At every discrete timestep *t*, the agent interacts with the environment by observing the current state *st* and performing an action *at* from the set of available actions. After performing an action at the environment moves to a new state *st+1* and the agent observes a reward *rt+1* associated with the transition ( *st, at, st+1*). The ultimate goal of the agent is to maximize the future reward by learning from the impact of its actions on the environment. At every time-step, the agent needs to make a trade-off between the long term reward and the short term reward.

Due to its generality, Reinforcement Learning can be applied to a wide variety of prob- lems. For example, RL is frequently used in building AI for playing computer games such as packman, backgomman and AlphaGo, but also to design software for self- driving cars.

# Order-pick routing formulated as a RL-problem

Now that we know the core concepts of a RL-problem, we want to know how we can translate our order-pick routing problem to these core concepts.

Order-picking in a warehouse can be seen as a special form of the classical Traveling Salesman Problem (TSP) which asks “Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city and returns to the origin city?”. Translated to order-pick routing, this question becomes with other terminology: “Given a list of pick locations and the distances between each pair of pick- locations, what is the shortest possible route that visits each pick location and returns to the I/O point?”. The TSP is a widely studied problem in the field of combinatorial optimization and many heuristics have been developed to solve the TSP. Here we try to formulate the warehouse order-pick routing as a MDP:

A worker with a cart (agent) travels through the warehouse (environment) to visit a set of pick-nodes. The agent decides at every time step *t* which node is visited next changing the selected node from *unvisited* to *visited* (state). The agent tries to learn the best order of the nodes to traverse such that the negative total distance (reward) is maximized. The core concepts of this MDP are as follows:

**State**: A representation of the warehouse environment with the*current**node*and a set of*unvisited**nodes*.*st*contains the current node and a list of nodes still to be visited.**Action**: A finite set of actions*A*of selecting the next node to be visited.*A*could be the set of all nodes and the same for each state*s*or be dependent on*s*where the set of possible actions decreases over time. In the latter case, when all nodes are visited at time*t*, the only possible action*at*is to return to the I/O point.**Reward**: Feedback from the environment, the reward after transition from state*s*to state*s′*with action*a*is always the negative distance between nodes*s*and*s′*except for the terminal state. At the terminal state, the agent receives an extra big reward for completing his route.**State-action-transition probability**: Probability of transition from state*s*to state*s′*under action*a*denoted by*p(s′|s, a)*. For every action we know what the next state will be, hence*p(s′|s, a)*= 1 for a preknown state*s′*and 0 for all other states.

Now, we will illustrate these concepts using a small example of a warehouse that we model in Python.

# Small example modeled in Python

Suppose we have a warehouse with a layout as in figure 2. Now the agent gets an order and has to pick a total of four products at different locations. We number the pick locations from 1 to 4 and denote the starting location as location 0. By the structure of the warehouse, we can calculate the shortest paths between all the points and store these in the distance matrix D of size 5 by 5 (see table 1).

**Guiding**

The concrete goal of the agent is to visit all pick locations and return to the starting location in the shortest way possible. For this specific example it is easy to calculate the optimal order of nodes to traverse by just going through all possibilities, 4! = 24 in total. Since we know the optimal route, we can easily check whether our agent is able to learn the optimal route. For this specific example the set of actions is the same for each state s ∈ S, hence A(s) = A for all s ∈ S and is defined by:

A = {0,1,2,3,4}

Note that this means that an agent can decide to go to a pick-node that is already visited. A state is defined by a tuple s = (is, Vs) consisting of the current location is and a set of locations Vs still to be visited. For example, the state s = (2, {1, 3}) means that the agent is at pick-location 2 and still needs to visit pick-locations 1 and 3. Formally, we define the set of states by:

Note that one can verify by hand that the total number of states in this example is equal to 48. In general, if we define the set of states in this way, the number of states is equal to:

In equation (2), if the agent is at location 0, there are 2|A|−1 possible lists of locations still to be visited, for the other (|A| − 1) locations, there are 2|A|−2 possible lists of locations still to be visited. For every given state we know for every action what the next state will be. For example if the agent is in state (0, {1, 2, 3, 4}) and decides to go to pick location 3, the next state is (3, {1, 2, 4}). Formally, we define the state-action-transition probability as:

Now, all that is left is to define the reward function. We give the agent a negative reward if it goes from location x to location y equal to minus the distance between x and y: −D(x, y). If he returns to the starting point having visited all the cities, i.e. if he reaches the terminal state, he receives a big reward of 100 (or another relatively large number in comparison to the distances). Formally the reward is given by:

## Python code

**Solving with Q-learning**

Several reinforcement learning algorithms have been developed in order to train the agent. The most used one is called Q-learning, introduced by Chris Watkins in 1989. The algorithm has a function that calculates a quality measure for every possible state action combination:

Q:S×A→R

The value Q(st, at) tells, loosely speaking, how good it is to take action at while being in state st. Q-learning iteratively updates the Q-values to obtain the final Q-table with Q-values. From this Q-table, one can read the policy of the agent by taking action at in every state st that yields the highest values. Updating is done according to the following rule:

The algorithm uses the following parameters/variables:

**Initial values (Q0)**: Usually, the initial Q-values are set to 0. However, if we want to stimulate exploration (a concept explained later on in this section), we should pick relatively high initial Q-values. These are called optimistic initial values.**Learning rate (α)**: The learning rate determines how much the agent learns. A learning factor of 0 means that the agent learns nothing and Q-values are not updated at all. A value of 1 means that the agent only looks at the most recent information and forgets the old information.**Discount factor (γ)**: The discount factor indicates how much the agent looks at future reward. If γ = 0, the agent only looks at the current reward.

If there is a terminal state, one episode of the algorithm stops if this terminal state is reached.

Always taking the action that gives the highest Q-value in a certain state is called a greedy policy. However, for many problems, always selecting the greedy action could get the agent stuck in a local optimum. Therefore, we make a distinction between exploitation and exploration:

**Exploitation**: Selecting in every iteration the action that results in the best known reward according to a predefined function.**Exploration**: Selecting a move that not necessarily leads to the best known reward which gives more information about the environment. This could be a random action or an action according to some probability function.

A way to implement the trade-off between exploitation and exploration is to use ε- greedy. With probability 1 − ε the agent chooses the action that he believes has the best long term effect (exploitation) and with probability ε he takes a random action (exploration). Usually, ε is a constant parameter, but it could be adjusted over time if one prefers more exploration in the early stages of training.

Now, we will implement this algorithm in Python to solve our small order-pick routing example. Since we are dealing with an episodic setting with a terminal parameter, we set our discount rate γ = 0.9. We take learning parameter α = 0.2 and exploration parameter ε = 0.3.

We run the algorithm until the Q-values converge and the final Q-table can be found in table 2. From the table we can read the solution found with Q-learning by selecting the action that yields the highest value and following the state-action-transition defined with the probabilities: 0 → 4 → 3 → 2 → 1 → 0. We see that the agent visits every pick- node once and returns to the starting point. Moreover, he was able to find the optimal solution!

**Conclusion**

In this blog post, we gave an introduction to Reinforcement Learning and showed how Q-learning can be used to solve a small order-pick routing example in a warehouse. To solver large routing instances, the number of states explodes and operating a Q-table becomes computationally infeasible. Neural networks could be used to overcome this problem. Instead of operating a Q-table for every state-action pair, a neural network is trained to estimate the Q-values.