Reinforcement Learning, Cookies and your Mission Statement

I am fond of eating cookies once is a while. (who isn’t?). As I caught myself eating one this morning, I realized I was quite unconsciously being led by a reward mechanism.

Eating a cookie triggers the sugar detectors in the tongue which in turn release dopamine, which is a good thing as far as the body is concerned. When humans were evolving, finding a sugary treat was indeed a good thing that helped survival.

So why is it not good?

Humans are complex and we hold many often-conflicting goals at the same time. Eating cookies until the body releases the satiety signals, is most definitely not a good action towards achieving health related goals.

For instance, eating cookies every day might release dopamine in the short term, but in the long-term it most likely lead to a huge negative reward of obesity and health problems.

Further, by picking up the cookie, we become prone to picking up the cookie eating habit i.e. the probability of me picking up another cookie tomorrow will increase every so slightly if I picked up a cookie today, or have been eating cookies every day.

So, although in the short-run eating cookies produces a dopamine reward and from a hedonistic perspective my goal is to pick up rewards such as these in life, doing so results in the state of being obese (right) which has a large negative reward. This negative reward will factor in the health adversities and/or the societal stigma.

Thus, when I have to make the decision to pick up that cookie or not, I should consider not just the immediate reward of the dopamine release, but also the long-term (negative) reward of obesity.

In reinforcement learning, this consideration of long-term rewards quantifies the goal of an agent. If, in a lofty state of mind, I define my goal to have a healthy life, then the dopamine reward is a non-reward. But practically, even considering the sugar-rush as a desirable condition, if I properly factor in the other long term regrets, I should be able to make the right decision.

Formally, if I define G at the present to be my adjusted reward for taking an action. Then I take into account all future rewards that I expect to get by following this policy (habit). I realize that by eating a cookie today, I’m more likely to eat a cookie every time I am in this state (being in front of a cookie) and thus am more likely to end up unhealthy, and thus get a huge negative reward in the future.

The popular adage goes something like ‘A bird in hand is worth two in the bush’, which attempts to capture the uncertainty of future rewards. This is captured in the equation above by the discount factor ‘gamma’ that is between 0 and 1. If we are fairly certain that if we continue taking this action (eating the cookie), according to the current policy (see a cookie, eat a cookie), then in the long-term we are bound to get unhealthy, then we set ‘gamma’ to 1.

Mission Statement

So what does this line of thought have to do with a company’s mission statement?

Replace ‘eating a cookie’ with making a short-term profit and ‘being healthy’ with ‘surviving and thriving’.

The purpose of your company’s mission statement is not just to inspire employees with lofty rhetoric, but to expose a behavioral policy.

Let’s say Company A- an Airlines – has the mission statement is to increase share-holder profit. The head-employee (CEO) when faced with whether to charge cancellation fees, will naturally think in terms of share-holder profit and thus will decide accordingly.

Whereas, Company B- another airlines – has a mission statement aligned with serving more customers. An employee faced with a disgruntled customer, then might choose a behavior that not only ensures the customer stays this time, but tries to balance that with ensuring the customer comes back or she refers other customers in.

There is no telling for sure which Mission Statement is clearly the better one, but the focus of one seems to be on short-term rewards the the next one on long-term health.

What is paramount then is to understand the nature of the reward generation process and to quantify the ‘gamma’. For this we will try to understand the Environment we operate in better. (Next post)