Thursday, November 15, 2007

Thinking Cap questions (comment on the blog)



Here are some thinking cap questions.

(1) Underlying much of the discussion on MDP is a fundamental principle about decision making under uncertainty--that a rational decision maker does the action that maximises the  *expected* utility. For example, if you have two actions, A1, which gets you to S1 with 0.9 prob, and S2 with 0.1 prob, and A2 which gets you to S2 with 0.2 prob and S3 with 0.8 prob, and the utilities of the states are S1=10 S2=100 S3=-20.
Then the expected utility of doing action A1 is .9*10+0.1*100=19;  and that of doing A2 is 0.2*100+.8*-20= 20-16=4. So, a rational agent must pick A1 over A2.

Consider a typical lottery. You buy a ticket for 1$, and you get a 1 in ten million chance of winning 1million. You have  two possible actions: Buy the ticket or Not buy the ticket. Expected return of buying the ticket seems to be a loss of 90 cents while expected return of not buying the ticket is 0.

Despite this, we know that lotteries are flourishing everywhere. Does the success of lotteries show that the underlying assumption of decision theory is irrevocably flawed in accounting for what humans do? Or is there a decision-theoretic explanation for why some humans seem to go for lotteries?



(2) In today's class we agreed that the "optimal policy" is simply a mapping from states to actions. However, if two people, one 20 years old and another 60 years old, but otherwise completely identical in their financial situation, go to a financial advisor, the advisor will give differing advice to the 20yr and the 60yr one (typically, for the 20 year old, it will be to invest all in stocks and for the 60 yr one, it will be to invest in bonds or even put it under the pillow).
To the extent MDPs can be thought of as a model of our everyday decision making, how can you explain this difference between policies for two agents that seem to be identical in all respects?



(3) When we come into this world, we probably don't have a  very good model of the world (a kid, for example, doesn't quite know what the outcomes of many different actions are. So, the kid will eagerly try putting their hand on the red glowing thing--even if the glowing thing in question happens to be the cooking range. So, presumably we learn the model. How about the "reward" model? Is the reward model also completely learned from scratch?

Comments, on the blog, welcome.

cheers
Rao


     

8 comments:

sd said...

Considering the model, which is learnt from the scratch takes the trial and error to come up with the best/optimal path to the goal. This is best explained when it is compared with the classic model, where everything is known ahead of time to choose the next state. But rewards are generally obtained after the decision to take the next state is done. The decision to take the next possible state may be based on the utility function. So if rewards are to be learnt from the scratch then the value may not reflect the state's value (as in the best choice of state from the pool of stoicastic states, which may lead to the goal in optimal sense). So reward model has to have some preset numbers given to each state and update with the better rewards for the "good" states as we learn the underlying model, so that cumulative rewards reflects the optimality.
-Sushma

Subbarao Kambhampati said...

Re: Sushma's comment.

My question is slightly different. It is possible to have absolutely no idea about the action models (i.e., what the actions do in the world) and still get by. Is it possible to come into the world with absolutely no reward model?

Rao

Anupam said...

2. This can be explained using the relative lengths of horizons, and the corresponding discount factors.
Assuming a constant lifespan, the old man has a much shorter horizon and so the utility of future rewards for him is lesser than that for the young man.

3. Reward model cannot be learnt completely from scratch.
In the example, the kid does not know the consequences of putting his hand on the cooking range, but he does know (or rather his body mechanism) about a burn being a negative utility. So as soon as his put his hand on it, he evaluated the resulting immediate reward as being a negative one. Sate transition model can be learnt by exploring, but a capability of distinguishing upfront between +ve and -ve rewards is needed. For example, a machine can learn playing chess by trying different moves and evaluating the utilities of resulting states, but it needs to know that winning a game is a good thing and losing is not. But given a preliminary (hardcoded) reward model, it can be combined with the environment model to draw higher level models.

Ravi Gummadi said...

(1).
Well in the question only considers the the monetary returns as rewarding giving states, but the sheer action of buying a ticket might move the user into a state with reward function 'x' and that's wat matters to the person.

Probably this can be compared to buying a movie ticket as well.... In this case expected monetary loss is $10 whereas not buying a ticket is $0...But reward for watching a movie is 100 and spending $10 is -10
SO here the effective reward is 100 + -10 = 90.
But not spending $10 doesn't have any reward = 0

So Watching Movie/Gambling can be considered as more rewarding for some persons

(2).

The policy changes due to rewarding function.

In the first case (20 yrs) the reward for loosing money is less, say -10, because he still has chance to earn money before he dies.

But in the second case the reward for loosing money is very high, say -1000, because he doesn't have chance to get back the money. So he is suggested a policy where gets into states the probability of loosing money is less

So its the rewarding function which differntiates between the two persons

(3).
I am not sure if I understood the question correct.

I guess in this case all the states are gives uniform reward. So the agent might not be able to distinguish between good n bad states and just get to the goal state with smallest path ( if negative reward) or longest path ( if positive reward).

But I am not sure if the reward model can be learnt! Because rewards differs for each agent, What one agent considers a positive reward might be a negative reward for another. So extracting reward model out of one agent's actions and generalising doesn't seem to be quite a good idea!

sukru said...

I guess the third question is about q-learning (but I may be wrong).

Since we don't know particularly about the values of actions, but we have some idea of the final outcome, we can use this knowledge to "reward" or "punish" the path we've taken.

So we can indeed learn the reward model. (If we don't know if the final state we've reached is good or bad, then we have anoher problem).

Subbarao Kambhampati said...

Regarding Anupam's comment

I think the answer about the reward model is very well put.
While we may learn to love new things, we do come into this world with a beginning reward system.
At the chemical level, the dopamine rush in the brain is the hard-wired reward system for us.

When people are "depressed" and/or say that they are not able to enjoy life, in a way, the built-in reward system is not working well.

Similarly, the egregious effect of the narcotics is that they short circuit the "achieve something, get a reward" cycle by direcly sending dopamine rush to the brain.

Of course, this begs the question--what was the evolutionary advantage of having this type of reward system built-in?

rao

Subbarao Kambhampati said...

Regarding Ravi Gummadi's answer to the Lottery question...

It is reasonable to say that the act of taking part in a lottery itself has its own reward for some, and that with respect to that reward model, it is possibly rational to choose "play lottery" action.

There is also another way of looking at it. What if the "utility" of 1M dollars is NOT
1M times the utility of 1$?

Note that this is an issue of utility dependency--utility of
(right shoe + left shoe) is NOT
the same as utility of right shoe
+ utility of left shoe.

Think about how utility of 1M$ may be related to the utility of 1$ to different types of people.

(Also, check out the article--in today's NY Times--on the state lotteries

http://www.nytimes.com/2007/11/18/business/18queen.html

)

Rao

RANDY said...

1. Probability theory, not decision theory, is at fault in explaining lotteries. People suffer from the "gambler's fallacy" that they must eventually win/lose in any repeated random event & from simple lack of knowledge of how unlikely winning the lottery actually is. Some people also weigh large & small gains & losses differently, which could be accounted by adding weights to the decision-theoretic calculations.

2. The difference in age results in a different value for life expectancy, & the returns on investments have to actually be retrieved in order for them to be useful.

3. The reward model is partially predetermined by such biological considerations as our digestive system & pain receptors, & it also appears that people have an innate drive to seek novelty. These things can of course be "un-learned" later, but hunger, pain, & curiosity seem universal in young children.