Thursday, November 15, 2007
Thinking Cap questions (comment on the blog)
Here are some thinking cap questions.
(1) Underlying much of the discussion on MDP is a fundamental principle about decision making under uncertainty--that a rational decision maker does the action that maximises the *expected* utility. For example, if you have two actions, A1, which gets you to S1 with 0.9 prob, and S2 with 0.1 prob, and A2 which gets you to S2 with 0.2 prob and S3 with 0.8 prob, and the utilities of the states are S1=10 S2=100 S3=-20.
Then the expected utility of doing action A1 is .9*10+0.1*100=19; and that of doing A2 is 0.2*100+.8*-20= 20-16=4. So, a rational agent must pick A1 over A2.
Consider a typical lottery. You buy a ticket for 1$, and you get a 1 in ten million chance of winning 1million. You have two possible actions: Buy the ticket or Not buy the ticket. Expected return of buying the ticket seems to be a loss of 90 cents while expected return of not buying the ticket is 0.
Despite this, we know that lotteries are flourishing everywhere. Does the success of lotteries show that the underlying assumption of decision theory is irrevocably flawed in accounting for what humans do? Or is there a decision-theoretic explanation for why some humans seem to go for lotteries?
(2) In today's class we agreed that the "optimal policy" is simply a mapping from states to actions. However, if two people, one 20 years old and another 60 years old, but otherwise completely identical in their financial situation, go to a financial advisor, the advisor will give differing advice to the 20yr and the 60yr one (typically, for the 20 year old, it will be to invest all in stocks and for the 60 yr one, it will be to invest in bonds or even put it under the pillow).
To the extent MDPs can be thought of as a model of our everyday decision making, how can you explain this difference between policies for two agents that seem to be identical in all respects?
(3) When we come into this world, we probably don't have a very good model of the world (a kid, for example, doesn't quite know what the outcomes of many different actions are. So, the kid will eagerly try putting their hand on the red glowing thing--even if the glowing thing in question happens to be the cooking range. So, presumably we learn the model. How about the "reward" model? Is the reward model also completely learned from scratch?
Comments, on the blog, welcome.