Consider the following:

0. Consider the MDP applet demo we saw in the class today. The default discount factor was set to 0.9, and the reward states were made "absorbing". What do you think will happen if you make the discount factor to be 1. Confirm your intuition with the applet. Now, do this part again but without any absorbing states. Again, try to guess what will happen, and confirm your intuition.

1. What happens to MDPs if the underlying dynamics are "deterministic"? Can we still talk about value and policy iteration? Do we still have non-stationary policies for finite horizon deterministic MDPs?

2. We talked about how infinite horizon case is easier than finite horizon case, and said, in passing, that here is one case where "infinite" is easier to handle. Consider the following two cases:

2.1. CSPs where the variables are "real valued", and constraints between variables are expressed as linear inequalities. Clearly, the number of potential assignments for the CSP variables is infinite. What do you think will be the complexity of finding a satisfying assignment of the variables? (recall that discrete variable CSP is NP-complete) If the complexity is "low" by AI standards, what can you do to increase it? (hint: consider modifying constraints).

2.2. Planning problem where (some of) the variables are real valued. Actions have effects that can increment and decrement the variables by arbitrary amounts. What do you think will be the complexity of planning in this case? (recall that discrete variable planning is PSPACE-complete, i.e., it is among the hardest problems that can be solved with polynomial space).

3. The MDPs we considered until now are called "Fully observable"--in that during execution, the agent can tell which state it is in (thus the policy needs to only map states to actions). What happens if the domain is only "partially observable".

Note that this means that the agent may not know which unique state it is in, but knows the "probability distribution" over the possible states it could be in. When the agent does an action, it effect of the action is to modify the distribution into a new distribution over states (with some states being more likely and others less.

Notice that the "distribution" is fully observable although the underlying states aren't.

So, one idea will be to consider the distributions as the states. What happens to the complexity of MDPs? How is this connected to MDPs over "continuous" state spaces?

[Notice that the above still works for the special case fully observable domain. The initial distribution will be a delta function--with probablity being 1.0 for a single state, and 0.0 for all others. Doing an action converts into another delta function--with the 1.0 for a different state].

That is all for now...

Rao

2.2. Planning problem where (some of) the variables are real valued. Actions have effects that can increment and decrement the variables by arbitrary amounts. What do you think will be the complexity of planning in this case? (recall that discrete variable planning is PSPACE-complete, i.e., it is among the hardest problems that can be solved with polynomial space).

3. The MDPs we considered until now are called "Fully observable"--in that during execution, the agent can tell which state it is in (thus the policy needs to only map states to actions). What happens if the domain is only "partially observable".

Note that this means that the agent may not know which unique state it is in, but knows the "probability distribution" over the possible states it could be in. When the agent does an action, it effect of the action is to modify the distribution into a new distribution over states (with some states being more likely and others less.

Notice that the "distribution" is fully observable although the underlying states aren't.

So, one idea will be to consider the distributions as the states. What happens to the complexity of MDPs? How is this connected to MDPs over "continuous" state spaces?

[Notice that the above still works for the special case fully observable domain. The initial distribution will be a delta function--with probablity being 1.0 for a single state, and 0.0 for all others. Doing an action converts into another delta function--with the 1.0 for a different state].

That is all for now...

Rao

## 2 comments:

1. Deterministic case seems to a special one of the MDP (i.e: probability values are either zero or one). So every algorithm that works for stochastic problems will also work on deterministic ones (but of course relative efficiency will change). And for non stationary policies: yes we'll need them, but for at most N (state count) depth. (For more, it will have to cycle).

2.1. Linear solvers are too NP complete (e.g: integer programming). However non-linear solvers are not. Thus we may use this to increase the complexity. (for example x^2-x < 0, etc).

2.2. If we only have increment and decrement by arbitrary amount, we'll still be linear (thus in np). However if some actions can to more (e.g: x=sqrt(x), etc) then problem will no longer be decidable. (I guess I should not make a such bold statement here).

3. In case we assume that the states are represented as distributions, then this model is a discretization of a continuous MDP (MDP over "continuous" state spaces). It discretizes(e.g. splits into particular points) the space time continuous function representing the 'states' of continuous MDPs. Note that a particular discretization can fit a large number of continuous MDPs.

Thus the complexity in this case is the same as for discretized continuous MDPs, which is at least PSPACE-hard, e.g. harder then regular MDPs.

Post a Comment