**Posted: April 16, 2013 at 1:15 pm **

Following are some thoughts on Reinforcement Learning (RL) in relation to the current conception of the project. Note this this assumes a basic introductory understanding of RL.

### State and Action Space

Lets say that each state is every possible arrangement of percepts. If there are say 15 percepts, and their degree of activation is binary (they appear or not, no degrees), then that would be 2^15 possible states, right? Additionally, each state could lead to 2^15 possible actions (one action for every possible activation of percepts). 15 percepts is absurd, even 100 is too few. I’m aiming for 1000s (visually speaking).

I realize that it’s possible to reduce the number of states by only allowing certain combinations of percepts, perhaps by “pruning” the state space, which we should talk about A typical single image has 70 segments, then it seems like 70 would be the lower-limit, and that seems to imply 2^70 possible states. Seems the action space could be smaller, but as of yet its unclear to me how to limit it.

Additionally by being Markovian, it seems that the future state may only depend on the current state, but I’m not sure this is true in this case. Can the current set of selected percepts really predict the next set without information from previous states? For background percepts, It makes sense that the system could assume that the most likely next state would be the same as the current state (with minor chances of other states). For foreground percepts, the current state may have nothing to do with the next state. If we look at the plots of the number of foreground percepts over time on the blog, we see there are long gaps of no foreground, and its quite hard to see any structure there.

Then there is the issue of chimeric elements (fusions of multiple people / places). If the state-space is every possible collection of images and there are only transitions from seen combinations of percepts to previously seen combinations of percepts, so no fusions are possible. How could fusions be possible? One idea is if we break the state space into chunks, then we could have two different (incompatible) predictions in two different chunks, leading to impossible fusions. I have no idea on what basis the state-space would be broken up this way.

I remember we talked about using the circadian clock to provide a reference cycle for percepts. It does make sense that the foreground percepts are much more likely at particular times of day than others. Its unclear to me how RL would learn this… This may be back to a question of representation…

### Representation

While the core algorithms are fairly clear, I still don’t have a very good sense of how variables are represented. For example, maybe a more compact representation that could reduce the state space and maintain the number of percepts is possible.

There is also the issue of the fact that clusters (once the max has been reached) shift their values as they continue to integrate new sensory information. They don’t always represent the same thing, and its possible they shift a lot in content over time. How would this effect RL when each “state” could actually be changing over time?

I’m unclear on how policies are actually represented. It seems to make sense that the transition function and action selection would be stochastic, so learning would be manifest in the changes of probabilities from one state to another. What are the bounds of this table though? The probability that state B follows state A would be a really big table.

### Exploration vs Exploitation

Knowledge is always exploited during perception, mind-wandering and dreaming. Mind-wandering and dreaming could be “exploration”, where less likely probabilities are used in the prediction.

This could be controlled by habituation, such that maximal habituation causes actions to be selected based on their inverse probability?

What is the purpose of exploration when the maximal reward will always be the closest prediction for the next time-step? There is no location of the environment that has the potential to provide some hidden greater reward.

There is this thought of dreaming as memory consolidation. This would be manifest in pruning. How could exploration in RL result in pruning? (Could the exploration process be used to feed a possible “chunking” of the state-space as described earlier?)

For “exploration” to be useful, a reward must be given, but dreaming and mind-wandering are driven in the absence of external sensory information, and thus no reward is possible. Could expected reward come into play here?

Why would dreams be the selection of less likely actions? This would contribute to “bizareness” but in terms of theory dreams should be the same prediction as perception, always the most likely next state, its only the lack of external perception that leads dreams and mind-wandering to diverge from reality.

### Sketch of RL

I think DM3 should be model-free and myopic (no discount for future reward), and so it would be running on a non-discounted infinite horizon. Learning with Monte Carlo?

** S** The state space encodes all possible combinations of percepts (whose number is fixed once a max has been reading. Note this is not quite true because the we don’t stop adding percepts in the middle of segmenting a frame, so the max number of percepts is not exactly known in advance.

**The action space resembles the state space. It’s still all the possible combinations of all known percepts. An action is the selection of percepts (priming) that are expected next. There is no difference between S′ and A in this case, as actions have known effects.**

*A***This is unclear, but should be stochastic.**

*T***The distance between A and S, such that if S=A then the reward is maximal, if S is the opposite of A, then the reward is negative.**

*R*It seems that lots of the equations can be highly simplified for this context, though I’m not sure how it all fits together.