I had a valuable two hour+ meeting with a two fellow students specializing in AI (Omid Alemi: Multi-Agent Systems, Graeme McCaig: Deep Learning) to discuss how RL could be applied in this project, and to solidify my vague thoughts on the notion. Consensus was that the fit between the problem (predicting what percepts will appear at the next state considering the current state) and RL is not straight forward. This is because RL is aimed at allowing an agent to make a sequence of optimal actions that maximize reward, which is given when actions result in desirable states. Core to RL is the notion of delayed rewards, where a current action can inherent some of the reward given by a future attained state. Even if the action itself does not lead to reward, if it leads to another state and another action that leads to a reward, then it is still a valuable action for this state. As RL is oriented towards an agent’s behaviour in a complex environment, it includes an explicit way of trading off between exploitation (do what you know is best) and exploration (look around to find a better way then you current think is best).
In our case for this project, the prediction of sensory information, rewards are actually the error between the prediction and the actual sensory information. The most reward possible is zero error, and that is known immediately. Additionally, the fact that rewards are immediate, and the greatest possible reward is a prediction identical to reality, the role of exploration is unclear. In a proper RL framework, the purpose of exploration is to find another causal path to greater reward, but if no such path exists, exploration serves no purpose. By the end of the meeting, a recording of which is available here (apologies for the poor quality), we had not reached consensus but had considered a few ways that RL could be applicable, and also the alternative of an MLP for the purpose of prediction. While I was convinced that the awkwardness of RL for this project meant that an MLP may be more appropriate, Graeme continued to consider the mapping between the system as currently conceptualized and RL, which we are currently discussing. Following is an attempt to list pros and cons for MLP vs RL for this project:
- Relation to psychological conditioning
- Model is implicit in policy
- Can be trained continuously
- Atypical Mapping (No environment, no delayed reward)
- Large state space
- Optimization during dreaming (pruning) does appear naturally. (Pruning of unlikely transitions?)
- Does the Markovian property hold for this context?
- Can the system imagine unseen patterns?
- Requires lots of training iterations
- Requires custom implementation (few choices for C++ libraries)
- Designed to learn patterns of vectors from supervised error.
- Dreaming stimulation would be feeding output back into input.
- Pruning could remove weak links (optimization during sleep)
- Lots of existing C++ libraries
- Possible over-fitting with continuous training?
- Unknown topology (tuning required)
- Can the system imagine unseen patterns?
Up to this point I’ve been considering ML methods in the context of the system learning to predict which percepts are likely to appear next. The predictive system is not so easy to excise from the previous conception of associative propagation. Three features of the system, as currently conceptualized, are not obviously mapped to RL:
- Continuity between perception / mind-wandering and dreaming (change in behaviour of a single system)
- Habituation was initially included to reduce over-activation. This is no longer an issue, but habituation does offer attention-like mechanisms and would allow a lack of novelty in stimulus to lead to mind-wandering.
- Pruning: During sleep shifts in activation that resemble shifts between REM and NREM sleep could allow weak links between neurons to be pruned, optimizing the network during sleep.
The most important is the continuity between waking, mind-wandering and dreaming. This is a key aspect of the theory proposed as part of this research. In the case of RL, waking perception would be the system learning what percepts are active and priming the percepts it expects to be next. Dreaming would be the exploitation of that learning. It seems that this would require a hard switch between learning and exploitation, which does not seem to indicate the continuity described in the theory.
As currently conceptualized, habituation reduces overall activation in the associative network, and this lack of activation initiates mind-wandering. When the background is highly static, the system would free-associate, but that free-association could be interrupted by non-habituated external stimulus. One proposal for the function of mind-wandering is to cause dishabituation. The theory being that when memory structures are over-activated a chance for the brain to rest can improve performance. Mind-wandering is considered a fusion of waking and dreaming states, where percepts are activated by both external perception and by internal mechanisms simultaneously, as modulated by overall activation. In the case of RL, its unclear what habituation would be, nor what a mix of exploration and exploitation would look like. It seems that it would require a feedback loop. During mind-wandering, the current state should be partly due to perception and partly due to its own prediction. Ideally the current state would be a sum of the current external state and the predicted state. The differences between mind-wandering, dreaming and perception would be modulations of the degree of impact of external stimuli on this feedback process.
It is unclear how RL learning would function considering feedback along these lines. While habituation is also unclear when it comes to the MLP, the feedback described above seems quite natural. There is even a class of ANNs that learn temporal sequences through recurrency, where the outputs are fed back as inputs. Perception would be the MLP learning to predict the next state where its input would be entirely dependent external sensory information. A dream would be the result of the output of the predictor being fed directly back into the input, which would cause a second output (and a second set of activations) and so on. This loop would clearly diverge away from the starting input and be a simulation of a sequence. It’s unclear to me whether the dreaming recurrent MLP would be able to construct new spatial arrangements of objects not seen in perception. Mind-wandering would be a sum of the MLPs output and external sensory input. I’m not sure how this would effect learning, but it provides nice smooth continuity between states as the effect of external and predicted activations are modulated according to habituation and the circadian clock. In fact, if habituation only effects rendering and the modulation of state, then it could be an independent system of RL or MLP and simply effect how percepts are rendered and the weight of internal and external stimulus on the predictor.
After this discussion and my thoughts above, it seems to me that an MLP is the most appropriate ML method for the predictor in this system. I’m still open to RL as an option, but as of yet it seems unclear what RL would offer for the system that the MLP would not.