Synthetic Dataset Continued

If the MLP is able to learn a sequence, and demonstrates that learning by producing the correct pattern for a particular input, then feedback should result in the network replaying the sequence. There is no difference between feeding the network state t+1 no matter where that pattern comes from. So why is the network apparently learning the sequence in the previous post, while feedback does not result in replaying the learned sequence?

The answer is that the network is not learning the sequence as well as it appears. This is evident in the fact that the non-discretized output of the network is quite dissimilar to the input patterns, even though the discretized version is. Thus what seems to happen is that any differences between the outputs and the desired state of the sequence accumulate through the feedback process such that even when the outputs are discretized, error seems to increase over multiple iterations of feedback. If this is indeed the problem, then the clear solution is to train the MLP such that error is so low that feedback results in replay. This is a bit of an issue in relation to over-fitting though, where the network replays only when those exact patterns are presented and thus is not tolerant to noise. In order to investigate this, I’ve trained a batch of MLPs with different parameters to determine which ones learn the best. Following is the error reported after each epoch of a number of different networks, all trained on the same data. Variables that have been changed are only whether the learning rate decreases over time, and the number of hidden units.


Performance is pretty consistent between decreasing and constant learning rate tests, and poor performance seems largely due to insufficient number of hidden units. Of course the real test is how close each output pattern is to the corresponding sequence state (rather than MSE of the whole network, as pictured above). In the following plot the x axis is the iteration of the feedback of the MLP, and the y axis is the difference between output at that iteration of the feedback and the sequence state it should correspond to. A perfect replay would appear like a horizontal line. Note that according to this measure, the constant learning rate seems to provide better results:


The test with the least sum of errors over all iterations had 19 hidden units and used constant learning, and resulted in the following sequences:

Input D-Input Input D-Feedback Feedback
states-training-1-epoch PHASE2_19Hidden_constantLearning-valIndex PHASE2_19Hidden_constantLearning-valIndexCont PHASE2_19Hidden_constantLearning-valFeedback PHASE2_19Hidden_constantLearning-valFeedbackCont

Which is a quite poor example of replay beyond the first two iterations. On the hunch that this relative success was accidental, I ran the same test with different initial weights which resulted in the following:


Clearly, there is a lot of variance of performance between runs, and none of which learned as well as the best (accidental) case selected above. So it seems the gist is that this network is just not learning this sequence, and it is unclear how to make it do so, short of abandoning much work and moving to another method (e.g. RNN) and a wholly new implementation of the predictor.