Revisiting ML for Zombie Formalist

Since my past post on ML for the ZF, I’ve been running the system on Twitter and collecting data. The assumption being that the model’s lack of ability to generalize (work accurately for the test set) is due to a lack of data. Since classes are imbalanced, there are a lot of “bad” compositions compared to “good” ones, I end up throwing out a lot of generated data.

In the previous experiment I balanced classes only by removing samples that had very low attention. I considered these spurious interactions and thought they would just add noise. That data-set (E) had 568 good and 432 bad samples. The results of this most recent experiment follow.

In this experiment, I zeroed out the layers that are not visible in the final composition. I also balanced the classes and filtered out spurious interactions with low attention. The best model (determined by 10-fold cross validation using validation set) achieved accuracies of training 62%, validation 65% and test 54%. This model also achieved f1-scores of 56% “bad” and 71% “good” for the validation set and 43% “bad” and 61% “good” for the test set. The data-set has 1084 samples per class. While this is a little more than double the data of the previous set, the pattern of non-generalizing persists. It will take months to double the data yet again and this is really slowing down the ML side of the project.

I changed the train / test / split ratio (it was 75/15/15) to 50/25/25 hoping that greater variation within the validation set would translate to the test set. This is not the case. With the previous split there would have been 426 “good” samples in the training set; with the new split (and larger data-set) there are 542. This is not much of a change in terms of increasing the data-set size to improve generalization. To compare with the previous experiment, I’d have to go back to the old split and re-run the experiment. It should also re-run experiments where I used Twitter-only and attention only labels; I wonder whether there is too much variation over audiences…