Since I’ve been having trouble with generalizing classifier results (where the model achieves tolerable accuracy on training, and perhaps validation, data but poorly on test data) I thought I would throw more data at the problem; I combined all of the Twitter data collected to date (even though some of the code changed between various test runs) into a single data-set. This super-set contains 12861 generated compositions, 2651 of which were uploaded to twitter. I labelled samples as “good” where their score was greater than 100 (at least one like or RT and enough in person attention to upload to twitter). After filtering outliers (twice the system “saw” a face where there was no face, leading to very large and impossible attention values) this results in 1867 “good” compositions. When balancing the classes, the total set ends up with 3734 “good” and “bad” samples. Still not very big compared to my hand-labelled 15,000 sample pilot set, which contained 3971 “good” compositions. The amalgamated super-set was used for a number of experiments as follows.
The first results showed a small improvement in accuracy: 61% test accuracy compared to 54% for the smaller most recent test set. Even with this larger super-set, the problem is still generalization. The best model (according to validation accuracy) achieved a 63% accuracy for the training set, 67% for the validation set and 61% for the test set. This is still quite off from my best results for my hand-labelled set (~74% test accuracy).
I had an idea that my issues with accuracy and ML could perhaps be solved with grouping parameters and training multiple smaller networks on those groups. This could mean needing less training data and could improve the accuracy of my classification. I split my features into six groups, COLOUR, SSTM (composition skew, scale, translation and mode), CT (stripe contrast and threshold for layers), FREQ (stripe frequency for layers), OFFSET (layer X and Y offset of layers) and ORIENT (stripe orientation for layers). The smaller networks did not improve with the same amount of data. In fact, the COLOUR, CT and OFFSET groups had only slightly better than chance accuracy on the test set, ~53%. The best performing groups SSTM, FREQ and ORIENT achieved ~58% accuracy on the test set, which is very close to the 60% using all parameters. So it actually seems that it is not the amount of data that is the problem, so much as it is that perhaps the aesthetic cannot be generalized.
I decided to do a test experiment where rather than using Twitter engagement and attention as a proxy for aesthetic value, I would specify a rule for value: Only concentric circles are good. So I generated “good” labels for 90% of all concentric circles in the super-set leaving the other 10% as “noise”; these combined with non-concentric circle samples were labelled “bad”. This tests a perfect scenario where the viewer has a narrow aesthetic and consistently selects compositions for it. The results are unsurprisingly much better! The best model (selected for validation accuracy) achieved accuracies of 95% (training), 97% (validation) and 92% test. 10 “bad” compositions were detected as “good”, 55 concentric circles were included in the “bad” class) and only 2 “good” compositions were predicted to be “bad”.
This all leads to the big question: are aesthetic values actually learnable using these features? The answer seems to be only if the aesthetic is highly narrow. The data currently being collected collapses a lot of different aesthetics, those of many Twitter followers and those who interact with the Zombie in person. With this collapse of groups it is an open question whether any data would ever be enough to generalize. So where do I go from here? I think the big question is still worth-while so I’ll keep collecting data, but in the narrow sense of moving the Zombie Formalist forward what should my strategy be? Should I assume a narrow aesthetic (since that may be the only learnable aesthetic) and deploy a system that works for this idealized case? The best model for this case is quite simple, two hidden layers with a “funnel” shape, a batch_size of 16, 5 training epochs, dropouts of 0 and relu activation functions. I suspect this best performance was ‘lucky’ since I don’t think there is enough data to properly train two hidden layers, and I suspect a single hidden layer network will be comparable.
Due to the COVID delays of working with contractors for the enclosure, I’ve been procrastinating on the development of the learning system in the Zombie. I’m thinking I’ll try the ANN features included in dlib since I’m already using dlib for the face detection (aside: it does not work with masks and it’s hard to know when mask-free exhibitions are going to be possible). I guess I should start on that soon since that the task of accurately predicting the collapse of so many different aesthetics is over for now.