Following from my previous post, I used the same approach to change the thresholds for how much Twitter engagement is required for a composition to be “good”. The following table shows the result where the “TWIT Threshold” is the sum of likes and RTs for each composition. Of course, the increasing threshold decreases the number of “good” samples significantly; there are 880 “good” samples in Threshold 1, 384 in Threshold 2, and 158 in Threshold 3. (This is comparable to the number of samples using attention to determine labels.) The small number of samples in high threshold is why I did not try thresholds higher than 3.
Interestingly the results show the opposite pattern as observed using attention to generate labels where test validation accuracy increases as the threshold increases, It seems twitter engagement scores are actually much more accurate than those using the attention data. It makes sense that explicitly liking and RTing on Twitter is a better signal for “good”, even though it collapses many more peoples’ aesthetic. Indeed some would argue there are global and objective aesthetics most of us agree on, but I’m less convinced.
I also did a series of experiments using the amalgamated data-set (where the ZF code changed between subsequent test sessions) and the same twitter thresholds (with 1399 “good” in Threshold 1, 560 in Threshold 2, and 228 in Threshold 3) showed only a 2% difference in test accuracy that peaked at 58% test accuracy. Another experiment I was working on was a proxy for a single style ZF that would generate only circles, for example. This would be reducing some of the feature vector params and potentially increase accuracy as it would be an “apples to apples” comparison for the audience. This also involves reducing the amount of samples as, for example, “circles” are only 1/3 of a whole data set. Doing this for the F integration test resulted in a best accuracy of around 60% (where Threshold 3 has 129 “good” samples) and I’m considering doing the same with the amalgamated training set, which contains 1438 circle samples that were uploaded to Twitter, compared to the 898 that are included in the most recent integration test. Looking back at the amalgamated data-set, it actually has about the same number of circle compositions with high twitter scores as the F data-set, so no point in going back to that for more samples!
Through all of this it seems clear that online learning of viewer aesthetics from scratch would take a very very long time and perhaps shipping the project with a starting model based on Twitter data collected to date is the best approach. The Zombie Formalist has been on Twitter for about a year and over that time generated 15833 compositions, only slightly more than my initial hand-labelled training set of 15000 compositions, for which my best test accuracy was 70% (but I’ve done some feature engineering since then).