Returning to Machine Learning with Twitter Data

Now that I have the system running, uploading to Twitter and collected a pretty good amount of data, I’ve done some early ML work using this new data set! I spent a week looking at doing this as a regression (predicting scores) task vs a classification (predicting “good” or “bad” classes). The regression was not working well at all and I abandoned it; it was also impossible to compare results with previous classification work. I’ve returned to framing this as a classification problem and run a few parameter searches.

Previous work with hand-labelled data

First a little background; each composition has three variables used to determine it’s ‘score’ (measure of good or bad). The ‘attention’ is the amount of time the in-person viewer looks at the work. If the viewer looks at a composition for more than 50 frames (more than a few seconds) then the composition is uploaded to Twitter and the number of likes (l) and retweets (r) are tracked. As attention < 50 is bad and attention >= 50 is good, this is binary signal. One like or one retweet are weighted the same as that initial binary signal, so score = attention + 50(l+r).

In collecting this data I knew there was an issue with unbalanced classes, so I balanced classes by throwing away random ‘bad’ compositions such that the number of ‘good’ and ‘bad’ samples was equal. Of course this was ‘balanced’ according to the attention binary signal. In order to incorporate the likes and retweets as training signals, the threshold had to change from 50 to something greater. Up to this point I’ve been using 100, which unbalances the classes again (even fewer good compositions).

Before getting into the recent experiments I’m going to talk about my previous best results in my pilot work using hand-labelled data for comparison. The data set contains 15,000 images Zombie Formalist generated images labelled ‘good’, ‘bad’ and ‘neutral’. I ended up throwing away “neutral” and end up with 3872 ‘bad’ and 3921 ‘good’ samples. The data is split into 75% Training, 15% validation and 15% testing. The best model I trained on this data was 78% accurate on validation data (data used to determine which model structure is best through hyperparameter optimization) and the following shows the confusion matrix for test data that was not used in hyperparameter optimization, nor training.

The corresponding f1-scores for this best model for the validation set were 71% (bad) and 78% (good). The same model using the test set achieved f1-scores of 71% (bad) and 77% (good). This was the best performance after many experiments.

Using The ‘Real’ Data

Onto the new data collected by the running Zombie Formalist! The Zombie Formalist generated 1524 compositions (after ongoing balancing based on the attention threshold of 50). Using the updated threshold of 100 (where “good” compositions need double the attention, or minimal attention + 1 like or retweet) there are 1092 bad and 432 good compositions in the set. While the best model according to binary accuracy in hyperparameter optimization, using the same split as above, showed somewhat comparable accuracy in validation (72.3%) and test (63.3%) sets, the f1-scores show much worse performance is on predicting ‘good’ compositions: The validation set archived f1-scores of 81% (bad) and only 49% (good), while the test set was even worse with f1-scores of 74% (bad) and 35% (good). The corresponding confusion matrix is below.

I thought this may be due to the very unbalanced classes, so kept the threshold but then additionally removed all very low attention compositions (attention > 10) as I noticed there are a lot of incidental attentions where someone walks by without actually looking at the composition. In doing so the classes are a little more balanced at 568 (bad) and 432 (good). This did not improve performance where validation accuracy was 61%, and test accuracy only 54%. The f1-scores show an even worse story where the best model achieved f1-scores on the validation set of 72% (bad) and 41% (good) and on the test set of 65% (bad) and 32% (good). I conclude that there is just not enough data to train a good model for generalizability and the ability to predict ‘good’ compositions. The task is to predict the aesthetic preferences of the audience, so the prediction of ‘good’ is most important.

Twitter likes and retweets may actually be adding noise to the data, rather than boosting the training signal though. Some very good compositions may not get a lot of likes because of the relatively small number of followers that the Zombie Formalist has and the importance of timing in getting likes. Compositions are generated and uploaded based on user interaction, so that means they are distributed through the day and not focused on the peak twitter posting times. There is the other problem, audience. The previous best model performance was only able to predict my aesthetic ~75% of the time, with hand-labelled data. While running the ZF, I noticed that my partner and I ‘like’ quite different compositions and I can imagine there are even more diverse aesthetic preferences on Twitter. Then there is the validity of viewing time as even signalling the audience’s aesthetic. I still think this is a worthwhile approach since while not good, optimizing for time viewers spend looking at the work means increasing (implicit) participation.

More data is needed, but with all this diversity of aesthetic preferences, is a global ‘aesthetic’ actually learnable? Does it even exist? Maybe I need a version of the Zombie Formalist that does not use any in person interaction, but only follows Twitter norms in an uploading schedule and contrast that with an in person-only aesthetic. I’ll run an experiment where I use only the attention data and see how that performs. I can also try the same for only the Twitter data and see where more noise is. At least likes and retweets are pretty clear signals.