For this experiment I used the Twitter data (likes and retweets) alone to generate labels where ‘good’ compositions have at least 1 like or retweet. There are relatively few compositions that receive any likes or retweets (presumably due to upload timing and the twitter algorithm). Due to this, I random sample the ‘bad’ compositions to balance the classes, leading to 197 ‘good’ and 197 ‘bad’ samples. The best model archives an accuracy of 76.5% for the validation set and 56.6% on the test set. The best model archived f1-scores of 75% (bad) and 78% (good) for the validation set and 55% (bad) and 58% (good) for the test set. The following image shows the confusion matrix for the test set. The performance on the validation set is very good, but that does not generalize to the test set, likely because there is just too little data here to work with.
I was just thinking about this separation of likes from attention and realized that since compositions with little attention don’t get uploaded to twitter, they certainly have no likes; I should only be comparing compositions that have been uploaded to twitter if I’m using the twitter data without attention to generate labels. The set used in the experiment discussed herein contains 320 uploaded compositions and 74 compositions that were not uploaded. I don’t think it makes sense to bother with redoing this experiment with only the uploaded compositions because there are just too few samples to make any progress at this time.
In this data-set 755 compositions were uploaded and 197 received likes or retweets. For the data-collection in progress as of last night 172 compositions have been uploaded and 86 have received likes or retweets. So it’s going to be quite the wait until this test collects enough data to move the ML side of the project forward.