Looking at my data I noticed that there were quite a few weak compositions in the top 50 greatest attention set for the still-collecting F integration test. Some of these were due to outlier levels of attention caused by a false positive face detection in the bathroom, others seem to be either a change of heart, or my partner’s aesthetic. Since there seemed to be some quite poor results, I wondered about changing the attentional threshold to generate labels where “good” only if they received a lot of attention. The results are that the higher the threshold, the fewer the samples and the poorer the generalization:
|Test Set Accuracy:||56%||53%||45%|
Next I’ll try the same thing with a few different thresholds for the Twitter engagement (likes and retweets). I have lower expectations here because there are is potentially much greater variance in aesthetics preferred by the Twitter Audience. At the same time, the Twitter audience is more explicit about their aesthetic since they need to interact with tweets.