The results from my second attempt using the attention only to determine label and filtering out samples with attention < 6 are in! This unbalanced data-set has much higher validation (74.2%) and test (66.5%) accuracies. The f1 scores archived by the best model are much better also: For the validation set 36% (bad) and 84% (good) and for the test set 27% (bad) and 78% (good). As this data-set is quite unbalanced and the aim is to predict ‘good’ compositions, not ‘bad’ ones, I think these results are promising. I thus chose not to balance the classes for this one because true positives are more important than true negatives so throwing away ‘good’ samples does not make sense.
It is unclear whether this improvement is due to fewer bad samples, or whether the samples with attention < 6 are noise without aesthetic meaning. The test confusion matrix is below, and shows how rarely predictions of ‘bad’ compositions are made, as well as a higher number of ‘bad’ compositions predicted to be ‘good’.