After a few more experiments hoping to get a test validation close to the 63% I achieved recently, I realized I had not tried to run that same experiment again with the same hyperparameters. The only difference would be the random seeds used for initial weights and which samples end up in the test and validation sets. So I re-ran the recent test (using 3 as the twitter score threshold) and the best performing model, in terms of validation accuracy, achieved test validation of… 43%. Validation accuracy was quite consistent; being previously 73% and now being 71%. So the lesson is that I need to run each experiment multiple times with different test sets to know how well it is *actually* working because apparently my best results are merely accidentally generalizable test sets or favourable initial weights.
The next step is going to reset to using the F data set filtering for only the Twitter uploaded compositions and seeing how much variation there is in the test validation when using low twitter score thresholds. It is certainly an issue that a composition may have no likes not because it’s unliked but because it was not seen by anyone on Twitter. Perhaps I should consider compositions liked by only one person “bad” and those with greater than one “like” good; that way I’m only comparing compositions that have certainly been seen!