After all the issues with ‘lucky’ results I wanted to go back and confirm my 70% best-scenario results were not lucky! The good news is that those results are valid! I trained using the hand-labelled data using the same hyper-parameter search I’ve been using for the recent experiments and the results are great! The mean test accuracy was 71% and the range of average f1 scores for test sets were 68% to 73%. Thus I’ll only be aiming to get near 70% for this, at best, but again those results are not likely considering they collapse multiple aesthetics of the Twitter (or in person) audience. The current best performance on Twitter data is ~60% accuracy.