Following from the last couple ML posts, I’ve been looking at the Integration G data-set. This set has 1734 uploaded compositions (only slightly less than the F data-set). Interestingly, without the filter by in-person attention mechanism (face detection) to determine if a composition is “good” enough to be uploaded, the “good” and “bad” classes are more balanced. i.e. about half the compositions are liked or retweeted. I presume less likes happen over the North American over-night as I’ve observed; hopefully the Hong Kong exhibition will increase the number of followers in the Eastern hemisphere. I should look at the distribution of engagements over night in North America.
If “bad” means no likes or retweets (RTs) and “good” means at least one, then there are ~1000 “good” and ~700 “bad” compositions. Since the classes are fairly balanced I did an initial experiment without re-balancing. Since the aim is to detect “good” compositions, it does not make sense to balance by throwing away those compositions. The results are OK, but the f1-scores are quite inconsistent for “good” and “bad” classes. The average test accuracy was 57% with a peak of 60%. The f1-score for the best performing model from the best search was 72% for “good” but only 29% for “bad”. I suspect this is due to the unbalanced classes and the test split used in that search being lucky (having more “good” or similar to “good” compositions). The best performing model from the worst search attained a test f1-score of 62% for “good” and 43% for “bad”. (The difference between these two searches is only initial weights and different random train/val/test splits).
I tired a different threshold for “good” where the compositions with only 1 like or retweet was considered “bad” and more than 1 was “good”. Compositions with no likes or retweets were removed. This resulted in 1003 training samples, so a significant reduction, with very balanced classes (503 “good” and 500 “bad”). While this resulted in slightly more balanced f1-scores, it also results in lower average accuracy and poorer f1-scores for the “good” class. The average accuracy on test sets was 55% (compared to 57%) with a peak test accuracy of 58% (compared to 60%). This seems consistent, but the details bare out a lack of improvement; for the best model in the best search, the f1-scores are 50% for “good” and 60% for bad, meaning a significant reduction in f1-score for “good” compositions from the previous 72%. The best model from the worst search was very similar, with f1-scores of 58% for “good” and 52% for bad.
The conclusion is that there is no improvement throwing away compositions with no likes and retweets assuming they are seen by less people. So we’re about in the same place, hovering around a 60% accuracy. The current integration test, H, includes a more constrained set of compositions (only circles with only 3 layers) this reduces the number of parameters and constrains the compositions significantly. My hope is that this results in better performance, but I’ll have to wait until it collects a similar number of samples. Before sending the ZF to the crater, there were 342 compositions generated during that test, so there will be quite a wait to generate enough data to compare, especially considering the travel time to and back from HK. So I’m going to set the ML aside again for now.