As each composition uses 5 layers, I wanted to create the illusion of less density without changing the number of parameters. To do this, I allow for the possibility of offsets where each layer slides completed out of view, making it invisible. This allows for compositions of only the background colour, as well as simplified compositions where only a few layers are visible.
The problem with this from an ML perspective is that the parameters of the layers that are not visible are still in the training data; this is because the training data represents the instructions for making the image, not the image itself. This causes a problem for the ML because the training data still holds the features of the layer, even if it’s not visible. I thought I would run another hyperparameter search where I zero out all the parameters for layers that are not visible. I reran an older experiment to test against and the results are promising.
For this experiment I used the Twitter data (likes and retweets) alone to generate labels where ‘good’ compositions have at least 1 like or retweet. There are relatively few compositions that receive any likes or retweets (presumably due to upload timing and the twitter algorithm). Due to this, I random sample the ‘bad’ compositions to balance the classes, leading to 197 ‘good’ and 197 ‘bad’ samples. The best model archives an accuracy of 76.5% for the validation set and 56.6% on the test set. The best model archived f1-scores of 75% (bad) and 78% (good) for the validation set and 55% (bad) and 58% (good) for the test set. The following image shows the confusion matrix for the test set. The performance on the validation set is very good, but that does not generalize to the test set, likely because there is just too little data here to work with.
I was just thinking about this separation of likes from attention and realized that since compositions with little attention don’t get uploaded to twitter, they certainly have no likes; I should only be comparing compositions that have been uploaded to twitter if I’m using the twitter data without attention to generate labels. The set used in the experiment discussed herein contains 320 uploaded compositions and 74 compositions that were not uploaded. I don’t think it makes sense to bother with redoing this experiment with only the uploaded compositions because there are just too few samples to make any progress at this time.
In this data-set 755 compositions were uploaded and 197 received likes or retweets. For the data-collection in progress as of last night 172 compositions have been uploaded and 86 have received likes or retweets. So it’s going to be quite the wait until this test collects enough data to move the ML side of the project forward.
The results from my second attempt using the attention only to determine label and filtering out samples with attention < 6 are in! This unbalanced data-set has much higher validation (74.2%) and test (66.5%) accuracies. The f1 scores archived by the best model are much better also: For the validation set 36% (bad) and 84% (good) and for the test set 27% (bad) and 78% (good). As this data-set is quite unbalanced and the aim is to predict ‘good’ compositions, not ‘bad’ ones, I think these results are promising. I thus chose not to balance the classes for this one because true positives are more important than true negatives so throwing away ‘good’ samples does not make sense.
It is unclear whether this improvement is due to fewer bad samples, or whether the samples with attention < 6 are noise without aesthetic meaning. The test confusion matrix is below, and shows how rarely predictions of ‘bad’ compositions are made, as well as a higher number of ‘bad’ compositions predicted to be ‘good’.
Following from my previous ML post, I ran an experiment doing hyperparameter search using only the attention data, ignoring the Twitter data for now. The results are surprisingly poor with the best model achieving no better than chance accuracy and f1 scores on the test set! For the validation set, the best model achieved an accuracy of 65%. The following image shows the confusion matrix for the test set:
The f1 scores show that this model is equally poor at predicting good and bad classes: The f1 score for the validation set was 67% for bad classes and 62% for good. In the test set the f1 scores are very poor at 55% for the bad class and 45% for the good class.
As I mentioned in the previous post, I think a lot of noise is added with incidental interactions where someone walks by without actually attending to the composition. Watching behaviour around it, I’ve determined that attention values below 6 are very likely to be incidental. I’m now running a second experiment using the same setup as this one except where these low attention samples are removed. Of course this unbalances the data-set, in this case in favour of the ‘good’ compositions (754) compared to ‘bad’ compositions (339). As there is so little data here I’m not going to do more filtering of ‘good’ results to balance classes. After that I’ll repeat these results with the Twitter data and see where this leaves things.
Now that I have the system running, uploading to Twitter and collected a pretty good amount of data, I’ve done some early ML work using this new data set! I spent a week looking at doing this as a regression (predicting scores) task vs a classification (predicting “good” or “bad” classes). The regression was not working well at all and I abandoned it; it was also impossible to compare results with previous classification work. I’ve returned to framing this as a classification problem and run a few parameter searches.
The Zombie Formalist is taking a break from posting compositions to Twitter to create space for, amplify, and be in solidarity with black and indigenous people being facing death, violence and harassment as facilitated by white colonial systems.
I took this pause in generation to tweak the code that generates stripes. Now the offsets don’t cut off the stripes because they use the frequency to determine appropriate places to cut (troughs). The following image shows a random selection of images using the new code. This change replaced a lot of work-around code (blurring, padding, etc.) and resulted in opening up aesthetic variation that was not previously possible.