Following from the last couple ML posts, I’ve been looking at the Integration G data-set. This set has 1734 uploaded compositions (only slightly less than the F data-set). Interestingly, without the filter by in-person attention mechanism (face detection) to determine if a composition is “good” enough to be uploaded, the “good” and “bad” classes are more balanced. i.e. about half the compositions are liked or retweeted. I presume less likes happen over the North American over-night as I’ve observed; hopefully the Hong Kong exhibition will increase the number of followers in the Eastern hemisphere. I should look at the distribution of engagements over night in North America.
If “bad” means no likes or retweets (RTs) and “good” means at least one, then there are ~1000 “good” and ~700 “bad” compositions. Since the classes are fairly balanced I did an initial experiment without re-balancing. Since the aim is to detect “good” compositions, it does not make sense to balance by throwing away those compositions. The results are OK, but the f1-scores are quite inconsistent for “good” and “bad” classes. The average test accuracy was 57% with a peak of 60%. The f1-score for the best performing model from the best search was 72% for “good” but only 29% for “bad”. I suspect this is due to the unbalanced classes and the test split used in that search being lucky (having more “good” or similar to “good” compositions). The best performing model from the worst search attained a test f1-score of 62% for “good” and 43% for “bad”. (The difference between these two searches is only initial weights and different random train/val/test splits).
I tired a different threshold for “good” where the compositions with only 1 like or retweet was considered “bad” and more than 1 was “good”. Compositions with no likes or retweets were removed. This resulted in 1003 training samples, so a significant reduction, with very balanced classes (503 “good” and 500 “bad”). While this resulted in slightly more balanced f1-scores, it also results in lower average accuracy and poorer f1-scores for the “good” class. The average accuracy on test sets was 55% (compared to 57%) with a peak test accuracy of 58% (compared to 60%). This seems consistent, but the details bare out a lack of improvement; for the best model in the best search, the f1-scores are 50% for “good” and 60% for bad, meaning a significant reduction in f1-score for “good” compositions from the previous 72%. The best model from the worst search was very similar, with f1-scores of 58% for “good” and 52% for bad.
The conclusion is that there is no improvement throwing away compositions with no likes and retweets assuming they are seen by less people. So we’re about in the same place, hovering around a 60% accuracy. The current integration test, H, includes a more constrained set of compositions (only circles with only 3 layers) this reduces the number of parameters and constrains the compositions significantly. My hope is that this results in better performance, but I’ll have to wait until it collects a similar number of samples. Before sending the ZF to the crater, there were 342 compositions generated during that test, so there will be quite a wait to generate enough data to compare, especially considering the travel time to and back from HK. So I’m going to set the ML aside again for now.
The following photos show the assembly process for The Zombie Formalist! This version will test for a week before getting crated and shipped off to Hong Kong for Art Machines 2! It’s more restrained (only Twitter engagement and generates circles with three layers) in the hopes that the ML part works better with more constraint in the training data.
Robert and I had a miscommunication about the frame design and he ended up making a new version. The following images show the production process.
I got the metal enclosure back from the fabricator! There were a few issues; the camera mounting holes were not quite in the correct position, the Jetson board mounting holes were reversed and I did not take into account the length of the power connectors, so the power supply does not fit as expected. My tech Bobbi had the tools, so we were able to make the modifications and I test fit the components. The gauge of the metal was thicker than I (or the designer) were expecting so the next unit will probably by a thinner gauge.
Also, the Zombie Formalist was accepted for the Art Machines 2 conference in Hong Kong In June, 2021! Robert is currently working on the design of the wood frame and I’m working on getting the Jetson board to interface with Bobbi’s button interface.
After all the issues with ‘lucky’ results I wanted to go back and confirm my 70% best-scenario results were not lucky! The good news is that those results are valid! I trained using the hand-labelled data using the same hyper-parameter search I’ve been using for the recent experiments and the results are great! The mean test accuracy was 71% and the range of average f1 scores for test sets were 68% to 73%. Thus I’ll only be aiming to get near 70% for this, at best, but again those results are not likely considering they collapse multiple aesthetics of the Twitter (or in person) audience. The current best performance on Twitter data is ~60% accuracy.
After a few more experiments hoping to get a test validation close to the 63% I achieved recently, I realized I had not tried to run that same experiment again with the same hyperparameters. The only difference would be the random seeds used for initial weights and which samples end up in the test and validation sets. So I re-ran the recent test (using 3 as the twitter score threshold) and the best performing model, in terms of validation accuracy, achieved test validation of… 43%. Validation accuracy was quite consistent; being previously 73% and now being 71%. So the lesson is that I need to run each experiment multiple times with different test sets to know how well it is *actually* working because apparently my best results are merely accidentally generalizable test sets or favourable initial weights.
The next step is going to reset to using the F data set filtering for only the Twitter uploaded compositions and seeing how much variation there is in the test validation when using low twitter score thresholds. It is certainly an issue that a composition may have no likes not because it’s unliked but because it was not seen by anyone on Twitter. Perhaps I should consider compositions liked by only one person “bad” and those with greater than one “like” good; that way I’m only comparing compositions that have certainly been seen!
This is the design going to the fabricator! It’s nice that things are finally moving after all the challenges finding a new designer and fabricator during COVID. The main changes to this design is that the fabricator requires a 3/8″ gap between all holes and bends, which means shifting things quite a bit. It also means changing the top where the camera and buttons are mounted.
I also thought I would take this chance to double check my calculations for the camera angle, and it’s good I did because they were incorrect! I interpreted the camera angle being 70° horizontal, but it was actually diagonal so I had to recalculate the field of view for the sensor to make sure the monitor does not block it. Short version, the vertical angle of view was 45°, not the 35° previously specified.
Following from my previous post, I used the same approach to change the thresholds for how much Twitter engagement is required for a composition to be “good”. The following table shows the result where the “TWIT Threshold” is the sum of likes and RTs for each composition. Of course, the increasing threshold decreases the number of “good” samples significantly; there are 880 “good” samples in Threshold 1, 384 in Threshold 2, and 158 in Threshold 3. (This is comparable to the number of samples using attention to determine labels.) The small number of samples in high threshold is why I did not try thresholds higher than 3.
Interestingly the results show the opposite pattern as observed using attention to generate labels where test validation accuracy increases as the threshold increases, It seems twitter engagement scores are actually much more accurate than those using the attention data. It makes sense that explicitly liking and RTing on Twitter is a better signal for “good”, even though it collapses many more peoples’ aesthetic. Indeed some would argue there are global and objective aesthetics most of us agree on, but I’m less convinced.
I also did a series of experiments using the amalgamated data-set (where the ZF code changed between subsequent test sessions) and the same twitter thresholds (with 1399 “good” in Threshold 1, 560 in Threshold 2, and 228 in Threshold 3) showed only a 2% difference in test accuracy that peaked at 58% test accuracy. Another experiment I was working on was a proxy for a single style ZF that would generate only circles, for example. This would be reducing some of the feature vector params and potentially increase accuracy as it would be an “apples to apples” comparison for the audience. This also involves reducing the amount of samples as, for example, “circles” are only 1/3 of a whole data set. Doing this for the F integration test resulted in a best accuracy of around 60% (where Threshold 3 has 129 “good” samples) and I’m considering doing the same with the amalgamated training set, which contains 1438 circle samples that were uploaded to Twitter, compared to the 898 that are included in the most recent integration test. Looking back at the amalgamated data-set, it actually has about the same number of circle compositions with high twitter scores as the F data-set, so no point in going back to that for more samples!
Through all of this it seems clear that online learning of viewer aesthetics from scratch would take a very very long time and perhaps shipping the project with a starting model based on Twitter data collected to date is the best approach. The Zombie Formalist has been on Twitter for about a year and over that time generated 15833 compositions, only slightly more than my initial hand-labelled training set of 15000 compositions, for which my best test accuracy was 70% (but I’ve done some feature engineering since then).
Looking at my data I noticed that there were quite a few weak compositions in the top 50 greatest attention set for the still-collecting F integration test. Some of these were due to outlier levels of attention caused by a false positive face detection in the bathroom, others seem to be either a change of heart, or my partner’s aesthetic. Since there seemed to be some quite poor results, I wondered about changing the attentional threshold to generate labels where “good” only if they received a lot of attention. The results are that the higher the threshold, the fewer the samples and the poorer the generalization:
|Test Set Accuracy:||56%||53%||45%|
Next I’ll try the same thing with a few different thresholds for the Twitter engagement (likes and retweets). I have lower expectations here because there are is potentially much greater variance in aesthetics preferred by the Twitter Audience. At the same time, the Twitter audience is more explicit about their aesthetic since they need to interact with tweets.
Since I’ve been having trouble with generalizing classifier results (where the model achieves tolerable accuracy on training, and perhaps validation, data but poorly on test data) I thought I would throw more data at the problem; I combined all of the Twitter data collected to date (even though some of the code changed between various test runs) into a single data-set. This super-set contains 12861 generated compositions, 2651 of which were uploaded to twitter. I labelled samples as “good” where their score was greater than 100 (at least one like or RT and enough in person attention to upload to twitter). After filtering outliers (twice the system “saw” a face where there was no face, leading to very large and impossible attention values) this results in 1867 “good” compositions. When balancing the classes, the total set ends up with 3734 “good” and “bad” samples. Still not very big compared to my hand-labelled 15,000 sample pilot set, which contained 3971 “good” compositions. The amalgamated super-set was used for a number of experiments as follows.
Since my past post on ML for the ZF, I’ve been running the system on Twitter and collecting data. The assumption being that the model’s lack of ability to generalize (work accurately for the test set) is due to a lack of data. Since classes are imbalanced, there are a lot of “bad” compositions compared to “good” ones, I end up throwing out a lot of generated data.
In the previous experiment I balanced classes only by removing samples that had very low attention. I considered these spurious interactions and thought they would just add noise. That data-set (E) had 568 good and 432 bad samples. The results of this most recent experiment follow.
This most recent iteration of the case design is very close to finalized! There are still some tweaks, but I’m confident not too many changes will be needed. I’ve already sent this design off to a few local fabricators and only then will I have a good sense of where my budget lands and how many painting appropriation prints I can make!
Since I was on the fence about the two test prints I had previously done, I thought I should make smaller prints of all of the remaining short-listed paintings to do the final selection.
I got some test prints from my printer! The images above are #19 (top) and #4 (bottom). #4 looks pretty fantastic; the blacks are quite deeps and the whites quite bright; visually comparing with my Endura Metallic prints, the blacks are a little lighter but the whites are quite close. I was a little concerned about the (relatively) low resolution of these works both due to the source images and also due to the slowness of processing. Looking at the digital file you can see a little banding due to the subtle gradients, but these look very seamless and the texture of the canvas certainly contributes to the smoothness.
While #19 was quite popular in my Twitter pole, it seems to fall quite flat on canvas; I think the luminosity contrast is too low. Looking at the luminosity contrast of the other short-listed compositions, it looks like #22, #24, and perhaps #3 could also fall quite flat. If I choose not to print those, I would eliminate the more contemporary paintings including cubist and surrealist pieces. The remainder source paintings were made from 1517 to 1633, so quite a narrow window. I’m unsure how to proceed, but I think I’ll need more test prints. I also did not include some of these in my video versions, so I’ll do some of that work next.
This aspect of the project has been quite slow and I have not been up to date on the blog; my last post was when I finished my first sketchy drawing in December! The company I had originally gotten a quote from no longer was able to do the job, which included technical drawing, design and fabrication in wood and metal. I approached quite a few companies but no one was able to do all aspects of the job and / or did not want to take on the design task.
After desperate searching my partner suggested I ask a friend of hers and Robert Billard has taken on the design and technical drawing task! This is a real favour since an architect is far over qualified for a small job like this. Thanks to him, this part of the project is finally moving and I should be able to get realistic quotes for the metal fabrication job! The images following show various renderings of the enclosure through a number of iterations; they are incomplete, but do give a sense of progress from older (top) to newer (bottom).
Since I revisited many of the paintings and used the epoch training method used for the videos, I’ve made a longer revised short list of paintings and here they are all together:
This painting did not make the previous short list due to the patchy colour (bottom image); I thought I would go back to it with epoch training, and I’m quite happy with the results!
As each composition uses 5 layers, I wanted to create the illusion of less density without changing the number of parameters. To do this, I allow for the possibility of offsets where each layer slides completed out of view, making it invisible. This allows for compositions of only the background colour, as well as simplified compositions where only a few layers are visible.
The problem with this from an ML perspective is that the parameters of the layers that are not visible are still in the training data; this is because the training data represents the instructions for making the image, not the image itself. This causes a problem for the ML because the training data still holds the features of the layer, even if it’s not visible. I thought I would run another hyperparameter search where I zero out all the parameters for layers that are not visible. I reran an older experiment to test against and the results are promising.
For this experiment I used the Twitter data (likes and retweets) alone to generate labels where ‘good’ compositions have at least 1 like or retweet. There are relatively few compositions that receive any likes or retweets (presumably due to upload timing and the twitter algorithm). Due to this, I random sample the ‘bad’ compositions to balance the classes, leading to 197 ‘good’ and 197 ‘bad’ samples. The best model archives an accuracy of 76.5% for the validation set and 56.6% on the test set. The best model archived f1-scores of 75% (bad) and 78% (good) for the validation set and 55% (bad) and 58% (good) for the test set. The following image shows the confusion matrix for the test set. The performance on the validation set is very good, but that does not generalize to the test set, likely because there is just too little data here to work with.
I was just thinking about this separation of likes from attention and realized that since compositions with little attention don’t get uploaded to twitter, they certainly have no likes; I should only be comparing compositions that have been uploaded to twitter if I’m using the twitter data without attention to generate labels. The set used in the experiment discussed herein contains 320 uploaded compositions and 74 compositions that were not uploaded. I don’t think it makes sense to bother with redoing this experiment with only the uploaded compositions because there are just too few samples to make any progress at this time.
In this data-set 755 compositions were uploaded and 197 received likes or retweets. For the data-collection in progress as of last night 172 compositions have been uploaded and 86 have received likes or retweets. So it’s going to be quite the wait until this test collects enough data to move the ML side of the project forward.
The results from my second attempt using the attention only to determine label and filtering out samples with attention < 6 are in! This unbalanced data-set has much higher validation (74.2%) and test (66.5%) accuracies. The f1 scores archived by the best model are much better also: For the validation set 36% (bad) and 84% (good) and for the test set 27% (bad) and 78% (good). As this data-set is quite unbalanced and the aim is to predict ‘good’ compositions, not ‘bad’ ones, I think these results are promising. I thus chose not to balance the classes for this one because true positives are more important than true negatives so throwing away ‘good’ samples does not make sense.
It is unclear whether this improvement is due to fewer bad samples, or whether the samples with attention < 6 are noise without aesthetic meaning. The test confusion matrix is below, and shows how rarely predictions of ‘bad’ compositions are made, as well as a higher number of ‘bad’ compositions predicted to be ‘good’.
Following from my previous ML post, I ran an experiment doing hyperparameter search using only the attention data, ignoring the Twitter data for now. The results are surprisingly poor with the best model achieving no better than chance accuracy and f1 scores on the test set! For the validation set, the best model achieved an accuracy of 65%. The following image shows the confusion matrix for the test set:
The f1 scores show that this model is equally poor at predicting good and bad classes: The f1 score for the validation set was 67% for bad classes and 62% for good. In the test set the f1 scores are very poor at 55% for the bad class and 45% for the good class.
As I mentioned in the previous post, I think a lot of noise is added with incidental interactions where someone walks by without actually attending to the composition. Watching behaviour around it, I’ve determined that attention values below 6 are very likely to be incidental. I’m now running a second experiment using the same setup as this one except where these low attention samples are removed. Of course this unbalances the data-set, in this case in favour of the ‘good’ compositions (754) compared to ‘bad’ compositions (339). As there is so little data here I’m not going to do more filtering of ‘good’ results to balance classes. After that I’ll repeat these results with the Twitter data and see where this leaves things.
Now that I have the system running, uploading to Twitter and collected a pretty good amount of data, I’ve done some early ML work using this new data set! I spent a week looking at doing this as a regression (predicting scores) task vs a classification (predicting “good” or “bad” classes). The regression was not working well at all and I abandoned it; it was also impossible to compare results with previous classification work. I’ve returned to framing this as a classification problem and run a few parameter searches.