Following from my previous post I’ve written code that extracts foreground objects from the background using the masks previously calculated using foreground segmentation. Since I’m using the tenFPS set, subsequent frames are quite similar to each other. Following is a small subset of extracted foreground objects corresponding to only about 2s of time. The redundancy is due to what we would recognize as the same object in multiple frames. This gives a sense of the edge quality, which is tweaked a little with slightly different filtering operations to clean up edge boundaries and remove noise. In previous works from the Watching and Dreaming series I used a clustering algorithm to group objects deemed similar, but that had a strong ephemeral effect! It may be worth exploring here though since I’m working only with foreground objects (not all objects in the frame as in that previous work) and there is a lot of clear redundancy here.
I also did a version where I took all of the segments extracted from about an hour work of time and layered them on top of each other:
This represents only the objects that moved or changed over time during this period. There are some issues with the edge quality here (the isolated segments, e.g. floating away from other segments) above seem to have a dark boundary. The trails of peds and moving cars are quite interesting. I also made a version where I used the averaged background (without moving objects) as the background for the image above:
This follows from my work using a much less interesting visual data-set from my 2012 residency as part of the New Forms Festival where I showed work in progress on my Dreaming Machine.
[B. D. R. Bogart. The Zombie Formalist: An Art Generator that Learns. In Richard William Allen, editor, Art Machines 2: International Symposium on Machine Learning and Art 2021 Proceedings, pages 165–166, 2021.]
In writing my proposal for a Grunt Gallery show, I ended up making a few more explorations. I was curious how the colour in the frame would be manifest so I took the code from the MPCIP project and used that on one of the frames from the 24h set. The results are quite nice and really do show off the intensity of colour during the summer days. I ended up doing 8000 training iterations with a neighbourhood size of 250 (top) and 2000 iterations with a neighbourhood size of 1500 (bottom). At some point I’ll adapt this code so that the size of the neighbourhood is different for each pixel and use the previous average of foreground extraction to determine the relative size. I’m quite excited to see those results!
I also did a quick test using of the foreground segmentation masks as an alpha channel to pull out (create) moving objects from the corresponding frame. It’s quite messy as I did not do any filtering for noise, nor did I use the findContours code to create cleaner object bounds. I’m thinking about my emphasis on boundary-making and whether I should investigate using these raw masks without findContours for ‘object’ creation. The aesthetic would certainly me more chaotic and textural… There is also the problem of determining how to group white pixel blobs together, which is what findContours does.
I ran the foreground segmentation algorithm described in the previous post, and and the results are looking interesting! Over the 24hour period, the average of all the foreground segmentation results are quite striking:
These remind me a little bit of the long film exposures I made during the big 2003 Eastern blackout where the city is lit from activity on the street rather than from buildings. There is certainly a sense of flow with traffic movement, and I love the pedestrian glow behind the black stillness of the street sign near the middle of the frame. The tree movement is also very interesting and there the other bright details are the traffic light (which swings a little and changes colour) and the flashing lights in the retail windows on the corner of Scotia and Broadway. After the brightness adjustment the image is not very smooth:
This is because the 64bit float image gets a lot of degradation on conversion to a mere 8bit image with values of only 0-255. I’m now running the averaging process again, except now saving the file as a 16 bit per channel PNG with a little contrast enhancement and also saving the raw float data. Since one of the ideas with these images is to use them to determine the SOM neighbourhood size (see previous post), I can use the floats directly rather than using an image file as intermediary. The grid pattern above is actually the underlying macro-blocks in the compression algorithm, which are more visible in areas of lower contrast (i.e. the pavement and the sky).
In addition to the averages of the foreground segmentation, I also made averages of the day-light period of the 24hour set, and also the night after the 24hour set. I did not use the early morning of the 24hour set because construction happened overnight and the vehicles were static for quite long periods and remain visible in the average:
The following images are the day and night averages:
These both have quite a dreamy quality to them. The trees in the night image are quite a bit more distinct because there are fewer frames used to make them (there are many more daylight hours than night-time hours here) also, it could have been more movement during the day (as shown in the foreground segmentation at the top of this post). The averaging over the movement of the sun over quite a long period (on a bright sunny day) ends up looking like an overcast day, but with a strangely blue sky. Revisiting this with the long-term time-lapse will be quite interesting with all the variation in weather. I could even average only the frames where the sun is at a particular position in the sky where moving objects would disappear, but the shadows would remain.
Even doing some of these process live with a feed to the screen would be interesting. A long-exposure always updating shown over a long-period, or a representation of the movement of the intersection over a long period of time.
I started doing some explorations using foreground segmentation using the 7Day set, but the difference between frames was too large and it did not work very well, so I went straight to the 24 hour set! These old algorithms are quite fast so things are moving along quite well. It took a bit of relearning; I have not used these particular algorithms since my PhD. The gist is the algorithm learns what the “foreground” is by tracking what has not changed over time (akin to a long-exposure) and noting the differences. This is a little clunky because objects are not independent of their contexts and things like shadows also change over time. Aside: A key philosophical aspect of my work is the question of objects; I understand objects as dynamic constructions formed by boundary-making. In this case the machine provides a naive boundary and creates ‘objects’ accordingly.
The following image shows a single frame where the white areas are “foreground” and the black areas “background”. The scene is quite readable and the little points of light are small changes (i.e. noise) in the image. The light on the mountains and sky has changed quite a lot, as have leaves in the trees on the right the vehicles on the road. The “grey” areas are actually what the algorithm thinks are shadows, which is not very accurate for this complex scene.
My next step was to filter these results to remove the noise leaving only the large changes. Looking at these images gave me pause though, because the texture is actually very interesting. I proposed this project as a continuation of this piece. In that work, there is a clear horizon (a line where the abstraction process begins). The frame for these images don’t have a clear choice for how to manifest the progressive abstraction. Aside: abstract here is actually the reorganization of pixels to dissolve the boundaries that construct objects.
I’ve been thinking about how I could specify a horizon for this project. The “horizon” could be a division of natural and unnatural areas, it could also be static or dynamic areas. What if the areas where movement is most likely are the more photographic (corresponding to the bottom of the previous piece) and the areas of stability are more painterly (corresponding to the middle to top the previous piece)? I’m quite unsure what this would look like. When I first proposed this project for the MPCAS, I was even thinking about using a depth camera where ‘objects’ that are further would be more abstract and the ‘objects’ nearer more photographic. In the end, I decided it was best to use the existing camera.
So in following this idea of considering which areas in the image movement is more likely, I thought I would average the 1000 test frames of foreground segmentation. This should make areas of movement more bright and keep areas of stillness dark. Actually, this is exactly what is happening in my old piece from 2001 (different algorithms, but still inter-frame differences summed up over time)! Here is the result for the 1000 foreground segmented frames:
The following is the same image except with the levels modified:
1000 frames at 10fps is less than 2 minutes, but the results are already interesting. For these averages I appreciate the details and textures and I’m thinking about not filtering them to remove noise. If I decide I want to actually pull out the objects, I can do the erode filtering at that stage. I started running this foreground segmentation process on the whole 24h set (should only take 5 hours) and I also started running 24hour (daylight) and 24hour+ (overnight) averages. I’ll report these results soon!
Since the previous failed capture on the Summer Solstice proper I was able to successfully capture frames at 10fps. CPU and disk usage were quite low so I would have been able to get away with more, but really I just wanted a fast enough frame-rate to get a sense of flow and movement. Even at 10fps, a 24 hour period ends up being almost one million frames. As I have not yet determined exactly how I’ll process the video, even this may be too much.
The summer solstice was on June 20th, and the successful capture for the first midnight to midnight period ended up happening on June 25th. The length of the day on June 20th was 16:14:59 and on June 25th it was 16:13:59, so one minute shorter than the actual solstice is not too bad! I adjusted the “noise reduction” on this run and that seems to have removed the issue with overly soft focus. The images below are every 10th frame (1fps) of a subset of frames to show movement. This already flows quite nicely compared to the 1 frame per minute time-lapse. I’m going back to the smaller 7 day set for aesthetic exploration and set this near real-time version aside for now.
Today the Zombie Formalist returned from the Art Machines 2 exhibition in Hong Kong. The video of my talk will be available in the fall. I’m working on getting the revisions for the second Zombie metal enclosure and thinking through what changes need to be made for the wood frame. The crate worked well and the Zombie was undamaged in transport!
After getting the drive back and being able to get a close look at the 4 million image files, I made a disappointing discovery. The (old) laptop was not able to keep up with dumping 30 frames per second to disk. Unfortunately this did not manifest in a small number of files being written, but rather manifested in the same file being written over and over again! So it seemed that there was the right number of files, and when you page down through the files they seem to change over time; looking at the details, only ~150 images were actually unique out of the 4 million written to disk. Suffice it to say this is disappointing. I thought maybe some of these files may be useful (I actually have not looked at them yet), but because of the machine attempting to keep up with the stream, the lag between different frames being written to disk increased over time steadily, as shown in the plot below. By the time I stopped the process, the gap between different files being written to disk was over 4 hours.
Once I realized this (yesterday) I thought I should get in the room and try again with a faster machine and a more reasonable 5-10FPS rather than 30. The Grunt Gallery tech, Sebnem, was nice enough to give me a ride and I spent about 4 hours today trying to get the shuttle to run the version of ffmpeg with NDI support. I could not figure out what was going wrong, but on that specific shuttle machine (running the same version of the same OS as the laptop, both freshly installed) the NDI library fails to initialize. I even copied the same binaries over from the laptop and they don’t work. Then I tried the example files that come with the NDI SDK and even those don’t work! It looks specifically like the function NDIlib_initialize() returns 0 on that machine. I checked the NDI libraries, properly installed. I checked the compiled sample file, and ldd shows that indeed it is linked with the NDI library. I finally gave up and I’ll post this issue to the NewTek forum. I’ve tried this on two additional machines and they both seem to work fine; now that I think about it, the shuttle in question is an AMD machine and all the others are Intel… hmmm
I hope my account for the forum will be approved quickly so I can post this and get a response. As I’m trying to get this capture done close to the solstice I’m now inclined to use my faster shuttle PC, which I just confirmed does not show this issue, and give that a try rather than trying to debug in the server room…
I thought I would do a couple more explorations of the 7 days of frames following the previously posted average. The first is recreating the same frame where each column is actually extracted from a different time. I test this using the sample of different lighting conditions from this post and another test using the first 1920 frames from the full 7Day set below it. I think the latter results are quite uninteresting due to the amount of change minute to minute (i.e. the change of traffic manifest in vertical lines in the image).
The image above shows the mean of all 7 days of video. The effect of the night-time frames make it look quite different than most long exposures. This makes me wonder what a month, 6 month, or year+ means would look like…
I’ve visited the Public Art Control Room a few times now and set up a laptop and 5TB disk array for image capture. After doing a few tests to make sure my two ideas are possible, I’m moving forward with the “final” captures. My two ideas are (1) a time-lapse with one capture every minute for ~7 days and (2) a 30fps capture for 24 hours. I was not sure the old laptop would be up to the job, but I was able to capture to disk in real-time; though I did have to quit the window manager and run ffmpeg from the console to divert all available resources to the task. For the latter, I’ll be capturing on the summer solstice from midnight to midnight; the capture will actually be 48 hours to avoid the awkward hours and I’ll only keep frames captured during the solstice. As I’m saving individual JPEG frames, the 48 hour test takes close to 3TB of space. The laptop is now deleting the before and after midnight files, which takes a very long time; thus, I was unable to determine if I’ll have the space to keep the backup 24h test and the 48h solstice capture; I’ll find that out on Saturday morning when I’m back in the room to start the solstice capture.
The following images are a snap-shot of the diversity of weather over the 7 days of the capture. One thing I changed from the previous test was turn of the camera “sharpening” which seems to give everything a bit of a blurry look. I’m not sure how this will work for possible segmentation experiments, but it does seem to cut down on some of the JPEG artifacts. I should also explore changing the compression quality for the 7day capture since it takes up so much less space already; the 24hour will need to stay at this quality though.
Looking back on these images they really do look blurry; could the camera be having a focus issue? In comparing this to the previous test, the image quality is clearly less sharp and it seems “sharpening” is not actually a post-process for sharpening, but a smoothing post-process. Too bad about that, since I’m not likely to have the amazing diversity of weather over a 7 day period any time soon. I’ll have to keep an eye on the weather and see if I have another opportunity. It could also be the aperture of the camera… Also the exposure is automatic, but the sky very often gets totally blown out (e.g. frame 4212, middle left). The exposure of the street does look good though, so it is probably just exposing for the majority of the frame, exposing for the sky could make the street very dark; if I have to choose between exposed street and sky, I suppose the street makes more sense. I could also try the camera’s dynamic range mode. The sun certainly does shine directly into the camera at times (e.g. frame 9837, bottom left). The following video shows the entire time-lapse.
Last week I got my first chance to get into the server room for the MPCAS. This is the room with all the hardware that drives the screen and also the outdoor PTZ camera. The room is smaller than I expected, about the size of a closet with a server rack on wheels. The space is so small that we have to prop the door open and push the rack slightly out the room to open the access terminal! After some fiddling last week, I was able to get an old laptop onto the local network and access the camera feed! Unfortunately, I was unable to control the camera pan/tilt, resulting in a stream of images like this:
The camera is a BirdDog A200 outdoor PTZ IP camera. Since working on the Dreaming Machine, I have not looked at surveillance camera technology and it seems a new protocol, NDI (Network Device Interface), looks to be quite popular. While this is an royalty-free protocol, working with it does not seem that straight forward. I found that ffmpeg does support these streams “built-in” but only specifically version 3.4. Newer versions do not support NDI because NewTek (the company that created NDI) distributed ffmpeg without adhering to the GPL. Downloading the ffmpeg source and the NDI SDK allowed me to compile ffmpeg and access the video stream.
This morning Sebnem (the Grunt gallery Tech) and I got back onto the room to check on the previous test. Unfortunately only a few hours of frames were saved as a technician with the screen company (who installed the camera) did something to the camera that required a power cycle. Once we were on site and cycled the power for the camera we were able to control it! I also noticed that ffmpeg does seem to recover and write new frames in the case that the camera is unaccessible, so that is good to know for the future. After a little exploration I settled on the following frame as a starting point:
Compare this to my description of what I imagined the camera would see from my initial proposal:
I envision a frame showing the city-scape to the East with dynamism of Kingsway below, Kingsgate Mall in the lower middle, the growing density of the area beyond in the middle upper and the mountains and sky above. The camera’s view will contain a rich diversity of forms, colours and textures that are both natural and artificial.
Kingsgate mall is out of frame to the right behind the trees, but the dynamism of the busy intersection of Kingsway and Broadway below contrast nicely with the trees, mountains and sky. I also saved a “preset” view of without Kingsway below and more sky, but I think this composition is more balanced. I ended up using as many manual settings as possible, but with this much contrast (between the shadows in the trees and the bright sky) the sky looks quite blown out and some settings may need to be tweaked. There should be some rain coming at the end of this week so I should get a better sense of changing light through day, night and variable weather when I access the footage next week. I’m also quite curious about the image quality at night.
Following from the last couple ML posts, I’ve been looking at the Integration G data-set. This set has 1734 uploaded compositions (only slightly less than the F data-set). Interestingly, without the filter by in-person attention mechanism (face detection) to determine if a composition is “good” enough to be uploaded, the “good” and “bad” classes are more balanced. i.e. about half the compositions are liked or retweeted. I presume less likes happen over the North American over-night as I’ve observed; hopefully the Hong Kong exhibition will increase the number of followers in the Eastern hemisphere. I should look at the distribution of engagements over night in North America.
If “bad” means no likes or retweets (RTs) and “good” means at least one, then there are ~1000 “good” and ~700 “bad” compositions. Since the classes are fairly balanced I did an initial experiment without re-balancing. Since the aim is to detect “good” compositions, it does not make sense to balance by throwing away those compositions. The results are OK, but the f1-scores are quite inconsistent for “good” and “bad” classes. The average test accuracy was 57% with a peak of 60%. The f1-score for the best performing model from the best search was 72% for “good” but only 29% for “bad”. I suspect this is due to the unbalanced classes and the test split used in that search being lucky (having more “good” or similar to “good” compositions). The best performing model from the worst search attained a test f1-score of 62% for “good” and 43% for “bad”. (The difference between these two searches is only initial weights and different random train/val/test splits).
I tired a different threshold for “good” where the compositions with only 1 like or retweet was considered “bad” and more than 1 was “good”. Compositions with no likes or retweets were removed. This resulted in 1003 training samples, so a significant reduction, with very balanced classes (503 “good” and 500 “bad”). While this resulted in slightly more balanced f1-scores, it also results in lower average accuracy and poorer f1-scores for the “good” class. The average accuracy on test sets was 55% (compared to 57%) with a peak test accuracy of 58% (compared to 60%). This seems consistent, but the details bare out a lack of improvement; for the best model in the best search, the f1-scores are 50% for “good” and 60% for bad, meaning a significant reduction in f1-score for “good” compositions from the previous 72%. The best model from the worst search was very similar, with f1-scores of 58% for “good” and 52% for bad.
The conclusion is that there is no improvement throwing away compositions with no likes and retweets assuming they are seen by less people. So we’re about in the same place, hovering around a 60% accuracy. The current integration test, H, includes a more constrained set of compositions (only circles with only 3 layers) this reduces the number of parameters and constrains the compositions significantly. My hope is that this results in better performance, but I’ll have to wait until it collects a similar number of samples. Before sending the ZF to the crater, there were 342 compositions generated during that test, so there will be quite a wait to generate enough data to compare, especially considering the travel time to and back from HK. So I’m going to set the ML aside again for now.
The following photos show the assembly process for The Zombie Formalist! This version will test for a week before getting crated and shipped off to Hong Kong for Art Machines 2! It’s more restrained (only Twitter engagement and generates circles with three layers) in the hopes that the ML part works better with more constraint in the training data.
I got the metal enclosure back from the fabricator! There were a few issues; the camera mounting holes were not quite in the correct position, the Jetson board mounting holes were reversed and I did not take into account the length of the power connectors, so the power supply does not fit as expected. My tech Bobbi had the tools, so we were able to make the modifications and I test fit the components. The gauge of the metal was thicker than I (or the designer) were expecting so the next unit will probably by a thinner gauge.
Also, the Zombie Formalist was accepted for the Art Machines 2 conference in Hong Kong In June, 2021! Robert is currently working on the design of the wood frame and I’m working on getting the Jetson board to interface with Bobbi’s button interface.
After all the issues with ‘lucky’ results I wanted to go back and confirm my 70% best-scenario results were not lucky! The good news is that those results are valid! I trained using the hand-labelled data using the same hyper-parameter search I’ve been using for the recent experiments and the results are great! The mean test accuracy was 71% and the range of average f1 scores for test sets were 68% to 73%. Thus I’ll only be aiming to get near 70% for this, at best, but again those results are not likely considering they collapse multiple aesthetics of the Twitter (or in person) audience. The current best performance on Twitter data is ~60% accuracy.
After a few more experiments hoping to get a test validation close to the 63% I achieved recently, I realized I had not tried to run that same experiment again with the same hyperparameters. The only difference would be the random seeds used for initial weights and which samples end up in the test and validation sets. So I re-ran the recent test (using 3 as the twitter score threshold) and the best performing model, in terms of validation accuracy, achieved test validation of… 43%. Validation accuracy was quite consistent; being previously 73% and now being 71%. So the lesson is that I need to run each experiment multiple times with different test sets to know how well it is *actually* working because apparently my best results are merely accidentally generalizable test sets or favourable initial weights.
The next step is going to reset to using the F data set filtering for only the Twitter uploaded compositions and seeing how much variation there is in the test validation when using low twitter score thresholds. It is certainly an issue that a composition may have no likes not because it’s unliked but because it was not seen by anyone on Twitter. Perhaps I should consider compositions liked by only one person “bad” and those with greater than one “like” good; that way I’m only comparing compositions that have certainly been seen!
This is the design going to the fabricator! It’s nice that things are finally moving after all the challenges finding a new designer and fabricator during COVID. The main changes to this design is that the fabricator requires a 3/8″ gap between all holes and bends, which means shifting things quite a bit. It also means changing the top where the camera and buttons are mounted.
I also thought I would take this chance to double check my calculations for the camera angle, and it’s good I did because they were incorrect! I interpreted the camera angle being 70° horizontal, but it was actually diagonal so I had to recalculate the field of view for the sensor to make sure the monitor does not block it. Short version, the vertical angle of view was 45°, not the 35° previously specified.
Following from my previous post, I used the same approach to change the thresholds for how much Twitter engagement is required for a composition to be “good”. The following table shows the result where the “TWIT Threshold” is the sum of likes and RTs for each composition. Of course, the increasing threshold decreases the number of “good” samples significantly; there are 880 “good” samples in Threshold 1, 384 in Threshold 2, and 158 in Threshold 3. (This is comparable to the number of samples using attention to determine labels.) The small number of samples in high threshold is why I did not try thresholds higher than 3.
Interestingly the results show the opposite pattern as observed using attention to generate labels where test validation accuracy increases as the threshold increases, It seems twitter engagement scores are actually much more accurate than those using the attention data. It makes sense that explicitly liking and RTing on Twitter is a better signal for “good”, even though it collapses many more peoples’ aesthetic. Indeed some would argue there are global and objective aesthetics most of us agree on, but I’m less convinced.
I also did a series of experiments using the amalgamated data-set (where the ZF code changed between subsequent test sessions) and the same twitter thresholds (with 1399 “good” in Threshold 1, 560 in Threshold 2, and 228 in Threshold 3) showed only a 2% difference in test accuracy that peaked at 58% test accuracy. Another experiment I was working on was a proxy for a single style ZF that would generate only circles, for example. This would be reducing some of the feature vector params and potentially increase accuracy as it would be an “apples to apples” comparison for the audience. This also involves reducing the amount of samples as, for example, “circles” are only 1/3 of a whole data set. Doing this for the F integration test resulted in a best accuracy of around 60% (where Threshold 3 has 129 “good” samples) and I’m considering doing the same with the amalgamated training set, which contains 1438 circle samples that were uploaded to Twitter, compared to the 898 that are included in the most recent integration test. Looking back at the amalgamated data-set, it actually has about the same number of circle compositions with high twitter scores as the F data-set, so no point in going back to that for more samples!
Through all of this it seems clear that online learning of viewer aesthetics from scratch would take a very very long time and perhaps shipping the project with a starting model based on Twitter data collected to date is the best approach. The Zombie Formalist has been on Twitter for about a year and over that time generated 15833 compositions, only slightly more than my initial hand-labelled training set of 15000 compositions, for which my best test accuracy was 70% (but I’ve done some feature engineering since then).
Looking at my data I noticed that there were quite a few weak compositions in the top 50 greatest attention set for the still-collecting F integration test. Some of these were due to outlier levels of attention caused by a false positive face detection in the bathroom, others seem to be either a change of heart, or my partner’s aesthetic. Since there seemed to be some quite poor results, I wondered about changing the attentional threshold to generate labels where “good” only if they received a lot of attention. The results are that the higher the threshold, the fewer the samples and the poorer the generalization:
Test Set Accuracy:
Next I’ll try the same thing with a few different thresholds for the Twitter engagement (likes and retweets). I have lower expectations here because there are is potentially much greater variance in aesthetics preferred by the Twitter Audience. At the same time, the Twitter audience is more explicit about their aesthetic since they need to interact with tweets.
Since I’ve been having trouble with generalizing classifier results (where the model achieves tolerable accuracy on training, and perhaps validation, data but poorly on test data) I thought I would throw more data at the problem; I combined all of the Twitter data collected to date (even though some of the code changed between various test runs) into a single data-set. This super-set contains 12861 generated compositions, 2651 of which were uploaded to twitter. I labelled samples as “good” where their score was greater than 100 (at least one like or RT and enough in person attention to upload to twitter). After filtering outliers (twice the system “saw” a face where there was no face, leading to very large and impossible attention values) this results in 1867 “good” compositions. When balancing the classes, the total set ends up with 3734 “good” and “bad” samples. Still not very big compared to my hand-labelled 15,000 sample pilot set, which contained 3971 “good” compositions. The amalgamated super-set was used for a number of experiments as follows.
Since my past post on ML for the ZF, I’ve been running the system on Twitter and collecting data. The assumption being that the model’s lack of ability to generalize (work accurately for the test set) is due to a lack of data. Since classes are imbalanced, there are a lot of “bad” compositions compared to “good” ones, I end up throwing out a lot of generated data.
In the previous experiment I balanced classes only by removing samples that had very low attention. I considered these spurious interactions and thought they would just add noise. That data-set (E) had 568 good and 432 bad samples. The results of this most recent experiment follow.