Video of a dream (imaginary sequence generated from feedback in predictive model):
After some experimentation with LSTM topologies, I ended up with a 8 layer network with 32 LSTM units per hidden layer. These networks take a lot longer to train and the MSE for my 10,000 iteration test was 0.0201 (worse than other topologies). The amazing part is that using the feedback mechanism to reconstruct the sequence, scene transitions are preserved! In my previous single and 4 layer LSTM tests, the scene changes were not reconstructed using feedback in the model. The image below shows the results.
I thought I would try training without using the size of regions as features. The macro structure is quite nice, but it did not converge any better/faster than those using all the features. I do like the even distribution of the different sized segments over the composition. I think the previous versions are likely best, but I think they would need to be rendered larger (not possible on my current hardware if I want to ) so the large number of percepts do not overlap so much.
I wanted to do a test with a large number of segments spread evenly over the set of all segments to represent the palette of the whole film; the following image and details shows the result. Now that I’m using such a large number of percepts I’m noticing there is an dark outline around most percepts. This seems to be an anti-aliasing effect and I’m testing a version of the collage code that disables it. Due to the large number of segments, I used a relatively small number of training iterations (approximately 5 million) and thus the organization is not very good. Still, the results are interesting and quite painterly. In my next test I’ll go in the other direction and use a small (100,000) number of percepts evenly distributed over the set of all segments.
I’ve been tinkering with making collages from raw segments, rather than percepts. These have not been clustered or averaged and are simply cut out of Blade Runner frames without further processing. Thanks to ANNetGPGPU changes I’ve been able to generate some quite large scale collages. The one below (and its detail underneath) are generated from 1 million image segments (the 1 million largest out of the 30 million extracted). They take a lot of training (10 million iterations here), and still seem somewhat disorganized. I think there is potential here, but because of the number of (large) percepts, I think the my max GPU texture size (16384 x 16384) is a little small. This leads to a lot of overlap between segments, which does look quite interesting up close (see detail) but perhaps a little too dense. It’s possible that at 48″ square (as intended) that rich texture could make the overall composition successful.
I am not very happy with the lack of diversity of colours; this is because there is an over-representation of a few similar regions segmented from subsequent frames. I’m currently training a 6 million segment version using a stride (keep each 5th segment) that will hopefully result in an image more representative of the whole time-line. In the long-term the best approach may be to use stride based on frame numbers, but this information is not preserved in the current implementation.
The following images show a comparison of three modes of visual mentation all using the restricted set of 1000 percepts. The top image is the “Watching” mode where percepts are located in the same location as they are sensed. The middle image is something like “Imagery” where the position of percepts is random but constrained by the distribution of percept positions in Watching, and therefore still tied to sensation. The bottom image is a first attempt at dreaming decoupled from sensory information. Percepts are positioned randomly, but constrained by the distribution of percepts as learned by an LSTM network. The position in time and space of each percept is wholly determined by the LSTM predictive model.
What keeps this from being ‘real’ dreaming (according to the Integrative Theory) is that the sequence of distributions generated by the predictive model are seeded by every time-step in Watching (keeping them from diverging significantly from Watching). In real dreaming, one single time-step will seed a feedback loop in the predictive model to generate a sequence that is expected to diverge significantly from Watching. I think these are working quite well; the generation of positions from distributions certainly softens a lot of structure in Watching, but holds onto some resemblance. There is some literature on the possibility of mental imagery and dreaming being hazier and less distinct than external perception. I’ve also included a video at the bottom that shows the whole reconstructed sequence from the LSTM model.
Following from my previous post I’ve been investigating reducing the number of clusters in order to scale the predictor problem (for Dreaming) down to something feasible. The two pairs of images below show the original reconstructions with 200,000 clusters and the corresponding reconstructions with 1000 clusters. For more context, see this post. I’ll try generating a short sequence and see how they look in context.
In working on Dreaming, I recalculated the K-Means segment clusters (percepts) with only 1000 means (there were 200,000 previously). The images below show the results. It seems that when it comes to collages, the most interesting segments are the outliers (and I expect probably the raw segments). The fact that so many segments get averaged in these clusters means they end up being very small and 1000 is just not enough to capture the width and height features (hence the two very wide and very tall percepts). Clearly, the colour palette is still preserved, but that is pretty much it. The areas of colour below are so small that these images end up being only 1024px or smaller wide. These SOMs are trained over only 10,000 iterations to get a sense of what all the percepts look like together.
As filtering by area lead to such interesting results, I went ahead and split up the percepts into three groups according to percept areas. The triptych below shows all 200,000 percepts, but separated into three separately trained and differently sized SOMs. I’ve also included details of the latter two SOMs. I thought this approach would lead to more cohesion within each map, but the redundancy between the second and third images leads me to believe that 200,000 is too many clusters. Since I need to reduce the number of clusters for the Dreaming part of Watching and Dreaming, I’ll put the collage project aside until I’ve determined a reasonable max number of clusters for LSTM prediction and then come back to it.
After looking at the previous results I think the issue is that there is simply too much diversity in all 200,000 components to make an image with any degree of consistency. I’ve managed to implement code to filter image components based on pixel area. The following images and details are composed of the top 5,000 and 10,000 largest components. Due to the large size of these components, these are full size (no scaling) and suitable to large scale printing. I think the first image with 5,000 components is the most compelling. I will now look at making collages from the remaining smaller components, or a subset thereof.
Following shows the results of training over the weekend. It seems with this many inputs (200,000) and the requirement for over-fitting (the number of neurons ~= the number of inputs) we need a lot of iterations. I think this is the most interesting so far, but I also had the idea to break the percepts into sets and make a different SOM for each set. This would make each one more unified (in terms of scale) and give them very different character.
The following image is the result of a 5,000,000 iteration training run. Note the comparative lack of holes where no percepts are present. The more I look at these images the more I think they would need to be shown not as a print, but as a light-box. I wonder what the maximum contrast of a light-box would be… On the plus side, the collages seem to work best at a lower resolution (4096px square below) due to the small size of the percepts (extracted from a 1080p HD source); this would mean much smaller (27″ @ 150ppi, 14″ @ 300ppi) and affordable light-boxes. I wonder how the collages using the 30,000,000 segments will compare since they will not have soft edges and higher brightness and saturation. It will be a while before I get to those since the code I’m using is quite slow to return segment positions (17hours for 200,000 percepts) and is not currently scalable to the 30,000,000 segments.
I have been working on getting large percepts to stick in the middle so they don’t push the outer edges too much. I attempted this by explicitly setting particular neurons in the middle of the SOM with features corresponding to the largest percepts. While this worked for a smaller number of training iterations (1000) it did not seem to make any difference over a large number of training iterations. The following images show the results where large percepts are scaled down to reduce the size variance. The lack of training leads to quite a few dead spots where no percepts are located. While quite dark, the black background works better for this content. I’ve included a visualization of the raw weights and a few details.
The image above shows some early results of organizing 200,000 percepts (the same vocabulary used in “Watching (Blade Runner)“) in a collage according to their similarity (according to width, height, and colour). I’ve included a few details below showing the fine structure of the composition. The image directly below shows a visualization of the SOM that determines the composition of the work.
I have not started work on making large collages from Blade Runner clusters and segments since the residency. I ended up writing some code for my public art commission (“As our gaze peers off into the distance, imagination takes over reality…“, 2016) that arranged segments using a SOM. I did not end up using that approach in the final work, so I’m now adapting it to make collages from Blade Runner clusters and then segments.
The following image shows the colour values of each of the 200,000 clusters, in no particular order:
I’ve stalled on the ‘Dreaming’ size of the project for now realizing that changes I made for ‘Watching’ significantly impact dreaming. With 200,000 percepts and each being able to be in multiple locations for every frame, the LSTM (prediction network) would have an input vector of 5.7 million elements (Including a 19+8 position histogram for each frame). Too big for me to even build a model (at least on my hardware). I took the opportunity to rethink what I should do and came to the conclusion that I’ll need to recompute segments to downscale the LSTM input vector to something feasible. This will be about a month of computation time, so I’ve spent some time working on other projects, such as: “Through the haze of a machine’s mind we may glimpse our collective imaginations (Blade Runner)”.
I’ve been doing a lot of reading and tutorials to get a sense of what I need to do for the “Dreaming” side of this project. I initially planned to use Tensorflow, but found it too low level and could not find enough examples, so I ended up using Keras. Performance using Keras should be very close, since I’m using tensorflow as the back-end. I created the following simple toy sequence to play with:
I’ve written some code that will be the interface between the data produced during “perception”, the future ML components, and the final rendering. One problem is that in perception, the clusters are rendered in the positions of the original segmented regions. This is not possible in dreaming, as the original samples are not accessible. My first approach is to calculate probabilities for the position of clusters for each frame in the ‘perceptual’ data, and then generate random positions using those probabilities to reconstruct the original image.
The results are surprisingly successful, and clearly more ephemeral and interpretive than perceptually generated images. One issue is that there are many clusters that appear in frames only once; this means that there is very little variation in their probability distributions. The image above is such a frame where the bins of the probability distribution (19×8) are visible in the image. I can break this grid by generating more samples than there were in the original data; in the images in this post the number of samples is doubled. Due to this increase of samples, they break up the grid a little bit (see below). Of course, for clusters that appear only once in perception, this means they are doubled in the output, which can be seen in the images below. The following images show the original frame, the ‘perceptual’ reconstruction and the new imagery reconstruction from probabilities. The top set of images has very little repetition of clusters, and the bottom set has a lot.
After spending some time tweaking the audio, I’ve finally processed the entire film both visually and aurally. At this point the first half the project “Watching and Dreaming (Blade Runner)” is complete and the next steps are the “Dreaming” part which involves using the same visual and audio vocabulary where a sequence is generated through feedback within a predictive model trained on the original sequence. Following is an image and an excerpt of the final sequence.
Following is a selection of frames from the second third of the film. I’ve also finished rendering the final third, but I have not taken a close look at the results yet. Once I revisit the sound, I will have completed the “Watching” part of the project.
I have been slowly making progress on the project, but had my attention split due to the public art commission for the City of Vancouver. I managed to run k-means on the 30million segments of the entire film and generated 200,000 percepts (which took 10 days due to the slow USB2 external HDD I’m working from). The following images show a selection of frames from the first third of the film. I’m currently working on the second third. They show a quite nice balance between stability and noise with my fixes for bugs in the code written at the Banff Centre (which contained inappropriate scaling factors size features).
After tweaking some more code, the template matched percepts look even more blocky than the previous centred ones. Due to this, I’m doing one more test using template matching, and if that does not provide better results, I’ll abandon template matching for percept generation. I’m also considering changing the area and aspect ratio features to the width and height of segments. Currently the aspect ratio is over weighted because as its not normalized; normalizing could be based on the the widest and tallest aspects, being 1920 and 800 respectively, but the different would overly weight the wide percepts.
In my previous post I claimed that the inclusion of large readable percepts in the output was due to the lack of filtering, but in fact I was filtering out large percepts. It seems the appearance of apparently large percepts is due to tweaks I made to the feature vector for regions. Previously, the area of percepts was very small because it was normalized relative to the largest possible region (1920×800 pixels); as percepts this large are very unlikely, the area feature had less range than the other features. I increased the weight of the area feature ten fold hoping it would increase the diversity of percept size. Instead, it seems this extra sensitivity preserves percepts composed of larger regions, increasing their visual emphasis.
Through the whole Banff Centre residency I was trying to find a midway point between pointillism and the more readable percepts; it seems I stumbled the solution. I’m still not happy with their instability, and I’m now generating new percepts where I use template matching so that regions associated with one cluster are matched according to their visual features (to an extent). I have no idea how this will look, but it could make percepts less likely to be symmetrical, since regions are no longer centred in percepts.
Since randomizing the order of samples in the clustering process worked so well I went back to not filtering out large regions before clustering. The results are more interesting as stills as they are too literal and unstable in video, thus I’ve abandoned this line of exploration. The tweaking of features for clustering has certainly helped with emphasizing the aspect ratio, and I’ve increased the weight of the area feature hoping it will increase the diversity in the size of the percepts.
Watching (Blade Runner) (Work in Progress) is one channel of what is envisioned as a two channel generative video installation that was the focus of my tenure as a Banff Artist in Residence. Two seven minutes sequences were exhibited as part of the Open Studios in the Project Space at the Walter Phillips Gallery at the Banff Centre in February 2016. These two sequences use different clips from Ridley Scott’s Blade Runner and show the development of the work through the residency, as documented on the production blog.
Following is photo documentation of the work-in-progress shown at the Open Studios on February 10th.
In the most recent collages I was interested in the range of aesthetic results from different relative scales of the constituent parts. To explore this, I rendered one frame for each scale setting, resulting in a smooth transition between two extremes of percept scale.
Due to the final push to get the video of the second clip ready for the Open Studios event, I did not have a chance to create any collages from those percepts. The following are a few explorations, some of which involve sorting the percepts according to some of their features, such as the area of the region, or its hue or saturation.
Following is a tweak of the video I showed at the Open Studios yesterday; I’ve increased the size of the percepts so they blend together more.