Since this project is meant for the complex, ever changing real-world, I’ve moved from the simple toy examples (balls and coasters) to outdoor scenes. Unfortunately things are not working very well.
In order to make a greater contribution and situate this work in what is known about the visual system is a separation of the two major streams from the primary visual cortex. The dorsal stream (occipital to parietal lobe) is considered, by one theory, the “where” region of the visual system that is associated with locations and places. The ventral stream (occipital to temporal) is the “what” region associated with particular classes of objects.
The idea is to separate the visual analysis into these two streams. Some rough ideas regarding how this could work is in this document. It is a subset of the “Dreaming Machine #3 Notes” document posted previously and contains some additional ideas for the final paper, in particular a first attempt to map the system to biological processes.
I put a quick SOM into the current test patch to see how well it deals with this idealized object data. The SOM is a 2×2, trained using a constant learning rate of 0.5 and a constant neighbourhood size of 1. I did not keep track of the number of training iterations. As proposed the images were abstracted into a RGB histogram (768 values) and a 40×30 pixel edge-detection before being fed into the SOM.
The following images illustrate how the SOM would accumulate the images. These are not accumulated in PD, but manually, but are layered based on how the SOM would choose to accumulate them:
Here are some rough ideas for DM3, with particular effort paid to the perception / synthesis system I am currently working on. These notes have been written in tomboy and reflect my research up to this point. All linked notes are also included, which leads to a large html file as exported by tomboy.
Here is the simulated accumulation of the same coaster images discussed in this previous post. This time they are cropped by the single largest contour found. Note how much less emphasis the poorly registered image has.
Here is the 3 page proposal I’ve written for the meta-creation class. It describes the system in more detailed terms and discusses the whole system. Philippe’s feedback was that this perception component does not make enough contribution to the field. He suggested that I either choose a more interested method of object segmentation, perhaps a more biologically oriented one, or use a different type of SOM, like a GSOM. A GSOM is a SOM that increases the number of units depending on the quantization error (QE) calculated for each unit. A high QE means too many inputs are associated with a particular unit, and that the map is therefore likely too small.
Next steps are to look into other object segmentation methods, continue working on implementing the system, as currently proposed, in gridflow, and to look into implementing a SOM, and then a GSOM, in gridflow.
The perception/synthesis project is a component of the next “Dreaming Machine” installation. In DM1 and 2 entire images were stored, not components of images. The purpose of this project is to determine a method for extracting components from an image, but without a complex shape recognition system.
After exhibiting the high-resolution SOMs I’ve become very interested in their aesthetic. I’ve come up with an idea for system that trains a map using the robot camera. The solution to the problem of the finite number of images is solved by having a temporal aspect to the installation. The camera would capture images, and the SOM would be trained. The exhibition would start with a blank slate before the SOM is trained, and over time the SOM would be refined more and more. Once the SOM is trained the exhibition would end, or perhaps remain for a time in a static state.
Something like the dual SOM would likely be needed as the training would work best if the SOM is presented with the same images and in random order. The first SOM would store the frames which the second SOM would randomly iterate over in training. Another idea would be to simply have the SOM trained slowly with the live images from the camera. Even if the camera took an image every 3s, it would take at least 70 days to train a 70×70 SOM. Then act of capturing images would be more closely connected to the training process. As each image is captured it would immediately be presented on the display.
The display could be an array of smaller screens, perhaps 4-9 inches diagonal. Each screen would show one single image at full frame. One problem with this approach is that the frames would take too much visual emphasis and distract from the structural relationships between the images. Controlling as many as 40×40 or 70×70 units would be a technical difficulty.
Another idea would be to have multiple projectors each showing a portion of the whole image, 4×4 projectors (each showing 10×10 or 15×15 units), or 8×8 projectors (each showing 5×5 or 10×10 units ), depending on resolution. All the projectors would be projected on the back of a single rear-projection surface. The strength of this method is that the Gaussian masks could still be used, multiple projectors would blend together very well, and would be more technically simple. The resulting image would be extremely high resolution and still be using live content from the camera. Even the brightness of the space would be less of an issue as the projectors would likely be close to the projection surface and therefore brighter.
Even with cheaper (non-HD) projectors an 8×8 grid could create a resolution as high as 10,240×8,192 pixels which would allow the viewer to really see the details in each individual image as well as the overall structure.
The idea is to take a sequence, video or film. Take the first 100 (n) frames and train a SOM on those frames. That trained SOM is the first frame of the output sequence. Then take the 100 (n) frames starting at frame #2, and so on. The resulting sequence would have 100 (n) frames less than the input sequence. The result would show the structure over the last 100 (n) frames. A new scene would start a new cluster and grow in the field until it shrinks as the scene ends and is displaced by the next scene.
The number of frames used would define the number of clusters in the feild and the number of units in the frame. HD+ resolution would likely be needed as each frame would be composed of many frames from the original footage.
Even though I had to remove the code to save images from the installation, somehow I was able to save the memory field when striking the installation. The sawtooth method for training seems to work quite well, there is certainly a cluster of darker and lighter images. Many local errors though, clear in the number of colours floating around. This may be due to there simply being not enough iterations, this was only a two week exhibition. Here is the whole 40×40 memory field:
Here is a shot of the system being tested with the 52″ LCD TV provided by Videotage. Whereas the software showed a 3×3 portion of the memory map in Norway, this project will show a 7×7 section of the map and make use of the full 1920×1080 available pixels. Surprisingly the patch has scaled well and driving this large display is no issue, as well the DVI-I to HDMI cable worked just like a native DVI cable, the nvidia card read the EDID of the display fine. Only issue was the contrast seemed to increase when running in proper HD mode, which did not happen at the lower resolutions. For this application the contrast change was not an issue and I did not need to resolve it.
It has been a while since I posted due to being very busy working on Dreaming Machine. I spent Monday and Tuesday babysitting the system while it captured images of false-creek and the central branch of the Vancouver Public Library. I’ve yet been unable to train a SOM on these data-sets suitable for printing, as they take over 750,000 iterations. Here are some images of the setup and context of False Creek and the VPL:
I did a few tests by reducing the number of sensors in the SOM. This was accomplished by systematic sampling of the histograms. The results show that the reduction of sensors makes no significant difference to the number of iterations required to create a good map. A visual inspection of the memory fields shows almost no difference between the maps resulting from 128 and 768 sensors. The following plot shows the number of iterations (x) plotted against the number of associated memory locations (y):
Here are the memory fields of the 768 (top) and 128 (bottom) sensor training sessions:
After working with these large SOMs (50×50 or 75×75) the grid and uniformity of the images appears perhaps too consistent. One idea to inject some variation is to have the camera motivation keep track of the difference between the histogram of the middle region of the current frame and the previous frame. The larger the difference the larger that memory unit could be represented. A longer term idea could be to do a histogram analysis to figure out clusters of similarity (similar u-matrix values). These U-Matrix values could be mapped to unit size.
So once I added a random codebooks init to ann_som I configured oprofile to get a sense of what portions of the patch would need optimization. The results make it very clear that ann_som itself uses as much as 80% of the PD CPU usage. Python is using a measly 10%. My assumption that python may be a bottleneck is clearly not founded, and the only way to improve performance would be to limit the number of iterations ann_som goes through. Now that I have a fully populated SOM, I’ll see how few iterations are needed to train the second SOM. I wonder what will happen If I use the linear training method multiple times without clearing the SOM. It should optimize much faster the second time, as the majority of the data would not have changed.
Once I get an idea of how that will work then I should integrate the motivated camera and dual-SOMstuff into the current DM system.
I have been able to run a dual SOM in a sufficient amount of time. The more memories are stored in the first SOM, the slower the training appears to be. This is not in terms of number of iterations but in terms of CPU time of accessing many more memory locations. Here is the U-matrix of the first SOM, once it has only been partially populated (3468 out of 5625 (75×75) units):
And the second SOM trained on those memories:
This second SOM was trained in 30s, over 15,000 iterations (2ms / iteration). When training this quickly the CPU usage is somewhat high, and does interfere with rendering. In order to see how feasible it really is I would need to integrate it into the DM system and see how it performs. One problem is that in the current DM system the pix_buffer is in the parent patch, and its the second patch that simply describes which memory location a particular input should be stored. In a dual-SOM the pix_buffer for the initial SOM will need to be in that second patch, and therefore would not be available to the first. A second pix_share could be used to send the data back to the parent patch, but its unclear if that would be fast enough. Following is the second SOM trained once the first SOM has associated all its units with images.
There are still lots of issues with making this approach work, but the quality of these second SOMs is so high that this may be worth it. Ideas to make it optimize faster:
- Add random code-books initialization to ann_som
- Try using a different numpy data type (rather than a python list) for faster iteration.
- Perhaps a C PD external that does the job of hist2numpy.py
While testing a faster way to train a SOM I trained a SOM on some of the motivated gaze images. I used linear training functions so it is a pretty good indication of the topology of the data. This feature map took 100,000 iterations to train, and some units have still not been associated with images:
The dual SOM idea is good, but it turns out that the data just makes the SOM difficult to optimize. I’m not sure if I will be able to calculate a new second SOM fast enough to recalculate one for each dream.
Here is the memory field and U-Matrix for the primary (75×75) pixel SOM:
Here is the corresponding secondary (30×30 unit) SOM (trained over 15,000 iterations using linear decreasing functions):
Notice the organization of this second SOM (trained only on the images stored by the first SOM) gives an impression of the structure of the memory that is very similar to the SOM trained on the whole data-set using linearly decreasing functions.
So it appears this dual-SOM method gives static data-set quality results from a continuous feed of new data. The next question is how to train the SOM over 15,000 iterations (or perhaps more) in under 1min. An approach is to store a concatenated histogram for each image stored by the first SOM, and train the second SOM directly on those hists. These hists could be stored as numpy buffers, and dumped directly from python into ann_som.
Another question is the dreaming. Now there are two SOMs, one that is highly organized, the other more like a staging area or very rough organization. Do dreams only propagate through the second SOM, or both? One SOM would provide a highly abstract free-association, where the second would be more concrete. Perhaps associations propagating through these two SOMs simultaneously could correspond to Gabora’s associative and analytical modes of thought.
I’ve started exploring the idea of having two SOMs. The first SOM simply chooses which images should be stored, using a cyclic learning function. The second SOM uses a linear SOM and is trained on the (finite) results of the first SOM in order to make a highly organized map. The second SOM is fed with input data in random order, not the original order received by the camera. The first experiment used two SOMs of the same size (30×30 units). The problem was that the massive number of iterations possible in the first SOM just don’t scale to the second SOM (of the same size, but needing to optimize between dreams). The result is that the first SOM may have all its memory locations occupied, but the second SOM is unable to occupy all its locations. This is the case even when using the codebooks for the first SOM in the second.
Here is a U-Matrix of the hists of images stored in the first SOM:
The second SOM:
Here is an example of the second SOM with fewer units (15×15):
This shows the memory field for the above SOM:
I’m currently training a 75×75 unit SOM and will use that to feed smaller second stage SOMs.
After capturing some test images from the motivated camera I’ve been working on the SOM structure. The quality of the SOM is very interesting (when using typical linearly decreasing learning and neighbourhood functions) when the camera provides images that are already in clusters. This cluster structure is complex enough that the resulting SOM is quite complex. Following is a representation of the memory field, its Umatrix and the Umatrix of the codebooks (neuron weights):
Umatrix of images stored in the memory field:
Umatrix of codebooks:
In comparison here are the memory field resulting from the same learning settings, except images are fed in their original order, and due to slow learning rate are run for 30,000 iterations:
This clearly shows that the order in which images are presented is highly significant in the foldedness of the resulting SOM. This is problematic considering the basis of the camera motivation is making sublte variations on the camera’s position, based on the visual scene. I could explore using very large multipliers in the motivation, but that would loose some of the quality of the camera following features of the scene. It would appear to just be randomly jumping between points. Another approach could be to use a two stage SOM. An initial SOM would simply store images (the number yet to be determined) as a first effort at organization. This would be highly folded, as seen above. The question is whether a second SOM, trained on only those images stored by the first SOM, would produce a more organized result. The first SOM would have to be trained on a cyclic function (to integrate new data), possibly a sawtooth function. The second would read the images in a random order and retrain between each dream. I wonder how quickly the second SOM could be trained. The size of both SOMs is also interesting. If the first SOM was larger (in terms of the number of units) and the second smaller, this could be an analog to longer and shorter term memory.
After doing a full days testing on the camera motivation, I think things are close. In order to keep the camera from getting too lost in the small details, I’ve increased the multiplier (not the offset) for each step. The result is that areas of focus are quite large. Additionally I’ve added a reset so that when the motivation takes the camera to the edge of the visual field, a random pan/tilt is generated. This allows the path of the camera to search over the whole space with much better coverage. Out of the 23,181 iterations the camera motivation was reset 2394 times, representing only 10% of the camera movements. Although the goal was to remove the random aspect of the camera I’m quite happy with this direction. Perhaps another idea will come up in the future to remove this random requirement. For example the camera could be reset to the centre of the least dense area. This would require a much more complex statistical analysis of its motivational behaviour.
Here is a plot of the motivation paths of the camera during this test. The random movements have been removed.
As usual, the movements start in red and end in green.
Here is a 2D histogram of the density of the data. Notice the coverage over the visual field.
Here is the histogram overlayed over the visual field. Areas that are white are not visited often, areas that are visible are often visited (higher density).
The next step is to capture images from each location and see how the SOM responds to the non-uniformity. It seems clear the visible areas above (houses, trees) will take up much of the SOM, where other areas may not be represented (sky, grass).
By increasing the multiplier and decreasing the offset I’ve got the motivation providing much better coverage, it still does appear to get stick in certain areas. Following is a 2D histogram showing the pockets of density in certain areas:
Here is the same representation of data from the previous post for comparison:
An idea is to have the camera jump to a random position when it reaches the edges, and perhaps when it has spent too much time in one area, to provide better coverage. It is unclear how the non-evenly distributed gaze will manifest itself in the SOM. The increased frequency in particular “areas of interest” (to the motivation) should result in clusters in the SOM corresponding to those areas.
This approach to motivation is more subtle than the first approach. Rather than fixing camera positions in a grid, and using the histograms to choose which grid position to move to next, this method uses the difference between the middle histogram and the LRTB (left, right, top, bottom) histograms to create a vector for the next move. The more different the edges are, the larger the movement. The result has a rather obsessive quality. The camera’s gaze tends to obsess about the details of a small area, and eventually (after an indeterminate time) move onto another region to obsess about. Here is a plot of the camera’s movement. It starts in red, and ends up in green:
Notice the clusters where the camera explores the small details of one region. The obvious colour shift in the second from the right cluster indicates the camera spent much more time in that area than in the other clusters. Here is a detail of that area:
Upon closer inspection it seems clear that this cluster is actually two clusters, the second of which (in green) is much more dense. The camera spent 2571 iterations in this cluster alone, where the total iterations was only 4274, representing approximately 60%. In the next run I’ll attempt to increase the likelihood of the gaze to escape these clusters by increasing the length of the vector.
This mock-up shows the path of the camera overlayed on the visual field. The gaze is clearly attracted to areas including many edges, and tend to escape when the vector is aligned with edges in the frame:
I’ve made my first attempt to remove the random control of the camera’s gaze. This approach is based on an analysis of the histograms of middle, top, right, left, bottom regions of the image. The x and y regions that are most different than the middle control the pan/tilt direction. This was done so that the camera moves over a fixed grid, so that locations that have already been visited could not be revisited. Even with this mechanism the gaze of the camera is highly looped and overlapped. It also tends to get stuck in certain areas. The following plot shows 2667 iterations:
The plot starts in the red area and ends up in the green area. The density of the camera’s fixation on certain areas is clear:
The upper right corner has been visited extremely disproportionately. This is even more extreme when the range of the camera (the area in which the camera is able to look) is included in the plot:
The next steps will be to give up on this grid-based approach and calculate a vector from the differences between the various histograms in order to point the camera in a new direction. Since this vector will contain some of the complexity of the image I hope it will not be as likely to get stuck in a certain area.
Before starting to impliment the ideas of having the camera control its own path I need to be able to verify what the camera is looking at. To that end I’ have determined a mapping between unit ID and pan/tilt location, in order to vizualize the field of the camera. Following is a montage of 56×11 640×480 images arranges by thier position in pan/tilt space. Future plots of camera paths will be superimposed on this feild.
The images on the far left edge are blurry because the camera was not given sufficient time, between captures, to move from the far right to the left right.
The red Xs are units that were never associated with images. Note that due to a bug the histograms were truncated to 100 elements while being stored. This U-Matrix then only shows the similarity of the first 100 elements of the R channel histogram. A U-Matrix is normally calculated by the mean of the sum of differences between each unit and its neighbours. In this case the measure of similarity is based on the sum of the differences divided by the number of neighbours. The large amount of dark units means that this SOM was highly folded. Compare this to the U-Matrix and memory field below:
The following is an excerpt from a discussion with David Jhave Johnston.
david jhave johnston wrote:
one thought i had is that dreams are so personal so energetic street footage doesn’t convey the emotivity. thats a tough ledge to get around.
How can you compare a person’s lifetime of experiences with an installations life of 10 days in a shop window? Things could/would only really get interesting with a huge network (I upgraded my machines to 4GB, and am using a different method to make the patch as scalable as possible) so we’ll see how far I can push it. Also there is the aspect of storing components of stimulus, rather than images, which could allow the construction of imaginary memories. No idea how that would work currently. I have spoken briefly to Gabora about the emotional aspect. I’m playing with (the still Piagetian idea) that emotions could all be reduced to low level emotions related to biological state.
If all knowledge is based on sensor data (as in the Piagetian project) then why, Gabora asked, would a gun associate closely with a knife? They have little to do with one and other as sense-data is concerned. My answer was that perhaps they are bound not just by their sensor impressions, but also (and perhaps more importantly) by their effect on our internal state. They associate with one and other because they cause a similar emotional state.
The problem with that line of enquiry (for my phd) is that it would require a model of those low level emotions. Interestingly the character of those emotions would just be another channel of sense data fed into he SOM. I have still not yet figured out how to deal with multiple channels of sense data, the current idea is have a SOM for each channel, which are cross linked by temporal correlation. A free association could then move through the space of similar images, and then cross into a free-association of sound where the sound matches the image. Then cross into emotion… Ideally this should be an nD SOM, with 2 dimensions for every sensor channel. I can’t get my head around this part.
A measure of stimulation would be very interesting, where the cross-link happens where the stimulus is very strong in the SOM. (A strong stimulus could be as simple as a very close similarity between the activated unit and the input stimulus)
Here is a plot of how many memories were captured at which time. The X-axis shows the dates of the installation, the Y-axis the histogram counts. The peak on the first day of the installation is due to the accelerated collection of memories during the first 50% of collected memories.
Interesting that there are also peaks on the Friday and Saturday after the opening (on Thursday the 11th). It will be interesting to see these trends in a longer installation.