Clustering and Aesthetics

Posted: March 20, 2013 at 1:38 pm

The clustering code is working pretty well for background percepts. Following is a video that shows the raw frames on the left (in 720p) and the resulting clustered output on the right (also in 720p) through ~300 consecutive frames. Note the video is quite high resolution (2560×720) and best performance is likely attained by downloading (right click and “save video as”) and using a native video player. For each new frame all regions in the previous frame are compared and clustered: if they are sufficiently similar, then the corresponding regions in both frames are merged by averaging into a single percept.

In working on the clustering I realized quite a few aesthetic implications. The first clustered percepts I saw included subtle foreground objects in the background regions, which are still visible (although to a lesser degree) in the video above. It took me some time to figure out why, as background segmentation explicitly removes foreground objects by subtracting them from the percept masks. The regions extracted by segmentation do not contain any foreground objects. The issue is caused by the clustering process. While the masks for background objects explicitly mask out the foreground regions, the images being masked still contain the foreground objects. This is because while we use the background model to calculate the segmented boundaries, we use them to extract the pixels from the current live video frame. When the two masks (where one contains a foreground object and another does not) are averaged, the mask of the image not containing a foreground element leaves an area of diminished visibility of the corresponding part of the image, which now contains a foreground object due to merging. I solved this partially by adding a test for whether a background region contains a foreground object, and if it does then masks are not averaged, but the mask with foreground “knocked out” is used for the merged cluster. This is only a partial solution because its possible that both regions could contain knockouts of two different foreground objects. To remove the foreground objects more processing must be done on the masks to ensure that foreground objects are not included in final merged percepts. Another option would be not to use the pixels in the foreground, but only those in the background model. This would prevent any foreground objects from being visible, as the background never contains foreground objects. The open question is whether these regions would contain as much visual and textural diversity as the current frames, as they are smoothed statistical summations. It would certainly involve less processing to only segment the background model.

All this begs the question as to why so much effort is put into the separation of foreground and background, considering the technical difficulties with the background segmentation? This question is often asked in various forms: Why not consider the background only a single percept and not segment it? Why include the background at all? Firstly, this concept of foreground and background are highly artificial, and highly entrenched in Computer Vision where the background is most often noise that should be ignored. Some of the fastest and most accurate algorithms for making sense of images make this assumption of foreground and background. Secondly, this work is meant as a long-term public exhibition where it is sensing visual images at all times of the day and night. Seen from eye-level, the majority of the area of a frame is composed of background elements. Humans and moving objects occupy relatively little area of the visual surface. Additionally, there are large empty gaps of activity where no foreground objects are visible for long periods of time (for example at night), depending on the context of installation. Thirdly, dreams have been reported that contain “chimeric” elements that are fusions of multiple people or places. If the background is not segmented, then the system would be incapable of combining the same place at multiple moments in time. Chimeric elements are important to include because they demonstrate the flexibility and fluidity of dream generation. On a side note, it’s important that any reinforcement learning scheme used in the prediction aspect of the system be capable of generating these chimeric elements, and not just the recall of already seen images.

After watching the video, a few aesthetic aspects are likely to be apparent. For example, there quite a few percepts that appear to wiggle around the screen, slightly shifting their position with every frame. This occurs because when two regions of pixels are merged their new position is the mean of their constituents’. Percepts move around because their constituents are not stable, due to ambiguity and noise. Ambiguity is when two patches are very similar in terms of most features, but may not have the same position. The clustering process sees them as the same and merging then changes their positions. Many of these ambiguous regions have visible rectangular edges. This occurs when the mask touches the edges of the image’s bounding box. My first thought was to add some spacial padding around the images so that the masks would be isolated, but realized this would not be possible at the edges of the frame, where these edges are highly likely to occur. It’s currently unclear where these hard edges arise, because they are not present in the patches segmented for the New Forms Festival; perhaps they too are related to the clustering process. A quick check shows (pictured following) that it is indeed the segmentation causing these lines, and requires more investigation. If they are not removable, one option is to add a gradient mask to the edges of the percepts. This would soften the borders of each percept, and fit well with the plan to increase the smoothness of the mask edges (which appear pixelated due to segmentation regions being calculated at a lower resolution).

Segmentation w/out Clustering