Merging Patches into Percepts

Posted: April 25, 2012 at 10:39 am

After spending far too much time on early perception I really need to move on. Above is a somewhat aesthetically interesting reconstruction from patches that could not be merged. I think the current very rough sketchy state of the system is ok for now, but merging is problematic.

The first issue is the obvious problem that the more patches there are, the longer it will take to compare them (even using simple Euclidean distance). This is compounded by the fact that these merges are only pairs. As the camera is stable there is a lot of redundancy, so we also need to merge unmerged and merged patches in a recursive/nested/hierarchical fashion. While this would reduce the total number of percepts, it also greatly increases the time required to compare them.

Now I’m not sure the merging solves more problems that it causes. Unfortunately it’s the simplest thing I could think of that abstracts away from specific sensory images into a more conceptual abstract representation. It also only works on non-moving objects. I’m slowly going through a clustering book that should help with shortcuts. The conception that percepts are merged patches is important because (1) it abstracts away from raw sensor data (retinal image), (2) provides a unit of experience that bridges concept-like and sensor-like representations, (3) provides a method of compressing the massive amount of sensor information, and (4) reflects that some visual processing happens subconsciously. Perhaps I can accomplish this without merging percepts. Following are some test cases that describe the current state of the system.

13 frames, separated by 30 frames, clearing unmerged units each 25 frames.

The image above was created when I forgot to multiply the end point by the steps between frames. The result is that only 13 frames were processed, and no unmerged patches were removed. Due to this, all unmerged patches were available for the composition, resulting in the most interesting image produced by this code, to date (the image at the top of this post). Following is a plot of the various parameters dumped by the perception program during execution:

“Total PercepUnits” is the total number of percepts + patches in memory.
“Total merged PercepUnits” is the total number of percepts (merged)
“Total nonmerged PercepUnits” is the total number of patches (unmerged percepts)
“in merges, number of removals” is the number of patches removed during merging.
“New Merges each 2nd frame” is the number of merges done each second frame.
“New Merges each 5th frame” is the number of merges done each fifth frame.

The merging function only works on pairs, and so the current code runs the merge process periodically to try and reduce the number of merges. More patches are sensed than merged, so the total number of percepts is ever increasing. Note that the code that does the merging does not attempt to merge items within the same frame, only merges items from this frame to all previous frames. Following is the approximate time taken for each frame to be processed (including all segmentation and merging operations):

Note that the y axis is in seconds, which means the peak time to process a frame (and do merging operations) is as much as 11 seconds. This longest time is toward the end of the run, when there are over 2000 merged and unmerged percepts. While the image at the top of this post shows the unmerged percepts, the following image is all the merged percepts. The mean number of merges for merged percepts is 973, and the max is 1964, which means that many of the percepts making up the image are constructed from many patches. Note how it compliments the top image, especially in the area around the sign in the upper left.

300 frames, separated by 30 frames, clearing unmerged units each 25 frames.

The sawtooth waves are due to the clearing process that removes all unmerged percepts, and all percepts only merged once, each 25 frames. With 300 frames, the total number of percepts grows very large (up to 9344 units) which corresponds to a horrid increase in processing time:

Where the max processing time (for one frame + merging + clearing iteration) is 839 seconds. Following is the corresponding “merged” reconstruction produced at the end of this run:

The unmerged reconstruction is empty because it was cleared at the end of the run. The mean number of merges in all the percepts by the end of the run is 1669, with a maximum of 3339. Despite the large number of percepts and the large number of merges, the coverage of the visual field is very poor.

I see two options for continuing:

  1. I could continue to tweak the existing approach, perhaps considering merged percepts as long term memory and having them not decay, or decay slowly, and considering the raw unmerged patches as short term memories in the visual buffer that decay more quickly. This would depend on solving the problems with the slow processing by using proper clustering methods.
  2. I could give up on the merging process and figure out some other way of making higher level percepts from pixel patches. It’s unclear what this would be at this point.

In either case, the idea is to do as little as possible on this aspect of the system for now, dump all the percepts to disk after a run, and write a python prototype of the conceptual network part of the system. A slow and offline system is appropriate for this stage of development (in order to get through the proposal). Optimization and improvements can be made once there is a prototypical implementation of the bulk of the system.

After looking briefly through the clustering book two things occurred to me. (1) I could rethink the idea of “merging” in the context of the higher level features discussed in the post on system architecture (rather than low level pixel features, like histograms or shape). This would mean that patches could be merged based on their high level features (though that still requires a distance calculation for those features), and/or provides a filtering method where percepts in common locations of the feature space are thrown away. In the latter, the left over percepts would not be merged, but simply be exemplars of an area of the feature space. (2) In the current system when a pair of percepts that match are found, they are merged right away. This means that for each potential merge an actual (cpu intensive) merge is done. Another approach is to cache all the potential merges (in a recursive manner) and only then do the actual merge on many instances. I would expect this to be much faster.

I like the idea of filtering all percepts based on their distance from the distribution of all the percepts (increasing likelihood they are removed as their distance to the norm shrinks). This would not be O(n^2) because percepts would each be compared to the distribution, and not to each other. Also it provides nice consistency with the conceptual network aspect of the system. The only drawback I can think of is a lack of hierarchy in the system, much less lower level abstraction feeding a higher level process. (That being said, the segmentation is certainly a lower level process)