Open Percptual Problems: Over Stimulation, Stability Over Time and Task Independence

Posted: February 13, 2012 at 10:56 am

I’m stuck on a couple of problems, and wanted to post about them before continuing the work. These problems are highly interrelated and highly relevant to the link between theory and implementation. All problems are rooted in a single core problem, the over stimulation of the system, in relation to memory, and not activation. One of the central aspects of the “Dreaming Machine” project is a conception of perception as inherently constructive. The evidence is clear that what we see is not just what is out in the world, but is actually a combination of what our eyes see, and what our brains expect. We often resolve ambiguous forms according to context, even to the disregard of features (see this paper). This reconstruction means that perceptual images are constructed, but from what? The argument is that they are constructed from memories of visual components. The whole purpose of the perceptual mechanisms of the “Dreaming Machine” is the collection of these visual components.

The first problem is that for each segmented image, many perceptual atoms are formed, in fact so many that just a few frames is enough to use up all available memory. A solution is to group percepts by similarity such that only a single percept need to be stored that may refer to a number of presentations over time (for example static objects in the background of a scene). This is the approach currently taken. Currently, only static percepts are considered, and the data-set as been limited to only static (real-world and noisy) background frames. If the centre of two segmented patches are within a certain distance, then these two patches are merged into a single percept. The root problem is that even though many of these segments are merged, not enough are merged to keep memory from growing with every new frame. At best, 50% of the patches can be merged. One solution was to just throw away non merged patches, but it appears that so few patches are left, that they provide a very poor approximation of the scene. It appears that the reason for so little merges is that patches have very low stability over time, even just in terms of the centre position (the only feature currently used).

Why the lack of stability? Firstly, the test data used is collected from the most realistic context for this city: overcast and raining. This provides a significant amount of noise in the image, but one would think the mean-shift filtering, and morphology operations would deal with it sufficiently. Secondly, and I think this is the most likely cause, is that the notion of a task-independent perceptual system is inherently problematic. Many of the segmentation methods in OpenCV were ruled out because they require interactive input to assist in the segmentation. That is, the system is looking for a specific kind of segmentation, as lead by the user. These methods were ruled out because there is no user for the system, so where could these guiding landmarks come from? It is known that human perception is highly modulated by task requirements. It may be these task requirements (even if as simple as biologically rooted desire) that allow an infant to learn salient features of the world in order to eventually conceptualize it. Is a general, task-independent and temporally stable segmentation even possible?

Where does that leave the project? This work continues from MAM, where the choice was emphatic and explicit that the author control as little of the relation between the machine and the world as possible. Indeed this is relaxed to some degree in the current project, but not to the degree of specifying a priori tasks for the machine. Where is the middle ground? Is it possible to automate the selection of certain salient features in the world that both allow temporal stability, and also provide a rich and complex enough bank of components for the construction of dreams? Boring perception leads to boring dreams.

My next steps are to segment a number of background frames into patches using the current method dump their features, to see which are most stable. One of the greatest contributors to this lack of stability over time is the fact that each frame yields a different number of patches. In fact the variance in the number of patches in each frame, over 20 test frames, is currently ~58 with a mean of ~106 patches per frame. This alone causes a high degree of temporal instability, as one patch in the previous frame, may be represented by two patches in the current frame, obviously their centres would be different. Would any features be the same across time in this context?

Perhaps there are methods for automating the interactive segmentation methods, such as grabcut and watershed.