DNN Face Detection Confidence — Part 2

I ran a whole day (~8 hours) test when no one was home with a low threshold of confidence (0.1) for deep face detection. As I had previously seen, non-faces can be attributed with very high confidence values. Before the sunset (leading strangely high confidence in noise) the confidence wavers around quite a lot and the max confidence remains .96.

The following image shows the extreme wavering of confidence over time where no faces are present (blue) shown with the short face test (red). The horizontal lines show the means of face and noface sets. It seems that under certain (lighting) conditions, like the dip below, the DNN reports very low confidence values (0.36) that would be easily differentiated from the true positive faces. Since I’m working with example code, I have not been dumping the frames from the camera corresponding with these values. I may need this to determine under what conditions the DNN does perform well. Tomorrow I’ll run a test while I’m working (with face present) and see if I can make sure there are no false positives and collect more samples. Over this larger data-set I have determined that the bump of no face samples around 0.8 confidence does not happen in appropriate (bright) lighting conditions, see histogram below.

Without more information it’s unclear what confidence threshold would be appropriate or even whether the DNN face detector is indeed performing better than the haar-based detector. This reference showed a significant difference in performance between the DNN and Haar methods, so I’ll see what model they used and hope for better performance using that…

DNN Face Detection Confidence

As I mentioned in the previous post, I was curious whether the DNN method would be any harder to “fool” than the old haar method. The bad news is that a DNN will report quite high confidence when there are no faces, and even in a dark room where most of the signal is actually sensor noise. The following plot shows the confidence over time in face (red) and no face (blue) cases. The no face case involved the sun setting and the room getting dark, which can be seen in the increase of variance of the confidence over time (compared to the relatively stable confidence of the face case. The confidence threshold for the face case was 0.6 and 0.1 for the no face case.

Read more

Deep Face Detection

Following from my realization that the haar-based classifier is extremely noisy for face detection, I decided to look into deep-network based face detection methods. I found example code optimized for the jetson to do inference using deep models. Some bugs in the code has made it hard to test, but I’ve fixed enough of those bugs to start an early evaluation at least.

On first blush, the DNN method (using the facenet-120 model) is quite robust, but one of the bugs is a reset of the USB camera’s brightness and focus so that makes evaluation difficult. It does appear that there are very very few false positives. Unfortunately there are quite a lot of false negatives also. It does appear that a complex background is a problem for the DNN face detector as it was for the haar-classifier.

I’m now dumping a bunch of confidence values in a context in which I know there is only one face being detected to get a sense of variance… Then I’ll do a run where I know there will be no faces in the images and see what the variance of confidence is for that case. There is also come DNN-based face detection code in OpenCV that looks to be compatible I’m also trying to figure out.

Face Detection Inaccuracy

After getting the new rendering code and face detection into an integrated prototype that I can test (and generate training data) I’m realizing the old school haar classifier running on the GPU works very very poorly. Running the system with suitable lighting (I stopped labelling data once the images got too dark) yielded the detection of 628 faces; of those 325 were false positives. This is not great and the complex background did not help, see image below. I did not keep track of the number of frames processed (true negatives), so these numbers appear much worse than they actually are in terms of accuracy. There were likely 1000s of true negatives. In a gallery context there would be much more control of background, but I should try some example code using a trained CNN to detect faces and see how that seems to perform.

False positive in complex background

New Compositions with X and Y Layer Offsets

The following image shows 25 randomly generated compositions where the layers can be offset in both directions. This allows for a lot more variation and also for circles to include radial stripes that do not terminate in the middle. I’m about to meet with my tech, Bobbi Kozinuk, to talk about my new idea for a case design and talk about any technical implications. I’ll also create a prototype that will collect the time I look at each composition as a new data-set for training.

Long-List of Appropriated Paintings

The gallery below shows the strongest of all my explorations and refinements of the painting explorations. I’ll use this to set to narrow down to a shortlist that will be finalized and produced. I’m not yet sure about the print media or size, but was thinking normalizing them to ~19″ high to match the height of the Zombie Formalist. This would mean the tallest in this long-list would be ~8.5″ x 19″ (W x H) and the widest ~43″ x 19″. For media, I was thinking inkjet on canvas would emphasize painting.

AA Solution

I ended up adding the padding only to the right edge, which cleans up the hard outer edges of circles, which is where it bothered me the most. I also realized that there were dark pixels around the feathered edges. This was due to a blending error where I was setting a framebuffer to transparent black rather than transparent with the background colour. There are still some jaggies, as shown in the images below, but they are working quite well.

I also made some quick changes realizing that radial lines are never offset inwards or outwards from the circle, this is because offsets were only applied in 1D. I’ve added a second offset parameter for 2D offsets and there is a lot of additional variety. I just realized this also means my previously trained model is no longer useful (due to the additional parameter), but I’ll need to train on some actual attention data anyhow. I’ll post some of those new compositions soon.

AA Edges (Again)…

After more testing I realized the padding approach previously posted is includes some unintended consequences; Since all edges had padding, the circles are no longer continuous and the padding introduces a seam where 0 = 360 degrees, as shown in the following image. I also noticed that in some cases the background colour can be totally obscured by the stripes, which makes the padding look like a thin frame in a very different colour than the rest of the composition. In the end, while these changes make the edges look less digital, they introduce more problems than they solve.

AA Edges in Zombie Formalist Renderer

In my test data for machine learning I was not very happy with the results because of strong jaggies, especially in outer edges where the edge of the texture cuts off the sine-wave gradient. I added some padding on each single row layer on the left and right edges and used a 1D shader blur to soften those cut off edges. This works quite well, but as shown below only works on the left and right edges; the top and bottom stay jaggy: (note, due to the orientation of layers, sometimes these ‘outer’ jaggies are radial and sometimes circular.)

Read more

#4 Exploration and Refinement

The composition of this painting has a large black hole in the middle. The abstraction process seems to emphasize this and I’m not totally sure by the results. The best image (top one) does seem a little too abstract, but the emphasis on that dark area is reduced. I think I’ll try something in between sigma 500 and 600 if this image makes the final cut. Explorations below.

Final Experiment Using Colour Histogram Features

My talos search using a 24 bin colour histogram finished. The best model achieved accuracies of 76.6% (training), 74.6% (validation) and 74.2% (test). Compare this to accuracies of 93.3% (training), 71.2% (validation) and 72.0% (test) for the previous best model using initial features. On the test set, this is an improvement of only ~2%. The confusion matrix is quite a lot more skewed with 224 false positives and only 78 false negatives. Compare this to 191 false positives and 136 false negatives for the previous best model using initial features. As the histogram features would need to be calculated after rendering, I think it’s best to stick with the initial features where the output of a generator can be classified before rendering, which will be much more efficient.

The following images show the new 100 compositions classified by the best model using these histogram features.

“Good” Compositions
“Bad” Compositions

Read more

Classification using final model

The following two images show the classification done by the final model trained on all the data using the architecture params from hyperparameter search. I think these are slightly better than those from the previous post.

“Good” Compositions
“Bad” Compositions

Looking back through my experiments I thought I would take a crack on one more histogram feature experiment. I saw a peak validation accuracy (using the ruled out problematic method) of 75% with a 24 bin colour histogram, so I thought it would be worth a revisit.

Splits and new classified compositions!

One thing I realized in my previous experiments was that I did not change the train/validate/test split. So I ran a few experiments with different splits, 50/25/25 was my initial choice. I tried 80/10/10, 75/15/15 and 60/20/20. My results showed that 75/15/15 seemed to work the best and I wrote some code to classify new images using that trained model. The following are the results! I think the classification is actually working quite well; a couple compositions I consider “bad” made it in there, but looking at these two sets I’m quite happy with the results.

“Good” Compositions
“Bad” Compositions

My next ML steps are:

  • finalize my architecture and train the final model
  • integrate the painting generator and face detection to run as a prototype that logs looking durations for each composition
  • run some experiments using this new dataset collected in the ‘wild’ and decide on thresholds for mapping from duration of looking to “good” and “bad” labels.
  • finally determine the best approach to running training code on the Jetson (embed keras? use ANNetGPGPU? FANN?) and implement it.

Histogram Features Don’t Improve Classification Accuracy

Rerunning the grid search using the 48 bin (16 bins per channel) colour histogram features provided no classification improvement. The search reported a peak validation accuracy of 74% and 83% for the training set. The best model achieved a classification accuracy of 84.6% for training, 70.6% for validation and 72.3% for testing. The confusion matrix for the test set is as follows:

  • 649 bad predicted to be bad.
  • 319 bad predicted to be good
  • 220 good predicted to be bad.
  • 761 good predicted to be good.

So it appears I’ve hit the wall and I’m out of ideas. I’ll stick with the initial (instructional) features and see if I can manage a 75% accuracy for an initial model. Looking back at my experiments, it looks like my validation accuracies have ranged from ~62% to ~75% and test from ~70% to ~74%.

At least all this experimentation has meant that I have a pretty good idea that such a model will work on the Jetson and I will not even need a deep network. I may even be able to implement the network using one of the C++ libraries I’ve already been using like FANN or ANNetGPGPU.

No Significant Improvement Using Dropout Layers nor Changing the Number of Hidden Units.

After the realization that the ~80%+ results were in error, I’ve run a few more experiments using the initial features. Unfortunately no improvement from the ~70% results. I added dropout to input and hidden layers (there was previously only dropout on the input layer) and changed the number of units in the hidden layer (rather than using the same number of inputs). I did not try adding a second layer because I have not seen a second hidden layer improve performance in any experiment; perhaps this is due to a lack of sufficient training samples for deep networks.

The parameter search found a validation accuracy of 73.4%, while the best model showed a validation accuracy of 73.9% and a test accuracy of 71.8%. The network was not over-fit with a training accuracy of 88.1%. The confusion matrix for the test set is as follows:

  • 658 bad predicted to be bad.
  • 291 bad predicted to be good
  • 258 good predicted to be bad.
  • 742 good predicted to be good.

I’m now running a slightly broader hyperparameter search using the 48 bin colour histogram and if I still can’t get closer to 80% accuracy I’ll classify my third (small) data set and see how it looks. In thinking about this problem I did realize that there was always a tension in this project. If the network is always learning its output will become increasingly narrow and never be able to ‘nudge’ the audience’s aesthetic into new territories; there is a need for the system to show the audience ‘risky’ designs to find new aesthetic possibilities. This is akin to getting trapped in local minima; there may be compositions the audience likes even more, but those can only be generated by taking a risk.

~86% Test Accuracy Appears to be Spurious

After running a few more experiments, it seems the reported near 90% test accuracy is spurious and related to a lucky random split of data that was probably highly overlapping with the training data split. The highest test and validation accuracies I’ve seen after evaluating models using the same split as training are merely ~74% and 71%, respectively.

I did a little more reading on dropouts and realized I had not tried different numbers of hidden units in the hidden layer, so I’m running a new search with different input and hidden layer dropout rates, number of hidden units and some range of epochs and batch_size. If this does not significantly increase test and validation accuracy then I’ll go back to the colour histogram features and if that does not work… I have no idea…

#24 Exploration and Refinement

I spent a little too much time on #24, but I quite like Yves Tanguy and I thought the muted colour palette here would be interesting. I can’t say I’m happy with the results. I suspect the lack of colour diversity is what causes these to require so many training iterations to obliterate the original. The top image is my favourite, and the gallery below shows the other explorations. I’m next moving onto #15.