Final Experiment Using Colour Histogram Features

My talos search using a 24 bin colour histogram finished. The best model achieved accuracies of 76.6% (training), 74.6% (validation) and 74.2% (test). Compare this to accuracies of 93.3% (training), 71.2% (validation) and 72.0% (test) for the previous best model using initial features. On the test set, this is an improvement of only ~2%. The confusion matrix is quite a lot more skewed with 224 false positives and only 78 false negatives. Compare this to 191 false positives and 136 false negatives for the previous best model using initial features. As the histogram features would need to be calculated after rendering, I think it’s best to stick with the initial features where the output of a generator can be classified before rendering, which will be much more efficient.

The following images show the new 100 compositions classified by the best model using these histogram features.

“Good” Compositions
“Bad” Compositions

Read more

Classification using final model

The following two images show the classification done by the final model trained on all the data using the architecture params from hyperparameter search. I think these are slightly better than those from the previous post.

“Good” Compositions
“Bad” Compositions

Looking back through my experiments I thought I would take a crack on one more histogram feature experiment. I saw a peak validation accuracy (using the ruled out problematic method) of 75% with a 24 bin colour histogram, so I thought it would be worth a revisit.

Meeting the Universe Halfway: Chapter 4 – Agential Realism

I finally got to reading Karen Barad’s book (titled above) and thought I would post my notes here while I reflect on them. After reading I also realized that I had gotten Bohm and Bohr confused in my notes from the Karen Barad Seminar; this has now been corrected. In parallel with the collage production one idea is to reconsider my current Artist Statement and rewrite it to be consistent with Agential Realism. Next, I think I’m going to read Chapter 7 to focus on what is meant by “entanglements”. My notes on chapter 4 are as follows:

Read more

Refinement of 3,000,000 Training Iteration Version

Since the previous post, I’ve focused on developing of the 3,000,000 iteration version. I was not happy with the shuffled version, shown below on the right of the 3,000,000 iteration version. I prefer the balance of large photo-readable segments and small segments that emphasize flow in the left (previously posted) version.

Following this I generated a sorted version of this composition where larger segments are behind the smaller segments; this emphasizes greater flow, but at the expense of photo-readable segments being visible. I’ve included the sorted version and a few details below. I was just thinking that perhaps I could include a small subset of the large (or medium) segments in the front of the small ones by manipulating of their order in a more complex way; for example, randomly select a few segments from the large end and insert them on the small end?

Fewer Iterations and Random Shuffling

Following from previous collages I thought I would try fewer iterations (100,000) and a randomly shuffling the stacking order of percepts. I can’t say I’m happy with these results; the most recent iteration is still the strongest. I’ve included below a few of these explorations. I’m now calculating a couple variations with 3,000,000 training iterations. I’m also going to focus on Barad and (re)framing my thinking about objects in relation to how I’ve been thinking about Machine Subjectivity. This will manifest rewriting my artist statement, and I’ve also been playing with the idea of the artist statement as indeterminate where the specific language is manifested as multiple permutations.

Splits and new classified compositions!

One thing I realized in my previous experiments was that I did not change the train/validate/test split. So I ran a few experiments with different splits, 50/25/25 was my initial choice. I tried 80/10/10, 75/15/15 and 60/20/20. My results showed that 75/15/15 seemed to work the best and I wrote some code to classify new images using that trained model. The following are the results! I think the classification is actually working quite well; a couple compositions I consider “bad” made it in there, but looking at these two sets I’m quite happy with the results.

“Good” Compositions
“Bad” Compositions

My next ML steps are:

  • finalize my architecture and train the final model
  • integrate the painting generator and face detection to run as a prototype that logs looking durations for each composition
  • run some experiments using this new dataset collected in the ‘wild’ and decide on thresholds for mapping from duration of looking to “good” and “bad” labels.
  • finally determine the best approach to running training code on the Jetson (embed keras? use ANNetGPGPU? FANN?) and implement it.

Histogram Features Don’t Improve Classification Accuracy

Rerunning the grid search using the 48 bin (16 bins per channel) colour histogram features provided no classification improvement. The search reported a peak validation accuracy of 74% and 83% for the training set. The best model achieved a classification accuracy of 84.6% for training, 70.6% for validation and 72.3% for testing. The confusion matrix for the test set is as follows:

  • 649 bad predicted to be bad.
  • 319 bad predicted to be good
  • 220 good predicted to be bad.
  • 761 good predicted to be good.

So it appears I’ve hit the wall and I’m out of ideas. I’ll stick with the initial (instructional) features and see if I can manage a 75% accuracy for an initial model. Looking back at my experiments, it looks like my validation accuracies have ranged from ~62% to ~75% and test from ~70% to ~74%.

At least all this experimentation has meant that I have a pretty good idea that such a model will work on the Jetson and I will not even need a deep network. I may even be able to implement the network using one of the C++ libraries I’ve already been using like FANN or ANNetGPGPU.

No Significant Improvement Using Dropout Layers nor Changing the Number of Hidden Units.

After the realization that the ~80%+ results were in error, I’ve run a few more experiments using the initial features. Unfortunately no improvement from the ~70% results. I added dropout to input and hidden layers (there was previously only dropout on the input layer) and changed the number of units in the hidden layer (rather than using the same number of inputs). I did not try adding a second layer because I have not seen a second hidden layer improve performance in any experiment; perhaps this is due to a lack of sufficient training samples for deep networks.

The parameter search found a validation accuracy of 73.4%, while the best model showed a validation accuracy of 73.9% and a test accuracy of 71.8%. The network was not over-fit with a training accuracy of 88.1%. The confusion matrix for the test set is as follows:

  • 658 bad predicted to be bad.
  • 291 bad predicted to be good
  • 258 good predicted to be bad.
  • 742 good predicted to be good.

I’m now running a slightly broader hyperparameter search using the 48 bin colour histogram and if I still can’t get closer to 80% accuracy I’ll classify my third (small) data set and see how it looks. In thinking about this problem I did realize that there was always a tension in this project. If the network is always learning its output will become increasingly narrow and never be able to ‘nudge’ the audience’s aesthetic into new territories; there is a need for the system to show the audience ‘risky’ designs to find new aesthetic possibilities. This is akin to getting trapped in local minima; there may be compositions the audience likes even more, but those can only be generated by taking a risk.

~86% Test Accuracy Appears to be Spurious

After running a few more experiments, it seems the reported near 90% test accuracy is spurious and related to a lucky random split of data that was probably highly overlapping with the training data split. The highest test and validation accuracies I’ve seen after evaluating models using the same split as training are merely ~74% and 71%, respectively.

I did a little more reading on dropouts and realized I had not tried different numbers of hidden units in the hidden layer, so I’m running a new search with different input and hidden layer dropout rates, number of hidden units and some range of epochs and batch_size. If this does not significantly increase test and validation accuracy then I’ll go back to the colour histogram features and if that does not work… I have no idea…

#24 Exploration and Refinement

I spent a little too much time on #24, but I quite like Yves Tanguy and I thought the muted colour palette here would be interesting. I can’t say I’m happy with the results. I suspect the lack of colour diversity is what causes these to require so many training iterations to obliterate the original. The top image is my favourite, and the gallery below shows the other explorations. I’m next moving onto #15.

~86% Test Accuracy Using Initial Features?

Following from the previous results using the new workflow, I went back to my initial features (the 52 vector of instructions used to generate compositions). The results are have turned out to be amazing. The best model achieved accuracies of 85.5% (training), 85.6% (validation) and 85.9% (test). This is a significant increase from the previous best result of 79% (validation). These accuracies are means of accuracies reported over five runs with different splits of the data-set. Note, these splits are still 50/25/25 so that the size of the subsets are comparable with previous results. The ‘training’ accuracy, is then not actually the accuracy on the data used to train the network, but the accuracy on a random subset of similar size as the training set. 616 bad compositions were predicted to be bad, 105 bad predicted to be good, 105 good predicted to be bad and 634 bad predicted to be bad. Again, these are averages over multiple predictions with different splits.

As I’m writing this I was thinking that my validation method is problematic. I set aside a test set (during training), to check generalizability beyond the training and validation sets. My validation code is a separate instance and has no access to that specific test split. I need to save that specific test set and then validate the best model based on it, not multiple random runs with random splits. This may be skewing my results, since my random splits use both training and validation samples. So what I need to do is save the split used during training and evaluation and run predictions on them. I’m working on those code changes now…

#1 Refinement

I ran a few more iterations appropriating #1 and they are looking quite nice. I think the top image is the most successful, but I’m not convinced by the blueish band near the right edge. I’m happy with the degree of abstraction where the structure breaks away from the figure form which is still visible in the lower image. I’m starting to realize my choice of neighbourhood size seems to be related to the size of faces in the source. Portraits of one person require larger neighbourhoods than group portraits. An interesting side exploration would be to use face detection to automatically determine neighbourhood size for paintings with faces (assuming face detection works well enough on painted faced). I think I’ll leave this one here for now and move along.

Revisiting Older Experiments

After those recent strong results with the changed code, I’m revisiting older experiments to see if the they were in fact showing promise; I’m figuring out whether it was previous features, or the previous validation method that lead to that 70% accuracy ceiling.

The 24 colour histogram feature results do not improve upon the 24 hist + 31 non-colour parameter results. I did learn a few things in the process, including that the stochastic splits change the measured accuracy of the best selected model. From this point I’ll be reporting the mean of accuracy and confusion matrices of 5 runs using different random splits of validation and test data. I also re-ran the evaluation code on the previous experiment with 24+31 features in case the good results were a fluke. Following are the results.

31 + 24 Features

Mean Accuracy:


Mean of Confusion Matrices

375.0 bad predicted to be bad
106.4 bad predicted to be good
112.8 good predicted to be bad
381.8 good predicted to be good

24 Hist Features

Mean Accuracy:


Mean of Confusion Matrices

531.8 bad predicted to be bad.
194.6 bad predicted to be good
155.2 good predicted to be bad.
579.4 good predicted to be good.

So the results are that the 31 + 24 features have performed much better than 24 colour hist features alone. I’m rerunning the initial and variance feature experiments using the new validation method.

#1 and #3 Initial Sketches.

As I work my way up in resolution, I’ve generated an initial sketch of #1 and #3. #1 requires a lager neighbourhood to create more abstraction since the original is so well known. #3 also needs more iterations as some of the original painting (God’s face) is still visible. I also tried to do a run of one of the larger paintings, #4, but the process crashed; presumably due to a memory error.