Histogram Features Don’t Improve Classification Accuracy

Posted: September 17, 2019 at 4:16 pm

Rerunning the grid search using the 48 bin (16 bins per channel) colour histogram features provided no classification improvement. The search reported a peak validation accuracy of 74% and 83% for the training set. The best model achieved a classification accuracy of 84.6% for training, 70.6% for validation and 72.3% for testing. The confusion matrix for the test set is as follows:

  • 649 bad predicted to be bad.
  • 319 bad predicted to be good
  • 220 good predicted to be bad.
  • 761 good predicted to be good.

So it appears I’ve hit the wall and I’m out of ideas. I’ll stick with the initial (instructional) features and see if I can manage a 75% accuracy for an initial model. Looking back at my experiments, it looks like my validation accuracies have ranged from ~62% to ~75% and test from ~70% to ~74%.

At least all this experimentation has meant that I have a pretty good idea that such a model will work on the Jetson and I will not even need a deep network. I may even be able to implement the network using one of the C++ libraries I’ve already been using like FANN or ANNetGPGPU.

No Significant Improvement Using Dropout Layers nor Changing the Number of Hidden Units.

Posted: September 15, 2019 at 6:13 pm

After the realization that the ~80%+ results were in error, I’ve run a few more experiments using the initial features. Unfortunately no improvement from the ~70% results. I added dropout to input and hidden layers (there was previously only dropout on the input layer) and changed the number of units in the hidden layer (rather than using the same number of inputs). I did not try adding a second layer because I have not seen a second hidden layer improve performance in any experiment; perhaps this is due to a lack of sufficient training samples for deep networks.

The parameter search found a validation accuracy of 73.4%, while the best model showed a validation accuracy of 73.9% and a test accuracy of 71.8%. The network was not over-fit with a training accuracy of 88.1%. The confusion matrix for the test set is as follows:

  • 658 bad predicted to be bad.
  • 291 bad predicted to be good
  • 258 good predicted to be bad.
  • 742 good predicted to be good.

I’m now running a slightly broader hyperparameter search using the 48 bin colour histogram and if I still can’t get closer to 80% accuracy I’ll classify my third (small) data set and see how it looks. In thinking about this problem I did realize that there was always a tension in this project. If the network is always learning its output will become increasingly narrow and never be able to ‘nudge’ the audience’s aesthetic into new territories; there is a need for the system to show the audience ‘risky’ designs to find new aesthetic possibilities. This is akin to getting trapped in local minima; there may be compositions the audience likes even more, but those can only be generated by taking a risk.

#15 Exploration and Refinement

Posted: September 15, 2019 at 5:33 pm

The top image shows my favourite result for #15, which I think is pretty successful; I was not sure how the abstraction of the original (cubist) source would work out. I think this shows sufficient dissolution of the original. Explorations are included in a gallery below.


~86% Test Accuracy Appears to be Spurious

Posted: September 13, 2019 at 5:09 pm

After running a few more experiments, it seems the reported near 90% test accuracy is spurious and related to a lucky random split of data that was probably highly overlapping with the training data split. The highest test and validation accuracies I’ve seen after evaluating models using the same split as training are merely ~74% and 71%, respectively.

I did a little more reading on dropouts and realized I had not tried different numbers of hidden units in the hidden layer, so I’m running a new search with different input and hidden layer dropout rates, number of hidden units and some range of epochs and batch_size. If this does not significantly increase test and validation accuracy then I’ll go back to the colour histogram features and if that does not work… I have no idea…

#24 Exploration and Refinement

Posted: September 13, 2019 at 3:43 pm

I spent a little too much time on #24, but I quite like Yves Tanguy and I thought the muted colour palette here would be interesting. I can’t say I’m happy with the results. I suspect the lack of colour diversity is what causes these to require so many training iterations to obliterate the original. The top image is my favourite, and the gallery below shows the other explorations. I’m next moving onto #15.

#3 Refinement

Posted: September 7, 2019 at 9:39 am

I’ve found it quite difficult to get a version of #3 smooth and without remnants of the original. The image on the top here is closest, even though there is a very small detail in the original which is still visible. Images below were ruled out.

~86% Test Accuracy Using Initial Features?

Posted: September 5, 2019 at 3:58 pm

Following from the previous results using the new workflow, I went back to my initial features (the 52 vector of instructions used to generate compositions). The results are have turned out to be amazing. The best model achieved accuracies of 85.5% (training), 85.6% (validation) and 85.9% (test). This is a significant increase from the previous best result of 79% (validation). These accuracies are means of accuracies reported over five runs with different splits of the data-set. Note, these splits are still 50/25/25 so that the size of the subsets are comparable with previous results. The ‘training’ accuracy, is then not actually the accuracy on the data used to train the network, but the accuracy on a random subset of similar size as the training set. 616 bad compositions were predicted to be bad, 105 bad predicted to be good, 105 good predicted to be bad and 634 bad predicted to be bad. Again, these are averages over multiple predictions with different splits.

As I’m writing this I was thinking that my validation method is problematic. I set aside a test set (during training), to check generalizability beyond the training and validation sets. My validation code is a separate instance and has no access to that specific test split. I need to save that specific test set and then validate the best model based on it, not multiple random runs with random splits. This may be skewing my results, since my random splits use both training and validation samples. So what I need to do is save the split used during training and evaluation and run predictions on them. I’m working on those code changes now…

#1 Refinement

Posted: September 5, 2019 at 10:20 am

I ran a few more iterations appropriating #1 and they are looking quite nice. I think the top image is the most successful, but I’m not convinced by the blueish band near the right edge. I’m happy with the degree of abstraction where the structure breaks away from the figure form which is still visible in the lower image. I’m starting to realize my choice of neighbourhood size seems to be related to the size of faces in the source. Portraits of one person require larger neighbourhoods than group portraits. An interesting side exploration would be to use face detection to automatically determine neighbourhood size for paintings with faces (assuming face detection works well enough on painted faced). I think I’ll leave this one here for now and move along.

Revisiting Older Experiments

Posted: September 3, 2019 at 5:46 pm

After those recent strong results with the changed code, I’m revisiting older experiments to see if the they were in fact showing promise; I’m figuring out whether it was previous features, or the previous validation method that lead to that 70% accuracy ceiling.

The 24 colour histogram feature results do not improve upon the 24 hist + 31 non-colour parameter results. I did learn a few things in the process, including that the stochastic splits change the measured accuracy of the best selected model. From this point I’ll be reporting the mean of accuracy and confusion matrices of 5 runs using different random splits of validation and test data. I also re-ran the evaluation code on the previous experiment with 24+31 features in case the good results were a fluke. Following are the results.

31 + 24 Features

Mean Accuracy:


Mean of Confusion Matrices

375.0 bad predicted to be bad
106.4 bad predicted to be good
112.8 good predicted to be bad
381.8 good predicted to be good

24 Hist Features

Mean Accuracy:


Mean of Confusion Matrices

531.8 bad predicted to be bad.
194.6 bad predicted to be good
155.2 good predicted to be bad.
579.4 good predicted to be good.

So the results are that the 31 + 24 features have performed much better than 24 colour hist features alone. I’m rerunning the initial and variance feature experiments using the new validation method.

#1 and #3 Initial Sketches.

Posted: September 3, 2019 at 10:44 am

As I work my way up in resolution, I’ve generated an initial sketch of #1 and #3. #1 requires a lager neighbourhood to create more abstraction since the original is so well known. #3 also needs more iterations as some of the original painting (God’s face) is still visible. I also tried to do a run of one of the larger paintings, #4, but the process crashed; presumably due to a memory error.

#07 Refinements

Posted: August 30, 2019 at 11:14 am

I’m now setting this aside and moving onto the next images in the short list. The top image is the best result at this time.

#7 Explorations

Posted: August 29, 2019 at 11:17 am

While I’m not quite satisfied with these results, the top image shows what I think of as the most successful iteration; there is still a little of the initial conditions showing in in the faces though, so I’m running another session with slightly more iterations. The gallery below shows all my explorations of #7 up to this point. I’m struggling a little with the tension between smoothness and somewhat uniform colour patches with their harder edges. For this source painting, the patches in the ground can cue camouflage patterns that I’m not keen about.

#23 and #8 Revisited

Posted: August 26, 2019 at 5:33 pm

As I mentioned in the previous post, I wanted to revisit the previously ruled out paintings. I used smaller learning rates to see if that salvaged them. I can’t say I’m happy with the results; although they are more smooth, they are still lacking.

Further Narrowing Down for #5.

Posted: August 25, 2019 at 3:33 pm

After doing a few more runs with tweaked parameters I’m not sure I’m doing much better so I’m going to leave #05 here and re-run the two lower resolution paintings that were previously ruled out (#23 and #18). The first image is the most successful, but is very similar to the those in the top row of the gallery. The bottom row includes the least successful, though I still think there is something to the larger neighbourhood in the lower right image.

Narrowing Down Explorations of #5 With Smaller Learning Rates

Posted: August 24, 2019 at 12:15 pm

After the insight in the previous post, I’ve explored a few variations using learning-rates smaller than 1.0. The following images are my favourites. They balance abstraction and emergent structure quite well, but are not quite there. The image on the left is insufficiently abstract where remnants of the mast in the original are still present. The wave-like structures in the lower left are very interesting and suggest quite a bit of depth and also cue the waves in the original. The image on the right shows quite good abstraction, but lacks some of that complexity in the waves, due to the larger neighbourhood (sigma = 200px).

The following images show the rest of the explorations, including highly over-abstracted versions that approach gradients. I’ve also included an attempt with a relatively high learning rate of 0.5, the highest of these explorations where the rest are 0.25 or 0.1. In that image (upper left of bottom gallery) the wave section in the lower left is very interesting, although approaches the appearance of spires; I’m not sure about the harder edges and mottled patches. That composition also shows a degree of under-organization at the smaller scale, e.g. splashes of red in the area above the bright spot.

Spires and Full Resolution Explorations.

Posted: August 21, 2019 at 12:56 pm

The images above show a few attempts to reproduce the aesthetic of the mid-resolution exploration of #5 at full resolution. As the ‘spires’ clearly overwhelm the image I wrote to the author of ANNetGPGPU. The conclusion is that the interaction of high learning rates and small neighbourhood functions lead to cases where the next BMU is very likely to be close to the previous BMU. The result is a trail of BMUs that progress across the SOM. It is unclear why they always progress at the same angle. I’m now running a test with a learning rate of 0.75 (rather than 1.0 as used previously) and I’ll continue to change learning rates and see how that looks! I may want to also revisit my previously ruled out paintings with this new insight. Now that I know these spires are an emergent result of the SOM, it’s something I should explicitly explore in the future!

Finally Cracked the 70% Validation Accuracy Wall!

Posted: August 21, 2019 at 12:46 pm

I changed my Talos code to explicitly include a best model selection call, running Predict(), and added a call to do 10-fold cross validation of models, running Evaluate() before saving the search session. It is not quite clear to me whether these two actions change the criteria by which models are selected for deployment, but in my first use of these calls my performance has jumped 10%.

I also split my data differently; data is split into 50/25/25% for training, validation and testing. The validation set is used in Talos Scan() and the testing set is used in Evaluate(). The features of this last session were using 31 features from the initial dataset (instructions to generate compositions, excluding colour data) and 25 colour histogram features. I was also wondering if the number of dimensions of my features meant I was not going to get anywhere with as few samples as I have.

The best model reported an accuracy of 78.4% on the training set, 80% on the validation set and 77% on the testing set. This indicates a huge improvement and makes me wonder if Talos was just selecting a very poor ‘best model’ previously. One caveat is that the log Talos generates that shows performance during training shows very different results; in the log, the greatest accuracy was reported as 56.8% on the validation set and 100% on the training set, highly divergent from the prediction accuracy made by the best model. I should also note that I removed the fixed RNG seeds for splits and data shuffling, so the search is stochastic and may be getting a broader picture since it’s not limited by reproducibility. The best model using the validation set predicted 304 bad compositions to be bad, 70 bad to be good, 74 good to be bad and 284 good to be good.

If I can reproduce this performance, I’ll then generate a new set of random compositions and see how the best model classifies them.

Early Full-Resolution Explorations

Posted: August 18, 2019 at 1:05 pm

Starting from the lowest resolution images of the 7 short-listed, I’ve been exploring using them at full resolution. Using the previous parameters for the intermediary resolutions, I was unable to get any strong results, see below. I’m wondering if colour diversity tends to result in images that are poorer… The main aesthetic weakness is the hard edges that manifest, even though the neighbourhood function has Gaussian edges. This was not seen, at least to the same degree, in the expanded intermediary resolution explorations. I’m currently computing a full resolution version of #5 (intermediary, original) and hope it’s more successful.

Fewer Histogram Features

Posted: August 18, 2019 at 9:12 am

After the lack of success in the previous experiment using the 768 element vector, I have the results of the 96 histogram bin experiment. During the search, Talos reported a peak validation accuracy of 73.3%. The best model reported a validation accuracy of 66.4% and a training accuracy of 99.7%. Clearly the model is learning the training set well, but again not generalizing to the validation set. The following image shows the confusion matrix for the validation set. I note that there is no appreciable difference between 1000 and 10,000 epochs to validation accuracy.

Expanded Intermediary Resolution Explorations

Posted: August 16, 2019 at 2:27 pm

The following images were computed over night using the same params as in the previous post. The training time is significantly longer than estimated, due to the larger number of pixels (due to aspect ratio), so only three were generated at the time of writing. While these results are going in the right direction, they are still too similar to the original compositions (with the exception of 07, lower right) and need further abstraction (increase of neighbourhood size). I emailed the author of the GPU accelerated SOM I’m using and see if he can reproduce these spire effects. Since the number of iterations has such a significant effect, it seems I should be working image by image at full resolution. As inefficient as I may be that seems like the next step; I’ll prioritize the lowest resolution images for exploration sake!

Intermediary Resolution Explorations

Posted: August 15, 2019 at 4:47 pm

I’m thinking that it makes the most sense to move up in resolution and do some experimentation at each resolution until the desired resolution is reached. It will be clear from this post that the quality of the aesthetic changes significantly at various resolutions. In order to prevent the image from approaching a gradient with such a high number of training iterations (required to provide a good sampling of the underlying diversity of the original painting), I’ve been using very small neighbourhood sizes. The image below is my best choice and it’s trained over 0.5 epochs (half the pixels) and a neighbourhood of 35px. At HD resolution, this image takes 2.5 hours to compute. If you look carefully, you’ll see some dark ‘spires’ growing from the lower left that look to be the same as those I encountered during the development of “As our gaze peers off into the distance, imagination takes over reality…” (2016). I still have no explanation of them…

For comparison, I’ve included the original image and the low resolution sketch below. At the bottom of this post images show the other neighbourhood sizes I experimented with (left: 78px; right: 150px), and rejected due to their over-abstraction.


Decomposed Survey of Long-Listed Paintings

Posted: August 14, 2019 at 6:00 pm

I realized that I would not be able to get a survey of images that at least sketch out how they may look without down-scaling significantly. I’ve reduced the resolution of my working files from fitting in an HD frame down to 10% and calculated SOMs where the number of iterations matches the number of pixels. I’m quite happy with the quality of these results! Only a few seem quite weak to me, due to (a) the lack of diversity (which is exaggerated by the brutal down-sampling here) or (b) a lack of colour restraint. The images below are in the same order as the painting long-list post.

Painting Decomposition by SOM: Initial Work in Progress

Posted: August 14, 2019 at 5:00 pm

While Talos is searching for suitable models for the Zombie Formalist, I’ve started experimenting with revisiting the painting appropriation side of the project. For the initial exploration, I’m using da Vinci’s “Mona Lisa” (1517).

The following images are various explorations of abstracting the above image using the SOM to reorganize constituent pixels. Through exploring these I realized that one of the greatest influences on the quality of the result is the random sampling of pixels. The working image is 1080×1607 pixels, which means 1,735,560 training samples. In my tests using ~20,000 training iterations, only a small subset of the diversity of those pixels influence the resulting image. In these tests, I realized the most successful results are those that happen to select (randomly) a large diversity of pixels to train the SOM. The same parameters can produce very different results:

I think the image on the left is more successful because it happened to select a few brighter pixels in the original. I can produced better results by down-scaling the image to increase the diversity of pixels selected by random sampling, but that is not ideal since I’m limiting both the output resolution and the diversity of data used in training. It seems I should stick with the number of iterations that equal (at least) the number of training samples (the number of pixels in the original). Looking again at my old code, I did not realized I had fixed the neighbourhood function; in all the images below, the only variable that effects the output is the number of iterations.


Not seeing improvement with hist features.

Posted: August 13, 2019 at 2:09 pm

It took nearly 10 days for Talos to search possible models using the 768 item vector representing the colour histogram for each composition. The best validation accuracy listed by the search was 68.5% and the best model 66.2%. The best model achieved a training accuracy of 77.9%. 465 bad compositions were predicted to be bad, 294 bad compositions were predicted to be good, 232 good compositions were predicted to be bad and 568 good compositions were predicted to be good.

This is a very minor improvement from the variance features. The low training accuracy indicates there may not be enough epochs for such a large dimensional vector. I’m now running a second experiment where the 768 bin (256 bins per channel) histogram is reduced to a 96 bins (32 bins per channel). This is more comparable to the initial 57 element training vectors. If the problem is the size of the vector, this should allow for higher training accuracy and I hope, also better generalization in the next search.

Variance features result in even lower validation accuracy.

Posted: August 4, 2019 at 1:31 pm

The quick variance features were easy to implement, but provided no improvement and performed worse than the previous features. The parameter search resulted in a peak validation accuracy of 64.1% while the best model achieved 66% accuracy on training data and 62.1% on validation data. The following image shows the confusion matrix for validation data. I’m next going to generate colour histograms for the 15000B compositions and see if leads to any improvement.

Long list of paintings for appropriation.

Posted: August 3, 2019 at 1:49 pm

With all this focus on the Zombie Formalist I’ve been spending some of the ML search time researching for the painting history appropriation aspect of the project. I’ve narrowed down a long list of paintings based on popularity and their trajectory from Northern European Renaissance realism to modern problematizations of realism; I’ve selected works from the Renaissance, Cubism and Surrealism, as follows. Thumbnail images are included below the table.

The next step for this component of the project is to do some ML to reorganize the pixels and see what works best. The resolution some of the sources are quite high, quite low for others. It’s yet unclear how to consider the scale of the originals in the appropriation works as some are very large. I’m also not sure how large I will be able to go with the self-reorganization process.

Leonardo da VinciMona Lisa1517
Leonardo da VinciSalvator Mundi1500
MichelangeloThe Creation of Adam, Sistine Chapel ceiling1512
CaravaggioThe Conversion of Saint Paul1601
RembrandtThe Storm on the Sea of Galilee1633
RembrandtThe Anatomy Lesson of Dr Nicolaes Tulp1632
RembrandtThe Night Watch1642
Juan GrisGlass and Checkerboard1917
Juan GrisNature morte à la nappe à carreaux (Still_Life_with_Checked_Tablecloth)1915
Juan GrisPortrait of Pablo Picasso1912
DuchampNude Descending a Staircase No. 21912
Jean MetzingerLe goûter (Tea Time)1911
Georges BraqueViolin and Palette (Violon et palette, Dans l’atelier)1909
Georges BraqueNature Morte (The Pedestal Table)1911
Georges BraqueMan with a Guitar (Figure, L’homme à la guitare)1912
Georges BraqueBottle and Fishes1912
Fernand LégerLes Fumeurs (The Smokers)1912
Albert GleizesPortrait de Jacques Nayral1911
Albert GleizesL’Homme au Balcon (Man on a Balcony)1912
Rene MagritteThe Son of Man1964
Rene MagritteThe Human Condition1933
Yves TanguyMama, Papa Is Wounded1927
Yves TanguyThrough birds through fire but not through glass1943


No Improvement with New Features

Posted: July 31, 2019 at 5:22 pm

Running a scan of hyperparameters over 145 models resulted in no improvement of the 70% validation accuracy. (Well, actually one model reported 74% validation accuracy during the search, but was not saved as the “best model” by talos.) Below are the confusion matrices for both training and validation sets. Based on these results, it’s time to change to other features. I’ll try calculating the variance of each feature across layers first, since that’s pretty easy to implement, and resort to colour histograms of images if that leads no where.


New Features with new Dataset.

Posted: July 16, 2019 at 6:03 pm

Following from my last post I finished generating and labelling a new dataset. I’m right now rerunning the previous experiment in talos to see if this new data-set makes any change. An initial look at a few of the distributions of my features, good and bad compositions remain quite evenly distributed but a second look shows there is some uneveness to the distribution of some features, such as offset:

More offsets near 0 were labelled to be bad. Those two spikes are also quite far from an even distribution.

The new dataset is also 15,000 items including 3921 “good”, 3872 “bad” and 7207 “neutral” labels. Following is a random sampling of good and bad compositions from the new dataset:


Deep Networks provide no increase of validation accuracy.

Posted: June 12, 2019 at 4:20 pm

After doing quite a bit more learning I used talos to search hyperparameters for deeper networks. I ran a few experiments and in no case could I make any significant improvement to validation accuracy from the simple single hidden layer network. While there is some improvement tuning various hyperparameters, all the tested network configurations resulted in validation accuracy ranging from 61% to 73% (60% to 100% for training data). The following plot shows the range of validation accuracy over the number of hidden layers. Note the jump from 1 to 2 hidden layers does increase validation accuracy, but only an mean increase of 0.3%.

The confusion matrix for the best model is about the same as it was for the single hidden layer model first trained in keras (without hyperparameter tuning!):

Through the last couple of weeks I have made no significant gains with keras. So the problem is clearly my features. Everything I’m seeing seems to indicate that my initial fears in regards to the lack of separability in my initial T-sne results. So I have a few ideas on how to move forward:

  1. Rather than using the raw features for classification, do some initial stats on those features and use those stats for training. This only effects features that can be grouped, e.g. stats on the set of colours of all layers in a composition. Two ideas are variance of such groups of features, or full histograms for each group of features.
  2. Since my features are normalized, they all have the same range. This means that regardless of their labels, all features will have the same stats, making #1 moot! So it looks like I should convert my code so that the features are the actual numbers used to generate compositions and not these 0-1 evenly distributed random numbers. This means generating and labelling a new data-set.

Reproducing previous R results in Keras.

Posted: May 31, 2019 at 6:35 pm

After spending so much time in R trying to get a simple network to work I’ve made the jump into Keras, sklearn, scipy, etc. in order to build deeper networks. Work flow is a lot more awkward, but I’ve managed to figure our the key metrics (categorical accuracy and confusion matrices) and reproduced the previous results in R (note, I did not use the indexical features in the keras model).

Comparable to the R model, a single 52 unit hidden layer network trained on 80% of the data over 300 epochs achieved an accuracy on the training data of 99% and an accuracy on the validation data of 72% (73% in the previous model). There were 105 ‘bad’ samples predicted to be ‘good’ (compared to the previous 101) and 113 ‘good’ samples predicted to be ‘bad’ (compared to the previous 109). The following images show the confusion matrices for training and validation data, respectively.