Thanks to the suggestion of Sofian Audry, I’ve added an additional 4000 compositions to the initial set of 1000. It took me the entire week to label these new compositions. The complete 5000 composition set breaks down with these labels: good (287); neutral (1976); bad (2737). Similarly to the initial training set, there are about 5% good compositions, but the number of bad compositions grew from ~38% to ~55%. Following are random samples from the good and bad sets, respectively.
The next steps are to train using this data-set; if that does not work then investigate some methods to re-balance the training-data, since “good” samples are very rare.
I selected one composition (#569) from the training set as the reference and computed its distance (RMS) from all other samples in the training set (without neutral samples). The result is not unexpectedly that the good compositions are spread throughout the bad compositions (see below).
Also there seems to be no visual relationship between compositions with shorter RMS distances; #878 is not more similar to #569 than #981 is. This is confirmed by plotting the images themselves according to their RMS distance to the reference (in the upper left corner, filling rows first):
So it seems using these instructions as feature vectors may be a no go. The benefit of using these vectors was that the composition could be evaluated by the classifier without it actually getting rendered. I’ll next try the using colour histogram features and see if my results are any better.
Using the labelled data set, I was unable to get a (simple MLP) classifier to perform with accuracy better than 50%; it seems my fears, based on the t-sne visualization previously posted, were warranted. There is the underlying question regarding whether I should even treat the instructions to make compositions (my vectors) as features of the compositions. To look at this a different way, I thought I should generate histograms for each composition and see how t-sne and simple classifiers perform on those features.
I was thinking that perhaps the rarity of “good” compositions in the training set was a problem. Splitting the data-set into 80% training and 20% validation (using sampling that keeps the distribution of the three labels similar in both sets) leads to a training set with 41 “good” compositions, and a validation set with 10 (~12% in both cases).
There are also a lot of “neutral” samples that are neither good nor bad, and that is certainly not helping with what (at least initially) is a binary classification problem. So I did a test removing all the neutral samples and the classifier accuracy jumped from 50% to 82%, which is obviously significant. Unfortunately (because of the rarity of good compositions?) this translates into 6 “bad” compositions predicted to be “good” and 0 “good” compositions predicted to be “good”. The following images were labelled “bad” and predicted to be “good”:
I have a few other things to investigate, including arranging images according to their distance to a reference (an arbitrary composition) and see if (a) the distance corresponds to some sense of visual similarity, and (b) the distribution of good and bad compositions (are good compositions more distant from bad ones?). I suspect the latter will mirror the t-sne results, but it’s worth looking at whether distances in vector space matches any sense of visual similarity. Another investigation will be to generate colour histograms for each composition and see how those features look according to t-sne and the classifier.
The plot above shows the t-sne results of the vectors that represent each composition. It’s very clear that bad, good, and neutral compositions are evenly distributed and conflated with no discernible separability. I spent a little time trying to figure out what the implication is, but I seem to only find information on linearly separability and classification. My concern is a classification of these vectors will not be able to discriminate between good and bad compositions. If this is the case, I would need a different representation of each composition and it’s unclear what an appropriate representation would be.
In preparation for Machine Learning aspect of this project I’ve generated 1000 images (and the vectors that represent them) and labelled them as good, bad or neither good nor bad. There were 51 good, 376 bad and 513 neutral images in the training set.
The labelling is based in my intuitive compositional sense and is a stand-in for viewer interaction (via preferential looking and social media likes). The idea is to get a sense of this data set and train a few classifiers to see if they can discriminate good and bad compositions.
The next step is to plot the corresponding vectors using t-sne in R and see how my labels are distributed in that vector space.