Binary Classifiers

Using machine learning, more precisely Kuhn (2015)’s wrapper CARET for numerous machine-learning algorithms in R, it is possible to build so called binary classifiers. The binary classifiers form the backbone of the VokalJäger‘s classification engine and are trained to detect the prevalence of certain elementary phonetic conditions.

The statistical Gaussian Mixture Discriminant Analysis (MDA) model was found to perform well in the VokalJäger vowel feature classification task [Hastie/Tibshirani 1996; Leisch/Hornik/Ripley 2013; full description in Keil 2017, p. 170–224]. Other models have been tested as well: Support Vector Machines, Neural Networks, Random Forests and K-Nearest Neighbours (KNN) yielded decent results at the level of the MDA, while more simpler ones like the Linear Discriminant Analysis (LDA) fell behind. Binary Classifiers trained with repeated cross validation on the first three formant values (which have been robustly Lobanov-normalized) yielded the most promising and most robust set up. In its default setting the VokalJäger classifies German vowels using mixture discriminant analysis and the formants F1 to F3.

The response of the binary classifier

Below exemplary chart shows the response of a VokalJäger MDA binary classifier (here: based on the F1/F2 formants only), which was trained to detect the vowel [ɔ]. In the setting of the floating phonetic feature openness of  back vowels the [ɔ] constitutes the support sound of the elementary phonetic feature value m = 3, this is: mid open. The plotted response on the scale 0 to 1 is a measure of the probability p that the speech sound indeed occurs to be an [ɔ]. As can be seen, the binary classifier performs as expected: the support sounds [ɔ] are usually detected (responses > 0.5), while most of the contrast sounds [a:/ε:/o:] are rejected by the binary classifier (responses < 0.5). The neutral sounds, which are omitted from the training, perform by phonetic proximity: e.g. [a] is rejected as is the contrast sound [a:].

[Fig. Response of a binary classifier]

Figure. Probability response of a binary classifier trained to detect the vowel [ɔ], when confronted with inbound vowels of all kind [Keil 2017, figure 78, p. 209; colored version].

Binary Classifiers: Results in the F1/F2 plane

If one assembles different binary classifiers together in an ensemble, one recognizes them detecting the “correct” sounds – respectively throwing a positive answer when fed with acoustic variables originating from the respective “area” in the F1/F2-plane. That is shown in below picture, which shows by color which binary classifier “wins” in the ensemble and indicates what sound was found. The colors nicely align with “areas” the corresponding vowels usually “occupy”, what is indicated with the ellipses.

Figure. The ensemble for the openness of back vowels and the responses of the constituting binary classifiers [Keil 2017, figure 72, p. 180; colored version]

Binary Classifiers: Here i am

If one “walks” a “line” through the F1/F2-plane, touching (in order)  the [a:], [ɔ], [o:] and [u:] “areas”, the individual binary classifiers throw probabilities to indicate the supposed prevalence of the sound they have been trained on – comparable with a cry “here I am”. As measure for the location in the F1/F2 plane here F1 is plotted as x-axis, basically the vertical position in an F1/F2 chart. One can clearly recognize the binary classifiers knocking in and fading out:

Figure. Responses of the binary classifiers in the ensemble for the openness of back vowels, when “walking” the F1/F2 plane from [a:] to [u:], here measured in F1 [Keil 2017, picture 71, p. 179; colored version].

Assembling Binary Classifiers into floating phonetic features

From the individual binary classifiers’ probability values it is now possible to calculate the floating feature value ζ, what is shown in the next diagram. One can observe the desired pattern: When walking the line between the [a:], [ɔ], [o:] and [u:] “areas”, the floating feature value ζ “jumps” between the elementary feature values m = 4, 3, 2 and 1. There is some clustering, literally forming a “staircase” – those being the values associated 1:1 with the winning binary classifier. On the other side there is some intermediate migration in between – those being the values usually described with diacritics.

Figure. The ζ value of the floating phonetic feature openness of back vowels when “walking” the F1/F2 plane from [a:] to [u:], here measured in F1 [Keil 2017, figure 69, p. 174; colored version]