Phonetic Algorithms

The first building block of the VokalJäger concerns phonetics, or more precise: processes to manage the measurement, extraction and normalization of acoustic variables. The most important acoustic variables for the purpose of the classification of phonetic features in a later step are the formants and the fundamental frequency.


The VokalJäger algorithm significantly builds on preceding phonetic measurements performed by specialized software – most notably by the de-facto standard tool in phonetics: Praat [Boersma/Weenink 2015]. But, and that constitutes a key challenge, such software may produce entirely different measurements depending on how certain calibration parameters have been set. Setting those parameters usually requires experience how to tune them and what results are expected. The process of calibrating those parameters in practice can be tedious and time consuming. The VokalJäger employs a post-processing approach: It asks the phonetic software simply to perform measurements while sweeping literally 10s of different parameter settings – then the VokalJäger picks the measurement (respectively parameter setting), which produces the most appropriate measurement. That allows to build an automated process and eliminates manual and arbitrary intervention.

That requires to define “appropriate”. The standard approach in the VokalJäger is to prefer measurements of acoustic variables, which produce smooth and sound curves – this it: it enforces continuity constraints which are in line with the human speech generation process respectively the human vocal tract. That works perfectly well for formants.


To that end the VokalJäger analyses the curves of acoustic variables – here: most notably the formants F1, F2 and F3 – over a speech segment. The algorithm asks for those curves to be continuous, i.e. without jumps, breaks etc. and to follow a base pattern which is compatible with the human vocal tract configuration. If one allows for the base patterns of flatness/constancy, increase/decrease and positive/negative peaking, a mathematical construction exists to enforce/allow only those patterns: The Discrete Cosine Transformation (DCT) [Ahmed/Rao 1974; Keil 2017, p. 58-61]. The DCT constitutes a commonly used approach to smoothen or parametrize formant curves [e.g. Zahorian/Jagharghi 1993; Watson/Harrington 1999].

A DCT seeks to approximate a series of measurement points with a series of centered cosine functions. Mathematically a DCT of order 3 looks formula wise as follows:

[Keil 2017, formula (5), p. 60].

The first DCT parameter, G[1], equates the average μ over the series, the second, G[2], controls the increase/decrease, respectively the slope and the third, G[3], whether or not there is a central peak or valley. A third-order DCT is the default setting in the VokalJäger processing of vowels. This is best explained with following picture:

The basis shapes of a discrete cosine transformation (DCT) of third order. The first DCT parameter (here: G1) accounts for vertical shifts, the second (here: G2, expressed in % of G1) for slope and the third (here: G3, expressed in % of G1) for peaks or valleys. In this context one looks at the prototypical F0 contours of Tillmann (2016)’s universal discourse particles.


For the purpose of picking the most appropriate curve, different measurements based on varying calibration parameters are considered. In the setting exemplified in below picture, Praat conducted formant curve measurements (in the picture: the colored points) under different settings of the upper ceiling frequency parameter (in the picture: the number on top of the charts; note that German style notation is applied and ‘.’ is the 1000-seperator). For each measurement DCTs have been fitted (in the picture: the colored curves). The best fit curve is chosen to be the one used for further processing (whereas best-fit means minimized error energy in dB between the measurement points and the fitted DCT-curve; the best-fit criterion itself can be calibrated). Similar mechanisms to “optimize the formant ceiling” have been proposed e.g. by Escudero/Boersma (2009).

F1, F2 and F3 formant readings obtained with Praat by varying the upper frequency ceiling (points). DCTs curves of third order have been fitted [Keil 2017, p. 65, figure 17; colored simplified version]

The process described above allows a fully automated determination of the calibrating parameters when conducting phonetic measurements with Praat. A full and detailed description can be found in Keil (2017, p. 53-71 and p. 105-108). The algorithm is implemented in R.


Having fitted a DCT-smoothed curve, the formant target value (similarly for the fundamental frequency value) can easily be extracted at specific points in the smoothed curve, like the maximum/minimum or one-third within the curve, at the middle of the curve etc.


One particular challenge in automated processing of acoustic variables – especially formant values – is to deal with measurements, which are obvious “rubbish”. But how to test on “rubbish-ness”?

Another effective way to check formant readings is to know where to expect them to be
[Thomas 2010, p. 48].

The VokalJäger seeks to extract appropriate acoustic variables by applying a best-fit approach. In the section “curve smoothing” it was explained how the algorithm would pick a set of formant curves (and hence the “target” F1/F2 formant values). But the so-picked values could still produce nonsensical results: Nothing yet ensures that the F1/F2-values lie within the area of the F1/F2-space which is physically possible, given the nature of the human vocal tract.


The VokalJäger determines the area in the F1/F2-plane where, given a reference universe (here: the Kiel PHONDAT Corpus for High German, separated by gender), most of the samples are located – under the hypothesis that the area is shaped like a triangle. This so-called extreme triangle is used to improve the best-fit curve picking process described in the previous sections: curves with formant readings within the physically reasonable triangle are favored over unreasonable readings outside the triangle. Unrealistic values, respectively Praat-parameter settings yielding those values, are discouraged. As result, values outside the triangle are “folded back” into the triangle. This VokalJäger mechanism – based on a geometric optimization – is called the Rückfaltung (German: for folding back) and described in detail in Keil (2017, p. 93-101 and p. 105-108).

The following pictures exemplify the process:

The first picture shows the original situation for one male speaker from the REDE corpus: this process would have allowed for F1/F2-readings, which lie outside the extreme triangle – colored in blue, one the lower left [Keil 2017, figure 28, left, p. 96; colored version].

The second picture shows the situation after the Rückfaltung: Curves, respectively Praat calibration parameters, are now favored, which produce values within the extreme triangle. The readings formerly outside the triangle (blue points) have now been “folded back” into the triangle [Keil 2017, figure 28, right, p. 96; colored version].