The VokalJäger algorithm significantly builds on preceding phonetic measurements performed by specialized software – most notably by the de-facto standard tool in phonetics: Praat [Boersma/Weenink 2015]. But, and that constitutes a key challenge, such software may produce entirely different measurements depending on how certain calibration parameters have been set. Setting those parameters usually requires experience how to tune them and what results are expected. The process of calibrating those parameters in practice can be tedious and time consuming. The VokalJäger employs a post-processing approach: It asks the phonetic software simply to perform measurements while sweeping literally 10s of different parameter settings – then the VokalJäger picks the measurement (respectively parameter setting), which produces the most appropriate measurement. That allows to build an automated process and eliminates manual and arbitrary intervention.
That requires to define “appropriate”. The standard approach in the VokalJäger is to prefer measurements of acoustic variables, which produce smooth and sound curves – this it: it enforces continuity constraints which are in line with the human speech generation process respectively the human vocal tract. That works perfectly well for formants.
the Discrete Cosine Transformation (DCT)
To that end the VokalJäger analyses the curves of acoustic variables – here: most notably the formants F1, F2 and F3 – over a speech segment. The algorithm asks for those curves to be continuous, i.e. without jumps, breaks etc. and to follow a base pattern which is compatible with the human vocal tract configuration. If one allows for the base patterns of flatness/constancy, increase/decrease and positive/negative peaking, a mathematical construction exists to enforce/allow only those patterns: The Discrete Cosine Transformation (DCT) [Ahmed/Rao 1974; Keil 2017, p. 58-61]. The DCT constitutes a commonly used approach to smoothen or parametrize formant curves [e.g. Zahorian/Jagharghi 1993; Watson/Harrington 1999].
A DCT seeks to approximate a series of measurement points with a series of centered cosine functions. Mathematically a DCT of order 3 looks formula wise as follows:
[Keil 2017, formula (5), p. 60].
The first DCT parameter, G, equates the average μ over the series, the second, G, controls the increase/decrease, respectively the slope and the third, G, whether or not there is a central peak or valley. A third-order DCT is the default setting in the VokalJäger processing of vowels. This is best explained with following picture:
Figure: The basis shapes of a discrete cosine transformation (DCT) of third order. The first DCT parameter (here: G1) accounts for vertical shifts, the second (here: G2, expressed in % of G1) for slope and the third (here: G3, expressed in % of G1) for peaks or valleys. In this context one looks at the prototypical F0 contours of Tillmann (2016)’s universal discourse particles.
Automated curve and calibration parameter picking
For the purpose of picking the most appropriate curve, different measurements based on varying calibration parameters are considered. In the setting exemplified in below picture, Praat conducted formant curve measurements (in the picture: the colored points) under different settings of the upper ceiling frequency parameter (in the picture: the number on top of the charts; note that German style notation is applied and ‘.’ is the 1000-seperator). For each measurement DCTs have been fitted (in the picture: the colored curves). The best fit curve is chosen to be the one used for further processing (whereas best-fit means minimized error energy in dB between the measurement points and the fitted DCT-curve; the best-fit criterion itself can be calibrated). Similar mechanisms to “optimize the formant ceiling” have been proposed e.g. by Escudero/Boersma (2009).
Figure. F1, F2 and F3 formant readings obtained with Praat by varying the upper frequency ceiling (points). DCTs curves of third order have been fitted [Keil 2017, p. 65, figure 17; colored simplified version]
The process described above allows a fully automated determination of the calibrating parameters when conducting phonetic measurements with Praat. A full and detailed description can be found in Keil (2017, p. 53-71 and p. 105-108). The algorithm is implemented in R.
Formant target value extraction
Having fitted a DCT-smoothed curve, the formant target value (similarly for the fundamental frequency value) can easily be extracted at specific points in the smoothed curve, like the maximum/minimum or one-third within the curve, at the middle of the curve etc.