Electroacoustic methods of determining the parameters of speech sound generator

Kłaczyński, Maciej; Wszołek, Wiesław

Vibroengineering Procedia

Browse Procedia

Published: 10 October 2014

Check for updates

Electroacoustic methods of determining the parameters of speech sound generator

Maciej Kłaczyński¹

Wiesław Wszołek²

^{1, 2}AGH University of Science and Technology, Department of Mechanics and Vibroacoustics, Krakow, Poland

Corresponding Author:

Maciej Kłaczyński

Cite the article Download PDF

Downloads 1142

Abstract

The spectrum character of speech wave is connected with the fundamental frequency (F0) of human vocal folds vibration. As it is considered, F0 of the source during voicing contains an abundance of information on the larynx pathology, individual trait, the emotional state and ethnographic origin of speaker. The present paper presents results of research that conducted simultaneous measurement of fundamental frequency of vocal fold vibration by the electroglottography (EGG) and with the acoustic methods. The analysis of the $F_{0}$ function exactitude and the usefulness of these methods were executed too.

1. Introduction

The emitted speech signal is a source of useful diagnostic and prognostic information. Besides of the individual features of a speaker, the speech signal carries semantic and emotional state information, and other kinds, enabling to determine speaker’s ethnic origin, social status, education, and overall health. The speech signal can become, through selected parameters, an additional source of information on anatomic, physiological and pathological (deformation) conditions of human internal organs. A number of authors’ research proves that maximum information on phonetic action can be assembled by delimitation the parameters of speech sound generator such as the fundamental frequency $F_{0}$ , short and long term frequency perturbations, short and long term amplitude perturbations, noise related, tremor, voice break and subharmonic. The fundamental tone function $F_{0}$ can be estimated by internal measurements (e.g. optical methods) or external measurements (like acoustic or electrical methods) [1, 2]. The optical methods include: stroboscopy, cinematography, videokymography (VKG), photoglottography (PGG), electrolaryngography (ELG) and two-point holographic interferometry. The acoustic methods include ultrasonography (USG), multi-dimension speech signal analysis and test evaluation of the voice acoustic pressure, while the electrical method is usually electroglottography (EGG). The literature demonstrates non numerous researches for Polish speech have conducted simultaneous measurement of fundamental frequency by the EGG and with the acoustic methods. The present paper presents results of such research. In this paper had been carried out the analysis of the accuracy of algorithms (zero crossing measure ZCM, cepstral analysis – CEPA, higher-order spectra analysis – HOSA) to determining the parameters of $F_{0}$ , Jitter, Shimmer [3-7].

2. Speech signal production

An acoustic speech signal, defined as a variation of acoustic pressure in time, has a complex graph, being a reflection of its complex articulation process. On the parameters of the signal influence both its source (i.e. the vibrating vocal folds or sound caused by turbulent air flow through the narrowing of speech organs) and dynamical properties of the vocal channel, forming the structure of the signal. In the time domain the speech signal $p (t)$ can be mathematically described using a convolution of time-dependent signal source $g (t)$ and pulselike answer of the voice channel $h (t)$ [8]:

1

$p (t) = \int_{0}^{t} h (t - τ) g (τ) d τ .$

Interpretation of Eq. (1) indicates that in the time-dependent acoustic speech signal the properties of the source and the properties of the sound forming voice channel are closely related (Fig. 1).

Fig. 1a) Acoustic speech signal, b) vibrations of vocal folds

a)

b)

The repetition time (period) of the vocal cords vibrations is called fundamental frequency $F_{0}$ and approximate value can be expressed by the following formula [9]:

2

$F_{0} = \frac{1}{2 π} \sqrt{\frac{s}{m}},$

where $m$ – mass of the vibrating vocal [kg], $s$ – stiffness constant of the chords [N/m].

The fluctuation of the fundamental frequency and signal amplitude can be estimate by Jitter and Shimmer parameters. Jitter ( $J i t t$ ) denotes the deviation of the larynx tone frequency in consecutive cycles from the average frequency of the larynx tone according formula:

3

$J i i t = \frac{\frac{1}{N - 1} \sum_{i = 1}^{N - 1} |F_{0}^{(i)} - F_{0}^{(i + 1)}|}{\frac{1}{N} \sum_{i = 1}^{N - 1} F_{0}^{(i)}} ∙ 100 %,$

where $N$ – number of instantaneous signal periods.

Shimmer ( $S h i m$ ) denotes the deviation of the larynx tone amplitude in the consecutive cycles from the average amplitude of the larynx tone according formula:

4

$S h i m = \frac{\frac{1}{N - 1} \sum_{i = 1}^{N - 1} |A_{0}^{(i)} - A_{0}^{(i + 1)}|}{\frac{1}{N} \sum_{i = 1}^{N - 1} A_{0}^{(i)}} ∙ 100 %,$

where $A_{0}$ – amplitude of fundamental frequency in instantaneous signal periods.

3. Research material and methodology

The goal of this research and analysis was to determine the difference between the $F_{0}$ , Jitter and Shimmer estimation from the acoustic signals and the EGG signals during phonation. The experiment was carried out on the group of 328 people, both men and women, age 19 to 80 years, so-called standards of Polish language, without any pathologies that could affect the voice quality.

The time-dependent acoustic speech signal and the EGG signal were recorded simultaneously in an anechoic chamber, at the Department of Mechanics and Vibroacoustic, AGH University of Science and Technology, Kraków, Poland. The diagram of the measurement setup is shown in Fig. 2.

Fig. 2The block diagram of the measuring setup, where: 40 AF – G.R.A.S microphone, 1201 – Norsonic preamplifire, 12AA – G.R.A.S amplifire, 1314 – M-AUDIO IN/OUT chart, 6103 – Kayelemetrics Electroglottograph (EGG)

The block diagram of the measuring setup, where: 40 AF – G.R.A.S microphone, 1201 – Norsonic preamplifire, 12AA – G.R.A.S amplifire, 1314 – M-AUDIO IN/OUT chart, 6103 – Kayelemetrics Electroglottograph (EGG)

The task of the group people being the subjects of examination was to read out the phonetic text slowly and without any intonation. They had to repeat three times: the vowels – a, e, i, u; the vowels with the prolonged phonation – a, e, i, u; the words – “ala”, “as”, “ula”, “ela”, “igła” (i.e. Polish names and Polish equivalent for “needle”) and the sentence – “dziś jest ładna pogoda” (i.e. the Polish equivalent of the sentence “We have a good weather today”).

4. Results

To depict and compare the $F_{0}$ , determined by the acoustic and the electroglotographic methods, the analysis in the frequency domain was made, using Short Term Fourier Transform (STFT). Before frequency analysis, the data were subjected to the process of preemphasis with the band-pass FIR filter, with $f_{l o w} =$ 50 Hz and $f_{u p p} =$ 400 Hz. The dynamical spectrum $W (t, f)$ containing 56 lines with the $f =$ 10 Hz width, made with the $t =$ 0.1 s time quantum and the level quantum equal to $L =$ 0.2 dB, was obtained. The subject of analysis was 4 vowels pronounced by each person (3936 records – 328 persons × 4 vowels × 3 expression). The goal of the analysis was to determine the difference between the spectra obtained from the acoustic and the EGG signals. These vowels have a fundamental significance in the examination of the voice channel condition (especially of the glottis) because of their stationary-like time dependence. The examples of the over-time-averaged spectra of the vowels with the prolonged phonation, obtained from the EGG and the acoustic signals, are presented in Fig. 3.

Analysis of the frequency spectra carried out for each investigated signal sample, showed only minor differences (in shape and envelope) between the fundamental tone spectra determined from the acoustic signal and from the EGG signals. The substantial differences, observed in the relative level (amplitude) of recorded signal, are related to the signal normalization process. For each group (acoustic signal sample, EGG signal sample), the averaged minimal value for all samples recorded in the given group was used as a reference level in the logarithmic scale.

In the second part of this research, comparison between the averaged values of $F_{0}$ obtained by the acoustic methods and the $F_{0}$ value determined with the help of EGG was made. The algorithms carrying out the detection of $F_{0}$ based on the zero crossing measure, higher-order spectra analysis, cepstral analysis were also implemented in the MATLAB environment. Estimation of relative error and standard uncertainty were done. Table 1 details a sample results for $F_{0}$ determined in acoustic methods for the /a/ vowel and these are displayed “vis a vis” results for $F_{0}^{*}$ , determined by the EGG method. Table 2 shows an example results of Jitter and Shimmer parameters estimation.

Fig. 3Averaged spectrum of the vowel

a) “a” with the prolonged phonation

b) “i“ with the prolonged phonation

Table 1Example results for calculation of relative error of F0 function for the “e” vowel with prolonged phonation

ID of sample	$F_{0}^{*}$ [Hz] (EGG)	$F_{0}$ [Hz] (ZCM)	$F_{0}$ [Hz] (CEPA)	$F_{0}$ [Hz] (HOSA)	$Δ F_{0}$ % (ZCM)	$Δ F_{0}$ % (CEPA)	$Δ F_{0}$ % (HOSA)
1	119	119	119	119	0	0	0
2	109	109	109	109	0	1	0
3	101	101	107	100	0	6	1
4	98	99	104	98	1	6	0
5	126	128	122	126	1	3	0
6	123	122	121	123	0	2	0
7	108	108	108	108	0	0	0
8	123	123	121	123	0	2	0
9	94	94	101	94	0	8	0
10	120	120	118	119	0	1	0

Table 2Example results of Jitter and Shimmer estimation

	Jitter [%]	Shimmer [%]
ZCM	0.02	0.03
CEPA	0.01	0.08
HOSA	0.02	0.09
EGG	0.02	0.02

5. Conclusions

The data analysis showed that for all analyzed vowels (the prolonged phonation), the mean squared error for the determination of $F_{0}$ by using the acoustic methods does not exceed 2 Hz for the zero crossing measure (ZCM), 1.5 Hz for the cepstrum algorithm (CEPA), and only 1 Hz for the higher-order spectra analysis (HOSA). This makes clear that the acoustic methods for $F_{0}$ derivation are effective and accurate, and can be treated as precise tools for the examination of non-pathologic $F_{0}$ derived from a healthy glottis.

References

Hess W. Pitch Determination of Speech Signals. Springer-Verlag Berlin, Heidelberg, New York, Tokyo, 1983.

Search CrossRef
Marasek K. Electroglottography Description of Voice Quality. Phonetic AIMS, Univesitat Stuttgard, 1997.

Search CrossRef
Swami A., Mendel J. M., Nikias C. L. Higher-Order Spectral Analysis Toolbox for use with Matlab. Natick, The MathWorks Inc., 1995.

Search CrossRef
Xudong J. Fundamental frequency estimation by higher order spectrum. IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 1, 2000, p. 253-256.

Search CrossRef
Wszołek W., Kłaczyński M. Estimation of the vocal folds vibration fundamental frequency by higher order spectrum. Archives of Acoustics, Vol. 33, Issue 4, 2008, p. 183-188.

Search CrossRef
Wszołek W., Kłaczyński M., Engel Z. The acoustic and electroglottographic methods of determination the vocal folds vibration fundamental frequency. Archives of Acoustics, Vol. 32, Issue 4, 2007, p. 143-150.

Search CrossRef
Wszołek W., Kłaczyński M. Comparative study of the selected methods of laryngeal tone determination. Archives of Acoustics, Vol. 31, Issue 4, 2006, p. 219-226.

Search CrossRef
Tadeusiewicz R. Speech Signals. WKiŁ, Warszawa, 1988.

Search CrossRef
Wszołek W., Kłaczyński M. Outcome of F0 determination using acoustic and electroglottographic algorithms. Speech and Language Technology, Polish Phonetic Association, Poznan Division, p. 39-49.

Search CrossRef

About this article

Received

Accepted

16 September 2014

Published

10 October 2014

Keywords

speech signal

electroglottography

vocal fold vibration

fundamental frequency

Acknowledgements

The paper has been written and the respective research undertaken within the project 2011/01/D/ST6/07178 (National Science Centre).

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.