Pages coming from (2004-2013):

Music and New Technologies - Conservatory "A.Casella" in L'Aquila- Italy

Voice and his timbre: individual, gender and age features

These quick notes are about voice timbre, from the point of view of signal processing. For the physiological and specifically acoustic aspects of the human voice, please refer to the classes of Psychoacoustics and to general books about the subject (for instance, the Chapter 11 by Mario Uberti in "Acustica Musicale e Architettonica" - UTET). 

The human voice 

Physically, the vocal emission is the result of the vibration of the vocal chords, which produce an excitatory vibration,  propagating in pharynx, oral and nasal cavities. These cavities have their own resonance modes, that spectrally "mould" the emitted sound, conferring to it a specific timbre quality. 
From the point of view of the signal processing, the excitement sent forth by the vocal chords can be schematized as a sawtooth waveform (the vocal chords roughly operate like a reed), to say as a harmonic signal having amplitudes decreasing with frequency. 
If we restrict ourselves to the consideration of vowels (which are those who mostly identify the specific timbre and individuality of voice), the filtering effect (or shaping of the spectral profile) of the cavities can be schematized as a the rather selective enhancement of some specific frequencies. It is usual to consider at maximum 6 of these. These prominent frequencies are called "formants". Every vowel has its specific sequence of formants, and the process of identification of the vowels actually consists in the  unaware process of identification of the formants themselves. 
The formants frequencies are marked out with the letter F followed by a figure in order of increasing frequency: F1, F2, F3, F4, F5 and F6. The importance of the formants decreases with the growth of the order, and in effects vowels can be identified, to a degree 0 of approximation, only by the  first two formants (F1 and F2), or better, by the ratio of them (i.e. F2 / F1).


A quick discussion with many figures about vowels: hyperphysics

A more deep discussion about speech by OGI School of Science & Engineering, Oregon Health & Science University

Acustica della Voce by  Mauro Uberti (in Italian). 

Analisi e grafici di alcune vocali inglesi, Gabriele Azzaro, Università di Bologna (in Italian)

On the Magic of Overtone singing ((Piero Cosi, Graziano G. Tisato)

Fisiologia e fonetica nel canto e nel parlato (Silvana Iuliano) (in Italian)

A huge repository of important paper about voice and music at KTH in Stockholm.

A recent survey paper by  Johan Sundberg: Research on the singing voice in retrospect

A virtual laboratory about some aspects of human voice. 

Thanks to Svante Granqvist, of the KTH in Stockholm, who made available some smart (simple, light and small) software tools, we can easily make many interesting experiments. The page of link to computer tools for the signal processing contains also the link to these smptools

Let  now take into consideration two of them. Madde, a singing synthesizer, and RTSect, a virtual oscilloscope & spectrum analyzer in real time.

Madde produces a synthesized singing voice (in additive synthesis), and RTSect can be used to see in real time the signals produced by Madde. 

How to use  Madde and RTSect at  the same time

Using Madde

Using RTSect

The voice. Individual, gender, age, fatures. 

What does it make the differences among voices? What, particularly, makes the difference among a masculine voice, a female voice, a child voice? 

The formants are resonances of the vocal tract, and they greatly depend on the shapes assumed by the cavities (in the measure in which we are able to modify them in order to modulate vowels), and on  their dimensions. We cannot obviously change the last ones, which therefore mark the voice, through the formant profile, in a determinant way. 

So we can expect higher formants for smaller dimensions. Then, higher formants for children, for instance. Women have their vocal tract meanly about 20 cm shorter than those of men, and in general smaller dimensions of the oral cavities. As a result, their formants are higher than those of men (in adults, roughly 20% higher). This is what makes the difference of gender (and of age) in the timbre of the voice, not only the pitch, which is by itself inadequate in order to explain gender and age timbre. A man singing or speaking in falsetto will continue showing a masculine timbre (sopranos and  falsettistas sounds quite different). In Madde, this can be experimented by modifying the "Factor" in the Formants pane. We can moreover try to set Formants to some frequencies taken by the data found in some of the links quoted here.

Link: This article is addressed to advisors of transsexuals, inviting them  - in order to avoid future disappointments - to clarify to their own clients that even after the surgical intervention of change of sex, the timbre of the voice will remain the original one. It would not be  changed or simulated by simply modifying the pitch (i.e. speaking in falsetto): Acoustic Correlates of Speaker Sex Identification - Coleman.

What is it the "nasal" timbre? It is the raise of nasal formants at 200-300 Hz, 1 kHz and 2 kHz, and of some "antiformants" (antiresonances, that is suppression of frequencies), a global loss of power of the first formant and of the higher frequencies, together with a decrease of the Q for all the formants (Eric Keller, University of Lausanne: Tutorial Review:The Analysis of Voice Quality in Speech Processing).

Speech Analyzer

It is a free software for speech analysis from SIL International (Summer Institute of Linguistics), an international organization for the enhancement of the knowledge of the little known languages, or of not written languages, founded in 1934 by William Cameron Townsend. 
SIL makes available a wide repository of software programs for the linguistics, many of which are free, for different platforms (Windows, MacOs, Linux, PalmOS, Unix). 


An alternative tool (more powerful and flexible) to perform this kind of analysis is WaveSurfer, by the KTH in Stockholm, free and multiplatform (Linux, Windows 95/98/NT/2K/XP, Macintosh, Sun Solaris, HP-UX, FreeBSD, and SGI IRIX). based on SNACK, the sound package for TclTk. 

These tools can also be obviously used to analyze sounds of musical importance, not only the singing voice.

A forensic point of view. 

In forensic practice it is usual to submit  to expertise some recordings made by wire tapping or interception, in order to identify and prove which was the actual speaker. As the frequency location of the first formants greatly depends on the dimensions of the vocal tract, it is quite clear that the formants values are hard to modify or counterfeit. A speaker can thus be in principle identified by the  statistic on the position of his/her formants. According to Manfred Schroeder, these analyses are well suited only to exclude a speaker,  not for the contrary, because of the huge uncertainty in the determination. As a result, he questions the use of the term "vocal imprints" (in analogy to "digital imprints") as being misleading.

Link: The LPC (Linear Predictive Coding) analysis divides the whole signal into two components: an excitation, and a formant filter. It is therefore particularly suitable for the vocal signal. Here a tutorial on the LPC. 
This device uses instead this type of analysis in real time and a correspondent resynthesis, to modify the formants (besides the pitch) and, consequently, the timbre of the voice. We can add, moreover, that it can totally confuse whatever system of identification of the speaker, in a sense or in the other. Namely, it can be use in either making oneself  unrecognizable, or to make oneself recognized as a different person.

Link: This free program of the Institute of Dutch Phonetic Sciences, performs a quantity of analysis and resynthesis of the voice, included the reconstruction of the measures of the vocal tract by means of the analysis of the formants. It is Praat, one of the best tools, if not the best, for this purpose.

The "singer formant"

To speak about this subject, nothing better than quoting the abstract of a Sundberg' paper:


The “singer’s formant” is a prominent spectrum envelope peak near 3 kHz, typically found in voiced sounds produced by classical operatic singers. According to previous research, it is mainly a resonatory phenomenon produced by a clustering of formants 3, 4, and 5. Its level relative to the first formant peak varies depending on vowel, vocal loudness, and other factors. Its dependence on vowel formant frequencies is examined. Applying the acoustic theory of voice production, the level difference between the first and third formant is calculated for some standard vowels. The difference between observed and calculated levels is determined for various voices. It is found to vary considerably more between vowels sung by professional singers than by untrained voices.

The center frequency of the singer’s formant as determined from long-term spectrum analysis of commercial recordings is found to increase slightly with the pitch range of the voice classification.

Johan Sundberg, Voice Research Centre, Department of Speech Music Hearing, KTH, Stockholm, Sweden, Level and Center Frequency of the Singer’s Formant, Journal of Voice, Vol. 15, No. 2, pp. 176–186

The reason of this prominent concentration of acoustic power is may be due to a relative weakness in the same spectral region for long-term spectra of operatic orchestra. The singer' formant is thus a mean to emerge with respect to a large orchestra.

 (National Center for Voice and Speech, Iowa University )

A research report on Sutherland' and Gruberova voices, Geneva University.

UA paper on the preference in choral singing for resonances close to the singer formant, International Journal of Research in Choral Singing.

Underlying complexity. 

Please don't be deceived by this quick exposition: what we are here dealing with is far to be simple and for ever clarified. First, we have restrained our attention to vowels, where the individuality of voices is on the contrary also based on further features (transients, as in the consonants), not unlike to what happens in musical instruments, in which the spectral shape is only one amid the various  components  of the timbre. Even restraining our attention to vowels, it must be quite clear that these belong to a universe which is more populated than every list that anyone can compile on the mere basis of his own knowledge. The vowels are not an absolute datum, and don't depend only on the language. There are innumerable dialect and sub-dialect variations. This study deals with the differences between the Pisa' vowels and Florentine vowels, while  this one is a wide review of the work in progress about the study of the vowels (and on their formant representation) in relation to gender, age and (very) specific dialect. The determination of the connection formants-vowels, beyond coarse subdivisions, is full of uncertainties due to the high variability and - probably - to the intervention of different, not purely perceptive mechanisms (for instance, semantic and cognitive) in the human process of vowels identification.

As it often happens, statistics are unable to describe and capture phenomena in which the human behavior - in a late and manifold sense - is critically present. This complexity can explain the extreme slowness with which, after an initial exploit, the speech recognition systems are today progressing (STT - Speech To Text or ASR - Automatic Speech Recognizer). They don't even came close to the robustness of every human "recognizer" (namely an interlocutor). 

The state of the art of the opposite systems (TTS - Text to Speech), is instead quite different. They are able to automatically read out written texts, at a good qualitative level. In this field, the Italian firm "Loquendo" (former CSELT of the group STET, today owned by Telecom Italia) deserves to be noted, together with its system "Actor" that has marked a notable progress in comparison to the former "Eloquens". Here you can try the capabilities of the system, and compare it with the previous one (which is "Mario, Robotic Voice" in the menu of voices). 


formants"formants determination" - formants analysis

A laboratory for the analysis.