Voice and his timbre: individual, gender
and age features
These quick notes are about voice timbre, from the point
of view of signal processing. For the physiological and specifically
acoustic aspects of the human voice, please refer to the classes of
Psychoacoustics and to general books about the subject (for instance,
the Chapter 11 by Mario Uberti in "Acustica Musicale e Architettonica"
The human voice
Physically, the vocal emission is the result of the
vibration of the vocal chords, which produce an excitatory
vibration, propagating in pharynx, oral and nasal cavities. These
cavities have their own resonance modes, that spectrally "mould" the
emitted sound, conferring to it a specific timbre quality.
From the point of view of the signal processing, the excitement sent
forth by the vocal chords can be schematized as a sawtooth waveform
(the vocal chords roughly operate like a reed), to say as a harmonic
signal having amplitudes decreasing with frequency.
If we restrict ourselves to the consideration of vowels (which are
those who mostly identify the specific timbre and individuality of
voice), the filtering effect (or shaping of the spectral profile) of
the cavities can be schematized as a the rather selective enhancement
of some specific frequencies. It is usual to consider at maximum 6 of
these. These prominent frequencies are called "formants". Every vowel
has its specific sequence of formants, and the process of
identification of the vowels actually consists in the unaware
process of identification of the formants themselves.
The formants frequencies are marked out with the letter F followed by a
figure in order of increasing frequency: F1, F2, F3, F4, F5 and F6. The
importance of the formants decreases with the growth of the order, and
in effects vowels can be identified, to a degree 0 of approximation,
only by the first two formants (F1 and F2), or better, by the
ratio of them (i.e. F2 / F1).
A virtual laboratory about some aspects of human
Thanks to Svante Granqvist, of the KTH in Stockholm, who
made available some smart (simple, light and small) software tools, we
can easily make many interesting experiments. The page of link to computer tools for the signal
processing contains also the link to these smptools.
Let now take into consideration two of them.
Madde, a singing synthesizer, and RTSect, a virtual oscilloscope &
spectrum analyzer in real time.
Madde produces a synthesized singing voice (in additive
synthesis), and RTSect can be used to see in real time the signals
produced by Madde.
How to use
Madde and RTSect at the same time
The voice. Individual, gender, age, fatures.
What does it make the differences among voices? What,
particularly, makes the difference among a masculine voice, a female
voice, a child voice?
The formants are resonances of the vocal tract, and they
greatly depend on the shapes assumed by the cavities (in the measure in
which we are able to modify them in order to modulate vowels), and
on their dimensions. We cannot obviously change the last ones,
which therefore mark the voice, through the formant profile, in a
So we can expect higher formants for smaller dimensions.
Then, higher formants for children, for instance. Women have their
vocal tract meanly about 20 cm shorter than those of men, and in
general smaller dimensions of the oral cavities. As a result, their
formants are higher than those of men (in adults, roughly 20% higher).
This is what makes the difference of gender (and of age) in the timbre
of the voice, not only the pitch, which is by itself inadequate in
order to explain gender and age timbre. A man singing or speaking in
falsetto will continue showing a masculine timbre (sopranos and
falsettistas sounds quite different). In Madde, this can be
experimented by modifying the "Factor" in the Formants pane. We can
moreover try to set Formants to some frequencies taken by the data
found in some of the links quoted here.
Link: This article is addressed to advisors of
transsexuals, inviting them - in order to avoid future
disappointments - to clarify to their own clients that even after the
surgical intervention of change of sex, the timbre of the voice will
remain the original one. It would not be changed or simulated by
simply modifying the pitch (i.e. speaking in falsetto): Acoustic Correlates of Speaker Sex Identification -
What is it the "nasal" timbre? It is the raise of nasal
formants at 200-300 Hz, 1 kHz and 2 kHz, and of some "antiformants"
(antiresonances, that is suppression of frequencies), a global loss of
power of the first formant and of the higher frequencies, together with
a decrease of the Q for all the formants (Eric Keller, University of
Review:The Analysis of Voice Quality in Speech Processing).
It is a free software for speech analysis from SIL International
(Summer Institute of Linguistics), an international organization for
the enhancement of the knowledge of the little known languages, or of
not written languages, founded in 1934 by William Cameron
SIL makes available a wide repository of software programs for the
linguistics, many of which are free, for different platforms (Windows,
MacOs, Linux, PalmOS, Unix).
An alternative tool (more powerful and flexible)
to perform this kind of analysis is WaveSurfer, by the KTH in
Stockholm, free and multiplatform (Linux, Windows 95/98/NT/2K/XP,
Macintosh, Sun Solaris, HP-UX, FreeBSD, and SGI IRIX). based on SNACK,
the sound package for TclTk.
These tools can also be obviously used to analyze
sounds of musical importance, not only the singing voice.
A forensic point of view.
In forensic practice it is usual to submit to
expertise some recordings made by wire tapping or interception, in
order to identify and prove which was the actual speaker. As the
frequency location of the first formants greatly depends on the
dimensions of the vocal tract, it is quite clear that the formants
values are hard to modify or counterfeit. A speaker can thus be in
principle identified by the statistic on the position of his/her
formants. According to Manfred Schroeder, these analyses are well suited
only to exclude a speaker, not for the contrary, because of the
huge uncertainty in the determination. As a result, he questions the
use of the term "vocal imprints" (in analogy to "digital imprints") as
Link: The LPC (Linear Predictive Coding)
analysis divides the whole signal into two components: an excitation,
and a formant filter. It is therefore particularly suitable for the
vocal signal. Here a tutorial
on the LPC.
This device uses instead this type of analysis in real
time and a correspondent resynthesis, to modify the formants (besides
the pitch) and, consequently, the timbre of the voice. We can add,
moreover, that it can totally confuse whatever system of identification
of the speaker, in a sense or in the other. Namely, it can be use in
either making oneself unrecognizable, or to make oneself
recognized as a different person.
Link: This free program of the Institute of
Dutch Phonetic Sciences, performs a quantity of analysis and
resynthesis of the voice, included the reconstruction of the measures
of the vocal tract by means of the analysis of the formants. It is Praat, one
of the best tools, if not the best, for this purpose.
The "singer formant"
To speak about this subject, nothing better than quoting
the abstract of a Sundberg' paper:
The singers formant is a prominent
spectrum envelope peak near 3 kHz, typically found in voiced sounds
produced by classical operatic singers. According to previous research,
it is mainly a resonatory phenomenon produced by a clustering of
formants 3, 4, and 5. Its level relative to the first formant peak
varies depending on vowel, vocal loudness, and other factors. Its
dependence on vowel formant frequencies is examined. Applying the
acoustic theory of voice production, the level difference between the
first and third formant is calculated for some standard vowels. The
difference between observed and calculated levels is determined for
various voices. It is found to vary considerably more between vowels
sung by professional singers than by untrained voices.
The center frequency of the singers
formant as determined from long-term spectrum analysis of commercial
recordings is found to increase slightly with the pitch range of the
Johan Sundberg, Voice Research Centre,
Department of Speech Music Hearing, KTH, Stockholm, Sweden, Level and
Center Frequency of the Singers Formant, Journal of Voice, Vol. 15,
No. 2, pp. 176186
The reason of this prominent concentration
of acoustic power is may be due to a relative weakness in the same
spectral region for long-term spectra of operatic orchestra. The
singer' formant is thus a mean to emerge with respect to a large
(National Center for Voice and Speech, Iowa
A research report on Sutherland' and Gruberova voices, Geneva University.
UA paper on the preference in choral singing for
resonances close to the singer formant, International Journal of
Research in Choral Singing.
Please don't be deceived by this quick exposition: what
we are here dealing with is far to be simple and for ever clarified.
First, we have restrained our attention to vowels, where the
individuality of voices is on the contrary also based on further
features (transients, as in the consonants), not unlike to what happens
in musical instruments, in which the spectral shape is only one amid
the various components of the timbre. Even restraining our
attention to vowels, it must be quite clear that these belong to a
universe which is more populated than every list that anyone can
compile on the mere basis of his own knowledge. The vowels are not an
absolute datum, and don't depend only on the language. There are
innumerable dialect and sub-dialect variations. This study deals with the differences between the
Pisa' vowels and Florentine vowels, while this one is a wide review of the work in progress
about the study of the vowels (and on their formant representation) in
relation to gender, age and (very) specific dialect. The determination
of the connection formants-vowels, beyond coarse subdivisions, is full
of uncertainties due to the high variability and - probably - to the
intervention of different, not purely perceptive mechanisms (for
instance, semantic and cognitive) in the human process of vowels
As it often happens, statistics are unable to describe
and capture phenomena in which the human behavior - in a late and
manifold sense - is critically present. This complexity can explain the
extreme slowness with which, after an initial exploit, the speech
recognition systems are today progressing (STT - Speech To Text or ASR
- Automatic Speech Recognizer). They don't even came close to the
robustness of every human "recognizer" (namely an interlocutor).
The state of the art of the opposite systems (TTS - Text
to Speech), is instead quite different. They are able to automatically
read out written texts, at a good qualitative level. In this field, the
Italian firm "Loquendo" (former CSELT of the group STET, today owned by
Telecom Italia) deserves to be noted, together with its system "Actor"
that has marked a notable progress in comparison to the former
"Eloquens". Here you can try the capabilities of the system,
and compare it with the previous one (which is "Mario, Robotic Voice"
in the menu of voices).
formants - "formants determination" - formants analysis
laboratory for the analysis.