The effects of harmonicity in Mandarin tone perception Audio Demo

Yiran Ding

Main Reference

McPherson, M. J., & McDermott, J. H. (2018). Diversity in pitch perception revealed by task dependence. Nature human behaviour, 2(1), 52-66.

Popham, S., Boebinger, D., Ellis, D. P., Kawahara, H., & McDermott, J. H. (2018). Inharmonic speech reveals the role of harmonicity in the cocktail party problem. Nature communications, 9(1), 1-13.

F0, pitch and harmoinic

Fundamental Frequency (F0)

The Fundamental Frequency of a speech signal, often denoted by F0, refer to the approximate frequency of the (quasi-)periodic structure of voiced speech signals.

it is typically not stationary, but changes constantly within a word or sentence, so it can be used for expressive purposes to signify, like emphasis, question, and lexical tone in tonal languages

Pitch

As F0 describes the actual physical phenomenon, whereas pitch describes how ears and brains interpret the signal, in terms of periodicity.
For example, a voiced signal could have an F0 of 100hz, if a high-pass filter applies to remove all singal component below 450hz (which would remove the actual F0). The lowest remaining periodic compeonet would be 500hz (equal to fifth harmonic of original F0) but a human listener would then typically still preceive a pitch of 100hz even it doesn't exist.This well-known phenomenon is however still not completely understood.

Harmonic and Harmonicity

A typical attribute of vocal sound is harmonic structure, depends on the waveform produced by the vibrating vocal cords. Like the musical instrument, the human voice is not a pure tone, rather, it is composed of a fundamental tone(frequency) and a series of higher frequencies called upper harmonics. Usually corresponding to a simple mathematical ratio of harmonics. and the peak in harmonics often reflect the Formants of a vowel. harmonic and formants

Harmonicity play a important role in speech perception, as long as the harmonics are precise multiples of the F0, the voice will sound clear and pleasant. If the voice contains partial of (non-)un-harmonicity, like the old piano, it might increase degree of roughness, harshness,or hoarseness.

Harmoincs also plays a crucial role in terms of pitch perception, another important feature of harmonic is the resolvability. as show below.

The frequency resolution of the peripheral auditory system can be represented by auditory filters that increase in bandwidth with increasing frequency (in gray). The low-numbered harmonics (1-6, in blue) of a tone with a fundamental frequency F0 are processed within distinct filters and are said to be 'resolved'. Neighboring high-numbered harmonics (above the 12th, in red) interact within the same filter and are said to be 'unresolved'.

There are several possible central mechansim that can be used to extract cues to decode the pitch of harmoinc sound.

The simplest mechanism is the use the lowest-frequency component to extract pitch (F0).
Secondly, audition system might store many spectral harmonic templates, a match between one of these templates and resolved harmonics determins the pitch.
THe third potenital mechanism is to extract pitch from the interactions among unsolved harmonics within an auditory filter, which generate a temporal envelope with a periodicity equal to the pitch.

Evaluation(test-purpose)

The effect of harmonicity and relation between pitch and fundamental frequency has been long studied in general perception tasks, but the effects of harmonicity in Mandarin tone perception have remained deabated. I'd like to create a series of stimulus to study the possible computations underline during tone processing

This isn't the experiment-oriented demo, so I will embed the audio as well as the spectrogram (which will indicate the manipulation result). It will contain almost 15 audio samples and takes like 5 mins to listen, I will be much appreciated for your feedback.

To date, there are three possible ways to analyze and manipulate speech then generate (relatively) audible speech,

Pyworld A typical source-filter model of speech, it can analysis and manipulate of speech based on F0, Spectral Envelope and Aperodic Envelope genereate by WORLD vocoder
Mosaic Speech a modified Noise-vocoded method to mask the desire parts of speech
Harmonic plus stochastic model a modified version of Sinusoids model that can decompose signal into harmonics and stochastics parts for analysis and synthesis.

Below are a few demo audios.

Mosaic speech demo.@ the low/high (resolved/unresolved) harmonic can be masked or normalized by using mosaic method, see whether there is difference of tone perception when parital information lost

left: only low number harmonics maintain, right: only high-number harmonic maintain. (bi-syllable and sentence example provide.)

Pyworld-based speech demo.@ instead of mask or normalize the low/high harmonics, pyworld based method can remove the energy or amplitude of centain harmonics

one syllable audio provided, seems no difference between mosaic

Harmonic plus stochastic model demo @ this method can decompose the signal into harmonic parts and stochastics parts and then manipulate and synthesis, this is only one method able to control harmonic seperately for now.

a female and male sentence provided

first row : the original sentences produced by male/female speakers. second row: the resynthesized inharmonicity sentences by male/female speakers (the harmonics are stretched or compressed that not perfect ratio of F0) it was claimed that it will make the signal aperiodic and inconsistent with any signle F0 and hence, rendering F0/pitch from Harmonics become less possible.

This part we will have some demos about what if the harmonics are conflict.