Formant question speech/screaming synthesis

Guest

Hi all,

I was working on some speech synthesis in PD by emulating spectral analysis of speech. This was doable. But I got curious about emulating other types of speech, like screaming and singing.

Any advice on how to accomplish this? I used fft~ to analyze my own screams, and did some searches for analysis already done online. There were a couple papers available.

I made some experiments to simulate screaming already. The fft showed that unlike a normal speaking voice which shows only a few spectral peaks and is convincingly mimicked with a few bp filters, spectral peaks for screaming are active throughout the audible spectrum.

First I tried setting up a regular formant synth with higher fundamental tones, setting the gain higher on the partials and adding in some noise. This sounded too much like whispering.

Second I created parallel bp filters for the noise to try and make it less airy and more direct/forceful sounding.

Third I tried to skip the noise entirely and simply add more [phasor~] objects where I saw partials. There had to be so many that they tended to cancel out or sound way too distorted to be like a speaking or screaming voice.

I'm open to any kind of response, theoretical or practical. Just can't wrap my head around the problem yet.

austin_ep

Firstly, this sounds incredibly interesting! Formants and synthesized speech have always fascinated as well as puzzled me so congratulations to you for being able to tackle it. Just reading about the steps that you have taken so far and all the knowledge that you have on it it really impressive to me.

Sadly, I have no clue how one would go about synthesizing screaming, but on the upside of your ventures you have made a whisper!

I don't doubt that you have already stumbled on this article:
http://www.soundonsound.com/sos/mar01/articles/synthsec.asp
It was quite a good read, and those diagrams are easy to make again in pd. I could not find anything else on screaming. I was sitting here thinking just what screaming was, and really cannot get a grasp on what it is. It isn't a distorted voice, its louder than speaking. To me speaking would be like and [osc~] and screaming would be like a [phasor~], which it seems that you have already tried.

Hopefully someone who knows what they're actually doing in pd can chime in and give you some guidance, I just wanted to say how much this post fascinated me and even gave me a few ideas!

Guest

Actually someone else on the forums discovered that adding noise makes whispering sounds. I don't have it in front of me right now so I can't credit the creator, but I found an incredibly nice abstraction of a formant synth that was my jumping off point for this project. It's the one that includes Turkish vowels. Comes out pretty convincing - sounds like the whisper voice on Mac.

Many thanks for this article, I had not seen it.

Here is an intriguing article I referenced that analyzes death metal growling, if you or anyone is interested:
http://courses.physics.illinois.edu/phys406/Student_Projects/Spring05/Chuck_Stelzner/Chuck_Stelzner_P498POM_Final_Report.pdf

acreil

Screaming often involves oscillation of the false vocal cords. This is also done sometimes done in throat singing (kargyraa, I think). Example:

This sort of interacts with the normal vocal oscillation, so maybe you could mix two phasors (one an octave lower than the other for this example) and distort the result.

Or alternatively, apply noise modulation to the amplitude and pitch of the phasor. If you think of each period of the phasor as a glottal pulse, you could sample a noise source once per period, and use that for the amplitude modulation. I think this should have less of a whispery quality.

Flipp

I curious about it as well.. So how is it going??
A true text-to-speech made with pd, that would be nice..
I tried do it with my own voice. It works - at least if you know what was written...
But it's not true tts since I just tried to combine sounds of letters in a more or less intelligent manner..

Will it be possible to create something like a voice-synth in pd?
Maybe something like: You write the text to be read out and then one can set some parameters like "scream<>whisper", "fast<>slow", "hi<>low", "drunk<>lisp" or whatever over time..
The "scream" things sound very interesting.

Guest

acreil, I did a search of 'false vocal chords'. It definitely seems right that they play a role. If they are thicker than true vocal chords perhaps they should have their own timbre, maybe warmer sounding. Perhaps if they are meant to interact then I should add them in series rather than parallel, and add feedback? Also, thanks for the throat singing example!

Adding two phasors worked pretty well, it actually produced its own distortion. I haven't worked on this in a while, but today I played around with some osc~ based amplitude modulation after my formant filters and it added some realism.

Flipp, I found out there is actually an external for speech synthesis.
http://puredata.hurleur.com/sujet-2092-speech-synthesis

I couldn't build it on my Mac, though it's apparently possible. But I wasn't actually thinking of going in this direction simply because I don't have enough programming experience. I just wanted to create some vowel sounds, or partial words.

Flipp

Thank you! It sounds good. Although it seems to be a litle buggy, maybe cause it's from 2008.. but it works -on windows.
However I don't think it can scream etc.. It generally sounds pretty much synthetic, in contrast to some great naturally sounding online-ttsS. I have no idea, but I guess it must me a different engine, or something..
Does anyone know if it is possible with pure pd (ext), even if it is a synthetic voice?! As long as it can scream (etc.) I'll be happy..

As I said, my version works by 'more or less intelligently' fading over the right samplefiles. So it's generally a time-domain approach. But I guess some fft would be better.. maybe something like a phase-vocoder.. and some "morphing" -whatever that means in this case.
One popular synth I found is the "vocaloid".

acreil

I found some papers that might help...

http://quod.lib.umich.edu/cgi/p/pod/dod-idx/vocal-fold-and-false-vocal-fold-vibrations-in-throat-singing.pdf?c=icmc;idno=bbp2372.2001.030

http://quod.lib.umich.edu/cgi/p/pod/dod-idx/synthesis-of-the-laryngeal-source-of-throat-singing-using.pdf?c=icmc;idno=bbp2372.2002.002

Flipp

Thanks! Some time ago I had found some comparable papers. But I can't find them anymore.. However it ended in physical modeling as well.. But I don't know precisely how the vocal tract is producing it's sounds.. But since I can't imitate guitar- or bell-sounds with my voice, I guess a digital waveguide ("karplus strong") will not be sufficient.. And for anything else, like mass-mass interactions, pd is not the best solution since it would imply single-sample loops, if I'm not mistaken..
So for pd physically modeling these sounds does not seem to be the best..
Basically it should be one osc and one filter and that at first glance it doesn't sound like physical modeling is really needed (although it is always the ultimate solution - as long as you want to recreate 'real' sounds). It rather sounds like a vocoder is ok... So did anyone try to get some sprectra from her/his mouth?! Recording an impulseresponse of a vocal tract does not sound to be the easiest task.. but theoretically it should work, if not with an impulsresponse, maybe with white noise (guess the phase is not that important) or a sinesweep.
Finally feed the filter with an osc (eg. saw), or noise (for hiss-sounds). The waveform could be recorded from your voice as well (just open your mouth as much as possible to eliminate the filter - I imagine this to look quite funny )
One just has to look what happens to the filter if one changes the mouth-shape.. so how to morph from one state to another..

Now if all this will not be working - why not?!

acreil

Here's another paper about modeling:

http://quod.lib.umich.edu/cgi/p/pod/dod-idx/voice-synthesis-usingthe-generalized-pressure-controlled.pdf?c=icmc;idno=bbp2372.2008.151

It's also worth pointing out that if you're doing PSOLA formant synthesis, you can randomly vary the amplitude of each windowed waveform segment. This is called spectral shimmer and can be used to model the random amplitude variation of each glottal pulse. Similarly you can randomly vary the frequency while preserving the formant (spectral jitter). I think large amounts of this might get closer to a screaming-type sound. Maybe you could even use some sort of chaotic oscillator to provide the spectral shimmer modulation.

acreil

Another one...

http://quod.lib.umich.edu/cgi/p/pod/dod-idx/spectrographic-analysis-of-vocal-techniques-in-extreme-metal.pdf?c=icmc;idno=bbp2372.2012.015