Creating Cartoon Voices with Math

Have you ever wanted to create a humorous or entertaining voice like a cartoon character’s voice for a get-well video, a Valentine’s video, the narration for a DVD of home videos, an advertisement for your business or some other application? This article tells how to create cartoon voices using mathematics to shift the pitch of normal voices. The article includes the Octave source code for an Octave function chipmunk that applies pitch shifting to audio.

The standard audio pitch shifting incorporated in many commonly used audio editors such as the free open-source Audacity editor is presented in detail. The article also shows the results of using a more sophisticated algorithm that produces a more natural sounding pitch-shifted voice similar to the voice of the famous cartoon character Mickey Mouse.

One of the basic concepts and methods of signal and speech processing is the Fourier transform, named after the French mathematician and physicist Joseph Fourier. The basic concept is that any real function [tex] f(x) [/tex] can be represented as the sum of the trigonometric sine and cosine functions. For example, a function [tex] f(x) [/tex] defined on the region [tex] (0, L) [/tex] can be expanded as the sum of sines and cosines:

[tex]\displaystyle f(x) = \frac{a_0}{2} + \sum_{n=1}^{\infty} a_n cos\left(\frac{n\pi x}{L}\right) + b_n sin\left(\frac{n\pi x}{L}\right)[/tex]

where the coefficients [tex] a_n [/tex] and [tex] b_n [/tex] are known as Fourier coefficients. This is a continuous Fourier Transform.

There is a discrete version of the Fourier Transform, often used in digital signal processing:

[tex]\displaystyle a_s=\frac{1}{\sqrt{n}}\sum_{r=1}^n u_r e^{2\pi i(r-1)(s-1)/n}[/tex]

where [tex]r[/tex] is the index of an array of discrete values such as audio samples, [tex] u_r [/tex] is the value of the [tex]r[/tex]th audio sample, [tex]s[/tex] is the index of the discrete Fourier coefficients [tex] a_s [/tex] and [tex] n [/tex] is the number of discrete values such as the number of audio samples in an audio “frame”. The index [tex] s [/tex] is essentially the frequency of the Fourier component. This version of the discrete Fourier Transform uses the mathematical identity:

[tex]\displaystyle e^{ix} = cos(x) + i sin(x) [/tex]

where

[tex]\displaystyle i = \sqrt{-1} [/tex]

to combine the cosine and sine function components into complex functions and numbers.

In audio signal processing such as speech or music, the Fourier Transform has a straightforward meaning. The sound is broken up into a combination of frequency components. In most instrumental music, this is very simple. The music is a collection of notes or tones with specific frequencies. Percussion instruments and certain other instruments can produce more complex sounds with many frequency components. A spectrogram of a signal such as speech or music shows time on the horizontal axis and the strength of the frequency component on the vertical axis. This is the spectrogram of a pure 100 Hertz (cycles per second) tone:

Spectrogram of 100 Hz Tone

Spectrogram of 100 Hz Tone

The spectrogram is generated using the specgram function in the Octave signal signal processing package by dividing the signal into a series of overlapping audio frames. Overlapping audio frames are frequently used to achieve better time resolution during signal processing in the Fourier domain. Each audio frame is windowed using the Hanning window to reduce aliasing effects.

The Fourier transform is applied to each windowed audio frame, giving a series of frequency components, which are displayed on the vertical dimension of the spectrogram. Each frequency component is a bin in frequency covering a frequency range equal to the audio sample rate divided by the number of samples in the audio frame. This frequency bin size or frequency resolution of the Fourier transform is about 20 Hz in the spectrogram above (44100 samples per second/2048 samples in an audio frame = 21.533 cycles per second). Because the 100 Hz tone in the example is not perfectly centered in the frequency bin spanning 100 Hz, the tone spreads out in the spectrogram, contributing to other bins as can be seen above. This is a limitation of the discrete Fourier transform which can lead to problems with signal processing such as pitch shifting.

Speech has a much more complex structure than a pure tone. In fact, the structure of speech remains poorly understood which is why current (2011) speech recognition systems perform poorly in realistic field conditions compared to human beings. This spectrogram shows the structure of the introduction to United States President Barack Obama‘s April 2, 2011 speech on the energy crisis: “Hello everybody. I’m speaking to you today from a UPS customer center in Landover, Maryland where I came to talk about an issue that is affecting families and businesses just like this one — the rising price of gas and what we can…”.

President Obama on the Rising Price of Gas

President Obama on the Rising Price of Gas

The spectrogram below shows the region from 0 to 600 cycles per second (Hertz). One can see a series of bands in the spectrogram. These bands are located at integer multiples (1, 2, 3, …) of the lowest frequency band, which is often referred to as F0 in the scholarly speech literature. The bands are known as the harmonics. F0 is known as the fundamental frequency. This is the frequency of vibration of the glottis which provides the driving sound for speech and is located in the throat. The glottis vibrates at frequencies ranging from as low as 80 cycles per second (Hertz) in some men to as high as 400 cycles per second (Hertz) in some women and children. This fundamental frequency appears to be loosely correlated with the height of the speaker, higher for short speakers such as children and lower for taller women and men.

The fundamental frequency F0 fluctuates in a rhythmic pattern that is not well understood as people speak. In some languages such as Mandarin Chinese, the changing pitch conveys meaning; a word with rising pitch has a different meaning from an otherwise identical word with falling pitch. In English, a rising pitch at the end of a phrase or sentence indicates that a question is being asked. “The chair.” is pronounced with falling pitch whereas “The chair?” is pronounced with a rising pitch at the end. It is difficult and even sometimes impossible to understand English if the rhythmic pattern of the fundamental frequency or pitch is abnormal.

President Obama on the Rising Price of Gas (to 600 CPS)

President Obama on the Rising Price of Gas (to 600 CPS)

This spectrogram shows President Dwight David Eisenhower saying “in the councils of government we must guard against the acquisition of unwarranted influence, whether sought or unsought, by the military industrial complex” from his Farewell Address, January 17, 1961, probably his most famous phrase and his most famous speech today.

Eisenhower on the Military Industrial Complex

Eisenhower on the Military Industrial Complex

This spectrogram shows the spectrogram in the range 0 to 600 Hertz (cycles per second). Again, one can easily see the repeating bands.

Eisenhower on the Military Industrial Complex (to 600 CPS)

Eisenhower on the Military Industrial Complex (to 600 CPS)

Human beings perceive something which we call “pitch” in English which appears closely related to or identical to the center frequency of the F0 band in the spectrogram. The F0 band will be higher in higher pitched speakers such as many women and most children. Both President Obama and President Eisenhower have similar pitches, varying between 200 and 75 Hertz with an average of about 150 Hertz. Nonetheless, their voices sound very different. The F0 band can be as low as 70 or 80 Hertz (cycles per second) in a few speakers. Former California governor and actor Arnold Schwarzenegger used an extremely low pitched voice while playing the Terminator, his most famous role.

In general, low pitched voices tend to convey seriousness and sometimes menace whereas high pitched voices tend to convey less seriousness, although there are exceptions. The voice of the genocidal Daleks in the BBC’s Dr. Who series is both high pitched and menacing at the same time. Cartoon style voices can be created by shifting the pitch of normal speakers. This has been done for the Alvin and the Chipmunks characters created by Ross Bagdasarian Sr.. It is probable that some form of pitch shifting has been used over the years to create some of the voices of the Daleks on Dr. Who. Some robot voices have probably been created by combining pitch shifting with other audio effects.

Traditional Pitch Shifting

Pitch shifting predates the digital era. In the analog audio era, one could shift the pitch of a speaker by playing a record or tape faster or slower than normal. This shifts the pitch but also changes the tempo — speed or rate of speaking — as well. One can achieve a pure pitch shift by, for example, recording a voice performer speaking at half normal speed and then playing the recording back at twice the normal rate. In this case, the pitch will be shifted up by a factor of two and the tempo or rate of speaking will be normal. One can create the Alvin and the Chipmunks high pitched voice in this way using analog tapes or records. One can also create lower pitched voices by appropriately combining the tempo of the original voice and the playback rate of the recording. Although these voices are easily understandable, they have artificial, electronic qualities not found in normal low or high pitched speakers or voice performers intentionally creating a low or high pitched voice. The voice of Walt Disney’s Mickey Mouse was performed by a series of voice artists starting with Walt Disney himself. This high pitched voice sounds much more natural than the Alvin and the Chipmunks voice.

In digital audio, it is possible to shift the pitch of the voice without changing the tempo of the speech. This can be done by manipulating the Fourier transform of the speech, the spectrogram, and converting back to the “time domain,” the actual audio samples. One can simply shift the Fourier components from their original frequency bin in the spectrogram to an appropriate higher or lower frequency bin. For example, if a Fourier component is in the 100 Hz bin, one shifts this Fourier component value to the 200 Hz bin to double the pitch. This must be done for each and every non-zero Fourier component. In general, this will produce a recognizable pitch shifted voice. If the Fourier components are not centered in each bin, which is normally the situation, this pitch shifted voice will have an annoying beat or modulation. It is necessary to perform some additional mathematical acrobatics to compensate for these effects to produce a relatively smooth pitch shifted voice similar to the output of the analog processing described above.

This video is President Obama’s original introduction from his April 2, 2011 speech on the energy crisis. Click on the images below to download or play the videos.

This video is President Obama speaking with his pitch doubled by shifting the Fourier components but without the mathematical acrobatics to compensate for un-centered frequency components:

This video is President Obama speaking with a chipmunked voice; his pitch has been doubled.

This video is President Obama speaking with a deep voice; his pitch has been reduced to seventy percent of normal.

Octave is a free open-source numerical programming environment that is mostly compatible with MATLAB. The Octave source code below, the Octave function chipmunk, implements the standard pitch shifting algorithm in widespread use. The Octave code requires both Octave and the Octave Forge signal signal processing package for the specgram function which computes the spectrogram of the signal.

The videos in this article were created by downloading the original MPEG-4 videos from the White House web site and splitting the audio and video into a MS WAVE file and a sequence of JPEG still images using the FFMPEG utility. Presidential speeches and video are in the public domain in the United States. The original still images were reduced in size by half using the ImageMagick convert utility. The audio was pitch shifted in Octave using the chipmunk function below. The new audio and video were recombined into the MPEG-4 videos in this article by again using the FFMPEG utility. Variants of this pitch shifting algorithm can be found in many programs including the widely used free open-source Audacity audio editor (the Audacity pitch shifting algorithm may be slightly different from the algorithm implemented below):

function [ofilename, new_phase, output] = chipmunk(filename, pitchShift, fftSize, numberOverlaps, thresholdFactor)
% [ofilename, new_phase, output] = chipmunk(filename [,pitchShift , fftSize, numberOverlaps, thresholdFactor]); 
%
% chipmunk audio effect (as in Alvin and the Chipmunks)
%
% ofilename -- name of output file with pitch shifted audio
% new_phase -- the recomputed phases for the pitch shift audio (for debugging)
% output -- the pitch shifted audio samples
%
% arguments:
%
% filename -- input file name (MS Wave audio file)
% pitchShift -- frequency/pitch shift (default=2.0)
% fftSize -- size of FFT (default = 2048)
% numberOverlaps -- number of overlaps (default = 4)
% thresholdFactor -- threshold factor for zeroing silence frames 
%
% $Id: chipmunk.m 1.44 2011/08/04 01:25:35 default Exp default $
% (C) 2011 John F. McGowan, Ph.D.
% E-Mail: jmcgowan11@earthlink.net
% Web: https://www.jmcgowan.com/
%

if nargin < 2
	pitchShift = 2.0; % frequency shift
end
nPitchShift = uint32(pitchShift*100); % to write output file

if nargin < 3
	fftSize = 2048; % size of audio blocks/FFT size
end

if nargin < 4
	numberOverlaps = 4; % number of overlaps
end

if nargin < 5
	thresholdFactor = 0.002;
end

printf("pitchShift: %f fftSize: %d numberOverlaps: %d thresholdFactor: %f\n", pitchShift, fftSize, numberOverlaps, thresholdFactor);
fflush(stdout);

stepSize = fftSize/numberOverlaps;
phaseShift = 2.0*pi*(stepSize/fftSize);

printf("loading %s\n", filename);
fflush(stdout);

result = char(strsplit(filename, '.'));
filestem = result(1,:);
ext = sprintf("_oct_%d_%d_%d.wav", nPitchShift, fftSize, numberOverlaps);
ofilename = [filestem ext];

[data, sampleRate, bits] = wavread(filename);

freq_resolution = sampleRate / fftSize; % frequency resolution = sample rate / fft size

if columns(data) > 1
	raw_data = data(:,1); % input is stereo with 2 channels in 2 columns of array
else
	raw_data = data; % mono sound input
end
data = [];
clear data; % free memory

mx_input = max(abs(raw_data(:)));

printf("applying fft\n");
fflush(stdout);
%spectrogram = fft(spectrogram);

overlap = fftSize - stepSize;
printf("stepSize: %d overlap is %d\n", stepSize, overlap);
fflush(stdout);

nsamples = length(raw_data);

% hanning window
window = hanning(fftSize); % window the output
window = (numel(window)/sum(window(:)) )*window; % normalize the window

% use Octave signal package specgram function to apply fft to windowed overlapping frames
% [] indicates default window (hanning)
%
[spectrogram, f, t] = specgram(raw_data, fftSize, sampleRate, window, overlap);

printf("spectrogram has dimensions %d %d\n", rows(spectrogram), columns(spectrogram));
fflush(stdout);

% free memory
raw_data = [];
clear raw_data;

intensity = dot(spectrogram, spectrogram, 1); % each column is an audio frame
max_intensity = max(intensity(:));
threshold = thresholdFactor*max_intensity;

speech_frames = intensity > threshold;

printf("speech_frames has dimensions: %d %d \n", rows(speech_frames), columns(speech_frames));
fflush(stdout);

printf("zeroing silence frames...\n");
fflush(stdout);

speech_frames = repmat(speech_frames,rows(spectrogram), 1);

spectrogram = spectrogram .* speech_frames; 

printf("dimensions spectrogram are now: %d %d \n", rows(spectrogram), columns(spectrogram));
fflush(stdout);

printf("computing phase...\n");
fflush(stdout);

% spectrogram is half-array without duplicate fft coefficients
% 1:fftSize/2 rows, number time steps columns
% each row is an fft coefficient
%
magn = 2.*abs( spectrogram ); % magnitude of fft coefficients
phase = arg( spectrogram ); % phase of fft coefficients

previous_phase = zeros(size(phase));
previous_phase(:,2:end) = phase(:,1:end-1);

phaseShifts = (0:(fftSize/2)-1)*phaseShift; % expected phase shift if frequency component is centered in bin
phaseShifts = repmat(phaseShifts', 1, columns(phase));

spec_buf = phase - previous_phase; % change in phase from previous time step
spec_buf = spec_buf - phaseShifts; % difference between change in phase and expected phase change
         % if frequency component is centered in frequency bin

printf("computing phase adjustment\n");
fflush(stdout);									
									% handle mapping to -pi to pi range of atan2/arg (below)
phase_adjust = uint32(spec_buf./pi); % 0 if spec_buf between -pi and pi
phase_adjust = phase_adjust + ((phase_adjust >= 0).*(2) - 1).*bitand(phase_adjust,1);

spec_buf = spec_buf - pi*double(phase_adjust);
spec_buf = numberOverlaps*spec_buf./(2*pi);

printf("computing corrected frequencies\n");
fflush(stdout);
% compute corrected frequency 
frequencies = repmat(f',1,columns(spectrogram)); % f is row vector when returned by specgram

spec_buf = frequencies + spec_buf*freq_resolution;

corrected_freq = spec_buf;

printf("applying frequency shift\n");
fflush(stdout);

shifted_magn = zeros(size(magn));
shifted_freq = zeros(size(corrected_freq));


oldTime = time;
for k = 1:fftSize/2
	ind = uint32((k-1)*pitchShift) + 1;
	if (ind <= fftSize/2)
		shifted_magn(ind,:) += magn(k,:);
		shifted_freq(ind,:) = corrected_freq(k,:) * pitchShift;
	end
	newTime = time;
	deltaTime = newTime - oldTime;
	if (deltaTime > 1)
		pct = (k / fftSize)*100.0; % percent progress
		printf("frequency shift: processed %3.1f%% %d/%d\n", pct, k, fftSize);
		fflush(stdout);
		oldTime = time;
	end % end if
end
 
%shifted_freq = corrected_freq * pitchShift;

% now convert from mag and freq to mag and phase
%
printf("computing new phase\n");
fflush(stdout);

spec_buf = zeros(size(spectrogram)); % make sure start with zeros

printf("new phase: assigning shifted frequencies\n");
fflush(stdout);

spec_buf(2:end,:) = shifted_freq(2:end,:);

printf("new phase: subtracting center frequencies\n");
fflush(stdout);

spec_buf(2:end,:) = spec_buf(2:end,:) - (frequencies(2:end,:) );

printf("new phase: dividing by frequency resolution\n");
fflush(stdout);

spec_buf(2:end,:) /= freq_resolution;

printf("new phase: adjusting for overlap\n");
fflush(stdout);

spec_buf(2:end,:) = 2.*pi*spec_buf(2:end,:)/numberOverlaps;

printf("new phase: computing delta phase\n");
fflush(stdout);

delta_phase = spec_buf + phaseShifts;

%delta_phase = phaseShifts;

new_phase = delta_phase;

printf("new phase: adding delta phase\n");
fflush(stdout);

%new_phase = spec_buf;
new_phase = zeros(size(spec_buf));
% % %new_phase(:,1) = spec_buf(:,1);
% % %dc coefficient has no phase (always a non-negative real)
oldTime = time;
ncols = columns(spec_buf);
for i = 2:ncols
	new_phase(2:end,i) = new_phase(2:end,i-1) + delta_phase(2:end,i-1);
	newTime = time;
	deltaTime = newTime - oldTime;
	if (deltaTime > 1)
		pct = (i / ncols)*100.0; % percent progress
		printf("new phase: processed %3.1f%% %d/%d\n", pct, k, fftSize);
		fflush(stdout);
		oldTime = time;
	end % end if
end

spec_buf = [];
clear spec_buf; % free memory

new_spectrogram = zeros(fftSize, columns(spectrogram)); % allocate full fft array for inverse fft

new_spectrogram(1,:) = shifted_magn(1,:); % dc coefficient
new_spectrogram(2:fftSize/2,:) = shifted_magn(2:end,:).*cos(new_phase(2:end,:)) + i*shifted_magn(2:end,:).*sin(new_phase(2:end,:));

new_spectrogram(fftSize/2 + 2:end,:) = conj(flipud(new_spectrogram(2:fftSize/2,:))); % reflect fft coefficients

spectrogram = [];
clear spectrogram;

% INVERSE FFT
%
printf("applying inverse fft\n");
fflush(stdout);

new_data = real(ifft(new_spectrogram))/fftSize; 

printf("dimensions new_data are %d %d\n", rows(new_data), columns(new_data));
fflush(stdout);

new_spectrogram = [];
clear new_spectrogram;

% each column is an audio frame which may overlap with previous audio frame by overlap samples
%

iframe = 1; % start at frame 1
it = 1; % start at first sample of output
output = zeros(nsamples,1); % all rows, 1 column

printf("applying overlap and add...\n");
fflush(stdout);

while( (it+fftSize-1) < nsamples)
	update = (new_data(:,iframe).*window)/numberOverlaps; % row of audio data
	output(it:it+fftSize-1) = output(it:it+fftSize-1) + update(1:fftSize);
	it = it + stepSize; % advance to next time
	iframe = iframe + 1; % advance to next audio frame (column of new_data)
end % while

new_data = [];
clear new_data;

mx = max(abs(output(:)));
%mean = sum(abs(output(:)))/numel(output);

if mx > 1.0
	scale_factor = mx / mx_input;
	printf("scaling output by %f\n", 1.0/scale_factor);
	fflush(stdout);
	output = output / scale_factor;
end

printf("writing shifted audio to %s\n", ofilename);
fflush(stdout);
%
wavwrite(output, sampleRate, bits, ofilename);

disp('ALL DONE');
end % function
%

The screenshot below shows running the chipmunk function in Octave 3.2.4 on a PC under Windows XP Service Pack 2 (Click on the screenshot image to see the full size screenshot). This screenshot shows the function called from the Octave prompt using the default values of the function’s arguments. The argument numberOverlaps controls the mathematics to compensate for the uncentered frequency components. If numberOverlaps is one, there is no compensation. The larger numberOverlaps, the more effective the compensation. The more overlaps, the more computer time and resources required by the pitch shifting. A value of numberOverlaps of thirty-two (32) was used to pitch shift President Obama’s voice in the video above.

Running the Chipmunk Function in Octave

Running the Chipmunk Function in Octave

Although easily understandable, these pitch-shifted voices sound somewhat artificial. Indeed, this artificial quality is part of the appeal of the Alvin and the Chipmunk voice.

Pitch Shifting Gets Better

Pitch shifting algorithms have improved. It is now possible to produce voices that sound much more like natural voices at the desired new pitch, very similar to the voice of Mickey Mouse. This video is President Obama speaking with a voice similar to the voice of Mickey Mouse:

This particular pitch shifting algorithm does better with producing natural sounding high pitched voices than low pitched voices.

Conclusion

There are many ways to manipulate voices using mathematics. One of the most common is pitch shifting, which has been described in detail including working source code above. Traditional pitch shifting algorithms give artificial qualities to the pitch-shifted voice. There are now new, improved algorithms that can create more natural sounding pitch-shifted voices. These voices can be used for humor, entertainment, or emphasis in movies, television, video games, video advertisements for small businesses, personal and home video, and in many other applications.

© 2011 John F. McGowan

About the Author

John F. McGowan, Ph.D. solves problems using mathematics and mathematical software, including developing video compression and speech recognition technologies. He has extensive experience developing software in C, C++, Visual Basic, Mathematica, MATLAB, and many other programming languages. He is probably best known for his AVI Overview, an Internet FAQ (Frequently Asked Questions) on the Microsoft AVI (Audio Video Interleave) file format. He has worked as a contractor at NASA Ames Research Center involved in the research and development of image and video processing algorithms and technology. He has published articles on the origin and evolution of life, the exploration of Mars (anticipating the discovery of methane on Mars), and cheap access to space. He has a Ph.D. in physics from the University of Illinois at Urbana-Champaign and a B.S. in physics from the California Institute of Technology (Caltech). He can be reached at jmcgowan11@earthlink.net.

2 Comments

  1. Yao Hong Kok August 16, 2011
  2. Calculus Guy August 25, 2011

Leave a Reply