Design And Development Of Voice Transformation.

DESIGN AND DEVELOPMENT OF VOICE TRANSFORMATION

LILY LING AI LING

This report is submitted in partial fulfillment of the requirements for the award of
Bachelor of Electronic Engineering (Computer Engineering) With Honours

Faculty of Electronic Engineering and Computer Engineering
Universiti Teknikal Malaysia Melaka

April 2009

DESIGN AND DEVELOPMENT OF VOICE TRANSFORMATION
Sesi
Pengajian

:

…..2008/2009……………………………………………………………

Saya


………………LILY LING AI LING………………………………………………………..
(HURUF BESAR)
mengaku membenarkan Laporan Projek Sarjana Muda ini disimpan di Perpustakaan dengan syaratsyarat kegunaan seperti berikut:
1. Laporan adalah hakmilik Universiti Teknikal Malaysia Melaka.
2. Perpustakaan dibenarkan membuat salinan untuk tujuan pengajian sahaja.
3. Perpustakaan dibenarkan membuat salinan laporan ini sebagai bahan pertukaran antara institusi
pengajian tinggi.
4. Sila tandakan (
):

SULIT*

(Mengandungi maklumat yang berdarjah keselamatan atau
kepentingan Malaysia seperti yang termaktub di dalam AKTA
RAHSIA RASMI 1972)

TERHAD*

(Mengandungi maklumat terhad yang telah ditentukan oleh

organisasi/badan di mana penyelidikan dijalankan)

TIDAK
TERHAD
Disahkan oleh:

__________________________
(TANDATANGAN PENULIS)

Alamat Tetap: ……………………………......
……………………………......

___________________________________
(COP DAN TANDATANGAN
PENYELIA)

“I hereby declare that this report is the result of my own work except for quotes as cited
in the references”

Signature


: …………………………………

Author

: Lily Ling Ai Ling

Date

: 27 April 2009

“I hereby declare that I have read this report and in my opinion this report is sufficient
in terms of the scope and quality for the award of Bachelor of Electronic Engineering
(Computer Engineering) With Honours.”

Signature

: …………………………………

Supervisor’s Name


: Mdm Juwita Bt Mohd Sultan

Date

: 27 April 2009

Dedicated to my beloved family member especially my father, mother and also to my
friends.

ACKNOWLEDGEMENT

First of all, I would like to thank to my supervisor, Madam Juwita binti Mohd
Sultan for her valuable guidance in completing the project and thesis. I am especially
grateful to my beloved father, mother and my family member for all their esteem
support, patience and understanding regarding to my study load and research work.
I would like to acknowledge the contributions of my classmate in Universiti
Teknikal Malaysia Melaka, for their great efforts in successful completion of this
project, which was, otherwise, not possible without their priceless support and help.
Lastly, thanks to my dearest friend Leong Eng Chui and Pang Pek Hong for

their help, guidance and idea. Those with whom I did not have the pleasure of personal
interacting, nevertheless their contributions are extremely admirable and valuable to me.

ABSTRACT

This project is the DSP implementation of innovative algorithms for voice
transformation in real time. Voice transformation is the process of transforming the
characteristics of speech uttered by a source speaker, such that a listener would believe
the speech was uttered by a target speaker. In this project, two aspects of the
transformation problem are addressed: voice quality and intonation. The main steps of
the complete project include: a method for high quality voice transformation and
designing a suitable algorithm in Matlab/Simulink. Voice transformation technology
has been used more and more widely in many fields. Yet, the source voice patterns after
transformation may exhibit a substantial degree of variance from the target speaker. The
objective of this project was to develop a digital voice transformation program utilizing
Matlab will be able to transform the voice from target speaker to source speaker. Matlab
provided us with the necessary tools to record, filter, and analyze different voice
samples and compare them to the archived sample. Research about the related will be
done before design the program in Matlab. Troubleshooting will be done if there is any
error occurs. At the end of this project, a complete project includes a method for high

quality voice transformation will be implemented and a suitable algorithm in
Matlab/Simulink will be designed.

ABSTRAK

Projek ini bertujuan untuk menghasilkan DSP algorithm yang boleh menjalankan suara
transformasi. Suara transformasi adalah proses untuk mengubah bentuk sifat suara
seorang penutur supaya pendengar lain akan percaya bahawa suara ini adalah
dikeluarkan oleh sasaran penutur. Suara qualiti dan intonasi adalah dua aspek utama
untuk menghasilkan projek ini. Langkah utama untuk menyiapkan projek ini adalah
termasuk: cara untuk menghasilkan suara transformasi dengan quality tinggi dan
menghasilkan algorithm dalam Matlab/Simulink. Teknologi ini telah digunakan dalam
pelbagai bidang tetapi ketepatan keputusan suara transformasi ini adalah tidak
memuaskan. Objektif projek ini adalah menghasilkan sistem suara transformasi dalam
Matlab yang boleh rekod, menapis, membuat suara analisis dan membuat perbandingan.
Segala penyelidikan yang berkaitan akan dibuat sebelum menghasilkan system tersebut.
Hasilan untuk projek ini adalah untuk menghasilkan satu projek yang lengkap dengan
cara untuk mencapai suara transformasi yang mempunyai qualiti tinggi dan algorithm
yang sesuai akan direka dalam Matlab/Simulink.


CONTENTS

CHAPTER

SUBJECT

PAGE

TITLE
REPORT STATUS VERIFICATION
FORM
DECLARATION

iii

SUPERVISOR VERIFICATION

iv

DEDICATION


v

ACKNOWLEDGEMENT

vii

ABSTRACT

vii

ABSTRAK

viii

CONTENTS

ix

LIST OF FIGURES


xiv

LISTS OF TABLE

xvi

LIST OF SHORT FORM

xvii

CHAPTER 1

CHAPTER 2

INTRODUCTION

PAGE

1.1


Introduction of Project

1

1.2

Objective of Project

2

1.3

Problem Statement

2

1.4

Scope


2

1.5

Methodology

3

1.6

Thesis Outline

3

LITERATURE REVIEW

PAGE

2.1

Introduction of Voice Transformation

5

2.2

Speech Model

6

2.3

Speaker Characteristics

7

2.4

Component of Voice Conversion System

8

2.4.1

Feature Extraction

8

2.4.2

Model Estimation

9

2.4.3

Voice Mapping

10

2.5

2.6

Existing Voice Transformation Systems

10

2.5.1

Voice Quality Conversion

11

2.5.1.1 Representation of Speech

11

2.5.1.2 Mapping Method

12

Transforming the Spectral Envelope

12

2.6.1

Computing Transformation

13

Parameters
2.6.2

Unvoiced Section Transformation

13

2.7

Intonation Transformation

13

2.8

Sample Rate Conversion

14

2.9

Pitch and Frequency

15

2.9.1

16

Pitch Range

2.10

Pitch Synchronous Overlap Add (PSOLA) 16

2.11

Virtual Dubbing Process

17

2.11.1 Advantage of Virtual Dubbing

19

Application of Voice Transformation

19

2.12.1 Text to Speech Adaptation

19

2.12.2 Speaker Identification System

20

Matlab

20

2.13.1 History of Matlab

20

2.13.2 Rules on Variable and Function

21

2.12

2.13

Names

2.14

2.13.3 Graphics

22

2.13.4 Character Set

23

2.13.5 Commenting in MATLAB Editor

24

Graphical User Interface

26

2.14.1 Elements of GUI

27

CHAPTER 3

METHODOLOGY

PAGE

3.1

Introduction

29

3.2

Project methodology

29

3.2.1

Collect information

31

3.2.2

Understand basic of voice

31

Transformation
Design source code

31

3.2.4

Testing the program

32

3.3

Monitoring program flow chart

32

3.4

Software

34

3.4.1

34

3.5

CHAPTER 4

3.2.3

Matlab

Voice Analysis and Voice Mapping

RESULTS AND ANALYSIS

35

PAGE

4.1

Introduction

36

4.2

Results

37

4.3

Analysis of the Results

49

CHAPTER 5

CONCLUSION AND SUGGESTION

PAGE

5.1

Introduction

51

5.2

Conclusion and Recommendation

51

REFERENCES

APPENDIX

53

PAGE

Appendix A: Source for the main program

55

(Voice Transformation System)
Appendix B: Source Code for Sub Program
(Load The File)

70

LIST OF FIGURE

No.

TITLE

2.1

Human vocal tract

2.2

TD-PSOLA Transformation of pitch, intonation and duration

PAGE

6
16

Parameters
2.3

Virtual Dubbing Block Diagrams

18

2.4

Example of comment with comment symbol

25

2.5

Example of using Matlab editor to select group of line

25

26

Example of comment out part of statement

26

2.7

Comment out text within a multiline statement

26

3.1

Flow chart of project

30

3.2

Flow chart of program

33

4.1

Blank GUI (default)

37

4.2

GUI Window

38

4.3

Property Inspector for Voice Transformation System

39

4.4

Drawing GUI in GUIDE Template

40

4.5

Output GUI for the Voice Transformation System

41

4.6

GUI when “Load Voice” was clicked

42

4.7

Output when either one of the option was clicked

43

4.8

Signal waveform of the user before and after the transformation

44

(Cartoon Voice for 5 seconds)
4.9

Signal waveform of the user before and after the transformation

44

(Cartoon Voice for 15 seconds)
4.10

Signal waveform of the user before and after the transformation

45

(Man to Woman Voice for 5 seconds)
4.11

Signal waveform of the user before and after the transformation

45

(Man to Woman Voice for 15 seconds)
4.12

Signal waveform of the user before and after the transformation

46

(Woman to Man Voice for 5 seconds)
4.13

Signal waveform of the user before and after the transformation

46

(Woman to Man Voice for 15 seconds)
4.14

The wav file that had prerecorded and save in the file.

47

4.15

Signal waveform of the user before and after the transformation

47

(Load the file to transform the voice to cartoon voice)
4.16

Signal waveform of the user before and after the transformation

48

(Load the file to transform the voice from woman to man voice)
4.17

Signal waveform of the user before and after the transformation
(Load the file to transform the voice from man to woman voice)

49

LIST OF TABLE

No.

2.1

TITLE

Lists of Operator

PAGE

23

LIST OF SHORT FORM

DSP-

Digital Signal Processing

DFT

Discrete Fourier Transform

DTW-

Dynamic Time Warping

EM

Expected Maximization

FFT-

Fast Fourier Transform

FIR

Finite Impulse Response

GMM

Gaussian Mixture Model

HMM-

Hidden Markov Modeling

HNM

Harmonic plus Noise Model

IIR

Infinite Impulse Response

LPC-

Linear Prediction Coding

MFCC-

Mel Frequency Cepstral Coefficients

RELP

Residual Excited Linear Prediction

PSOLA

Pitch Synchronous Overlap Add

CHAPTER 1

INTRODUCTION

1.1

Introduction of Project

Speech is the most used way of communication for people. We born with the
skills of speaking learn it easily during our early childhood and mostly communicate
with each other with speech throughout our lives. By the developments of
communication technologies in the last era, speech starts to be an important interface
for many systems. Instead of using complex different interfaces, speech is easier to
communicate with computers.

This project is the DSP implementation of innovative algorithms for voice
transformation in real time. This entire set of operations represents a particular
implementation of the so-called Virtual Dubbing procedure. Voice transformation is
the process of transforming the characteristics of speech uttered by a source speaker,
such that a listener would believe the speech was uttered by a target speaker. In this
project, two aspects of the transformation problem are addressed: voice quality and
intonation. The main steps of the complete project include: a method for high quality
voice transformation and designing a suitable algorithm in Matlab/Simulink.

1.2

Objectives of Project

There are several objectives for this project.
To design and develop the algorithm for a high quality voice
transformation system.
To analyze the result of the signal after transformation.

1.3

Problem Statement

Nowadays, voice transformation technology has been used more and more
widely in many fields. For example, in virtual dubbing process, text to speech program
and so on.

There are also other factors which can affect the quality of voice samples other
than the noise disruptions created by microphones devices. For example, factors such as
mispronounced verbal phrases, different media used for enrollment and verification
(using a land line telephone for the enrollment process, but then using a cell phone for
the verification process), as well as the emotional and physical conditions of the
individual.

1.4

Scope

The system that implement for this project is a user independent system which
can transform any voice to the desired voice. The devices that we intended to used to
capture an individual's voice samples are computer microphones. There are two
important aspects of the transformation problem are addressed: voice quality and
intonation.

User can choose to transform their voice to two choices and record for 5 or 15
seconds. In this project, there is only mainly discussed in the algorithm of the system
and therefore will not include the hardware design.

1.5

Methodology

At first, after the title of the project was confirmed, the research about the topic
was done by find the important information from journal, reference book and internet
resource. The features of Matlab and basic concept of voice transformation was studied.

After that, the graphical user interface (GUI) and source code was designed in
Matlab. Program was checked and the troubleshooting was done if any errors had
occurred within the program.

The project was completed and successful if there is no error.

1.6

Thesis Outline

This thesis is a report that delivers the idea generated, concepts applied,
activities done and the final year project produced. It consists of five chapters which are
Chapter 1: Introduction, Chapter 2: Literature Review, Chapter 3: Methodology,
Chapter 4: Results and Discussion and finally last chapter, Chapter 5: Conclusion and
Recommendation.

Chapter 1 is delivering the introduction of the project. It contains objective,
problem statement, scope of work, methodology and thesis outline of this project.

Chapter 2 is discussing about the literature review of this project. The features of
Matlab are studied. The application of the voice transformation system was also learned
in this chapter.

Chapter 3 briefly described the method that used in this project in order to solve
the problem. It also covered the factor and reason that we consider when we choosing
the certain method. The advantage of the method was discussing in this chapter too.

Chapter 4 is deals with the analysis of the result at the final stage which is
complete designed and implemented the voice transformation in Matlab. The
monitoring source code is written by using the Matlab language.

Chapter 5 is described the conclusion and result of the project at the final stage.
The recommendation and future development of this project is discussed in order to
upgrade the voice transformation system.

CHAPTER 2

LITERATURE REVIEW

2.1

Introduction

Definition of voice conversion aims at transforming the characteristics of the
speech signal uttered by a speaker (Source Speaker), in such a way that a human listener
could believe that the transformed speech is produced by another specific speaker
(Target Speaker).

Voice transformation is the process of taking the speech of a source speaker and
transforming the characteristics of the signal, such that a human listener would believe
the speech was uttered by a target speaker.

2.2

Speech Model

The human voice consists of sound made by a human being using the vocal
folds for talking, singing, laughing, crying, screaming, etc. Human voice is specifically
that part of human sound production in which the vocal folds (vocal cords) are the
primary noise source. Generally speaking, the voice can be subdivided into three parts;
the lungs, the vocal folds, and the articulators. The lung must produce adequate airflow
to vibrate vocal folds (air is the fuel of the voice). The vocal folds (vocal cords) are the
vibrators, neuromuscular units that ‘fine tune’ pitch and tone [1]. The articulators (vocal
tract consisting of tongue, palate, cheek, lips, etc.) articulate and filter the sound.

Figure 2.1: Human vocal tract [2]

Human speech is produced by the vocal tract, which starts at the glottis (vocal
folds) and ends at the lips. The lung contract is to force air through the trachea and

!

pharynx and out through the nasal and oral cavities. In English there are four different
types of sounds that can be created: aspiration noise, plosion and voicing. Voicing is a
quasi periodic vibration of the vocal folds. The frequency of the vibration is called the
fundamental frequency of F0 and is perceived as pitch.

A voice frequency or voice band is one of the frequencies which within part of
the audio range that is used for the transmission of speech. The voiced speech of a
typical adult male will have a fundamental frequency of from 85 to 155 Hz, and that of
a typical adult female from 165 to 255 Hz [3]. Thus, the fundamental frequency of most
speech falls below the bottom of the "voice frequency" band as defined above.[4]

2.3

Speaker Characteristics

There are a very large number of respects in which speech may differ from different
speakers. These can be divided into three main types of speaker identity:

a. Segmental: In linguistics, the term segment may be defined as "any discrete unit
that can be identified, either physically or auditorily, in the stream of speech [5].
Segments are called “discrete” because they are separate and individual, such as
consonants and vowels and occur in distinct temporal order.

b. Suprasegmental: These characteristics describe the prosodic features of the
voice related to the style of speaking. This includes information about how the
fundamental frequency (F0) varies during utterances, duration variation and also
how stress varies over the course of a sentence. Other units, such as tone, stress,
and sometimes secondary articulations such as nasalization, may coexist with
multiple segments and cannot be discretely ordered with them [6]. These
elements are termed suprasegmental. It is not clear how the concept of segment
applies to sign languages.