Text-to-Speech Algorithm Text Preprocessing Text Parsing Speech Generation Synthesizer Facial Animation Algorithm

Information and Communication Technology Seminar, Vol. 1 No. 1, August 2005 ISSN 1858-1633 2005 ICTS 64 Text converter part is used for changing the input sentence in a certain text language into voice codes which usually represent as duration phoneme code with its pitch. This part is language dependant [6]. Phoneme converter into speech will get the input code in phoneme, pitch, and duration resulted before. Based on its codes, phoneme converter into speech will produce the signal voice which is fit with the sentence that will be said. There are two alternative techniques used in order to implement this part. The one which is often used are formant synthesizer and synthesis concatenation [6 ].

2.2 Text-To-Video Stake And Desain

2.2.1. Text-to-Speech Algorithm

Figure 1. Text to speech algorithm

2.2.2. Text Preprocessing

There will be an input paragraph analysis in this block that is the punctuation recognition and the number translation into letters.

2.2.3. Text Parsing

Text Parsing is used for processing the separation of words into a smaller part of voicespeech, and letters normalization into symbols that can reduce the complexity of the implemented program. For example the coma and point will be changed into another similar symbol. In this block, there will be also the rules to differentiate the way to read ‘e’ letter. In the TTS block diagram, it is seen that the input from text parsing consist of two data bases, which are dictionary and language manner rules. The dictionary consists of certain words which is not appropriate with the language manner rules.

2.2.4. Speech Generation

The arrangement of diphone from the voice symbols resulted from the previous block is done in this block. Voice signal .wav is taken from the database we made before.

2.2.5. Synthesizer

Overlap-Add method is the synthesizer used in TTS system. Here, every diphone will be windowed. Kaiser window is used here, because by changing the value of Beta, we can change the impulse respon in time domain. After windowing process, the diphone will be overlapped with the next voice signal. The purpose of this overlap system is to ignore the sound multiplexing resulted by the editing process, since it is impossible for every diphone to be edited exactly on the half of its transition. In order to get the smoother output, especially in the overlap area where the two process of filtering, average smoothing and LPF Low Pass Filter is proceed.

2.2.6. Facial Animation Algorithm

Morphing technique is used for animated the transformation of two difference pictures, so that the changes can be seen slowly. There area three kinds of Facial animation techniques used, dissolve, feature morphing, and mesh morphing. Among thus three method, the fastest computation resulted by cross dissolve method, while the slower one produced by feature morphing method. The best result will be given by mesh morphing method. In this research, the one will be used is cross dissolve method. It is used because of the fastest computation produced and the pretty good result for Text-to-Video. By using cross dissolve technique, what is needed is appearing the two pictures source and target pictures using transparent mode which can be controlled. In the beginning of animation, the transparent level is set to maximum 100, while the second picture is set to minimum 0. During the animation process, the transparency of the first picture is reduced slowly, while the second picture is increasing if it is fulfill the requirement of the first picture transparency level compare with the second one is 100.

2.2.7. Text-to-Video Algorithm