Text-to-Video Algorithm Text-To-Video Stake And Desain

Information and Communication Technology Seminar, Vol. 1 No. 1, August 2005 ISSN 1858-1633 2005 ICTS 64 Text converter part is used for changing the input sentence in a certain text language into voice codes which usually represent as duration phoneme code with its pitch. This part is language dependant [6]. Phoneme converter into speech will get the input code in phoneme, pitch, and duration resulted before. Based on its codes, phoneme converter into speech will produce the signal voice which is fit with the sentence that will be said. There are two alternative techniques used in order to implement this part. The one which is often used are formant synthesizer and synthesis concatenation [6 ].

2.2 Text-To-Video Stake And Desain

2.2.1. Text-to-Speech Algorithm

Figure 1. Text to speech algorithm

2.2.2. Text Preprocessing

There will be an input paragraph analysis in this block that is the punctuation recognition and the number translation into letters.

2.2.3. Text Parsing

Text Parsing is used for processing the separation of words into a smaller part of voicespeech, and letters normalization into symbols that can reduce the complexity of the implemented program. For example the coma and point will be changed into another similar symbol. In this block, there will be also the rules to differentiate the way to read ‘e’ letter. In the TTS block diagram, it is seen that the input from text parsing consist of two data bases, which are dictionary and language manner rules. The dictionary consists of certain words which is not appropriate with the language manner rules.

2.2.4. Speech Generation

The arrangement of diphone from the voice symbols resulted from the previous block is done in this block. Voice signal .wav is taken from the database we made before.

2.2.5. Synthesizer

Overlap-Add method is the synthesizer used in TTS system. Here, every diphone will be windowed. Kaiser window is used here, because by changing the value of Beta, we can change the impulse respon in time domain. After windowing process, the diphone will be overlapped with the next voice signal. The purpose of this overlap system is to ignore the sound multiplexing resulted by the editing process, since it is impossible for every diphone to be edited exactly on the half of its transition. In order to get the smoother output, especially in the overlap area where the two process of filtering, average smoothing and LPF Low Pass Filter is proceed.

2.2.6. Facial Animation Algorithm

Morphing technique is used for animated the transformation of two difference pictures, so that the changes can be seen slowly. There area three kinds of Facial animation techniques used, dissolve, feature morphing, and mesh morphing. Among thus three method, the fastest computation resulted by cross dissolve method, while the slower one produced by feature morphing method. The best result will be given by mesh morphing method. In this research, the one will be used is cross dissolve method. It is used because of the fastest computation produced and the pretty good result for Text-to-Video. By using cross dissolve technique, what is needed is appearing the two pictures source and target pictures using transparent mode which can be controlled. In the beginning of animation, the transparent level is set to maximum 100, while the second picture is set to minimum 0. During the animation process, the transparency of the first picture is reduced slowly, while the second picture is increasing if it is fulfill the requirement of the first picture transparency level compare with the second one is 100.

2.2.7. Text-to-Video Algorithm

Text-to-Video is the combination between Text-to- Speech and Facial Animation. Some blocks of TTS which are text preprocessing and text parsing, are going to use in the Facial Animation also. The problem of Text-to-Video is how to combine both of two systems Teks-to-Speech with Facial Animation so that the synkronize and the arrangement of each word can be reached. Figure 2. Image generation process Text-to-Video: Text to Facial Animation Video Convertion – Hamdani Winoto, Hadi Suwastio, Iwan Iwut T. ISSN 1858-1633 2005 ICTS 65 In the image generation process, first of all, all of the voice data base will be read and will be kept in the picture variable. From the normalization text input, can be found pairs of picture variable source and target pictures. Morphing cross dissolve process will be done in every pair. For the synchronization between morphing and voice duration, the duration of morphing process has to equal with the diphone duration. But if we watch carefully in the pronouncing of a word, for example the word ”baik” in the phoneme ”ba”, the duration of the letter ”b” is not the same with the duration of the letter ”a”. It can be conclude that it is unnatural at all. Because of that reason, it is strongly needed to give the duration animation rules between two nearby chronological letters, where thus rules are shown in this table below: Table 1. Duration Rules By having thus rules, the animation will be will be looked more natural. 3. RESULT 3.1 Voice Diphone Quality Analysis From the side of quality, the kinds of data base of concatenation synthesis can be divided from the best synthesis into the lower one, which are: word, phoneme, triphone, diphone, and phone. But from the side of quantity, the chronology will be the reverse. Because of the good quality and the less quantity, Text-to-Speech system use diphone data base.

3.2 NLP Module Analysis

1. Diftong Letter ai, au, oi 2. Numeral reading 3. Abbreviation and acronym 4. Points and comas punctuation 5. Dash “-” 6. Mathematics symbol 7. “E” letter 8. space duration and punctuation

3.3 DSP Module Analysis

The method used in DSP module of Text-to-Speech is OLA OverLa -Add. The Window used is Kaiser, because by using Kaiser, the responses can be changed. By doing the experiments, the optimal beta values is found about β = 1,7 – 2,2 because on those rank, the voice signal vague in the side lobe is a half of the peak response. The bigger the values of Beta, the signal will look discontinue in the connector, but the lesser the values of Beta, there will be a letup in the connector area, because the addition of the signals is done in the connector era.

3.4 Naturalism Animation Analysis

By using cross dissolve method, the time needed for proceeding two pictures computation is pretty fast. It is happen because the deformation for every picture in the mesh morphing and feature morphing is very long. Decreasing the pictures resolution to make the computation faster can result the quality degradation. Because of that reason, for Text-to- Video application, cross dissolve method is used.

3.5 The Analysis of Frame Mistakes Animation Amount

The round of the multiplication between diphone duration and the video speed is done to determine frame animation amount. Round off error and diphone overlapping make the frames animation number incorrect. In the case of using 15 fps video speed, commonly there are round to under value because the difference of diphone multiplication duration with the video speed is not reach a half. So that usually the difference between the number of frames animation and texts normalization character are not going too far. However if it is accounted by the multiplication between total voice duration with the video speed, the amount will be more than the multiplication of animation frames. But in the case of using 50 fps video speed, the reverse will be happened. The amount of frames animation will always be more than the real number of frames. It is happen because usually the speed round video factor is pretty big, the round which commonly happened is the upper round while overlapping happened in the DSP process which resulting the voice total duration smaller than if it is accounted for every diphone voice duration. Based on the experimental data, the optimal video speed smaller error frames is 30fps. Eventhouh sometimes on 25 fps and 40 fps error frames resulted is smaller than 30 fps. If it is monitored from the experimental data, the error frames could reach 0 for some sentences. It is proved that the optimal video speed if it is monitored based on the number of error frames is on 30 fps. If the error frames is being generated on 25 fps, 30 fps and 40 fps, generally the smaller errors is on 30 fps. By thus analysis, the conclusion is Text-to-Video program will produce video with the smaller error frames amount on 30 fps video speed.