NLP Module Analysis DSP Module Analysis Naturalism Animation Analysis The Analysis of Frame Mistakes Animation Amount

Text-to-Video: Text to Facial Animation Video Convertion – Hamdani Winoto, Hadi Suwastio, Iwan Iwut T. ISSN 1858-1633 2005 ICTS 65 In the image generation process, first of all, all of the voice data base will be read and will be kept in the picture variable. From the normalization text input, can be found pairs of picture variable source and target pictures. Morphing cross dissolve process will be done in every pair. For the synchronization between morphing and voice duration, the duration of morphing process has to equal with the diphone duration. But if we watch carefully in the pronouncing of a word, for example the word ”baik” in the phoneme ”ba”, the duration of the letter ”b” is not the same with the duration of the letter ”a”. It can be conclude that it is unnatural at all. Because of that reason, it is strongly needed to give the duration animation rules between two nearby chronological letters, where thus rules are shown in this table below: Table 1. Duration Rules By having thus rules, the animation will be will be looked more natural. 3. RESULT 3.1 Voice Diphone Quality Analysis From the side of quality, the kinds of data base of concatenation synthesis can be divided from the best synthesis into the lower one, which are: word, phoneme, triphone, diphone, and phone. But from the side of quantity, the chronology will be the reverse. Because of the good quality and the less quantity, Text-to-Speech system use diphone data base.

3.2 NLP Module Analysis

1. Diftong Letter ai, au, oi 2. Numeral reading 3. Abbreviation and acronym 4. Points and comas punctuation 5. Dash “-” 6. Mathematics symbol 7. “E” letter 8. space duration and punctuation

3.3 DSP Module Analysis

The method used in DSP module of Text-to-Speech is OLA OverLa -Add. The Window used is Kaiser, because by using Kaiser, the responses can be changed. By doing the experiments, the optimal beta values is found about β = 1,7 – 2,2 because on those rank, the voice signal vague in the side lobe is a half of the peak response. The bigger the values of Beta, the signal will look discontinue in the connector, but the lesser the values of Beta, there will be a letup in the connector area, because the addition of the signals is done in the connector era.

3.4 Naturalism Animation Analysis

By using cross dissolve method, the time needed for proceeding two pictures computation is pretty fast. It is happen because the deformation for every picture in the mesh morphing and feature morphing is very long. Decreasing the pictures resolution to make the computation faster can result the quality degradation. Because of that reason, for Text-to- Video application, cross dissolve method is used.

3.5 The Analysis of Frame Mistakes Animation Amount

The round of the multiplication between diphone duration and the video speed is done to determine frame animation amount. Round off error and diphone overlapping make the frames animation number incorrect. In the case of using 15 fps video speed, commonly there are round to under value because the difference of diphone multiplication duration with the video speed is not reach a half. So that usually the difference between the number of frames animation and texts normalization character are not going too far. However if it is accounted by the multiplication between total voice duration with the video speed, the amount will be more than the multiplication of animation frames. But in the case of using 50 fps video speed, the reverse will be happened. The amount of frames animation will always be more than the real number of frames. It is happen because usually the speed round video factor is pretty big, the round which commonly happened is the upper round while overlapping happened in the DSP process which resulting the voice total duration smaller than if it is accounted for every diphone voice duration. Based on the experimental data, the optimal video speed smaller error frames is 30fps. Eventhouh sometimes on 25 fps and 40 fps error frames resulted is smaller than 30 fps. If it is monitored from the experimental data, the error frames could reach 0 for some sentences. It is proved that the optimal video speed if it is monitored based on the number of error frames is on 30 fps. If the error frames is being generated on 25 fps, 30 fps and 40 fps, generally the smaller errors is on 30 fps. By thus analysis, the conclusion is Text-to-Video program will produce video with the smaller error frames amount on 30 fps video speed. Information and Communication Technology Seminar, Vol. 1 No. 1, August 2005 ISSN 1858-1633 2005 ICTS 66

3.6 The Fitness Analysis Between Voice And Animation