Text-to-Video: Text to Facial Animation Video Convertion – Hamdani Winoto, Hadi Suwastio, Iwan Iwut T.
ISSN 1858-1633 2005 ICTS 65
In the image generation process, first of all, all of the voice data base will be read and will be kept in the
picture variable. From the normalization text input, can be found pairs of picture variable source and
target pictures. Morphing cross dissolve process will be done in every pair.
For the synchronization between morphing and voice duration, the duration of morphing process has
to equal with the diphone duration. But if we watch carefully in the pronouncing of a
word, for example the word ”baik” in the phoneme ”ba”, the duration of the letter ”b” is not the same with
the duration of the letter ”a”. It can be conclude that it is unnatural at all.
Because of that reason, it is strongly needed to give the duration animation rules between two nearby
chronological letters, where thus rules are shown in this table below:
Table 1. Duration Rules
By having thus rules, the animation will be will be looked more natural.
3. RESULT 3.1 Voice Diphone Quality Analysis
From the side of quality, the kinds of data base of concatenation synthesis can be divided from the best
synthesis into the lower one, which are: word, phoneme, triphone, diphone, and phone. But from the
side of quantity, the chronology will be the reverse.
Because of the good quality and the less quantity, Text-to-Speech system use diphone data base.
3.2 NLP Module Analysis
1. Diftong Letter ai, au, oi 2. Numeral reading
3. Abbreviation and acronym 4. Points and comas punctuation
5. Dash “-” 6. Mathematics symbol
7. “E” letter 8. space duration and punctuation
3.3 DSP Module Analysis
The method used in DSP module of Text-to-Speech is OLA OverLa -Add.
The Window used is Kaiser, because by using Kaiser, the responses can be changed. By doing the
experiments, the optimal beta values is found about β
= 1,7 – 2,2 because on those rank, the voice signal vague in the side lobe is a half of the peak response.
The bigger the values of Beta, the signal will look discontinue in the connector, but the lesser the values
of Beta, there will be a letup in the connector area, because the addition of the signals is done in the
connector era.
3.4 Naturalism Animation Analysis
By using cross dissolve method, the time needed for proceeding two pictures computation is pretty fast.
It is happen because the deformation for every picture in the mesh morphing and feature morphing is very
long. Decreasing the pictures resolution to make the computation faster can result the quality degradation.
Because of that reason, for Text-to- Video application, cross dissolve method is used.
3.5 The Analysis of Frame Mistakes Animation Amount
The round of the multiplication between diphone duration and the video speed is done to determine
frame animation amount. Round off error and diphone overlapping make the frames animation number
incorrect.
In the case of using 15 fps video speed, commonly there are round to under value because the difference
of diphone multiplication duration with the video speed is not reach a half. So that usually the difference
between the number of frames animation and texts normalization character are not going too far.
However if it is accounted by the multiplication between total voice duration with the video speed, the
amount will be more than the multiplication of animation frames.
But in the case of using 50 fps video speed, the reverse will be happened. The amount of frames
animation will always be more than the real number of frames. It is happen because usually the speed round
video factor is pretty big, the round which commonly happened is the upper round while overlapping
happened in the DSP process which resulting the voice total duration smaller than if it is accounted for
every diphone voice duration.
Based on the experimental data, the optimal video speed smaller error frames is 30fps. Eventhouh
sometimes on 25 fps and 40 fps error frames resulted is smaller than 30 fps. If it is monitored from the
experimental data, the error frames could reach 0 for some sentences. It is proved that the optimal video
speed if it is monitored based on the number of error frames is on 30 fps. If the error frames is being
generated on 25 fps, 30 fps and 40 fps, generally the smaller errors is on 30 fps. By thus analysis, the
conclusion is Text-to-Video program will produce video with the smaller error frames amount on 30 fps
video speed.
Information and Communication Technology Seminar, Vol. 1 No. 1, August 2005
ISSN 1858-1633 2005 ICTS 66
3.6 The Fitness Analysis Between Voice And Animation