Journal on Multimodal User Interfaces, Submission File, Draft, 30 June 2008

Oya Aran, Ismail Ari, Lale Akarun

Erinc Dikici, Siddika Parlak, Murat Saraclar

PILAB, Bogazici University, Istanbul, Turkey

BUSIM, Bogazici University, Istanbul, Turkey

Pavel Campr, Marek Hruz
University of West Bohemia, Pilsen,
Czech Republic

broadcast news is for the hearing impaired and consists of three
major information sources: sliding text, speech and sign. Fig. 1
shows an example frame from the recordings.

The objective of this study is to automatically extract annotated
sign data from the broadcast news recordings for the hearing
impaired. These recordings present an excellent source for automatically generating annotated data: In news for the hearing
impaired, the speaker also signs with the hands as she talks. On
top of this, there is also corresponding sliding text superimposed
on the video. The video of the signer can be segmented via the
help of either the speech or both the speech and the text, generating segmented, and annotated sign videos. We call this application as Signiary, and aim to use it as a sign dictionary where
the users enter a word as text and retrieves sign videos of the
related sign. This application can also be used to automatically
create annotated sign databases that can be used for training recognizers.
speech recognition – sliding text recognition – sign language
analysis – sequence clustering – hand tracking

Figure 1: An example frame from the news recordings. The three
information sources are the speech, sliding text, signs.

Sign language is the primary means of communication for deaf
and mute people. Like spoken languages, it emerges naturally
among the deaf people living in the same region. Thus, there exist a large number of sign languages all over the world. Some of
them are well known, like the American Sign Language (ASL),
and some of them are known by only a very small group of
deaf people who use it. Sign languages make use of hand gestures, body movements and facial expressions to convey information. Each language has its own signs, grammar, and word
order, which is not necessarily the same as the spoken language
of that region.
Turkish sign language (Turk Isaret Dili, TID) is a natural
full-fledged language used in the Turkish deaf community [1].
Its roots go back to the 16th century, to the time of the Ottoman
Empire. TID has many regional dialects, and is used throughout
Turkey. It has its own vocabulary and grammar, and its own system of fingerspelling and it is completely different from spoken
Turkish in many aspects, especially in terms of word order and
grammar [2].
In this work, we aim to exploit videos of the Turkish news
for the hearing impaired in order to generate usable data for sign
language education. For this purpose, we have recorded news
videos from the Turkish Radio-Television (TRT) channel. The

The three sources in the video convey the same information
via different modalities. The news presenter signs the words as
she talks. However, since it is not necessary to have the same
word ordering in a Turkish spoken sentence and in a Turkish
sign sentence [2], the signing in these news videos is not considered as TID but can be called as signed Turkish: the sign of
each word is from TID but their ordering would have been different in a proper TID sentence. Moreover, facial expressions
and head/body movements, which are frequently used in TID,
are not used in these signings. Since the signer also talks, no facial expression that uses lip movements can be done. In addition
to the speech and sign information, a corresponding sliding text
is superimposed on the video. Our methodology is to process
the video to extract the information content in the sliding text
and speech components and to use both the speech and the text
to generate segmented and annotated sign videos. The main goal
is to use this annotation to form a sign dictionary. Once the annotation is completed, unsupervised techniques are employed to
check consistency among the retrieved signs, using a clustering
of the signs.
The system flow of Signiary is illustrated in Fig. 2. The application receives the text input of the user and attempts to find


Journal on Multimodal User Interfaces, Submission File, Draft, 30 June 2008

Figure 3: Block diagram of the spoken term detection system
2.1. Speech recognition
Prior to recognition, audio data is segmented into utterances
based on energy, using the method explained in [3]. The speech
signal corresponding to each utterance is converted into a textual
representation using an HMM based large vocabulary continuous speech recognition (LVCSR) system. The acoustic models
consist of decision tree clustered triphones and the output probabilities are given by Gaussian mixture models. As the language
model, we use word based n-grams. The recognition networks
and the output hypotheses are represented as weighted automata.
The details of the ASR system can be found in [4].
The HTK toolkit[5] is used to produce the acoustic feature
vectors and AT&T’s FSM and DCD tools[6] are used for recognition.
The mechanism of an ASR system requires searching through
a network of all possible word sequences. One-best output is obtained by finding the most likely hypothesis. Alternative likely
hypotheses can be represented using a lattice. To illustrate, lattice output of a recognized utterance is shown in Fig. 4. The
labels on arcs are words and the weights indicate the probability
of arcs [7]. In Fig. 4, the circles represent states, where state
’0’ is the initial state and state ’4’ is the final state. An utterance
hypothesis is a path between the initial state and one of the final
states. The probability of each path is computed by multiplying
the probabilities of the arcs on that path. For example, the probability of path iyi gUnler is 0.73 × 0.975 ≃ 0.71, which is the
highest of all hypotheses (one-best).

Figure 2: Modalities and the system flow

the word in the news videos by using the speech. At this step
the application returns several intervals from different videos
that contain the entered word. If the resolution is high enough
to analyze the lip movements, audio-visual analysis can be applied to increase speech recognition accuracy. Then, sliding text
information is used to control and correct the result of the retrieval. This is done by searching for the word in the sliding
text modality during each retrieved interval. If the word can
also be retrieved by the sliding text modality, the interval is assumed to be correct. The sign intervals are extracted by analyzing the correlation of the signs with the speech. However, the
signs in these intervals are not necessarly the same. First, there
can be false alarms of the retrieval corresponding to some unrelated signs and second, there are homophonic words that have
the same phonology but different meanings; thus, possibly different signs. Thus we need to cluster the signs that are retrieved
by the speech recognizer.
In Sections 2, 3 and 4, we explain the analysis techniques
and results of the recognition experiments for speech, text and
sign modalities, respectively. The overall assesment of the system is given in Section 5. In Section 6, we give the details about
the Signiary application and its graphical user interface. We
conlude and discuss further improvement areas of the system
in Section 7.










Spoken term detection (STD) is a subfield of speech retrieval,
which locates occurrences of a query in a spoken archive. In this
work, STD is used as a tool to segment and retrieve the signs in
the news videos based on speech information. After the location
of the query is extracted with STD, the sign video corresponding
to that time interval is displayed to the user. The block diagram
of the STD system is given in Fig. 3.
The three main components of the system are: speech recognition, indexation and retrieval. The speech recognizer converts
the audio data, extracted from videos, into a symbolic representation (in terms of weighted finite state automata). The index,
which is represented as a weighted finite state transducer, is built
offline, whereas the retrieval is performed after the query is entered. The retrieval module uses the index to return the time,
program and relevance information about the occurrences of the
query. Now, we will explain each of the components in detail.





Figure 4: An example lattice output, for the utterance ”iyi gUnler”


Journal on Multimodal User Interfaces, Submission File, Draft, 30 June 2008

2.2. Indexation

all the words seen in manual transcriptions (excluding foreign
words and acronyms). Correct transcriptions are obtained manually and the time intervals are obtained by forced viterbi alignment. The acoustic model of the speech recognizer is trained on
our broadcast news corpus, which includes approximately 100
hours of speech. The language models are trained on a text corpus, consisting of 96M words [4].
We consider the retrieved occurrence as a correct match if
the time interval falls within ± 0.5 seconds of the correct interval, if not, we assume a miss or false alarm.

Weighted automata indexation is an efficient method in retrieval
of uncertain data. Since the output of the ASR (hence, input
to the indexer) is a collection of alternative hypotheses represented as weighted automata, it is advantageous to represent the
index as a finite state transducer. Therefore, the automata, (output of ASR) are turned into transducers such that the inputs are
words and the outputs are utterance numbers, which the word
appears in. By taking the union of these transducers, a single
transducer is obtained, which is further optimized via weighted
transducer determinization. The resulting index is optimal in
search complexity, the search time is linear in the length of the
input string. The weights in the index transducer correspond to
expected counts, where the expectation is taken under the probability distribution given by the lattice [8].

2.4.2. Results
We experimented with both lattice and one-best indexes. As
mentioned before, the lattice approach corresponds to a curve
in the plot, while one-best approach is represented with a point.
The resulting precision-recall graph is depicted as in Fig. 5

2.3. Retrieval
Having built the index, we are now able to transduce input queries
into utterance numbers. To accomplish this, the queries are represented as finite state automata and composed with the index
transducer (via weighted finite state composition) [8]. The output is a list of all utterance numbers which the query appears in,
as well as the corresponding expected counts. Next, utterances
are ranked based on expected counts, the ones higher than a particular threshold are retrieved. We apply forced alignment on
the utterances to identify the starting time and duration of each
term [9].
As explained in section 2.1, use of lattices introduces more
than one hypothesis for the same time interval, with different
probabilities. Indexation estimates the expected count using
these path probabilities. By setting a threshold on the expected
count, different precision-recall points can be obtained which results in a curve. On the other hand, one-best hypothesis can be
represented with only one point. Having a curve allows choosing an operating point by varying the threshold. Use of a higher
threshold improves precision but recall falls. Conversely, a lower
threshold value causes less probable documents to be retrieved.
This increases recall but decreases precision.
The opportunity of choosing the operating point is a great
advantage. Depending on the application, it may be desirable to
retrieve all of the related documents or only the most probable
ones. For our case, it is more convenient to operate at a point
where precision is high.

word lattice




1 X C(q)
Q q=1 R(q)





In the plot, we see that, use of lattices performs better than
one-best. The arrow points at the position where the maximum
F-measure is achieved for lattice. We also experimented with
setting a limit on the number of retrieved occurrences. When
only 10 occurrences are selected, precision-recall results were
similar on one-best vs lattice performance. However, the maxF measure obtained by this method outperforms the previous
by 1%. A comparison of lattice and one-best performances (in
F-measure) is given in Table 1, for both experiments. Use of
lattices introduces 1-1.5 % of improvement in F-measure. Since
the broadcast news corpora are fairly noiseless, the achievement
may seem minor. However for noisy data, this improvement is
much higher [7].

Speech recognition performance is evaluated by the word error
rate (WER) metric which was measured to be around 20% in
our previous experiments.
Retrieval part is evaluated via precision-recall rates and Fmeasure, which are calculated as follows: Given Q queries,
let the reference transcriptions include R(q) occurrences of the
query q, A(q) be the total number of retrieved documents and
C(q) be the number of correctly retrieved documents. Then:
Recall =

P = 90.9
R = 71.2
F = 80.3

Figure 5: Precision-Recall for word-lattice and one-best hypotheses when no limit is set on maximum number of retrieved

2.4.1. Evaluation

1 X C(q)
Q q=1 A(q)



2.4. Experiments and results

Precision =







Table 1: Comparison of lattice and one-best ASR outputs on
maximum F-measure performance

2 × Precision × Recall
Precision + Recall
Evaluation is done over 15 news videos, each with an approximate duration of 10 minutes. Our query set consists of
F =


Journal on Multimodal User Interfaces, Submission File, Draft, 30 June 2008

3.1. Sliding Text Properties
The second modality consists of the overlaid sliding text, which
includes simultaneous transcription of the speech. We work with
20 DivX r encoded color videos, having two different resolutions (288x352 - 288x364), and 25 fps sampling rate. The sliding text band at the bottom of the screen contains a solid background and white characters with a constant font. Sliding text
speed does not change considerably throughout the whole video
sequence (4 pixels/frame). Therefore, each character appears on
the screen for at least 2.5 seconds. An example of a frame with
sliding text is shown in Fig. 6.

Figure 7: Binary image with horizontal and vertical projection

distinct parts of a single character to be combined, which complicates the segmentation of text into characters. We apply morphological opening with a 2x2 structuring element to remove
such effects of noise. To further smooth the appearance of characters, we horizontally align binary text images of the frames
between two samples, and for each pixel position, decide on a 0
or 1, by voting.
Once the text image is obtained, vertical projection histogram
is calculated again to find the start and end positions of every
character. Our algorithm assumes that two consecutive characters are perfectly separated by at least one black pixel column. Words are segmented by looking for spaces which exceed
an adaptive threshold, which takes into account the outliers in
character spacing. To achieve proper alignment, only complete
words are taken for transcription at each sample frame.
Each character is individually cropped from the text figure
and saved, along with its start and end horizontal pixel positions.

Figure 6: Frame snapshot of the broadcast news video

3.2. Baseline Method
Sliding text information extraction consists of three steps: extraction of the text line, character recognition and temporal alignment.
3.2.1. Text Line Extraction

3.2.2. Character Recognition

Since the size and position of the sliding text is constant throughout the video, it is deternined at the first frame and used in the
rest of the operations. To find the position of the text, first, we
convert the RGB image into a binary image, using grayscale
quantization and thresholding with the Otsu method [10]. Then
we calculate horizontal projection histogram of the binary image, i.e., the number of white pixels for each row. The text band
appears as a peak on this representation, separated from the rest
of the image. We apply a similar technique over the cropped text
band, this time on the vertical projection direction, to eliminate
the program logo. The sliding text is bounded by the remaining box, whose coordinates are defined as the text line position.
Fig. 7 shows an example of binary image with its horizontal and
vertical projection histograms.
Since there is redundant information in successive frames,
we do not extract text information from every frame. Experiments have shown that updating the text transcription once in
every 10 frames is optimal for achieving sufficient recognition
accuracy. We call these the ”sample frames”. The other frames
in between are used for noise removal and smoothing.
Noise in binary images stems mostly from quantization operations in color scale conversion. Considering the low resolution of our images, noise can cause two characters, or two

Since the font of the sliding text is constant in all videos, template matching method is implemented for character recognition. Normalized Hamming distance is used to compare each
binary character image to each template. The total number of
matching pixels are divided by the size of the character image
and used as a normalized similarity score: Let nij be the total
number of pixel positions, where binary template pixel has value
i and character pixel has value j. Then, the score is formulated
n00 + n11
sj =
n00 + n01 + n10 + n11
The character whose template gets the best score is assigned
to that position. Matched characters are stored as a string. Fig. 8
depicts the first three sample text band images, their transcribed
text and corresponding pixel positions.
3.2.3. Temporal Alignment
The sliding text is interpreted as a continuous band throughout the video. Since we process only selected sample frames,
the calculated positions of each character should be aligned in
space (and therefore in time) with their positions from the previous sample frame. This is done using frame shift and pixel


Journal on Multimodal User Interfaces, Submission File, Draft, 30 June 2008
¨ U),
¨ or
of some Turkish characters, such as erasing dots (˙I, O,
pruning hooks (C
¸ , S¸). Fig. 9 shows examples of such characters, which, after noise cancellation operations, look very much
the same. A list of top 10 commonly confused characters, along
with their confusion rates, is given in Table 2.

Figure 9: Examples to commonly confused characters
Table 2: Top 10 commonly confused characters


Figure 8: First three sample frames and their transcription results
shift values. For instance in Figure 8, the recognized ”G” which
appears in positions 159-167 in the first sample (a) and the one
in 119-127 of the second (b) refer to the same character, since
we investigate each 10th frame with 4 pixels of shift per frame.
Small changes in these values (mainly due to noise) are compensated using shift-alignment comparison checks. Therefore, we
obtain a unique pixel position (start-end) pair for each character
seen on the text band.
The characters in successive samples, which have the same
unique position pair may not be identical, due to recognition errors. Continuing the previous example, we see that the parts of
figures which corresponds to the positions in boldface are recognized as ”G”, ”G” and ”O”, respectively. The final decision is
made by majority voting; the character that is seen the most is
assigned to that position pair. Therefore, in our case, we decide
on ”G” for that position pair.



Rate (%)

As it can be seen in Table 2, the highest confusion rates
occur in the discrimination of similarly-shaped characters and
numbers, and Turkish characters with diacritics.
3.3. Improvements over the Baseline
We made two major improvements over the baseline system, to
improve recognition accuracy. The first one is to use Jaccard’s
coefficient as template match score. Jaccard uses pixel comparison with a slightly different formulation. Following the same
notation as in 3.2.2, we have [11]:
sj =

3.2.4. Baseline Performance

n11 + n10 + n01


A second approach to improve recognition accuracy is correcting the transcriptions using Transformation Based Learning
[12], [13].
Transformation Based Learning (TBL) is a supervised classification method, commonly used in natural language processing. The aim of TBL is to find the most common changes on a
training corpus and construct a rule set which leads to the highest classification accuracy. A corpus of 11M characters, collected from a news website, is used as training data for TBL.
The labeling problem is defined as choosing the most probable
character among a set of alternatives. For the training data the
alternatives for each character are randomly created using the
confusion rates presented in Table 2. Table 3 presents some of
the learned rules and their rankings. For example, in line 2, the
letter ”O”, which starts a new word and is succeeded by an ”E”
is suggested to be changed to a ”G”.
The joint effect of Jaccard’s score for comparison and TBL
corrections on the transcribed text lead to 98% of character and
89% of word recognition accuracies, respectively.

We compare the transcribed text with the ground truth data and
use character recognition and word recognition rates as performance criteria. Even if only one character of a word is misrecognized, we label this word as erroneous.
Applying the baseline method on the same corpora as speech
recognition, we achieved 94% character recognition accuracy,
and 70% word recognition accuracy.
3.2.5. Discussions
One of the most important challenges of character recognition
was the combined effect of low resolution and noise. We work
with frames of considerably low resolution, therefore, each character covers barely an area of 10-14 pixels in height and 2-10
pixels in width. In such a small area, any noise pixel distorts the
image considerably, thus making it harder to achieve a reasonable score by template comparison.
Noise cancellation techniques, described in subsection 3.2.1,
created another disadvantage since they removed distinctive parts


Journal on Multimodal User Interfaces, Submission File, Draft, 30 June 2008


Table 3: Learned rules by TBL training
Rule Rank




to be Changed

4.1. Skin color detection

Changed with

Skin color is widely used to aid segmentation in finding parts
of the human body [14]. We learn skin colors from a training
set and then create a model of them using a Gaussian Mixture
Model (GMM).

3.4. Integration of Sliding Text Recognition and Spoken Term
The speech and sliding text modalities are integrated via a cascading strategy. Output of the sliding text recognition is used
like a verifier on the spoken term detection.

Figure 11: Examples from the training data

The cascade is implemented as follows: In the STD part, the
utterances whose relevance scores exceed a particular threshold
are selected, as in section 2.3. In the sliding text part, spoken
term detection hypotheses are checked with sliding text recognition. Using the starting time information provided by STD,
the interval of starting time ± 4 seconds is scanned on the sliding text output. The word which is closest to the query (in terms
of normalized minimum edit distance) is assigned as the corresponding text result. The normalized distance is compared to
another threshold. Those below the distance threshold are assumed to be correct and returned to the user.

4.1.1. Training of the GMM
We prepared a set of training data by extracting images from
our input video sequences and manually selecting the skin colored pixels. In total, we processed six video segments. Some
example training images are shown in Fig. 11. We use the RGB
color space for color representation. In the current videos that
we process, there are a total of two speakers/signers. RGB color
space gives us an acceptable rate of detection with which we
are able to perform an acceptable tracking. However, we should
change the color space to HSV when we expect a wider range
of signers, from different ethnicities [15].
The collected data are processed by the Expectation Maximization (EM) algorithm to train the GMM. After inspecting the
spatial parameters of the data, we decided to use a five Gaussian
mixtures model.



Recall (%)




P = 98.5
R = 56.8

only speech
sliding text aided



P = 97.5
R = 50.8

Precision (%)



Figure 10: Precision-Recall curves using only speech information and using both speech and sliding text information

Evaluation of the sliding text integration is done as explained
in section 2.4.1. The resulting precision recall curves are shown
in Figure 10. To obtain the text aided curve, probability threshold in STD is set to 0.3 (determined empirically) and the threshold on the normalized minimum edit distance is varied from 0.1
to 1. Sliding text integration improves the system performance
by 1% in maximum precision, which is a considerable improvement in the high precision region. Using both text and speech,
the maximum attainable precision is 98.5%. This point also has
a higher recall than maximum precision of only speech.

Figure 12: Skincolor probability distribution. There are two
probability levels visible. The outside layer corresponds to probability of 0.5 and the inner layer correresponds to probability of
The straightforward way of using the GMM for segmentation is to compute the probability of belonging to a skin segment
for every pixel in the image. This computation takes a long time,
provided the information we have is the mean, variance and gain


Journal on Multimodal User Interfaces, Submission File, Draft, 30 June 2008

of each Gaussian in the mixture. To reduce the time complexity of this operation, we have precomputed the probabilities and
used table look-up.
The values of the look-up table are computed from the probability density function given by the GMM and we obtain a 256
x 256 x 256 look-up table containing the probabilities of all the
colors from RGB color space. The resulting model can be observed in Fig. 12.

4.1.3. Segmentation
The segmentation is straightforward. We compare the color of
each pixel with the value in the look-up table and decide whether
it belongs to the skin segment or not. For better performance,
we blur the resulting probability image. That way, the segments
with low probability disappear and the low probability regions
near the high probability regions are strengthened. For each
frame, we create a mask of skin color segments by thresholding the probability image (See Fig. 14) to get a binary image
that indicates skin and non-skin pixels.

4.1.2. Adaptation of general skin color model for a given
video sequence
To handle different illuminations and different speakers accurately, we need to adapt the general look-up table, modeled from
several speakers, to form an adapted look-up table for each video.
Since manual data editing is not possible at this stage, a quick
automatic method is needed.

Figure 14: The process of image segmentation. From left to
right: original image, probability of skin color blurred for better
performance, binary image as a result of thresholding, original
image with applied mask.

4.2. Hand and face tracking
The movement of the face and the hands is an important feature
of sign language. The trajectory of the hands gives us a good
idea about the performed sign and the position of the head is
mainly used for creating a local coordinate system and for normalizing the hand position.
From the resulting bounding rectangle of the face detector
[16] we fit an ellipse around the face. The center of the ellipse is
assumed to be the center of mass of the head. We use this point
as the head’s tracked position.
For hand tracking, we aim to retrieve the hands from the
skin color image as separate blobs [17]. We define a blob as a
connected component of skin color pixels. However, if there are
skin color-like parts in the image, the algorithm retrieves some
blobs which do not belong to the signer. That means the blob
is neither the signer’s head nor the signer’s hand. These blobs
should be eliminated for better performance.
Figure 13: Examples of different lighting conditions resulting
in different skin color of the speaker and the corresponding
adapted skincolor probability distribution.

4.2.1. Blob filtering
We eliminate some of the blobs with the following rules:
• The hand area should not be very small. Eliminate the
blobs with small area.

We detect the face of the speaker [16] and store it in a separate image. This image is then segmented using the general
look up table. We generate a new look-up table by processing
all the pixels of the segmented image and store the pixels’ color
in the proper positions in the new look-up table. As the information from one image is not sufficient, there may be big gaps
in this look-up table. These gaps are smoothed by applying a
convolution with a Gaussian kernel. To save time, each color
component is processed separately. At the end of this step, we
obtain a specialized look-up table for the current video.
The general look-up table and the new specialized look up
table are combined by weighted averaging of the two tables.
The speaker is given more weight in the weighted averaging.
This way, we eliminate improbable colors from the general look
up table thus improving its effectiveness. Some examples of
adapted look-up tables can be seen in Fig. 13.

• Head and the two hands are the three biggest blobs in the
image, when there is no occlusion. Use only the three
biggest blobs.
• Eliminate blobs whose distances from the previous positions of already identified blobs exceeds a threshold. We
use a multiple of the signer’s head width as threshold.
At the end of filtering we expect to have blobs that correspond to the head and hands of the signer. However, when hands
occlude each other or the head, two or three objects form a single blob. We need to somehow know which objects of interest
are in occlusion or whether they are so close to each other that
they form a single blob.


Journal on Multimodal User Interfaces, Submission File, Draft, 30 June 2008

intersection of this region and the combined blob where the occlusion is. We use the estimated position of the hand as in Fig.
16. The parameters of the ellipse are taken from the bounding
ellipse of the hand. As a last step we collect a new template for
the hand.

4.2.2. Occlusion prediction
We apply an occlusion prediction algorithm as a first step in
occlusion solving. We need to predict whether there will be an
occlusion and among which blobs will the occlusion be. For this
purpose, we use a simple strategy that predicts the new position,
pt+1 , of a blob from its velocity and acceleration. The velocity
and acceleration are calculated using:


pt − pt−1
vt − vt−1
pt + vt + 0.5 ∗ at


The size of the blob is predicted with the same strategy.
With the predicted positions and sizes, we check whether these
blobs intersect. If there is an intersection, we identify the intersecting ones and predict that there will be an occlusion between
those blobs.

Figure 16: An example of hand and head occlusion. From left
to right: original image, image with the ellipse drawn, result of

4.2.3. Occlusion solving

Both hands occlude the head. In this case, we apply the
template matching method, as described above, to each of the

We solve occlusions by first finding out the part of the blob that
belongs to one of the occluding objects and divide the blob into
two parts (or three in the case that both hands and head are in
occlusion). Thus we always obtain three blobs which can then
be tracked (unless one of the hands is out of the view). Let us
describe the particular cases of occlusion.
Two hands occlude each other. In this case we separate the
combined blob into two blobs with a line. We find the bounding
ellipse of the blob of the occluded hands. The minor axis of the
ellipse is computed and a black line is drawn along this axis,
forming two separate blobs (see Fig. 15). This approach only
helps us to find approximate positions of the hand blobs in case
of occlusion. The hand shape information on the other hand is
error-prone since the shape of the occluded hand can not be seen

4.2.4. Tracking of hands
The tracking starts by assuming that, for the first time a hand is
seen, it is assigned as the right hand if it is on the right side of
the face and vice versa for the left hand. This assumption is only
used when there is no hand in the previous frame.
After the two hands are identified, the tracking continues
by considering the previous positions of each hand. We always
assume that the biggest blob closest to the previous head position belongs to the signer’s head. The other two blobs belong to
the hands. Since the blobs that are far away from the previous
positions of hands are eliminated at the filtering step, we either
assign the closest blob to each hand or if there are no blobs, we
use the previous position of the corresponding hand. This is applied to compensate the mistakes of the skin color detector. If
the hand blob can not be found for a period of time (i.e half a
second), we assume that the hand is out of view.
4.3. Feature extraction
Features need to be extracted from processed video sequences
[19] for later stages such as sign recognition, or clustering. We
choose to use the hand position, motion and simple hand shape
The output of the tracking algorithm introduced at the previous section is the position and a bounding ellipse of the hands
and the head during the whole video sequence. The position
of the ellipse in time forms the trajectory and the shape of the
ellipse gives us some information about the hand or head orientation. The features extracted from the ellipse are its center of mass coordinates, x, y, width, w, height, a, and the angle between ellipse’s major axis and the x-axis, a. Thus, the
tracking algorithm provides five features per object of interest,
x, y, w, h, a, for each of the three objects, 15 features in total.
Gaussian smoothing of measured data in time is applied to
reduce noise. The width of the Gaussian kernel is five, i.e. we
calculate the smoothed value from two previous, present and
two following frames.
We normalize all the features such that they are speaker independent and invariant to the source resolution. We define a
new coordinate system. The average position of the head center
is the new origin and the average width of the head ellipse is the
new unit. Thus the normalization consists of translation to the

Figure 15: An example of hand occlusion. From left to right:
original image, image with the minor axis drawn, result of tracking.
One hand occludes the head. Usually the blob of the head
is much bigger than the blob of the hand. Therefore the same
strategy as in the two-hand-occlusion case would have an unwanted effect: there would be two blobs with equal sizes. Instead, we use a template matching method [18] to find the hand
template, collected at the previous frame, in the detected blob
region. The template is a gray scale image defined by the hand’s
bounding box. When occlusion is detected, a region around the
previous position of the hand is defined. We calculate the correlation
R(x, y) =
(T (x′ , y ′ ) − I(x + x′ , y + y ′ ))2


where T is the template, I is the image we search in, x and y
are the coordinates in the image, x′ and y ′ are the coordinates
in the template. We search for the minimum correlation at the


Journal on Multimodal User Interfaces, Submission File, Draft, 30 June 2008

sign. The starting border of the sign has to be moved backwards
in order to compensate for the delay between speech and signing. A typical delay of starting instant is about 0.2 seconds backwards. In some cases, the sign was delayed against the speech,
so it’s suitable to move the ending instant of the sign forward.
To keep our sequence as short as possible, we shifted the ending instant about 0.1 second forward. If we increase the shift of
the beginning or the ending border, we increase the probability
that the whole sign is present in the selected time interval, at the
expense of including parts of previous and next signs into the

Left hand features behaviour in time


position / size

smoothed X
smoothed Y
smoothed width
smoothed height
measured X
measured Y
measured width
measured height







frame number

Figure 17: Smoothed features - example on 10 seconds sequence
of four left hand features (x and y position, bounding ellipse
width and height)

Figure 19: Timeline with marked borders from the speech recognizer

4.4.1. Distance Calculation

new origin and scaling. Figure 18 shows the trajectories of head
and the hands in the normalized coordinate system for a video
of four seconds.

We take two signs from the news whose accompanying speech
segments were recognized as the same word. We want to find
out whether those two signs are the same or are different, in case
of homonyms or mistakes of the speech recognizer. For this
purpose we use clustering. We have to consider that we do not
know the exact borders of the sign. After we extend the borders
of those signs, as described above, the goal is to calculate the
similarity of these two signs and to determine if they are two
different performances of the same sign or two totally different
signs. If the signs are considered to be the same, they should be
added to the same cluster.
We have evaluated our distance calculation methodology by
manually selecting a short query sequence, which contains a
single sign and searching for it in a longer interval. We calculate the distance between the query sequence and other equallength sequences which were extracted from the longer interval. Fig. 20 shows the distances between a 0.6 second-long
query sequence and the equal-length sequences extracted from
a 60 seconds-long interval. This way, we have extracted nearly
1500 sequences and calculated their distances to the query sequence. This experiment has shown that our distance calculation has high distances for different signs and low distances for
similar signs.

Head and hands trajectories - 4 seconds
left hand
right hand









Figure 18: Trajectories of head and hands in normalized coordinate system

distance of 0.6 second sequence and same length sequences
taken from 60 seconds (1500 frames) interval

The coordinate transformations are calculated for all 15 features, which can be used for dynamical analysis of movements.
We then calculate differences from future and previous frames
and include this in our feature vector:


(x(t) − x(t − 1)) + (x(t + 1) − x(t)) (9)
x(t + 1) −
x(t − 1)





In total, 30 features from tracking are provided, 15 smoothed
features obtained by the tracking algorithm and 15 differences.


4.4. Alignment and Clustering








Figure 20: Distance calculation between 0.6 seconds query sequence and equal-length sequences taken from 60 seconds interval

The sign news, which contains continuous sign speech, was split
into isolated signs by a speech recognizer (see Fig. 19). In the
case the pronounced word and the performed sign are shifted,
the recognized borders will not fit the sign exactly. We have examined that the spoken word usually precedes the corresponding

The distance of two sequences is calculated by tracking features (x and y coordinates of both hands), simple hand shape


Journal on Multimodal User Interfaces, Submission File, Draft, 30 June 2008

Figure 21 shows the dendograms for two example signs,
‘¨ulke” and ‘bas¸bakan”. For sign ‘¨ulke”, nine of the retrievals
are put into the main cluster and one retrieval is put into another
cluster. That retrieval is a mistake of the speech recognizer and
the sign performed is completely different from the others. The
clustering algorithm correctly separates it from the other signs.
For sign ‘bas¸bakan”, three clusters are formed. The first cluster
is the one with the correct sign. The four retrievals in the second
and third clusters are from a single video where the tracking can
not be performed accurately as a result of bad skin detection.
The two signs are shown in Figure 22.

features (width and height of fitted ellipse) and approximate
derivatives of all these features. We calculate the distance of
two equal-length sequences as the mean value of squared differences between corresponding feature values in the sequences.
If we consider the feature sequence, represented as matrix S,
where s(m, n) is the n-th feature in the m-th frame, S1 and S2
represent two compared sequences, N is number of features and
M is number of frames, then the distance is calculated as:
D(S1 , S2 ) =

1 XX
(s1 (m, n) − s2 (m, n))2
M N m=1 n=1


The squared difference in Equation 11 suppresses the increase in the distance for small differences on one hand and emphasizes greater diferences on the other.
Inspired by the described experiment with searching for short
query sequence in longer interval, we extended the distance calculation for two different-length sequences. We take the shorter
of the two compared signs and go through the longer sign. We
find the time position where the distance between the short sign
and the corresponding same-length interval from the longer sign
is the lowest. The distance from this time position, where the
short sign and part of the longer sequence fit best, is considered
as the overall distance between the two signs.
We solve the effect of changes of the signing speed in the
following way: when the speed of the two same signs is different, the algorithm suppresses small variations and emphasizes larger variations as described before, so that the signs are
classified into different clusters. This feature is useful when
the desired behaviour is to separate even the same signs which
were performed at different speeds. Other techniques can be
used, such as multidimensional Dynamic Time Warping (DTW),
which can handle multidimensional time warped signals. Another approach for distance calculation and clustering is to use
a supervised method, such as Hidden Markov Models (HMM).
Every class (group of same signs) has one model trained from
manually selected signs. The unknown sign is classified into a
class which has the highest probability for features of the unknown sign. In addition, HMM can recognize sign borders in
the time interval and separate it from parts of previous and following signs which can be present in the interval. We plan to
use a supervised or a semi-supervised approach with labelled
ground truth data but currently we leave it as future work.


4.4.2. Clustering


To apply clustering for multiple signs, we use the following

Figure 21: Dendogram - grouping signs together at different
distances, for two example signs (a) ‘¨ulke”, and (b) ‘bas¸bakan”.
Red line shows the cluster separation level.

1. Calculate pairwise distances between all compared signs;
store those distances in an upper triangular matrix
2. Group the two most similar signs together and recalculate
the distance matrix. The new distance matrix will have
one less row and column and the scores of the new group
is calculated as the average of the scores of the two signs
in the new group.

The performance of the system is affected by several factors. In
this section we indicate each of these factors and present performance evaluation results based on ground truth data.
We extracted the ground truth data for a subset of 15 words.
The words are selected from the most frequent words in the news
videos. The speech retrieval system retrieves, at most, the best
10 occurences of the requested word. For each word we annotated each retrieval to indicate whether the retrieval is correct,
whether the sign is contained in the interval, whether the sign
is a homonym or a pronounciation difference, and whether the
tracking is correct.
For each selected word, 10 occurences are retrieved, except

3. Repeat step 2. until all signs are in one group
4. Mark the highest difference between two distances, at
which two signs were grouped together in previous steps,
as the distance up to which the signs are in the same cluster (see Fig. 21).
We set the maximum number of clusters to three: one cluster for the main sign, one cluster for the pronounciation differences or the homonyms, and one cluster for the unrelated signs,
due to mistakes of the speech recognizer, or the tracker.


Journal on Multimodal User Interfaces, Submission File, Draft, 30 June 2008

The clustering accuracy for the selected signs is 73.3%. This
is calculated by comparing the clustering result with the ground
truth annotation. The signs that are different from the signs in
the other retrievals should form a separate cluster. In 7.5% of
the correct retrievals, the performed sign is different from the
other signs, which can be either a pronounciation difference or
a sign homonym. Similarly, if the tracking for that sign is not
correct, it can not be analyzed properly and these signs should
be in a different cluster. The results show that the clustering performance is acceptable. Most of the mistakes occur as a result
of the insufficient hand shape information. Our current system
uses only the width, the height and the angle of the bounding
ellipse of the hand. However this information is not sufficient
to discriminate different finger configurations, or orientations.
More detailed hand shape features should be incorporated into
the system for better performance.




Figure 22: Two example signs (a) ‘¨ulke”, and (b) ‘bas¸bakan”.

The user interface and the core search engine are separately located and the communication is done via TCP/IP socket connections. This design is also expected to be helpful in the future
since a web application is planned to be built using this service.
This design can be seen in the Fig. 24.

Table 4: System performance evaluation

Correct retrieval
Correct tracking
Interval contains
Ext. interval contains


# of



two of them where we retrieve eight occurences. Among these
146 retrievals, 134 of them are annotated as correct, yielding a
91.8% correct retrieval accuracy for the selected 15 words. The
summary of the results can be seen in Table 4.
In 51.5% of the correct retrievals, the sign is contained in
the interval. However, as explained in Section 4.4, prior to sign
clustering, we extend the beginning and the end of the retrieved
interval, 0.2 seconds and 0.1 seconds respectively, in order to
increase the probability of containing the whole sign. Figure 23
shows the sign presence in the retrieved intervals. The number
of retrievals that contain the sign when the interval is extended
in different levels is also shown. If we consider these extended
intervals, the accuracy goes up to 78.4%.
The tracking accuracy is 77.6% for all of the retrievals. We
consider the tracking is erroreneus even if one frame in the interval is missed.

Figure 24: System Structure
In Fig. 25, the screenshot of the user interface is shown.
There are five sections in it. The first is the ‘Search” section
where the user inputs a word or some phrases using the letters
in the Turkish alphabet and sends the query. The application
communicates with the engine on the server side using a TCP/IP
socket connection, retrieves data and processes it to show the
results in the ‘Search Results” section. Each query and its results
are stored with the date and time in a tree structure for later
referral. A ‘recent searches” menu is added to the search box
aiming to cache the searches and to decrease the service time.
But the user can clear the search history to retrieve the results
from the server again for the recent searches. The relevance of
the returned results (with respect to the speech) is shown using
stars (0 to 10) in the tree structure to inform the user about the
reliability of the found results. Moreover, the sign clusters are
shown in parentheses, next to each result.
When the user selects a result, it is loaded and played in
the ‘Video Display” section. The original news video or the
segmented news video are shown in this section according to
the user’s ‘Show segmented video” selection. The segmented
video is added separately to the fi

Dokumen yang terkait

Dokumen baru