Aksara Jawa Text Detection in Scene Images using Convolutional Neural Network

  

Aksara Jawa Text Detection in Scene Images

using Convolutional Neural Network

Muhammad Labiyb Afakh

  martianda,

  Existing Aksara Jawa text works [3]–[7] are generally uti- lized many features which are cleverly integrated followed by a classifier to distinguish text or no-text then post-processing such as bounding box formation. This has been done so as to get better discriminative system. The feature is individually designed and manually tuned before connecting to a classifier followed by post-processing. The features are thus prone to error due to imperfection from manual design and require tedious jobs of hand-tuning before getting desired results. Moreover, features and classifier only has one-way connection, the features influence the classifier but not the other way round.

  images. Differing from the problems of existing Aksara Jawa text works [3]–[7] where most the characters are typically lied on uniform backgrounds, e.g., white background such as in document, detecting Aksara Jawa text in scene images is thus more challenging due to the complexity of the background, high variation of fonts, size, and color. Despite of those complexity, Aksara text has to be robustly detected indicated by forming bounding boxes on the detected text.

  Fig. 1. Aksara Jawa text in scene images. Background is usually complex, such as text is written on stone, cluttered non-text objects, regular pattern of roof, which are not so distinguishable from the text.

  Toward the goal of digitizing Aksara Jawa text, in this

  Converting visual information into text has gain wide atten- tion due to its usefulness in many real world applications, such as tourists navigation, spotting place name, robots navigation, assisting visually impaired people, enhancing safe vehicle driving, etc. [1], [2]. Aksara Jawa detection is highly necessary for a tourist who visits Jawa to navigate to a location or get a point of ancient script quickly, such as shown in Fig. 1. This as well can be used by Javanese people to understand the script both for research or non-research applications. Thus, there is a genuine need to develop Aksara Jawa text detection to preserve the nearly loss heritage Javanese culture and help people to get the meaning of the ancient script.

  Aksara jawa is an ancient Javanese character which has been used since 17th century during Mataram kingdom. In- donesian history is mostly written in this text which is typically written on stones. Some urban people still use this character for example as a place name, tourist spot, wedding, tombstone, etc. Today, this heritage culture is slowly been ignored as people increasingly become more educated. In order to preserve this, an effort toward digitizing Aksara Jawa text is extremely important to be done.

  NTRODUCTION

  I. I

  Keywords—Aksara Jawa, text detection, scene image, deep learning, convolutional neural network, joint learning

  Abstract—Aksara jawa is an ancient Javanese character, which has been used since 17th century. The character is mostly written on stones to describe history or naming such as places, wedding, tombstones, etc. This character is however gradually ignored by people. Thus, it is extremely important to preserve this near loss heritage culture. In this paper, as a step toward preserving and converting visual information into text, we develop Aksara Jawa text detection system in scene images employing deep convolutional neural network to localize the occurrence of Aksara Jawa text. This method mainly differs from the existing Aksara Jawa text works that employ manually hand- crafted features and explicitly learn a classifier. The features and classifier of this method are jointly learned from which the back-propagation technique is employed to obtain parameters simultaneously. A text confidence map is then produced followed by bounding boxes formation which is estimated and formed to indicate the occurrence of text lines. The experimental results from our method show promising results on both Aksara Jawa and English text.

  endah}@pens.ac.id

  ¶

  nasir_meka,

  §

  ‡

  ∗ , Anhar Risnumawan

  anhar,

  †

  [email protected], {

  ∗

  Kampus PENS, Jalan Raya ITS Sukolilo, Surabaya, 60111 Indonesia Email:

  Department of Creative Multimedia Politeknik Elektronika Negeri Surabaya (PENS)

  ‡

  Mechatronics Engineering Division,

  †§¶

  Computer Engineering Division,

  ¶ ∗

  § , and Endah Suryawati Ningrum

  ‡ , Mohamad Nasyir Tamara

  † , Martianda Erste Anggraeni

  In this paper, to solve those problems we employ deep convolutional neural network (CNN) [8] to localize Aksara Jawa text in scene images. This method mainly differs from the existing manually hand-crafted features and explicitly learn and has a dual connection between features and classifier. The features and classifier are jointly learned for which the back-propagation technique is employed to obtain parameters simultaneously. A text confidence map is built from input image by using trained CNN text detector. Then, bounding boxes are estimated and formed to indicate the occurrence of text lines. The experimental results from our proposed method show promising results on both Aksara Jawa and English text.

AWA TEXT DETECTION USING

  Aksara Jawa has 20 base characters (ngelegena script), 20 pair-characters as closing of vowels, 8 main characters (Murda script, some unpaired), 8 pair of main characters, 5 rekan characters and its pairs, some sandhangan as a controlling vowel, some special characters, punctuation marks, and writing marks. Some of these characters are shown in Fig. 3. Different from English text, Aksara Jawa has punctuation marks at the above and bottom of the characters that makes the detection relatively difficult. We found that these marks can deteriorate detection accuracy as it produces many bounding boxes for a single text line.

  Meanwhile, CNN consist of multiple layers of features which are intertwined together. A convolutional layer consist of N linear filters which is followed by a non-linear activation function h. The result from convolutional process produce fea- ture map f m (x, y) which is an input to the next convolutional layer, where (x, y) ∈ S m are spatial coordinates on layer m. m C c

  (l|u) = k l (Q(u)) over labels given the inputs.

  = {text, non-text} and a posterior probability distribution P

  (l|u), where labels L

  = argmax l∈L P

  (u) = (Q 1 (u), Q 2 (u), . . . , Q N (u)) are extracted. The method then learns a binary classifier k l for each label. The classifier is a separate stage from the features. The primary objective is then to maximize to detect the labels l which is contained in region u such that l

  Conventional detection algorithms are to classify a rectan- gle region u whether text or non-text. Those methods can be summarized as follows: a set of hand-crafted features Q

  Multiple classifiers which are connected in series are a common way to increase the performance of the detection system. For example of those works are [22], [23] which utilize cascaded classifier to boost the accuracy.

  In this work, CNN is employed to learn the features representation, optimizing the features and the classifiers si- multaneously. This in contrast to the existing manual features design that are optimized through tedious job of trial-and-error process individually optimizing the features and the classifiers.

  B. Text Detector CNN

  The rest of this paper is organized as follows. Section II describes related work. In Section III explains the methodology and it consists of III-A Aksara Jawa characters, III-B describes the CNN structure, and III-C explains the bounding boxes formation. IV and V describe experiments and conclusion, respectively.

  II. R ELATED W ORK Feature extraction based on image segmentation and skele- tonization has been done by [3] for recognizing Aksara Jawa text in documents. After skeletonization, post-processing are done including simple closed curve, straight lines and curve.

  2. Given an input scene image, response map is computed from the trained CNN, and bounding boxes are formed to indicate the detected text lines using non-maximal suppression. One can see that Aksara Jawa characters have relatively similar shape which can cause false detection such as due to regular pattern of roof. This differs from English text detection such as [20], [21] which are based on connected components analysis. Multiscale Post

  CNN Overall overview of the proposed system is shown in Fig.

  J

  KSARA

  III. A

  The above existing Aksara Jawa text works [3]–[7] have fo- cused on document analysis where the text is usually monotone written on uniform background such as white paper. Moreover, the features tend to be manually design with many heuristic rules followed by a classifier to distinguish text or non-text. The features are thus prone to error due to imperfection from getting desired results. Moreover, features and classifier only has one-way connection, the features influence the classifier but not the other way round. In this work, by leveraging deep network and sliding window technique we show it is possible to solve those problems to build a robust Aksara Jawa text detection system in scene images.

  Deep learning, [9], has recently shown excellent results in various visual perception tasks, such as image classification by [10], [11], image segmentation by [12]–[14], and object detec- tion by [15]–[19]. Deep learning such as convolution neural networks has the ability to automatically learn the effective representation of features found on the visual input, which makes it most suitable for most visual detection tasks. The features are studied automatically without involving manual design so that the parameters can be learned purely from the data.

  Template matching technique was used by [7]. It is unclear how the work tests the performance of the system as template matching have a well known generalization problem. The method might not work on a slight variation Aksara characters.

  An image processing techniques such as character seg- mentation, normalization, grayscale, and binarization have been employed by [6] and a neural network classifier trained using common back-propagation has been used. The method however contained so many heuristic rules which are difficult to achieve in practice. It is also not clear the effectiveness of the method for Aksara Jawa detection.

  Directional Element Feature (DEF) feature has been em- ployed by [4] and a SVM classifier to recognize Aksara Jawa text in documents. DEF is built by counting up image edge neighborhood element in each character. This feature becomes an input to a SVM classifier. It is however unclear how the method deals with noises for example as the character varies.

  Several experiments have been done using various parameters and Java characters in order to obtain the optimal parameters. In practice, it is difficult to implement this method as it contains many parameters.

  A. Aksara Jawa character

  neighborhood contained in c-th channel of feature maps. The

  Images CNN Processing

  process starts with f 1 equal to the resized box region u,

  t performs convolution layer by layer, and ends by connecting x p e Te x the last feature map to a logistic regressor for classification e a g g

  M Bo

  to get the correct label output probability. All the models

  e g t Ima s

  Ima x in t n parameters are learned simultaneously solely from the training t e d u o u

  • T n p

  p tp n data using commonly Stochastic Gradient Descent (SGD) and u In es

  Ou R Bo No Conv olution Pooling Fully Connected back-propagation by minimizing loss over training labels data.

  (a) Block diagram of overall system (b) Input image Fig. 3. Aksara Jawa base characters and vowel.

  In order to localize text information from scene images, a sliding window is applied on multiscaled input image. Each window is classified by our trained CNN obtaining text confi- dence map. Based on this map, the estimated bounding boxes are calculated to acquire candidate of text lines. Multiscale input image is constructed ranging from 10% to 150% of the original input image size with 10% increasing step.

  (c) Response map

  We employ the CNN architecture which is shown in Fig. 4. On each sliding window it generates an image patch 32×32 u which is used as input. First, the input u is

  ∈ R normalized by its mean and variance which is then then becomes an input to two convolutional and pooling layers. It is interesting that, the normalization is important to reduce loss as the input is centered and bounded to a value of [0, 1]. 1 The number of the first and second filters are N = 96 and

  N 2 = 128, respectively. ReLu activation layer is intertwined on each convolutional layer and the first fully-connected (FC)

  (d) Forming bounding boxes

  layer. ReLu is simply passing all the feature maps if it is not m

  

Fig. 2. Overview of Aksara Jawa text detection system using input scene less than zero h (f ) = max{0, f }. In practice ReLu help to

image. Response map is shown as the probability output , [0 1] from CNN, increase the learning process convergence rate instead of using ranging from blue(lowest) - yellow - red(highest). Note that most of the

  sigmoid function as it could easily deteriorate the gradient.

  probability values spread around the detected text line. Best viewed in color.

  In our experiments, we found empirically that sigmoid n function after last fully-connected layer can yield betters to indicate c-th channel feature map. A new feature map f

  m+1

  performance. This is because the output is bounded to be is produced after each convolutional layer such that, within the labels which are {0, 1} for non-text and text labels, respectively. Therefore, we apply sigmoid activation function n n n n n followed by softmax loss layer after the last fully-connected f m m

  = h (g m ), where g m = W m ∗ f + b m (1) n n n

  m+1

  layer. On testing, the network provides probability output value of each label. g m ,W m , and b m indicate the n-th net input, filter kernel, and bias on layer m, respectively. Normalization, subsampling, Existing Aksara Jawa text works [3]–[7] extract several and pooling, which are usually intertwined with convolutional features from the input image with the hope to get a richer layers, are used to build translation invariance in local neigh- information to the classifier for better representation. The clas- m borhoods. This work used an activation layer h such as the sifier then classify the correct label. Those classifiers basically m

  Rectified Linear Unit (ReLU) h (f ) = max{0, f }. Pooling contains complex non-linear transformation such as Support

  Input patch Convolution Pooling Convolution Pooling Text Non-text Fully connected 32x32 25x25x96 Fully connected 5x5x96 4x4x128 2x2x128 1x1x256 Fig. 4. Structure of convolutional neural network employed in this work.

  space or maps onto higher dimension to further apply linear operations.

  The above existing frameworks can be viewed as a kind of CNN [24], where each stage represents layers. The extracted features can be represented by feature maps including its channels and non-linear transformation as a convolutional operation as in Eq. 1 containing non-linear activation function of convolution between the filter kernel with feature maps. However, the main difference in CNN is, features integration, pooling, non-linear transformation, are all served in single layer. The computation is more efficient and it is fully feed- forward so that the method can optimize from input to output mapping consisting the features and the classifier. A new layer can be intertwined easily for a more discriminative system.

  5 10 15 20 25 30 i - image 10 20 30 40 50 60 70 80 Precentage (%) 90 100 Precision for each image precison our method precison previous methods Fig. 5. Precision score comparison for each image.

  5 10 15 20 25 30 i - images 10 20 30 40 50 60 70 80 Precentage (%) 90 100 Recall of each image Recall our method Recall previous methods Fig. 6. Recall score comparison for each image.

  From the response maps, we apply non-maximal sup- pression (NMS) as in [25] to form bounding boxes. More specifically, for each row in the response maps, the possible overlapping line-level bounding boxes are scanned iteratively. probability contained in response maps. Then NMS is applied again to remove overlapping bounding boxes.

  IV. E

  XPERIMENTAL

  R

  ESULTS

  The CNN network is trained using the well known IC- DAR 2003 training set and synthetic data from [26] mixed with Aksara Jawa font text and scene images collected from internet. We have collected 90 scene images containing Aksara Jawa. 60 images are used for training and 30 for testing.

  Some of the training data is used as validation set. We set the parameters as follows: Max iteration is 450 thousand, momentum parameter is 0.9, learning rate of SGD is 0.05 and for every 100 thousand it is decreased by multiplication of

  0.1. The validation set is tested for every 1000 iterations. PC desktop core i5 with 4Gb RAM running GPU 1Gb memory, the learning proses takes about one week. The performance is measured using precision and recall as in [19]. The ratio between the number of correctly predicted bounding boxes and the number of predicted bounding boxes is called precision. The ratio between the number of correctly predicted bounding boxes and the number of ground truth bounding boxes is called recall. Therefore, lower precision and high recall would likely be produced by predicting with many bounding boxes. On the other hand, high precision and lower recall would likely be produced by predicting with less number bounding boxes. A good balance between high precision and high recall is the best performance.

  Since the existing Aksara Jawa text works [3]–[7] do not have ready software to make comparison, we implement ourselves a code to represent their algorithms, that is to represents many features which are integrated cleverly which is followed by a classifier to discriminate text or no-text then post-processing such as bounding box formation. More specifically, connected components are extracted on gray im- age using a well-known MSER detector. Some trivial non- text connected components are removed using its geometrical structures that are stroke width, aspect ratio, eccentricity, euler number, extent, and solidity. Then the remaining non-trivial non-text connected components are further removed using SVM classifier. Then post-processing such as merge the text components to form a text lines bounding boxes are formed.

  The performance of our method is investigated as shown in Fig. 5 and 6. Our method shows fairly well to detect the Aksara Jawa text lines which is indicated by high precision and recall for each image. On the contrary, the previous methods based on manual features and a classifier seems to predict quiet amount of bounding boxes resulting in lower precision. This is inline with the qualitative results as shown in Fig. 7. This could be attributed to many components are extracted, most of it non-text while only a few text components. Thus,

C. Generating bounding boxes

  

Fig. 7. Comparison results between our method (green bounding boxes) and previous works (yellow bounding boxes) of Aksara Jawa text detection in scene

images. Best viewed in color.

  TABLE I. O

  [19]

  C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchi- cal features for scene labeling,” IEEE transactions on pattern analysis and machine intelligence , vol. 35, no. 8, pp. 1915–1929, 2013.

  [13]

  A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhuber, “A novel connectionist system for unconstrained handwriting recognition,” IEEE transactions on pattern analysis and machine intelligence , vol. 31, no. 5, pp. 855–868, 2009.

  [14] A.-r. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 20, no. 1, pp. 14–22, 2012.

  [15] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision , 2015, pp. 1440–1448. [16] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2014, pp. 580–587. [17] ——, “Region-based convolutional networks for accurate object de- tection and segmentation,” IEEE transactions on pattern analysis and machine intelligence , vol. 38, no. 1, pp. 142–158, 2016.

  [18] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,”

  IEEE transactions on pattern analysis and machine intelligence , vol. 32, no. 9, pp. 1627–1645, 2010.

  I. A. Sulistijono and A. Risnumawan, “From concrete to abstract: Multilayer neural networks for disaster victims detection,” in Electronics Symposium (IES), 2016 International .

  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems , 2012, pp. 1097–1105.

  IEEE, 2016, pp. 93–98. [20]

  A. Risnumawan and C. S. Chan, “Text detection via edgeless stroke width transform,” in ISPACS.

  IEEE, 2014, pp. 336–340. [21]

  A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan, “A robust arbitrary text detection system for natural scene images,” Expert Systems with Applications , vol. 41, no. 18, pp. 8027–8048, 2014.

  [22] L. Neumann and J. Matas, “Real-time scene text localization and recognition,” in CVPR.

  IEEE, 2012, pp. 3538–3545. [23] ——, “On combining multiple segmentations in scene text recognition,” ICDAR , 2013.

  [24]

  A. Risnumawan, I. A. Sulistijono, and J. Abawajy, “Text detection in low resolution scene images using convolutional neural network,” in International Conference on Soft Computing and Data Mining. Springer, 2016, pp. 366–375. [25] K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recog- nition,” in ICCV.

  [11] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [12]

  [9] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015. [10]

  VERALL PRECISION AND RECALL Method Overall Precision Recall

  V. C ONCLUSION In this paper, we have presented a method to detect Aksara

  Our Method

  0.96

  0.83 Previous Works [3]–[7] (Our Implementation)

  0.63

  0.74

  characters could be touching its neighborhood components and the characters stroke width are likely too small. These lead to the misclassification of text since those connected components analysis tends to investigate components’s geometrical struc- ture. Table I shows the overall precision and recall score. It is interesting to note that our method show higher precision while not degrading much recall, while the previous works show high recall compared to its precision. This is because the previous works estimate quiet a number of bounding boxes and it could be the method not precise enough to predict the occurrence of bounding boxes.

  Qualitative results from our method and the previous method is shown in Fig. 7. The background is very complex containing such as, regular pattern of roof, grass, trees, blur, and characters are away too small with a lot of noise on its surrounding. Despite those complexity, our method shows relatively well to detect Aksara Jawa, and English text lines, though it missclassifies few low resolution characters. Those characters are too small for even human to recognize correctly. It is interesting to noted that the existing method could not detect the text lines correctly, for example in the middle images shown in Fig. 7. This is because of the imperfect geometrical structure of the extracted components due to many noises, touching between characters, or probably the framework not deep enough to create better discriminative system.

  Jawa text in scene images to solve the problem of the existing works. Existing works tend to utilize many features which are manually designed and tuned and a classifier for inference. In practice, those works are hard to be implemented and can deteriorate the detection system. Interestingly, in this method we have shown features integration, pooling, non-linear trans- formation, are all packaged in a layer. The computation is more efficient and it is fully feed-forward. The experiments show encouraging results, we believe it will bring a lot of benefit for the future works on Aksara Jawa text detection and text analysis.

  [8] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural computation, vol. 1, no. 4, pp. 541–551, 1989.

  R

  EFERENCES [1] K. Jung, K. In Kim, and A. K Jain, “Text information extraction in images and video: a survey,” Pattern recognition, vol. 37, no. 5, pp. 977–997, 2004. [2] J. Liang, D. Doermann, and H. Li, “Camera-based analysis of text and documents: a survey,” International Journal of Document Analysis and

  Recognition (IJDAR) , vol. 7, no. 2-3, pp. 84–104, 2005.

  [3] R. Adipranata, M. Indrawijaya, G. S. Budhi et al., “Feature extrac- tion for java character recognition,” in International Conference on Soft Computing, Intelligence Systems, and Information Technology .

  Springer, 2015, pp. 278–288. [4] M. D. Sulistiyo et al., “Pengenalan aksara jawa tulisan tangan meng- gunakan directional element feature dan multi class support vector ma- chine [written aksara jawa recognition using direction element feature and support vector machine],” KNTIA, vol. 3, 2016.

  [5] Y. I. Nurhasanah, I. A. Dewi et al., “Sistem pengenalan aksara sunda menggunakan metode modified direction feature dan learning vector quantization [aksara sunda recognition system using modified direction feature and learning vector quantization methods],” Jurnal Teknik In- formatika dan Sistem Informasi , vol. 3, no. 1, 2017.

  [6]

  I. Al Farqi et al., “Aplikasi pengenalan aksara carakan madura dengan menggunakan metode back propagation [application of aksara carakan madura recognition using back propagation method],” Jurnal Ilmiah Teknologi Informasi Asia , vol. 9, no. 1, pp. 18–34, 2015. [7] R. Y. Astuty and E. D. Kusuma, “Pengenalan aksara jawa menggunakan digital image processing [aksara jawa recognition using digital image processing],” in Proceedings of the Informatics Conference, vol. 2, no. 2, 2016.

  IEEE, 2011, pp. 1457–1464. [26] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng, “End-to-end text recognition with convolutional neural networks,” in ICPR. IEEE, 2012, pp. 3304–3308.