Text Detection in Low Resolution Scene Images using Convolutional Neural Network


Text Detection in Low Resolution Scene Images

using Convolutional Neural Network

1 2 3 Anhar Risnumawan , Indra Adji Sulistijono , Jemal Abawajy , and Younes 4 1 Saadi 2 Mechatronics Engineering Division,

Graduate School of Engineering Technology, Politeknik Elektronika Negeri

Surabaya (PENS), Kampus PENS, Surabaya, Indonesia



School of Information Technology, Deakin University, Australia



Faculty of Computer Science & Information Technology, University of Malaya,


Kuala Lumpur, Malaysia



  Text detection on scene images has increasingly gained a lot

of interests, especially due to the increase of wearable devices. However,

the devices often acquire low resolution images, thus making it difficult to

detect text due to noise. Notable method for detection in low resolution

images generally utilizes many features which are cleverly integrated and

cascaded classifiers to form better discriminative system. Those methods

however require a lot of hand-crafted features and manually tuned, which

are difficult to achieve in practice. In this paper, we show that the notable

cascaded method is equivalent to a Convolutional Neural Network (CNN)

framework to deal with text detection in low resolution scene images.

The CNN framework however has interesting mutual interaction between

layers from which the parameters are jointly learned without requiring

manual design, thus its parameters can be better optimized from training

data. Experiment results show the efficiency of the method for detecting

text in low resolution scene images. Keywords: Scene text detection, low resolution images, cascaded clas-

sifiers, Convolutional Neural Network, mutual interaction, joint learning.

1 Introduction

  Detecting text on scene images has attracted wide interest due to its usefulness in varieties of real world applications, such as robots navigation, assisting visually impaired people, tourists navigation, enhancing safe vehicle driving, etc. [2, 4]. Differing from the problems of printed documents analysis where the characters are typically monotone on uniform backgrounds, detecting text on scene images is highly complicated due to complex background and high variation of fonts, size, and color. Thus, text on scene images has to be robustly detected, which is

2 Text Detection in Low Resolution Scene Image using CNN

  In most embedded-systems-based platforms such as robotics and mobile de- vices, low resolution images are often acquired from low resolution camera since it has small size, modularity with the platforms, and its compatibility with low processing power of most embedded systems. Ranging from robot localization applications using road signs as a GPS alternative, wearable-devices-based scene 5 6 7 text detection such as Google Goggles , vOICe , and Sypole to help visually impaired people.

  Dealing with low resolution images for detection is generally solved using many features which are cleverly integrated followed by cascaded classifiers to get better discriminative system as compared to the high-resolution-images-based systems [5, 11–13, 18–20]. Many features integration and cascaded classifier are extremely important to reduce misclassification rate due to the noise caused by low resolution images. The features are however prone to error due to imperfec- tion from manual design, while the cascaded classifiers require tedious jobs of hand-tuning before preparing another classifier to be cascaded, as shown in Fig.

  1. Stage 1 Stage 2 Stage m Feature 1 image fier 1 fier m-1 Input Feature 2 Feature n Classi Classi Output Fig. 1: General cascaded framework for detection in low resolution images, where there is no mutual interaction between stages. Each stage usually needs careful design and tune before working on another stage to be cascaded.

  The cascaded framework as shown in Fig. 1 has no mutual interaction be- tween stages. More specifically, the parameter on each stage is individually de- signed and tuned during learning, which is then the parameters become fixed when a new stage is designed. For example, the Histogram of Oriented Gradi- ent (HOG) feature is individually designed with its parameters manually tuned given the linear SVM classifier being used in [1]. Then HOG feature become fixed when new classifiers are designed [6].

  In this paper, we show that the most notable cascaded framework that is using many integrated features and cascaded classifiers is equivalent to a deep convolutional neural network (CNN) [3], to deal with text detection on low res- olution scene images. The method mainly differs from the existing manually 5 6 http://www.google.com/mobile/goggles/#text 7 http://www.artificialvision.com/android.htm

  Text Detection in Low Resolution Scene Image using CNN


  hand-crafted features and explicitly learn each classifier, the method implicitly integrates many features and has mutual interaction between stages. This mu- tual interaction can be seen as the stages are connected and jointly learned for which the back-propagation technique is employed to obtain parameters simul- taneously.

2 Text detection using CNN

  Overview of the system is shown in Fig. 2. Given a low resolution input scene image, response map is computed from CNN, and bounding boxes are formed to indicate the detected text lines. One can see the noise caused by low resolution from the zoomed image in Fig. 2b. These for example, the characters are not smooth as in high resolution images, touching between characters may be clas- sified as non-text, missing characters due to imperfect geometrical properties, and non-text may be classified as text due to noise.

  (a) Low resolution image, 250x138 pixels (b) Zoomed image patch (c) Response map from CNN (d) Bounding boxes formation

  Fig. 2: Overview of the system using low resolution input image. Response map is shown as the probability output [0, 1] from CNN, ranging from blue(lowest)

  • yellow - red(highest). Note that most of the probability values spread around the detected text line. Best viewed in color.

4 Text Detection in Low Resolution Scene Image using CNN

  Notable works which solve text detection on low resolution images are Mirme- hdi et al. [7] and Nguyen et al. [10], but it only works for document which is differ from scene images. Sanketi et al. [14] used several stages to localize text on low resolution scene images. However, the method is just a proving concept. It is also noted that several stages system could hardly be maintained as the error from the first stage could be propagated due to imperfect tuning, and requires manually hand-crafted features.

  Classifier is the main part of scene text detection system. It has been proven that increasing the performance of the classifier could increase the accuracy of detection, for example the works by [8, 9] applied cascaded classifiers to boost the accuracy.

  Conventionally, in order to classify an image patch u whether it is text or non-text, a set of features Q(u) = (Q (u), Q (u), . . . , Q N (u)) are extracted and l 1 2 a binary classifier k for each text and non-text label l is learned. From the probabilistic point of view, this means classifiers are learned to yield a posterior l probability distribution p(l|u) = k (Q(u)) over labels given the inputs. The objective is then to maximize to recognize the labels l contained in image patch u such that l = argmax l∈L p (l|u), where L = {text, non-text}. Those features Q often require hand-crafted manual design and optimized through a tedious jobs of trial-and-error cycle from which adjusting the features and re-learning the classifiers are needed. In this work, CNN is applied instead to learn the representation, jointly optimizing the features as well as the classifiers.

  Multiple layers of features are stacked in CNN. A convolutional layer consist of N linear filters which is then followed by a non-linear activation function h. m m A feature map f (x, y) is an input to a convolutional layer, where (x, y) ∈ S m C are spatial coordinates on layer m. The feature map f contains c (x, y) ∈ R

  C channels or f m (x, y) to indicate c-th channel feature map. The output of n convolutional layer is a new feature map f such that, m+1 n m mn m mn O mn mn f = h (W f + b ) (1) m+1 where W and b denote the n-th filter kernel and bias, respectively and N m denotes convolution operator. An activation layer h such as the Rectified m

  Linear Unit (ReLU) h (f ) = max{0, f } is used in this work. In order to build translation invariance in local neighborhoods, convolutional layers can be in- tertwined with normalization, subsampling, and pooling layers. Pooling layer is obtained by performing operation such as taking maximum or average over local neighborhood contained in channel c of feature maps. The process starts with f equal to input image patch u, performs convolution layer by layer, and 1 ends by connecting the last feature map to a logistic regressor for classification to get the probability of the correct label. All the parameters of the model are jointly learned from the training data. This is achieved by minimizing the classi- fication loss over a training data using Stochastic Gradient Descent (SGD) and back-propagation. 32×32 is

  We apply CNN structure as shown in Fig. 3. An image patch u ∈ R

  Input image 32x32 25x25x96 5x5x96 4x4x128 2x2x128 1x1x256 Text Detection in Low Resolution Scene Image using CNN Non-text

  5 Convolution Pooling Convolution Pooling Fully connected Fully connected Text Fig. 3: CNN structure for this work.

  normalized input then becomes an input to two convolutional and pooling layers. We found that the normalization is important to reduce the loss as the input is now centered and bounded to [0, 1]. We use the number of first filters N = 96 and 1 the second N = 128. After each convolutional layer and the first fully-connected 2

  (FC) layer we intertwine with ReLu activation layer. ReLu is simply passing all m the feature maps if it is not less than zero h (f ) = max{0, f }. In practice we found that ReLu will increase the convergence rate of learning process instead of using sigmoid function as it could easily deteriorate the gradient.

  After the last fully-connected layer we apply sigmoid activation function followed by softmax loss layer. In our experiments, it was found empirically that sigmoid function after last fully-connected layer yields superior performance. This can be thought as bounded the output to be within the labels which are {0, 1} for non-text and text labels, respectively. During testing, the output is the probability value of each label.

  In order to form bounding boxes from the response maps, we apply non- maximal suppression (NMS) as in [15]. In particular, for each row in the response maps, we scan the possible overlapping line-level bounding boxes. Each bounding box is then scored by the mean value of probability contained in response maps. Then NMS is applied again to remove overlapping bounding boxes.

3 Relationship to the Existing Cascaded Framework

  We show that the cascaded framework can be seen as a Convolutional Neural Network. Fig. 4 shows an illustration.

  In the cascaded framework, many features are extracted from the input im- age. Those features are used to provide rich information to the classifier for better input representation. The classifier then decide the correct label of the input. The classifiers generally contain non-linear transformation such as in Sup- port Vector Machine (SVM) from which the input features space are projected or mapped onto higher dimension to apply linear operations. In order to make better discriminative system, several classifiers are usually cascaded.

  The above discussion shows that the cascaded framework can be viewed as a kind of CNN, where the stages are now becomes layers. Many features are repre- sented by feature maps with its channels. And a classifier’s non-linear transfor- mation as a convolutional operation as in Eq. 1 containing non-linear activation

  6 Text Detection in Low Resolution Scene Image using CNN Stage 1 Stage m Framework Cascaded Input Feature 1 Non-linear Transformation Stage 2 Non-linear Transformation Output image fier 1 fier m-1 Feature n Feature 2 Classi Classi Non-text Text or Layer 1 Layer 2 CNN Many features integration Non-linear Transformation by Non-linear Transformation by Non-linear Activation Func. Non-linear Activation Func. Layer m image Input FC Non-Text Convolution Pooling Convolution Pooling Convolution Pooling Text

  Fig. 4: Analogy between cascaded framework and Convolutional Neural Network structure.

  However, in the CNN, many features integration, non-linear transformation, to- gether with maximum or average pooling, are all involved in a layer. It is fully feed-forward and can be computed efficiently. So the method able to optimize an end-to-end mapping that consists of all operations. In addition, the last fully- connected layer (FC) can be thought as pixel-wise non-linear transformation. A new stage or layer could be easily cascaded to make better discriminative system.

  4 Experimental Results We train the CNN using ICDAR 2003 training set and synthetic data from [16].

  Some of the training data is used as validation set. The learning rate of SGD parameters is set to 0.05 and decreases by multiplication of 0.1 for every 100,000 iterations. Maximum iteration is set to 450,000. Momentum parameter is 0.9. The validation set is tested for every 1000 iterations. Using a standard PC core i5 4Gb RAM running GPU 1Gb memory the training takes about 1 week. These validation test curves for every 1000 iteration during training as shown in Fig.

  5. The performance of our method is investigated further by reducing the res- olution image, as shown in Fig. 6. Our method shows able to detect the text line which is indicated by the probability values contained in the response map

  0.9745 0.3460

Test accuracy vs. Iters Test loss vs. Iters

Text Detection in Low Resolution Scene Image using CNN training

  7 Test accuracy 0.9725 0.3440 0.9730 0.3445 0.9735 0.3450 0.9740 0.3455 Test loss 0.9715 0.3430 0.9720 0.3435 0.9710 0.3425

50000 100000 150000 200000 250000 300000 350000 400000 450000 50000 100000 150000 200000 250000 300000 350000 400000 450000

Iters Iters


(a) (b)

  Fig. 5: Test accuracy (a) and loss (b) using validation set during training. boxes could be easily formed using NMS as the non-text responses do not have regular pattern (such as horizontal alignment).

  On the contrary, the well known methods based on connected components analysis such as Maximally Stable Extremal Regions (MSER)-based show many components are extracted as shown in Fig. 6c including non-text and fews are text. Moreover, note that some of the characters are touching its neighborhood, e.g., character S-a, r-i in the word ’Safari’, and the characters stroke width are relatively small. These could hardly be classified as text since most of the con- nected components analysis investigate geometrical structure of each component. We compare with the result from Yin et. al [17] as a representation from other cascaded-framework-like methods, since their method achieved the best perfor- mance on ICDAR competition dataset, it is also similar with the definition of the cascaded framework in this work, online demo is provided thus easy to make comparison, and more importantly it used connected components analysis using MSER-based. As can be seen the whole text line could not be detected as in Fig. 6e due to imperfect geometrical structure of the components because of noise.

  Test result with other low resolution scene images are shown in Fig. 7. One can observe the background are very complex and characters are very small with a lot of noise. Our method shows fairly well to detect the text lines, though it misses some very small characters. Those characters in such away too small for even human could not recognize correctly. It is noted that the best performance result on ICDAR could not correctly detect the text lines as shown in the middle Fig. 7. This could be attributed to the imperfect geometrical structure of the components due to low resolution images, or it could be the framework not deep enough to create better discriminative system. In our experiments, the limit of the lowest characters resolution using our method seems to be constrained by half of the image patch size. Below this value, the text lines may partly be

  8 Text Detection in Low Resolution Scene Image using CNN

(a) Input image (b) Response map of our (c) MSER output. Similar

150x83 pixels. method. color indicates a connected


  (d) Detected text of our (e) Yin et al. [17] method.

  Fig. 6: Result and comparison with low resolution input image. Best viewed in color.

  5 Conclusion

  In this paper, we have presented a framework that has mutual interaction be- tween stages or layers using CNN to deal with text detection problems in low resolution images. The framework can be viewed as the conventional cascaded framework to build better discriminative system. Interestingly, many features integration, non-linear transformation, together with maximum or average pool- ing, are all involved in a layer. It is fully feed-forward and can be computed effi- ciently. Thus it is able to jointly optimize an end-to-end mapping that consists of all existing cascaded operations. The experiments show encouraging results that it would be beneficial for the future works on text detection in low resolution scene images.

  6 Acknowledgements

  The authors would like to thank Pusat Penelitian dan Pengabdian Masyarakat (P3M) of Politeknik Elektronika Negeri Surabaya (PENS) for supporting this research by Local Research Funding FY 2016.


1. Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detec-

  Text Detection in Low Resolution Scene Image using CNN


  150x83 150x150

  200x121 300x199 350x197

  150x112 100x75

  Fig. 7: Results in low resolution scene images input (left-side) between our method (right-side) and Yin et al. [17] (middle-side). Dimension values at the left most side indicate resolution of the corresponding images. Best viewed in color.

  10 Text Detection in Low Resolution Scene Image using CNN


2. Keechul Jung, Kwang In Kim, and Anil K Jain. Text information extraction in

images and video: a survey. Pattern recognition, 37(5):977–997, 2004.


3. Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E

Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to

handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.


4. Jian Liang, David Doermann, and Huiping Li. Camera-based analysis of text and

documents: a survey. International Journal of Document Analysis and Recognition (IJDAR), 7(2-3):84–104, 2005.


5. Mirko M¨ ahlisch, Matthias Oberl¨ ander, Otto L¨ ohlein, Dariu Gavrila, and Werner

Ritter. A multiple detector approach to low-resolution fir pedestrian recognition.

  In Proceedings of the IEEE Intelligent Vehicles Symposium (IV2005), Las Vegas, NV, USA, 2005.


6. Subhransu Maji, Alexander C Berg, and Jitendra Malik. Classification using in-

tersection kernel support vector machines is efficient. In CVPR, pages 1–8. IEEE, 2008.


7. Majid Mirmehdi, Paul Clark, and Justin Lam. Extracting low resolution text with

an active camera for ocr. In Spanish Symposium on Pattern Recognition and Image Processing IX, pages 43–48, 2001.


8. Lukas Neumann and JirıMatas. Real-time scene text localization and recognition.

  In CVPR, pages 3538–3545. IEEE, 2012.


9. Lukas Neumann and Jiri Matas. On combining multiple segmentations in scene

text recognition. ICDAR, 2013.


10. Minh Hieu Nguyen, Soo-Hyung Kim, and Gueesang Lee. Recognizing text in

low resolution born-digital images. In Ubiquitous Information Technologies and Applications, pages 85–92. Springer, 2014.


11. Anhar Risnumawan and Chee Seng Chan. Text detection via edgeless stroke width

transform. In ISPACS, pages 336–340. IEEE, 2014.


12. Anhar Risnumawan, Palaiahankote Shivakumara, Chee Seng Chan, and Chew Lim

Tan. A robust arbitrary text detection system for natural scene images. Expert Systems with Applications, 41(18):8027–8048, 2014.


13. Samir Sahli, Yueh Ouyang, Yunlong Sheng, and Daniel A Lavigne. Robust vehicle

detection in low-resolution aerial imagery. In SPIE Defense, Security, and Sensing,

pages 76680G–76680G. International Society for Optics and Photonics, 2010.


14. Pannag Sanketi, Huiying Shen, and James M Coughlan. Localizing blurry and low-

resolution text in natural images. In Applications of Computer Vision (WACV), 2011 IEEE Workshop on, pages 503–510. IEEE, 2011.


15. Kai Wang, Boris Babenko, and Serge Belongie. End-to-end scene text recognition.

  In ICCV, pages 1457–1464. IEEE, 2011.


16. Tao Wang, David J Wu, Andrew Coates, and Andrew Y Ng. End-to-end text

recognition with convolutional neural networks. In ICPR, pages 3304–3308. IEEE, 2012.


17. Xu-Cheng Yin, Xuwang Yin, Kaizhu Huang, and Hong-Wei Hao. Robust text

detection in natural scene images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5):970–983, May 2014.


18. Jianguo Zhang and Shaogang Gong. People detection in low-resolution video with

non-stationary background. Image and Vision Computing, 27(4):437–443, 2009.


19. Tao Zhao and Ram Nevatia. Car detection in low resolution aerial images. Image

and Vision Computing, 21(8):693–703, 2003.


20. Jiejie Zhu, Omar Javed, Jingen Liu, Qian Yu, Hui Cheng, and Harpreet Sawhney.

  Pedestrian detection in low-resolution imagery by learning multi-scale intrinsic