From Concrete to Abstract: Multilayer Neural Networks for Disaster Victims Detection

  

From Concrete to Abstract: Multilayer Neural

Networks for Disaster Victims Detection

Indra Adji Sulistijono Anhar Risnumawan

  Graduate School of Engineering Technology Mechatronics Engineering Division Politeknik Elektronika Negeri Surabaya (PENS) Politeknik Elektronika Negeri Surabaya (PENS)

  Kampus PENS, Surabaya, Indonesia Kampus PENS, Surabaya, Indonesia Email: indra@pens.ac.id Email: anhar@pens.ac.id

  Abstract—Search-and-rescue (SAR) team main objective is to quickly locate victims in post-disaster scenario. In such disaster scenario, images are usually complex containing highly cluttered background such as debris, soil, gravel, ruined building, and clothes, which are difficult to distinguish from the victims. Previ- ous methods which only work on nearly uniform background taken from either indoor or yard are not suitable and can deteriorate the detection system. In this paper, we demonstrate the feasibility of multilayer neural network for disaster victims detection on highly cluttered background. Theoretical justifica- tion from which deep learning learns from concrete to object abstraction is established. In order to build a more discriminative system, this theoretical justification then leads us to perform pre- training using data-rich datasheet followed by finetuning only on the last layers using data-specific datasheet while keeping the other layers fixed. A new Indonesian disaster victims datasheet is also provided. Experimental results show the efficiency of the method for disaster victims detection in highly cluttered background.

  I Disasters always happen in people lives and can never be predicted. Every disaster is almost always the case fatalities. In most cases of disasters, most victims occur because of delay in handling the victims, which can be due to late location and the number of victims information. Primary objective of any search and rescue (SAR) operation is to quickly locate human victims after disaster. While employing ground robot

NTRODUCTION I.

  Fig. 1. Nearly uniform background of human victims detection from previous

  for a SAR application has been highly developed, yet most works (Top-row). Highly cluttered background of disaster victims detection

  and several results from our detection (middle and bottom rows). It can be

  of these robots have a disadvantage of mobility exploration of

  observed that the background is highly cluttered and complex as compared to

  disaster area. Today, with the widely available and low-cost of previous human victims detection works. unmanned aerial vehicles (UAVs), quickly exploring a disaster area from air to identify human victims is highly feasible [1]–[3]. Moreover, employing vision based UAVs leads to a

  Those methods are however prone to errors due to many reduction in the number and weight of on-board sensors and manual designs. thus can result in smaller and cheaper UAVs.

  State-of-the-art works for vision-based human victims de- Multilayer neural networks or deep learning have recently tection have been proposed by [4]–[6], using images either achieved significant performance results in many visual per- from indoor or yard. However, differing from the nearly uni- ception tasks, such as image classification [7], [8] and object form background of victims detection of those works, victim detection [9], [10]. Deep learning such as convolutional neu- detection on a disaster site is interestingly more challenging. ral networks (CNNs) has the ability to automatically learn Images taken from a disaster site tend to have highly clut- effective feature representations that exist on the observed tered background such as debris, soil, gravel, ruined building, visual input, which makes them very appropriate for most clothes, etc, which are difficult to distinguish from victims, as visual detection tasks. The features are jointly learned without shown in Fig. 1. Moreover, state-of-the-art methods often use requiring much from manual design and thus it can be better many hand-crafted features followed by cascaded classifiers. optimized from empirical data. Feature descriptors which are automatically learned from CNNs can be transferred also to specific tasks. The features are typically generic and can be intertwined with simple classifiers.

  To the best of our knowledge, we are the first who mainly study deep learning or multilayer neural networks for disaster victims detection on highly cluttered background. Our contributions are then three folds; First, we demonstrate the feasibility of multilayer neural networks for disaster victim detection on highly cluttered background. Second, theoretical justification of concrete-to-abstract mechanism using entropy is established. It has been known that deep learning learns from concrete to object abstraction by visualizing its kernel filter empirically [7], [11]–[13], which had been shown by experimental results. This mechanism is however difficult to understand theoretically. Interestingly, in order to build a more discriminative system, this theoretical justification thus leads us to perform pre-training on data-rich datasheet such as ImageNet and then finetuning only on the last two fully connected layers using data-specific datasheet, while keeping the other layers fixed. Third, we provide a new Indonesian Disaster Vistims datasheet called IDV-50 and its baseline performance for evaluation.

  n m

  (x, y) to indicate c-th channel feature map. The output of convolutional layer is a new feature map f

  n m+1

  such that, f

  n m+1

  = h m (g

  n m

  ), where g

  n m

  = W

  n m ∗ f m + b n m

  (1) g

  ,W

  contains C channels f

  n m

  , and b

  n m

  denote the n-th net input, filter kernel, and bias on layer m, respectively. An activation layer h

  m

  such as the Rectified Linear Unit (ReLU) h

  m (f ) = max{0, f} is used

  in this work. In order to build translation invariance in local neighborhoods, convolutional layers are usually intertwined with normalization, subsampling, and pooling layers. Pooling layer is obtained by taking maximum or averaging over local neighborhood contained in c-th channel of feature maps. The process starts with f

  1

  equal to the resized box region u, performs convolution layer by layer, and ends by connecting the last feature map to a logistic regressor for classification to get the correct label output probability. All the models parameters are jointly learned from the training data. This is achieved by minimizing the classification loss over a training data using commonly Stochastic Gradient Descent (SGD) and back-propagation.

  A. From Concrete to Abstract

  c m

  C

  The rest of this paper is organized as follows. Section II describes related works. In Section III explains the method- ology and it consists of III-A the theoretical justification of concrete-to-abstract mechanism, III-B describes the possibility of finetuning, III-C describes a new IDV-50 datasheet, and

  (u), . . . , Q

  III-D explains the training procedure. IV and V describe experiments and conclusion, respectively.

  II. R ELATED W ORK There have been quite a lot number of literatures on people detection using other sensors. Since many mobile robotic are equipped with laser range scanners, a lot of effort have been dedicated to using them for people detection and tracking [14]–[18]. The combination between visual and laser based detectors has been mainly explored in [19], where a laser range scanner is employed for extracting regions of interest in camera images and improving the confidence of an AdaBoost- based visual detector. Thermal images have also been used for people detection using either the method [20], [21] or by directly applying methods originally designed for detection of people in bright-day images [22].

  Combined information from different types of sensors have been recently proposed for autonomous victim detection [23], [24] applications. The work by [23] comes particularly similar with this work from which victim detection is taken from UAVs. The authors proposed to utilize a thermal camera to pre-filter promising image locations and subsequently verify them using a visual object detector. While in [23] people lying on the ground are assumed to be in ideal and nearly uniform background, in our paper we address the significantly more complex problem of detecting people in highly cluttered background. Note that the results of our work can still be used in combination with thermal camera images, which similarly to [23] can be used to restrict the search to image locations likely to contain people or to prune false positives, which contain no thermal evidence.

  The combination of multiple sensors for people detection is encouragingly beneficial in many scenarios, however it comes at the cost especially for unmanned aerial vehicles of an increased payload for the additional sensors. This paper there- fore aims to evaluate and disaster victim detection in highly cluttered background and to minimize sensor requirements as well.

  III. M

  ETHODOLOGY

  Previous works require features and model parameters to be manually designed and optimized through a tedious job of trial-and-error cycle including re-adjusting the features and re-learning the classifiers. In this work, CNN is employed instead to learn the features representation, jointly optimizing the features as well as the classifiers.

  Overall system of this work is shown in Fig. 2. It is assumed that the robots able to continuously send images to a server for processing and thus we mainly focus on the algorithm. Conventionally, in order to classify a box region u whether it is victim or background, a set of features Q (u) = (Q

  1

  (u), Q

  2

  N (u)) are manually extracted.

  (x, y) ∈ R

  Then a binary classifier k

  l

  for each label l is learned, which is a separate block from the features extraction. The objective is then to maximize to recognize the labels l contained in box region u such that l = argmax

  l∈L

  P (l|u), where labels

  L = {victim, background} and a posterior probability distri- bution P

  (l|u) = k l (Q(u)) over labels given the inputs. CNN comprises of multiple layers of features which are stacked together. A layer consisted of N linear filters followed by a non-linear activation function h is called a convolutional layer. A feature map f

  m (x, y) is an input to a convolutional

  layer, where (x, y) ∈ S

  m

  are spatial coordinates on layer m. The feature map f

  m

  Many works on computer vision have shown that deep learning learns from concrete to object abstraction by visual- izing its kernel filters empirically [7], [11]–[13]. For example

  

Fig. 2. Overall block diagram of disaster victims detection. Multiple UAV robots are spread around disaster area and sending images data to a GPU-based

server. Our method which is installed in the server then processes each image from which the output is analyzed and provides a command for the robots. It is

note that the method is implemented in the server for high computation.

  on the beginning layers it learns from pixel, to motif, to part, performance drop for both classification and detection [13]. For and then object consecutively with the increasing number of simplicity as well, binarization is applied to the network filters. layers, this is shown in Fig. 3. This kind of mechanism is Without loss of generality, assume that the network performs however relatively difficult to understand theoretically. feed-forward of input data from first to M layers. Each layer Layer 1 Layer 2 contains single node. The net input g is a scalar α > 0 if binary random variable of feature map F is present, and

  −α otherwise (weights and bias can be adjusted so as to make this true) and additive Gaussian noise of standard deviation σ is embedded on the net input, so that the distribution of g is an Layer 3 Layer 4 even mixture of the two Gaussian densities, 2 2 − − (g m α) /2σ

  • (g ) exp

  G m , √ σ 2π 2 2 − −

  1

  (g m +α) /2σ

  (2) (g m ) √ exp

  G , σ

  2π Layer 5 Bayes’ rule is then applied to find the posterior probability of F given g on layer m with unit uniform prior, P

  (g m m )P (F m ) |F

  P = (F |g) m Q

  m

  P (g )P (F ) + P (g )

  p |F p p p |¬F p )P (¬F p p=1

  • (g )

  G m = Q m

  • (g (g )

  G p ) + G p

  p=1 2 g /σ m

  C

  1 · exp

  = Q

  m 2 2 g p /σ g p /σ

  exp + exp

  p=1 Fig. 3. Filters visualization (bottom-row on each layer) with the corre-

  C

  1

  1 sponding image patches (top-row on each layer) of CNN trained on ImageNet

  = − m−1 2 Q 2 2

  2g m /σ − g p /σ g p /σ datasheet using deconvolution technique of [11]. CNN structure consisting of

  1 + exp exp + exp

  p=1 5 layers is used, as suggested by [7]. Note that the pattern of filters are shown

  (3)

  from motif, to part, and object consecutively with the increasing number of

  √ layers. Best viewed in color. where the constant C 2π. The entropy of a

  1 = (m − 1)σ

  binary random variable F as a function of P (F |g) = f is

  In order to provide theoretical justification of those mech- given by, anism, entropy of the neural network is measured. Entropy is similar to information as defined by Shannon [25] and provides a basis for measuring information as well. A low

  H (F ) = −P (F |g) log P (F |g)−(1−P (F |g)) log(1−P (F |g))

  value of this entropy indicates that the network layer is highly

  (4)

  class selective, while a large value indicates that it generalizes to many classes . The entropy computation however requires

  A well-known quadratic approximation is, explicit knowledge of the output probability density, while the probability in neural network is not available since the net-

  8 ˜

  H (F ) = P (5) works solely learn from empirical data samples. We can solve (F |g)(1 − P (F |g)) ≈ H(F ) exp this by using probabilistic model on neural network nodes so that the output probability density is explicitly available.

  By binarizing its weights W ∈ {0, 1}, the output of

  P Binarization of networks filters leads to an insignificant

  • exp − g p /σ
  • 2 (6) where the constant C =

      y

      Data-specific driven finetuning. In order to adapt the CNN parameters with the disaster victim detection task, we set the CNN parameters from first to fifth layers fixed, letting the last two fully connected layers parameters changed from the data-specific datasheet. The datasheet contains a quite large number of images from mixed IDV-50 and VOC [28] training data, thus can prevent from overfitting. We continue SGD training with a learning rate of 0.001 (1/10th of the initial pre-training rate) allowing finetuning to make progress while not severely damaging the initialization.

      IV. E

      XPERIMENTS

      For experiments, a standard PC core-i5 8Gb RAM running GPU 1Gb memory is used. The publicly available CNN ImageNet model [26] is used as pre-trained model. Then the last two fully-connected layers are trained again using the data- specific datasheet. The training roughly takes about one week.

      In order to perform IDV-50 datasheet evaluation, bounding box matching is used, that is the intersection between bounding boxes of the method and the ground truth is taken. We compare a set of bounding boxes ground truth T with a set of predicted bounding boxes D of each image, where a bounding box is defined as a top-left position and box’s width and height

      (pos

      x

      , pos

      , width, height ). Precision and recall are then commonly defined as follows,

      We build the CNN structure based upon Krizhevsky et al. [7], [9]. More specifically, 227x227 RGB regions are subtracted by its mean values. Features are then computed by forward propagating the mean-subtracted regions through five convolutional layers and two fully-connected layers. This eventually produces a 4096 feature vector from each box region using the Caffe [26] implementation of the CNN. We employ selective-search algorithm [27] to generate region boxes proposals. These proposals are then resized to 227x227 to match the first convolutional layer input size. We perform two steps of pre-training and finetuning as follows:

      Precision , P

      j

      I(T j , D, th

      ) |D|

      Recall , P

      j

      I(T j , D, th

      ) |T |

      (7) where a scoring I : (T j , D, th

      Data-rich driven pre-training. We discriminatively pre- trained the whole CNN layers on a large data-rich ImageNet using the open source Caffe CNN library [26].

      D. Training

      of (5) can be ignored. Thus, by substituting (3) to (5) such that, ˜

      8 exp

      H (F )

      m

      =

      8 exp P

      (F |g) m =

      1 1 + exp

      2g m /σ 2 C Q m−1 p=1

      exp

      g p /σ 2

      (m − 1)σ √ 2π.

      Publications/IDV 2016IES.tar.gz. This datasheet contains 19 images for testing and the rest for training. Differing from victim detection of [4]–[6], which tend to have uniform back- ground such as fully covered by grass, street, and road, our datasheet is more challenging, images background are com- plex containing highly cluttered background, such as debris, soil, gravel, ruined building, clothes, etc, which are not so distinguishable from victims.

      The entropy of formula (6) is plotted and shown in Fig.

      4. Interestingly, this theoretical justification clearly shows that network learns from concrete to abstract, from less to higher selective to a class since the entropy values is decreasing with the increasing layer number. This is inline with the empirical result of [13].

      n-th layer 1 2 3 4 5 Entropy 1 3 4 5 6 7 g=1 g=0

      Fig. 4. Entropy curve values of binary neural network for σ = 1 and m = 5. Note that with the increasing layer number, the network becomes more and more selective to a class, as the entropy value is decreasing.

      B. Possibility of Finetuning

      Overfitting on empirical data samples is often encountered when a large networks is trained on a small datasheet. To solve this, Fig. 4 theoretically shows that the beginning layers can be trained using a large data-rich datasheet, such as ImageNet, then the last layers can be trained using a more data-specific datasheet, in this case disaster victims datasheet, this process is called finetune. The reason is because the beginning layers tend to generalize quite well for many classes (indicated by higher entropy values), thus can be used for many tasks, while the last layers tend to be highly specific class (indicated by lower entropy values).

      This motivate us to perform finetuning on the last layers. More specifically, the networks are trained using a large data- rich ImageNet, then the last two layers are trained again using a data-specific disaster victims datasheet while keeping the other layers fixed. The last two layers are used for finetuning as suggested by [13].

      C. Disaster Victim Datasheet

      We collect 50 images from internet and real disaster area taken by camera, this is called Indonesian Disaster Victims 50 (IDV-50) datasheet. The datasheet can be downloaded from the following site, http://anhar.lecturer.pens.ac.id/files/

      ) 7→ {1 : (T j ∩ D) ≥ th, 0 otherwise} is a function that gives a value 1 if the pre- dicted bounding box overlaps with the corresponding ground

      truth for not less than a threshold th = 0.5 and 0 otherwise, with no multiple scoring from the same ground truth. From the formula (7), the precision shows the number of correctly predicted bounding boxes compared to the number of predicted bounding boxes, and the recall shows the number of correctly predicted bounding boxes compared to the number of ground truth bounding boxes. Thus predicting with many bounding boxes would likely produce lower precision and high recall, while predicting with less bounding boxes would produce high precision and lower recall. A good method should have high balanced score between precision and recall.

      R

      0.4

      0.6

      0.8

      1.0 Recall Recall of each image

      Fig. 6. Recall result for each image. Note that for many images our method able to fully detect the victims correctly, which are indicated by the maximum recall value.

      Fig. 7. Failure cases when the victims are covered with clothes.

      pre-training using data-rich datasheet followed by finetuning only on the last two fully connected layers using data-specific datasheet while keeping other layers fixed. The experiments show encouraging results that it would be beneficial for SAR teams to quickly locate the victims.

      EFERENCES [1] J. Cooper and M. A. Goodrich, “Towards combining uav and sensor operator roles in uav-enabled visual search,” in Human-Robot Interac- tion (HRI), 2008 3rd ACM/IEEE International Conference on .

      0.0

      IEEE, 2008, pp. 351–358. [2] W. E. Green, K. W. Sevcik, and P. Y. Oh, “A competition to identify key challenges for unmanned aerial robots in near-earth environments,” in ICAR’05. Proceedings., 12th International Conference on Advanced Robotics, 2005.

      IEEE, 2005, pp. 309–315. [3] K. Nordberg, P. Doherty, G. Farneb¨ack, P.-E. Forss´en, G. Granlund,

      A. Moe, and J. Wiklund, “Vision for a uav helicopter,” in International Conference on Intelligent Robots and Systems (IROS), workshop on aerial robotics. Lausanne, Switzerland , 2002, pp. 29–34.

      [4] M. Andriluka, P. Schnitzspan, J. Meyer, S. Kohlbrecher, K. Petersen, O. Von Stryk, S. Roth, and B. Schiele, “Vision based victim detection from unmanned aerial vehicles,” in Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on .

      IEEE, 2010, pp. 1740–1747. [5] P. Blondel, A. Potelle, C. P´egard, and R. Lozano, “Fast and viewpoint robust human detection for sar operations,” in Safety, Security, and

      Rescue Robotics (SSRR), 2014 IEEE International Symposium on .

      IEEE, 2014, pp. 1–6.

      0.2

      50 n-th image

      Due to unavailable datasheets from [4]–[6] and its imple- mentation, the performance of our method is tested on IDV-50 and provide a baseline measurement for others to evaluate. The performance is shown in Fig. 5 and Fig. 6. It can be seen that our method shows relatively well for detecting victims on highly cluttered background, it shows quite high precision while not degrading much the recall. For some images however our method could not detect any victims, which are shown in Fig. 7. This could be attributed to the victim’s visual features which are not so obvious for the detection. The training data could be added more with the fail-case images in order to solve this problem. We will investigate this case for future works.

      Precision of each image Fig. 5. Precision result for each image. Note that for several images our method able to detect the victims precisely, which are indicated by the maximum precision value.

      30

      20

      10

      40

    1.0 Precision

      Qualitative results are shown in Fig. 8. It is interesting to note that for even small victims in highly cluttered background, our method still shows fairly well to detect the victims. This could be because our model likely deep enough to discrimi- natively detect the victims.

      0.8

      V. C ONCLUSION In this paper, we have demonstrated the feasibility of deep learning for disaster victims detection on highly cluttered background tested using the new IDV-50 datasheet. Compared to the previous works, all the parameters from our method are jointly optimize as an end-to-end system, it is fully feed- forward and can be computed efficiently. Theoretical justi- fication from which deep learning learns from concrete to object abstraction consecutively with the increasing number of layers is provided. In order to build a more discriminative system, this theoretical justification then leads us to perform

      0.4

      0.2

      0.0

      50 n-th image

      40

      30

      20

      10

      0.6

      0.885 0.722 0.986 0.983 0.899 0.926 0.703 0.841 0.710

      0.591 0.989

    0.851

    0.822 0.955 0.947 0.992 0.814 0.568

    0.891

    0.822 0.584 0.963

      IEEE, 2007, pp. 3025–3030. [25]

      A. Kleiner and R. Kummerle, “Genetic mrf model optimization for real-time victim detection in search and rescue,” in 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems .

      [24]

      IEEE, 2006, pp. 206– 212. [23] P. Doherty and P. Rudol, “A uav search and rescue scenario with human body detection and geolocalization,” in Australasian Joint Conference on Artificial Intelligence . Springer, 2007, pp. 1–13.

      F. Suard, A. Rakotomamonjy, A. Bensrhair, and A. Broggi, “Pedestrian detection using infrared images and histograms of oriented gradients,” in 2006 IEEE Intelligent Vehicles Symposium.

      [22]

      Conference on Computer Vision and Pattern Recognition . IEEE, 2007, pp. 1–8.

      IEEE, 2004, pp. 713–716. [21] Q.-C. Pham, L. Gond, J. Begard, N. Allezard, and P. Sayd, “Real- time posture analysis in a crowd using thermal imaging,” in 2007 IEEE

      [20] J. W. Davis and V. Sharma, “Robust detection of people in thermal imagery,” in Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on , vol. 4.

      IEEE, 2009, pp. 76–81.

      G. Gate, A. Breheret, and F. Nashashibi, “Centralized fusion for fast people detection in dense environment,” in Robotics and Automation, 2009. ICRA’09. IEEE International Conference on .

      D. Schulz, W. Burgard, D. Fox, and A. B. Cremers, “People tracking with mobile robots using sample-based joint probabilistic data associ- ation filters,” The International Journal of Robotics Research, vol. 22, no. 2, pp. 99–116, 2003. [19]

      [18]

      [28] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisser- man, “The pascal visual object classes (voc) challenge,” International Journal of Computer Vision , vol. 88, no. 2, pp. 303–338, Jun. 2010.

      [26] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014. [27] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” International journal of com- puter vision , vol. 104, no. 2, pp. 154–171, 2013.

      [17]

      A. Carballo, A. Ohya, and S. Yuta, “Multiple people detection from a mobile robot using double layered laser range finders,” in ICRA Workshop , 2009.

      IEEE, 2007, pp. 3402–3407. [16]

      [15] K. O. Arras, O. M. Mozos, and W. Burgard, “Using boosted features for the detection of people in 2d range data,” in Proceedings 2007 IEEE International Conference on Robotics and Automation .

      ICRA 2008. IEEE International Conference on . IEEE, 2008, pp. 1710– 1715.

      [14] K. O. Arras, S. Grzonka, M. Luber, and W. Burgard, “Efficient people tracking in laser range data using a multi-hypothesis leg-tracker with adaptive occlusion probabilities,” in Robotics and Automation, 2008.

      ECCV 2014 . Springer, 2014, pp. 329–344.

      A. Risnumawan, I. A. Sulistijono, J. Abawajy, and Y. Saadi, “Text detection in low resolution scene images using convolutional neural network,” in SCDM, 2016, to be published. [13] P. Agrawal, R. Girshick, and J. Malik, “Analyzing the performance of multilayer neural networks for object recognition,” in Computer Vision–

      [Online]. Available: http://arxiv.org/abs/1504.08083 [11] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu- tional networks,” in Computer vision–ECCV 2014. Springer, 2014, pp. 818–833. [12]

      [8] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [9] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2014, pp. 580–587. [10] R. B. Girshick, “Fast R-CNN,” CoRR, vol. abs/1504.08083, 2015.

      A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems , 2012, pp. 1097–1105.

      [7]

      [6] P. Blondel, A. Potelle, C. Pgard, and R. Lozano, “Human detection in uncluttered environments: From ground to uav view,” in Control Automation Robotics Vision (ICARCV), 2014 13th International Con- ference on , Dec 2014, pp. 76–81.

      

    Fig. 8. Several results of our disaster victims detection on highly cluttered background. The detected victims are shown with its probability values. Best viewed

    in color.

      C. E. Shannon, “A mathematical theory of communication,” ACM SIGMOBILE Mobile Computing and Communications Review , vol. 5, no. 1, pp. 3–55, 2001.

    A. Fod, A. Howard, and M. Mataric, “A laser-based people tracker,” in

      Robotics and Automation, 2002. Proceedings. ICRA’02. IEEE Interna- tional Conference on , vol. 3.

      IEEE, 2002, pp. 3024–3029.