Automated feature extraction Directory UMM :Data Elmu:jurnal:P:Photogrametry & Remotesensing:Vol55.Issue1.Feb2000:

next. Bottom-up control is useful if domain-indepen- dent image processing is cheap, and input data is Ž . Ž . accurate and reliable. Marr 1982 and Ullman 1984 advocated the bottom-up approach on the basis that bottom-up processing of data occurs invariably in human vision. Marr saw this leading to an intermedi- ate representation called the 2 1r2-D sketch, con- taining surface orientations and distances in a viewer-centred frame of reference, as well as discon- tinuities in surface distances and orientations. In addition, Ullman hypothesised higher-level processes called visual routines, which detect features of inter- est in the intermediate representation. Top-down model-driven control is driven by ex- pectations and predictions generated in the knowl- edge base. Thus, model-driven control attempts to perform internal model verification, in a goal-di- rected manner. A common top-down technique is hypothesise-and-Õerify, which can normally control low-level processing. There appears to be support for the view that some aspects of human vision are not bottom-up, and the model-driven approach is moti- vated by this observation, as well as the desire to minimise low-level processing. In practice, computer vision systems tend to favour mixed top-down and bottom-up control that focuses attention efficiently and makes the computation prac- tical. Either parallel or serial computation may be performed within any of these schemes. Both top-down and bottom-up controls imply a hierarchy of processes. In heterarchical control, pro- cesses are viewed as a collection of cooperating and competing experts, and at any time, the ‘expert’ which can ‘help the most’ is chosen. Blackboard architectures are an example of this approach, in which modular knowledge sources communicate via Ž . a common blackboard memory to which they can write and from which they can read. 2.4. Modelling issues In the model-based approach to computer vision, a priori models of possible objects in a class of images are defined and utilised for object recogni- tion. The models encode external knowledge of the world and the application. Object models may be appearance models, shape models, physical models, etc. Each of these should capture the range of varia- tion in the presentation of the object due to changes in viewpoint, in lighting, and even changes in shape Ž . in the case of flexible objects Pope, 1995 . In addition, variations due to the image acquisition process itself, as well as variations among individual members of the object class, should be accounted for. Objects of interest may be 2-D or 3-D; they may also be rigid, articulated, or flexible. The images themselves may be range images or intensity images. Recognition is accomplished by determining a corre- spondence between some attributes of the image and comparable attributes of the model in a matching Ž . phase. Relevant attributes of a model image are represented using one of the representation schemes discussed earlier. Recognising a 3-D object in an intensity image of an unrestricted scene is the most difficult form of the problem, and aerial and space images fall into this category. Loss of depth due to projection, occlusion and cluttering of details are some of the problems occurring; further, image in- tensity is only indirectly related to object shape.

3. Automated feature extraction

The goal of most image interpretation systems is the extractionrrecognition of objects in the scene. In the model-based approach, this is achieved by first extracting object properties and then matching them to a model. 3.1. Feature attribute representation In computer vision, attributes or properties of objects and scenes that are extractable from the image are called features. These attributes are some- times classified to be either local or global. Within photogrammetry and remote sensing, however, the term ‘features’ refers to recognisable objects or structures in the image, such as a road or a building, and the classification of such features is application- dependent; for aerial images, for example, a global description may involve information about the area covered, such as rural or urban. Confusion may be minimised in the literature by avoiding overloading of names and definitions. In this paper, we use the term ‘features’ as in photogrammetry, viz. recognis- able objects in images. To refer to properties of objects, we shall use the term ‘attributes’. Global attributes of objects summarise informa- tion about the entire visible portion of the object, such as area, perimeter, length, etc. Ideally, such global attributes should be scale and translation in- variant in order to cope with multiple resolution and shifts in images; features should be non-overlapping, so that clutter and occlusion may be avoided; further, a separate model is necessary for each view of the Ž object, so as to handle multiple view multiple-look . angle images. Local attributes in photogrammetry may be, for example, junctions and edge segments, which may be treated as independent attributes of features. However, within computer vision, it is more usual to treat such attributes in relation to each other, or in context. Relational attributes are usually structured into graphs. A representation scheme for feature attributes is judged on the criteria of scope and sensitivity, stabil- Ž ity, efficiency and uniqueness Marr and Nishihara, 1978; Binford, 1982; Brady, 1983; Haralick et al., . 1988; Mokhtarian and Mackworth, 1992 . On these criteria, researchers conclude that a good representa- tion for the model-based approach includes a combi- nation of local attributes, each pertaining to a spe- Ž cific region of the image or object Grimson, 1990; . Pope, 1995 . This is because local attributes may be computed efficiently based on a limited part of the input data; they are stable, since small changes in appearance affect only some of the features, and partial occlusion of objects will only partly affect local features. Edge junctions are an example of such a local attribute, based on edge analysis. Also, a multiple-scale representation is preferable as two largely similar objects will then have similar descrip- tions, even if small-scale details are different. Such multi-scale representations are more readily obtain- able for aerial and satellite images, either from image databases or by sub-sampling of high-resolution im- ages. This option is not available for many computer vision applications. The uniqueness criterion for models is not of great importance in feature recogni- tion, since the recognition algorithm could allow for some mismatch due to noise and occlusion. To specify locations of local attributes is easier for aerial and satellite images, in comparison to images usually considered in computer vision, be- cause in the former, exterior orientation and camera parameters are either known or derivable. For most computer vision studies of aerial imagery, 2-D repre- Ž sentations have been found adequate e.g., Shufelt . and McKeown, 1993 , but 3-D models and matching are often employed in photogrammetry in applica- Ž tions such as building shape extraction Henricsson . and Baltsavias, 1997 . Finally, what attributes are useful for feature extractionrrecognition? Attributes should capture all distinctions needed to differentiate features from each other and from other parts of the scene; secondly, they should reflect regularities and structures in the external world. Thus, the choice of attributes is application-dependent. In remote sensing and photogrammetry, the characteristics of spectral images are fairly well-known, through radiometric calibration and the spectral characteristics of the objects, as well as ground truthing. Some of the attribute regularities will arise from knowledge of these characteristics; for example, the spectral char- acteristics of various types of ground cover, such as various types of vegetation, soil, minerals, water, and some man-made structures have been determined by extensive tests and ground truthing over a number of years. Other attributes will be shape- and appear- ance-based, just as in computer vision; such as ‘roads are long and narrow strips’, ‘buildings are closed regions’ and so on. Yet others will be context-based, such as ‘buildings are normally situated beside roads’, and ‘bridges span rivers’. Features may be organised into some kind of structure. One way is to arrange them hierarchically into partrwhole relations, as in semantic-network- Ž . based systems details later . A second is to arrange them according to adjacency relations. The latter Ž corresponds to spatial nearness, or context e.g., Strat . and Fischler, 1991 . Both may be represented as graphs. 3.2. Recognition of features Object recognition in computer vision corre- sponds to the term feature extraction in photogram- metry. To recognise a single object in an image, bottom-up data-driven control is usually sufficient, in which attributes are first detected and represented as symbols. New attributes are then identified by grouping more primitive attributes. The attributes are then used to select a likely model in a library of object models, also called indexing. The best match between image attributes and model attributes is then found. Finally, the match is verified using some decision procedure. The grouping, indexing, and matching steps essentially involve search procedures. Bottom-up control fails, however, in more com- plex images containing multiple objects with occlu- sion and overlap, as well as in the case of poor quality images, in which noise creates spurious at- tributes. This is a very likely scenario for remotely sensed images. In this situation, top-down or hybrid control strategies are more useful. In the top-down approach, the hypothesis phase requires the organisa- tion of models indexed by attributes so that based on observed attributes, a small set of likely objects can be selected. The selected models are then used to Ž recognise objects in the verification phase Jain et . al., 1995 . A disadvantage of this approach is that the model control necessary in some parts of the image is too strong for other parts; for example, symmetry requirements imposed by the model could corrupt borders. In the hybrid approach, the two strategies are combined to improve processing effi- ciency. Attributes are grouped whenever the resulting at- tribute is more informative than individual attributes. This process is also called perceptual organisation. Ž . Lowe 1985, 1990 addressed this grouping question in object recognition and came up with some objec- tive criteria for grouping attributes; he looks for configurations of edge segments that are unlikely to happen by chance and are preserved under projec- tion. Collinear and parallel edges are an example. Ž . Zerroug and Nevatia 1993 utilise regularities in the projections of homogeneous generalised cylinders into 2-D. Most other researchers have developed ad Ž . hoc criteria for grouping, e.g., Steger et al. 1997 for road extraction, and Henricsson and Baltsavias Ž . 1997 for building extraction. It seems obvious that local context will play a large part in attribute group- ing, since one would expect a particular arrangement of local attributes in relation to each other to define a local context. General knowledge about occlusion, perspective, geometry and physical support are also necessary for Ž . the recognition task. Brooks 1981 built a geometric reasoning system called ACRONYM for object recognition. The system SIGMA by Matsuyama and Ž . Hwang 1985 includes a geometric reasoning ex- Ž . pert. McGlone and Shufelt 1994 have incorporated projective geometry into their system for building Ž . extraction, while Lang and Forstner 1996 have ¨ developed polymorphic features for the development of procedures for building extraction. Context plays a significant role in image under- standing. In particular, relaxation labelling methods use local and global context to perform semantic labelling of regions and objects in an image. After the segmentation phase, scene labelling should corre- spond with available scene knowledge and the la- belling should be consistent. This problem is usually solved using constraint propagation: local con- straints result in local consistencies, and by applying an iterative scheme, the local consistencies adjust to global consistencies in the whole image. A full survey of relaxation labelling is available in Hancock Ž . and Kittler 1990 . Discrete relaxation methods are oversimplified and cannot cope with incomplete or inaccurate segmentation. Probabilistic relaxation works on the basis that a locally inconsistent but very probable global interpretation may be more valuable than a consistent but unlikely explanation; Ž . see Rosenfeld et al. 1976 for an early example of this approach. To handle uncertainty at the matching stage, vari- ous evidence-based techniques have been used. Ex- amples include systems which utilise Dempster– Ž Shafer theory Wesley, 1986; Provan, 1990; Clark- . Ž . son, 1992 , reliability values Haar, 1982 , fuzzy Ž . logic Levine and Nazif, 1985 , the principle of least Ž . commitment Jain and Haynes, 1982 , confidence Ž . values McKeown and Harvey, 1987 , random closed Ž . sets Quinio and Matsuyama, 1991 and Bayesian Ž networks Rimmey, 1993; von Kaenel et al., 1993; . Sarkar and Boyer, 1994 .

4. Some examples of applications of modelling and representation