Some examples of applications of modelling and representation

are then used to select a likely model in a library of object models, also called indexing. The best match between image attributes and model attributes is then found. Finally, the match is verified using some decision procedure. The grouping, indexing, and matching steps essentially involve search procedures. Bottom-up control fails, however, in more com- plex images containing multiple objects with occlu- sion and overlap, as well as in the case of poor quality images, in which noise creates spurious at- tributes. This is a very likely scenario for remotely sensed images. In this situation, top-down or hybrid control strategies are more useful. In the top-down approach, the hypothesis phase requires the organisa- tion of models indexed by attributes so that based on observed attributes, a small set of likely objects can be selected. The selected models are then used to Ž recognise objects in the verification phase Jain et . al., 1995 . A disadvantage of this approach is that the model control necessary in some parts of the image is too strong for other parts; for example, symmetry requirements imposed by the model could corrupt borders. In the hybrid approach, the two strategies are combined to improve processing effi- ciency. Attributes are grouped whenever the resulting at- tribute is more informative than individual attributes. This process is also called perceptual organisation. Ž . Lowe 1985, 1990 addressed this grouping question in object recognition and came up with some objec- tive criteria for grouping attributes; he looks for configurations of edge segments that are unlikely to happen by chance and are preserved under projec- tion. Collinear and parallel edges are an example. Ž . Zerroug and Nevatia 1993 utilise regularities in the projections of homogeneous generalised cylinders into 2-D. Most other researchers have developed ad Ž . hoc criteria for grouping, e.g., Steger et al. 1997 for road extraction, and Henricsson and Baltsavias Ž . 1997 for building extraction. It seems obvious that local context will play a large part in attribute group- ing, since one would expect a particular arrangement of local attributes in relation to each other to define a local context. General knowledge about occlusion, perspective, geometry and physical support are also necessary for Ž . the recognition task. Brooks 1981 built a geometric reasoning system called ACRONYM for object recognition. The system SIGMA by Matsuyama and Ž . Hwang 1985 includes a geometric reasoning ex- Ž . pert. McGlone and Shufelt 1994 have incorporated projective geometry into their system for building Ž . extraction, while Lang and Forstner 1996 have ¨ developed polymorphic features for the development of procedures for building extraction. Context plays a significant role in image under- standing. In particular, relaxation labelling methods use local and global context to perform semantic labelling of regions and objects in an image. After the segmentation phase, scene labelling should corre- spond with available scene knowledge and the la- belling should be consistent. This problem is usually solved using constraint propagation: local con- straints result in local consistencies, and by applying an iterative scheme, the local consistencies adjust to global consistencies in the whole image. A full survey of relaxation labelling is available in Hancock Ž . and Kittler 1990 . Discrete relaxation methods are oversimplified and cannot cope with incomplete or inaccurate segmentation. Probabilistic relaxation works on the basis that a locally inconsistent but very probable global interpretation may be more valuable than a consistent but unlikely explanation; Ž . see Rosenfeld et al. 1976 for an early example of this approach. To handle uncertainty at the matching stage, vari- ous evidence-based techniques have been used. Ex- amples include systems which utilise Dempster– Ž Shafer theory Wesley, 1986; Provan, 1990; Clark- . Ž . son, 1992 , reliability values Haar, 1982 , fuzzy Ž . logic Levine and Nazif, 1985 , the principle of least Ž . commitment Jain and Haynes, 1982 , confidence Ž . values McKeown and Harvey, 1987 , random closed Ž . sets Quinio and Matsuyama, 1991 and Bayesian Ž networks Rimmey, 1993; von Kaenel et al., 1993; . Sarkar and Boyer, 1994 .

4. Some examples of applications of modelling and representation

The applications of knowledge representation and modelling methods in machine vision, and pho- togrammetry and remote sensing, have incorporated most of the approaches described in the foregoing. The leaders in these applications have logically been researchers in machine vision. In the fields of pho- togrammetry and remote sensing, the approaches adopted have followed those in the field of computer vision, and have been adapted for the types of infor- mation being extracted. These applications demon- strate that there is a growing level of expertise in techniques of artificial intelligence amongst the re- searchers in photogrammetry and remote sensing. The evolution of these methods has been from rule- based systems to semantic networks and frames to descriptive logic. A review of some applications in machine vision, photogrammetry and remote sensing in this section will demonstrate these trends. 4.1. Logic The first researchers to advocate the use of logic as a representation in computer vision systems are Ž . Reiter and Mackworth 1989 . In their paper, they proposed a logical framework for depiction and in- terpretation of image and scene knowledge, as well as a formal mapping between the two. They propose image axioms, scene axioms and depiction axioms, whose logical model forms an interpretation of an image. They illustrate their approach using a simple map-understanding system called Mapsee. The ap- plication is relatively limited, however, and newer systems have not been reported. One reason could be the computational complexity. While logic provides a consistent formalism to specify constraints, ad hoc search using logic is not efficient. Further, FOL by itself is not good for representing uncertainty or incompleteness in data, which is in the nature of image properties. The correspondence between im- age elements and scene objects is not usually one-to- one, and additional logical relations are necessary to Ž . model these. Matsuyama and Hwang 1990 adopt a logical framework in which new logical constants and axioms are generated dynamically. 4.2. Rule-based and production systems Ž . Brooks 1981 developed ACRONYM, a model- based image understanding system for detecting 3-D objects, and tested it to extract aircraft in aerial images. 3-D models of aircrafts are stored using a frame-based representation. Given an image to be analysed, ACRONYM extracts line segments and obtains 2-D generalised cylinders. Rules encoding geometric knowledge as well as knowledge of imag- ing conditions are used to generate expected 3-D models of the scene, which are then matched against the frames to identify aircraft. Ž . SIGMA Matsuyama and Hwang, 1985 is an aerial image understanding system that uses frames to represent knowledge, and both top-down and bot- tom-up control schemes to extract features. It con- sists of three subsystems: the Geometric Reasoning Ž . Ž . Expert GRE , Model Selection Expert MSE , and Ž . Low Level Vision Expert LLVE . Information passes from the GRE to the MSE, which then com- municates with the LLVE. The frames in SIGMA use slots storing attributes of an object and its rela- tionships to other objects. Based on the spatial knowledge in the frames, hypotheses are generated for objects and matched against image features. This is done by the MSE reasoning about the most likely appearance of an object and conveying this in image terms to the LLVE. This top-down selection of image attributes helps detect small attributes. The system was tested to extract houses and road seg- ments from aerial images. Ž . McKeown et al. 1985 present a rule-based sys- tem for the interpretation of airports in aerial images. It was based on about 450 rules, divided into six classes for: initialisation, region-to-interpretation for interpreting the original image fragments, local eval- uation, consistency checks, functional area rules for grouping of image fragments into functional areas, and goal-generation rules for building the airport model. Ž . McKeown and Harvey 1987 present a system for aerial image interpretation, with rules compiled from standard knowledge sets, called schemata. They generated rules automatically from higher level mod- ules, which made for better error-handling and more efficient execution. Their system contained about 100 schemata, each of which generated about five rules. Ž . Strat and Fischler 1991 developed the knowl- edge-based system called ‘Condor’ for the recogni- tion of terrain scenes based on context. Context is defined by rules within context sets at various levels. The context sets are not infallible, and hence, redun- dancy is built into them. The interpretation is based on three types of rules: candidate generation, candi- date evaluation, and consistency determination. Can- didate comparisons are based on the evaluation of likely candidates in the evaluation process, which scores the relative likelihood that a candidate is an instance of that class. The authors state that this division of the knowledge assigns it to manageable sizes. Ž . Stilla et al. 1996 present a model-based system for automatic extraction of buildings from aerial images, in which objects to be recognised are mod- elled by production rules and depicted by a produc- tion set. The object model is both specific and generic. The specific model describes objects using a fixed topological structure, while the generic models are more general. These systems illustrate that rule-based systems do not guarantee additivity of knowledge and consis- tency of reasoning. Breaking up a rule base into multiple rules of varying granularity makes the pro- gram less modular and more difficult to modify. Ž . Draper et al. 1989 suggest blackboard and schema-based architectures to handle this. 4.3. Blackboard systems Ž . Nagao and Matsuyama 1980 first addressed the problem of scene understanding using the blackboard model, and applied it to aerial images of suburban areas, involving identification of cars, houses and Ž roads. Their system consists of a global database the . blackboard and a set of knowledge sources. The blackboard records data in a hierarchy consisting of elementary regions, characteristic regions and ob- jects. The blackboard also stores a label picture, which links pixels in the original image to the re- gions in the database. Elementary regions are the result of an image segmentation process, and are characterised by grey-level, size and location in the image. Characteristic features of the regions are then extracted, resulting in the identification of elemen- tary regions with the following attributes: 1. Large, homogeneous regions, based on region size. 2. Elongated regions, based on shape. 3. Regions in shadow, based on region brightness. 4. Regions capable of causing shadows, based on location of adjoining regions and the position of the sun. 5. Vegetation and water regions, from multispectral information. 6. High contrast texture regions, from textural infor- mation. These properties are stored on the blackboard by separate modules. The knowledge sources then iden- tify a particular object, given the presence or absence of the characteristic features of various regions. Each knowledge source is a single rule, with a condition and a complex action part that performs various picture processing operations to detect the object. For example, the knowledge source to detect a crop field would look like: if large homogeneous region and vegetation region and not water region and not shadow making region then perform crop field identification. Each knowledge source identifies an object inde- pendently, and this might lead to conflicting identifi- Ž cations for the same region for example, crop field . and grassland . To solve this, the system automati- cally calculates a reliability value for each identifica- tion, and uses it to discard all but the most reliable. Ž . Fuger et al. 1994 present a blackboard-based ¨ data-driven system for analysis of man-made objects in aerial images. Generic object models are repre- sented symbolically in the blackboard, an individual object being described by several attributes. The models are controlled by numerous parameters, which are determined by a closed-loop system using ‘evolu- tion strategies’. Ž . Stilla 1995 presents a blackboard-based produc- tion system for image understanding, which is suit- able for the structural analysis of complex scenes in aerial images. Starting with primitive objects, a tar- get object can be composed step-by-step using pro- ductions repeatedly. The compositions of objects are recorded and represented by a derivation graph. The map is modelled as a set of straight lines. The results of map analysis are one or more target objects with their corresponding derivation graphs. Image analy- sis may also be performed identically, by segmenting the binary image and approximating contours by straight lines. Blackboard systems in general tend to have a centralised control structure so that efficiency be- comes an issue. Also, blackboards assume that knowledge sources will be available when needed and then vanish, whereas in vision applications, they tend to persist as long as the image is being anal- ysed. 4.4. Frames Ž . Hanson and Riseman 1978 used frames as hy- pothesis generation mechanisms for vision systems. Knowledge about classes of objects was represented as frames, and slots represented binary geometric relations between classes of objects. Slots also con- tained production rules for instantiating other object frames. Thus, frames are used both for control and Ž . representation. Ikeuchi and Kanade 1988 used frames to represent aspects of 3-D objects. When exact object models are available, processing is top- down, but given weaker models and more exact data, processing is bottom-up. However, using frames for both control and representation hides the procedural behaviour of the system and destroys its temporal Ž . coordination Draper et al., 1989 . Other systems which use frames include ACRONYM, SIGMA and Nagao and Matsuyama’s system, already described above. 4.5. Semantic network Ž . Nicolin and Gabler 1987 describe a system to analyse aerial images, using semantic nets to repre- sent and interpret the image. The system consists of Ž . a Short Term Memory STM , a Methodology Base Ž . Ž . MB , and a Long Term Memory LTM . The STM is conceptually equivalent to a blackboard and stores the partial interpretation of the image. The LTM stores the a priori knowledge of the scene and the Ž domain-specific knowledge i.e., the knowledge . base . The system matches the contents of the STM against those of the LTM to produce an interpreta- tion. This is accomplished using an inference mecha- nism that calls modules in the MB. The initial con- tents of the STM are established in a bottom-up way, and a model-driven phase generates and verifies the presence or absence of object attributes stored in the LTM. Ž . Mayer 1994 has developed a semantic-network- based system for knowledge-based extraction of ob- jects from digitised maps. The system is based on a combined semantic network and frames representa- tion, as well as a combination of model-driven and data-driven control. The model is composed of three levels which generally correspond to the respective layers of bottom-up image processing: 1. The image layer, e.g., the digitised map, 2. Image-graph and graphics and text layers, 3. Semantic objects. The semantic network is built up from the concept of ‘partrpart of’ elements in the graphs layer to the semantic objects, which comprise the ‘specialisa- tionrgeneralisation’ relations between the graphics objects and the terrain objects. For example, an elongated area in the graphics objects layer is spe- cialised into ‘road-sides’, ‘pavements’, ‘road net- work’, etc. Descriptions of other objects are not given, but the tests demonstrated the extraction of parcels and road networks. The frames are designed to analyse the various concepts and their properties. The object extraction is based on both model-driven and data-driven instantiation, with the initial search being based on a goal specified by the user. While the method is based on the extraction of well-defined information on maps, Mayer believes that the pro- cess should be useful for the extraction of informa- tion from images. Ž . Tonjes 1996 has used semantic networks for ¨ modelling landscapes from overlapping aerial im- ages. The output is a 3-D view of the terrain with appropriate representations of the vegetation. Tonjes ¨ states that semantic networks are suited to represent- ing knowledge of structural objects. His semantic network is described by frames that include the relationships, attributes, and methods. The semantic net has three layers: 1. Sensor layer, which represents the segmentation layer, based on texture and stripes, as well as the image details; 2. Geometry and material layer, which represents the 3-D surface layer following the interpretation of the terrain cover from the sensor layer; 3. Scene layer, which are the extracted objects. The semantic network is established between compo- nents in the three layers.The relationships ‘con–of’ are concrete realisations of objects in the image data; ‘part–of’ describes the decomposition of the objects into parts; while ‘is–a’ is the specialisation of the object. The object descriptions are tracked through each layer for reconstruction, which is based on both data-driven as well as model-driven processes. Ž . Lang and Forstner 1996 have based their method ¨ of extraction of buildings on polymorphic mid-level features. The approach involves semantic modelling using a ’part–of’ hierarchical representation. Rela- tions between the parts have not yet been included. The hypothesis generation of the building is based on a combination of a data-driven model for the original generation of the vertices, and subsequent model-driven approaches for hypothesis generation of object interpretation and verification, using four building types as the models: flat roof, non-orthogo- nal flat roof, gable roof, and hip roof. The approach successfully extracts buildings. Ž . Schilling and Vogtle 1996 have developed a ¨ procedure for updating digital map bases using exist- ing map bases to aid the interpretation. The image is compared with the map to detect changes since the compilation of the map. New features are then anal- ysed by semantic networks. Two networks are cre- ated, one for the scene and the other for the image, with the typical relationships established at different levels in the networks. Ž . De Gunst 1996 has developed a combined data- driven and model-driven approach to recognising objects required for updating digital map data. The process is based on object oriented models for road descriptions and a semantic network for the feature recognition, based on frames. The frames define such details as object relations, object definition, alternative object definitions and preprocessing rela- tions. Road details include complex road junctions which are described by the knowledge base. This is a very detailed study involving several different types of road features. The success of the investigations varied significantly, demonstrating the difficulty in understanding such details. Ž . Ž . Quint and Sties 1996 and Quint 1997 present a model-based system called MOSES to analyse aerial images, which uses semantic networks as a mod- elling tool. Models are automatically refined by us- ing knowledge gained from topographical maps or GIS data. The generatiÕe model is the most general model containing common sense knowledge about the environment. Concepts in the generic models in the map and image domain are specialisations of the corresponding concepts in the generative model. A specific model is automatically generated by the system and is specific to the current scene; it is generated by combining the scene description ob- tained after map analysis with the generic model in the image domain. Initially, digitally available line segments are used for the structural analysis of the map, resulting in a structural description of the map scene. The scene description so obtained is then combined with the generic model in the image do- main to yield the specific model, which will be used for image analysis. For structural analysis, image Ž . primitives currently line segments and regions serve as input. The analysis is model-driven, resulting in Ž . recognition of objects parking places in the project . A merit function is used to guide search in the image analysis process. To sum up, semantic networks have found wide acceptance and use in the interpretation of aerial images and digital maps. 4.6. Description logics There are very few photogrammetric applications based on description logics. One such is Lange and Ž . Schroder’s 1994 description-logic-based approach ¨ to the interpretation of changes in aerial images with respect to reference information extracted from a map. Knowledge about types of objects and types of possible changes are represented using a KL-ONE- Ž like description logic Brachman and Schmolze, . 1985; Nebel, 1990 , which permit the description of concepts in terms of necessary and sufficient condi- tions. Factual information about the scene and the interpretation are represented using the assertional component of the description logic. Geometric and topological constraints and relations between spatial objects are represented in the logic as object con- cepts and change concepts. An object is recognised to be an instance of an object concept after the image has been preprocessed and the attributes extracted. The definitions of the change concepts are used in exactly the same way to recognise changes. Search for instantiation is goal-directed, and uses a number of heuristics. The examples in the paper, however, seem to be based on artificial images.

5. Conclusions