are then used to select a likely model in a library of object models, also called indexing. The best match
between image attributes and model attributes is then found. Finally, the match is verified using some
decision procedure. The grouping, indexing, and matching steps essentially involve search procedures.
Bottom-up control fails, however, in more com- plex images containing multiple objects with occlu-
sion and overlap, as well as in the case of poor quality images, in which noise creates spurious at-
tributes. This is a very likely scenario for remotely sensed images. In this situation, top-down or hybrid
control strategies are more useful. In the top-down approach, the hypothesis phase requires the organisa-
tion of models indexed by attributes so that based on observed attributes, a small set of likely objects can
be selected. The selected models are then used to
Ž recognise objects in the verification phase Jain et
. al., 1995 . A disadvantage of this approach is that
the model control necessary in some parts of the image is too strong for other parts; for example,
symmetry requirements imposed by the model could corrupt borders. In the hybrid approach, the two
strategies are combined to improve processing effi- ciency.
Attributes are grouped whenever the resulting at- tribute is more informative than individual attributes.
This process is also called perceptual organisation. Ž
. Lowe 1985, 1990 addressed this grouping question
in object recognition and came up with some objec- tive criteria for grouping attributes; he looks for
configurations of edge segments that are unlikely to happen by chance and are preserved under projec-
tion. Collinear and parallel edges are an example.
Ž .
Zerroug and Nevatia 1993 utilise regularities in the projections of homogeneous generalised cylinders
into 2-D. Most other researchers have developed ad Ž
. hoc criteria for grouping, e.g., Steger et al. 1997
for road extraction, and Henricsson and Baltsavias Ž
. 1997 for building extraction. It seems obvious that
local context will play a large part in attribute group- ing, since one would expect a particular arrangement
of local attributes in relation to each other to define a local context.
General knowledge about occlusion, perspective, geometry and physical support are also necessary for
Ž .
the recognition task. Brooks 1981 built a geometric reasoning system called ACRONYM for object
recognition. The system SIGMA by Matsuyama and Ž
. Hwang 1985 includes a geometric reasoning ex-
Ž .
pert. McGlone and Shufelt 1994 have incorporated projective geometry into their system for building
Ž .
extraction, while Lang and Forstner 1996
have
¨
developed polymorphic features for the development of procedures for building extraction.
Context plays a significant role in image under- standing. In particular, relaxation labelling methods
use local and global context to perform semantic labelling of regions and objects in an image. After
the segmentation phase, scene labelling should corre- spond with available scene knowledge and the la-
belling should be consistent. This problem is usually solved using constraint propagation: local con-
straints result in local consistencies, and by applying an iterative scheme, the local consistencies adjust to
global consistencies in the whole image. A full survey of relaxation labelling is available in Hancock
Ž .
and Kittler 1990 . Discrete relaxation methods are oversimplified and cannot cope with incomplete or
inaccurate segmentation.
Probabilistic relaxation
works on the basis that a locally inconsistent but very probable global interpretation may be more
valuable than a consistent but unlikely explanation; Ž
. see Rosenfeld et al. 1976 for an early example of
this approach. To handle uncertainty at the matching stage, vari-
ous evidence-based techniques have been used. Ex- amples include systems which utilise Dempster–
Ž Shafer theory Wesley, 1986; Provan, 1990; Clark-
. Ž
. son, 1992 , reliability values
Haar, 1982 , fuzzy Ž
. logic Levine and Nazif, 1985 , the principle of least
Ž .
commitment Jain and Haynes, 1982 , confidence Ž
. values McKeown and Harvey, 1987 , random closed
Ž .
sets Quinio and Matsuyama, 1991 and Bayesian Ž
networks Rimmey, 1993; von Kaenel et al., 1993; .
Sarkar and Boyer, 1994 .
4. Some examples of applications of modelling and representation
The applications of knowledge representation and modelling methods in machine vision, and pho-
togrammetry and remote sensing, have incorporated most of the approaches described in the foregoing.
The leaders in these applications have logically been
researchers in machine vision. In the fields of pho- togrammetry and remote sensing, the approaches
adopted have followed those in the field of computer vision, and have been adapted for the types of infor-
mation being extracted. These applications demon- strate that there is a growing level of expertise in
techniques of artificial intelligence amongst the re- searchers in photogrammetry and remote sensing.
The evolution of these methods has been from rule- based systems to semantic networks and frames to
descriptive logic. A review of some applications in machine vision, photogrammetry and remote sensing
in this section will demonstrate these trends.
4.1. Logic The first researchers to advocate the use of logic
as a representation in computer vision systems are Ž
. Reiter and Mackworth 1989 . In their paper, they
proposed a logical framework for depiction and in- terpretation of image and scene knowledge, as well
as a formal mapping between the two. They propose image axioms, scene axioms and depiction axioms,
whose logical model forms an interpretation of an image. They illustrate their approach using a simple
map-understanding system called Mapsee. The ap- plication is relatively limited, however, and newer
systems have not been reported. One reason could be the computational complexity. While logic provides
a consistent formalism to specify constraints, ad hoc search using logic is not efficient. Further, FOL by
itself is not good for representing uncertainty or incompleteness in data, which is in the nature of
image properties. The correspondence between im- age elements and scene objects is not usually one-to-
one, and additional logical relations are necessary to
Ž .
model these. Matsuyama and Hwang 1990 adopt a logical framework in which new logical constants
and axioms are generated dynamically.
4.2. Rule-based and production systems Ž
. Brooks 1981 developed ACRONYM, a model-
based image understanding system for detecting 3-D objects, and tested it to extract aircraft in aerial
images. 3-D models of aircrafts are stored using a frame-based representation. Given an image to be
analysed, ACRONYM extracts line segments and obtains 2-D generalised cylinders. Rules encoding
geometric knowledge as well as knowledge of imag- ing conditions are used to generate expected 3-D
models of the scene, which are then matched against the frames to identify aircraft.
Ž .
SIGMA Matsuyama and Hwang, 1985
is an aerial image understanding system that uses frames
to represent knowledge, and both top-down and bot- tom-up control schemes to extract features. It con-
sists of three subsystems: the Geometric Reasoning
Ž .
Ž .
Expert GRE , Model Selection Expert MSE , and Ž
. Low Level Vision Expert
LLVE . Information passes from the GRE to the MSE, which then com-
municates with the LLVE. The frames in SIGMA use slots storing attributes of an object and its rela-
tionships to other objects. Based on the spatial knowledge in the frames, hypotheses are generated
for objects and matched against image features. This is done by the MSE reasoning about the most likely
appearance of an object and conveying this in image terms to the LLVE. This top-down selection of
image attributes helps detect small attributes. The system was tested to extract houses and road seg-
ments from aerial images.
Ž .
McKeown et al. 1985 present a rule-based sys- tem for the interpretation of airports in aerial images.
It was based on about 450 rules, divided into six classes for: initialisation, region-to-interpretation for
interpreting the original image fragments, local eval- uation, consistency checks, functional area rules for
grouping of image fragments into functional areas, and goal-generation rules for building the airport
model.
Ž .
McKeown and Harvey 1987 present a system for aerial image interpretation, with rules compiled
from standard knowledge sets, called schemata. They generated rules automatically from higher level mod-
ules, which made for better error-handling and more efficient execution. Their system contained about
100 schemata, each of which generated about five rules.
Ž .
Strat and Fischler 1991 developed the knowl- edge-based system called ‘Condor’ for the recogni-
tion of terrain scenes based on context. Context is defined by rules within context sets at various levels.
The context sets are not infallible, and hence, redun- dancy is built into them. The interpretation is based
on three types of rules: candidate generation, candi-
date evaluation, and consistency determination. Can- didate comparisons are based on the evaluation of
likely candidates in the evaluation process, which scores the relative likelihood that a candidate is an
instance of that class. The authors state that this division of the knowledge assigns it to manageable
sizes.
Ž .
Stilla et al. 1996 present a model-based system for automatic extraction of buildings from aerial
images, in which objects to be recognised are mod- elled by production rules and depicted by a produc-
tion set. The object model is both specific and generic. The specific model describes objects using a
fixed topological structure, while the generic models are more general.
These systems illustrate that rule-based systems do not guarantee additivity of knowledge and consis-
tency of reasoning. Breaking up a rule base into multiple rules of varying granularity makes the pro-
gram less modular and more difficult to modify.
Ž .
Draper et
al. 1989
suggest blackboard
and schema-based architectures to handle this.
4.3. Blackboard systems Ž
. Nagao and Matsuyama 1980 first addressed the
problem of scene understanding using the blackboard model, and applied it to aerial images of suburban
areas, involving identification of cars, houses and Ž
roads. Their system consists of a global database the .
blackboard and a set of knowledge sources. The blackboard records data in a hierarchy consisting of
elementary regions, characteristic regions and ob- jects. The blackboard also stores a label picture,
which links pixels in the original image to the re- gions in the database. Elementary regions are the
result of an image segmentation process, and are characterised by grey-level, size and location in the
image. Characteristic features of the regions are then extracted, resulting in the identification of elemen-
tary regions with the following attributes:
1. Large, homogeneous regions, based on region size.
2. Elongated regions, based on shape. 3. Regions in shadow, based on region brightness.
4. Regions capable of causing shadows, based on location of adjoining regions and the position of
the sun. 5. Vegetation and water regions, from multispectral
information. 6. High contrast texture regions, from textural infor-
mation. These properties are stored on the blackboard by
separate modules. The knowledge sources then iden- tify a particular object, given the presence or absence
of the characteristic features of various regions. Each knowledge source is a single rule, with a condition
and a complex action part that performs various picture processing operations to detect the object.
For example, the knowledge source to detect a crop field would look like:
if large homogeneous region and vegetation region and
not water region and not shadow making region
then perform crop field identification.
Each knowledge source identifies an object inde- pendently, and this might lead to conflicting identifi-
Ž cations for the same region for example, crop field
. and grassland . To solve this, the system automati-
cally calculates a reliability value for each identifica- tion, and uses it to discard all but the most reliable.
Ž .
Fuger et al. 1994 present a blackboard-based
¨
data-driven system for analysis of man-made objects in aerial images. Generic object models are repre-
sented symbolically in the blackboard, an individual object being described by several attributes. The
models are controlled by numerous parameters, which are determined by a closed-loop system using ‘evolu-
tion strategies’.
Ž .
Stilla 1995 presents a blackboard-based produc- tion system for image understanding, which is suit-
able for the structural analysis of complex scenes in aerial images. Starting with primitive objects, a tar-
get object can be composed step-by-step using pro- ductions repeatedly. The compositions of objects are
recorded and represented by a derivation graph. The map is modelled as a set of straight lines. The results
of map analysis are one or more target objects with their corresponding derivation graphs. Image analy-
sis may also be performed identically, by segmenting the binary image and approximating contours by
straight lines.
Blackboard systems in general tend to have a centralised control structure so that efficiency be-
comes an issue. Also, blackboards assume that knowledge sources will be available when needed
and then vanish, whereas in vision applications, they tend to persist as long as the image is being anal-
ysed.
4.4. Frames Ž
. Hanson and Riseman 1978 used frames as hy-
pothesis generation mechanisms for vision systems. Knowledge about classes of objects was represented
as frames, and slots represented binary geometric relations between classes of objects. Slots also con-
tained production rules for instantiating other object frames. Thus, frames are used both for control and
Ž .
representation. Ikeuchi and Kanade 1988
used frames to represent aspects of 3-D objects. When
exact object models are available, processing is top- down, but given weaker models and more exact data,
processing is bottom-up. However, using frames for both control and representation hides the procedural
behaviour of the system and destroys its temporal
Ž .
coordination Draper et al., 1989 . Other systems which use frames include ACRONYM, SIGMA and
Nagao and Matsuyama’s system, already described above.
4.5. Semantic network Ž
. Nicolin and Gabler 1987 describe a system to
analyse aerial images, using semantic nets to repre- sent and interpret the image. The system consists of
Ž .
a Short Term Memory STM , a Methodology Base Ž
. Ž
. MB , and a Long Term Memory LTM . The STM
is conceptually equivalent to a blackboard and stores the partial interpretation of the image. The LTM
stores the a priori knowledge of the scene and the Ž
domain-specific knowledge i.e., the knowledge
. base . The system matches the contents of the STM
against those of the LTM to produce an interpreta- tion. This is accomplished using an inference mecha-
nism that calls modules in the MB. The initial con- tents of the STM are established in a bottom-up way,
and a model-driven phase generates and verifies the presence or absence of object attributes stored in the
LTM. Ž
. Mayer 1994 has developed a semantic-network-
based system for knowledge-based extraction of ob- jects from digitised maps. The system is based on a
combined semantic network and frames representa- tion, as well as a combination of model-driven and
data-driven control. The model is composed of three levels which generally correspond to the respective
layers of bottom-up image processing:
1. The image layer, e.g., the digitised map, 2. Image-graph and graphics and text layers,
3. Semantic objects.
The semantic network is built up from the concept of ‘partrpart of’ elements in the graphs layer to the
semantic objects, which comprise the ‘specialisa- tionrgeneralisation’ relations between the graphics
objects and the terrain objects. For example, an elongated area in the graphics objects layer is spe-
cialised into ‘road-sides’, ‘pavements’, ‘road net- work’, etc. Descriptions of other objects are not
given, but the tests demonstrated the extraction of parcels and road networks. The frames are designed
to analyse the various concepts and their properties. The object extraction is based on both model-driven
and data-driven instantiation, with the initial search being based on a goal specified by the user. While
the method is based on the extraction of well-defined information on maps, Mayer believes that the pro-
cess should be useful for the extraction of informa- tion from images.
Ž .
Tonjes 1996 has used semantic networks for
¨
modelling landscapes from overlapping aerial im- ages. The output is a 3-D view of the terrain with
appropriate representations of the vegetation. Tonjes
¨
states that semantic networks are suited to represent- ing knowledge of structural objects. His semantic
network is described by frames that include the relationships, attributes, and methods. The semantic
net has three layers:
1. Sensor layer, which represents the segmentation layer, based on texture and stripes, as well as the
image details; 2. Geometry and material layer, which represents
the 3-D surface layer following the interpretation of the terrain cover from the sensor layer;
3. Scene layer, which are the extracted objects.
The semantic network is established between compo- nents in the three layers.The relationships ‘con–of’
are concrete realisations of objects in the image data; ‘part–of’ describes the decomposition of the objects
into parts; while ‘is–a’ is the specialisation of the object. The object descriptions are tracked through
each layer for reconstruction, which is based on both data-driven as well as model-driven processes.
Ž .
Lang and Forstner 1996 have based their method
¨
of extraction of buildings on polymorphic mid-level features. The approach involves semantic modelling
using a ’part–of’ hierarchical representation. Rela- tions between the parts have not yet been included.
The hypothesis generation of the building is based on a combination of a data-driven model for the
original generation of the vertices, and subsequent model-driven approaches for hypothesis generation
of object interpretation and verification, using four building types as the models: flat roof, non-orthogo-
nal flat roof, gable roof, and hip roof. The approach successfully extracts buildings.
Ž .
Schilling and Vogtle 1996
have developed a
¨
procedure for updating digital map bases using exist- ing map bases to aid the interpretation. The image is
compared with the map to detect changes since the compilation of the map. New features are then anal-
ysed by semantic networks. Two networks are cre- ated, one for the scene and the other for the image,
with the typical relationships established at different levels in the networks.
Ž .
De Gunst 1996 has developed a combined data- driven and model-driven approach to recognising
objects required for updating digital map data. The process is based on object oriented models for road
descriptions and a semantic network for the feature recognition, based on frames. The frames define
such details as object relations, object definition, alternative object definitions and preprocessing rela-
tions. Road details include complex road junctions which are described by the knowledge base. This is a
very detailed study involving several different types of road features. The success of the investigations
varied significantly, demonstrating the difficulty in understanding such details.
Ž .
Ž .
Quint and Sties 1996 and Quint 1997 present a model-based system called MOSES to analyse aerial
images, which uses semantic networks as a mod- elling tool. Models are automatically refined by us-
ing knowledge gained from topographical maps or GIS data. The generatiÕe model is the most general
model containing common sense knowledge about the environment. Concepts in the generic models in
the map and image domain are specialisations of the corresponding concepts in the generative model. A
specific model is automatically generated by the system and is specific to the current scene; it is
generated by combining the scene description ob- tained after map analysis with the generic model in
the image domain. Initially, digitally available line segments are used for the structural analysis of the
map, resulting in a structural description of the map scene. The scene description so obtained is then
combined with the generic model in the image do- main to yield the specific model, which will be used
for image analysis. For structural analysis, image
Ž .
primitives currently line segments and regions serve as input. The analysis is model-driven, resulting in
Ž .
recognition of objects parking places in the project . A merit function is used to guide search in the image
analysis process. To sum up, semantic networks have found wide
acceptance and use in the interpretation of aerial images and digital maps.
4.6. Description logics There are very few photogrammetric applications
based on description logics. One such is Lange and Ž
. Schroder’s 1994 description-logic-based approach
¨
to the interpretation of changes in aerial images with respect to reference information extracted from a
map. Knowledge about types of objects and types of possible changes are represented using a KL-ONE-
Ž like description logic
Brachman and Schmolze, .
1985; Nebel, 1990 , which permit the description of concepts in terms of necessary and sufficient condi-
tions. Factual information about the scene and the interpretation are represented using the assertional
component of the description logic. Geometric and topological constraints and relations between spatial
objects are represented in the logic as object con- cepts and change concepts. An object is recognised
to be an instance of an object concept after the image has been preprocessed and the attributes extracted.
The definitions of the change concepts are used in exactly the same way to recognise changes. Search
for instantiation is goal-directed, and uses a number of heuristics. The examples in the paper, however,
seem to be based on artificial images.
5. Conclusions