Generation of Structural Patterns
8.1 Generation of Structural Patterns
We begin our definition of a generative grammar in the Chomsky model [47] by introducing a set of terminal symbols. Terminal symbols are expressions which are used for constructing sentences belonging to a language generated by a grammar. 2 Let us assume that we construct a grammar for a subset of the English language consisting of sentences which contain four subjects (male names): Hector, Victor, Hyacinthus, Pacificus, two predicates: accepts, rejects, and four objects: glob- alism, conservatism, anarchism, pacifism. For example, the sentences Hector accepts globalism and Hyacinthus rejects conservatism belong to the language. 3
Let us denote this language by L 1 . Then, a set of terminal symbols T 1 is defined as follows: T 1 ={ Hector, Victor, Hyacinthus, Pacificus, accepts, rejects, globalism, conservatism, anarchism, pacifism}. As we have mentioned above, sentences are generated with rewriting rules called productions . 4 Let us consider an application of a production with the following example. Let there be given a phrase:
Hector accepts B ,
(8.1) where B is an auxiliary symbol, called a nonterminal symbol, 5 which denotes an
object in a sentence. Let there be given a production of the form:
(8.2) An expression placed on the left-hand side of an arrow is called the left-hand side of
B→ globalism .
a production and an expression placed on the right-hand side of an arrow is called the right-hand side of a production. An application of the production to the phrase, denoted by =⇒, consists of replacing an expression of the phrase which is equivalent to the left-hand side of the production (in our case it is the symbol B) by the right- hand side of the production. Thus, an application of the production is denoted in the
2 Let us remember that the notions of word and sentence are treated symbolically in formal language theory. For example, if we define a grammar which generates single words of the English language,
then letters are terminal symbols. Then, English words which consist of these letters are called words (sentences) of a formal language. However, if we define a grammar which generates sentences of the English language, then English words can be treated as terminal symbols. Then sentences of the English language are called words (sentences) of a formal language. As we see later on, words (sentences) of a formal language generated by a grammar can represent any string structures, e.g., stock charts, charts in medicine (ECG, EEG), etc. Since in this section we use examples from a natural language, strings consisting of symbols are called sentences.
3 We omit the terminal symbol of a full stop in all examples in order to simplify our considerations. Of course, if we use generative grammars for Natural Language Processing (NLP), we should use
a full stop symbol. 4 We call them productions, because they are used for generating—“producing”—sentences of a
language. 5 Nonterminal symbols are usually denoted by capital letters.
8.1 Generation of Structural Patterns 105 following way:
(8.3) Now, we define a set of all productions which generate our language. Let us denote
Hector accepts B =⇒ Hector accepts globalism .
this set by P 1 :
( 1) S → Hector A ( 2) S → Victor A ( 3) S → Hyacinthus A ( 4) S → Pacificus A ( 5) A → accepts B ( 6) A → rejects B ( 7) B → globalism ( 8) B → conservatism ( 9) B → anarchism (
10) B → pacifism .
For example, the sentence Pacificus rejects globalism is generated as follows:
S =⇒ 4 Pacificus A 6 =⇒ Pacificus rejects B 7 =⇒ Pacificus rejects globalism . (8.4)
Indices of the productions applied are placed above double arrows. A sequence of production applications used for generating a sentence is called a derivation of this sentence. If we are not interested in presenting the sequence of derivational steps, then we can simply write:
(8.5) which means that we can generate (derive) a sentence Pacificus rejects globalism
S ∗ =⇒ Pacificus rejects globalism ,
with the help of grammar productions starting with a symbol S. As we can see our set of nonterminal symbols, which we denote by N 1 , consists of three auxiliary symbols S, A, B, which are responsible for generating a subject,
a predicate, and an object, i.e.,
N 1 = {S, A, B}.
In a generative grammar a nonterminal symbol which is used for starting any derivation is called the start symbol (axiom) and it is denoted by S. Thus, our grammar
G 1 can be defined as a quadruple G 1 = (T 1 , N 1 , P 1 , S) . A language generated by a grammar G 1 is denoted L(G 1 ) . The language L(G 1 ) is the set of all the sentences which can be derived with the help of productions of the grammar G 1 . It can be proved that for our language L 1 , which has been introduced in an informal way at the beginning of this section, the following holds: L(G 1 ) =L 1 .
A generative grammar has an interesting property: it is a finite “object” (it consists of finite sets of terminal and nonterminal symbols and a finite set of productions), however it can generate an infinite language, i.e., a language which consists of an
106 8 Syntactic Pattern Analysis infinite number of sentences. Before we consider this property, let us introduce the
following denotations. Let a be a symbol. By a n we denote an expression which consists of an n-element sequence of symbols a, i.e.,
where n ≥ 0. If n = 0, then it is a 0-element sequence of symbols a, called the empty word and denoted by λ. Thus: a 0 = λ, a 1 = a, a 2 = aa, a 3 = aaa, etc. For example, let us define a language L 2 in the following way:
(8.7) The language L 2 is infinite and consists of n-element sequences of symbols a, where
L 2 = {a n , n≥ 1} .
additionally a has to occur at least once. (The empty word does not belong to L 2 .) It
can be generated by the following simple grammar G 2 :
G 2 = (T 2 , N 2 , P 2 , S) ,
where T 2 = {a}, N 2 = {S}, and the set of productions P 2 contains the following productions:
( 1) S → a S ( 2) S → a .
A derivation of a sentence of a given length r ≥ 2 (i.e., n = r ) in G 2 is performed according to the following scheme:
1 1 1 S 1 =⇒ a S =⇒ aa S =⇒ . . . =⇒ a r− 1 S 2 =⇒ a r , (8.8) i.e., firstly the first production is applied (r − 1) times, then the second production
is applied once at the end. If we want to generate a sentence of a length n = 1, i.e., the sentence a, the we apply the second production once.
Let us notice that defining a language L 2 as an infinite set is possible due to the first production. This production is of an interesting form: S → a S, which means that it refers to itself (a symbol S occurs on the left- and right-hand sides of the production). Such a form is called recursive, from Latin recurrere—“running back”. This “running back” of the symbol S during a derivation each time after applying
the first production, makes the language L 2 infinite.
Now, we discuss a very important issue concerning generative grammars. There are a lot of classes (types) of generative grammars. These classes can be arranged in a hierarchy according to the criterion of their generative power. We introduce this criterion with the help of the following example. Let us define the next formal language as follows:
L 3 = {a n b m , n≥ 1 , m ≥ 2} .
8.1 Generation of Structural Patterns 107 The language L 3 consists of the subsequence of symbols a and the subsequence of
symbols b. Additionally, the symbol a has to occur at least once, and the symbol b has to occur at least twice. For example, the sentences abb, aabb, aabbb, aabbbb, ... , aaabb , etc. belong to this language. It can be generated by the following grammar:
G 3 = (T 3 , N 3 , P 3 , S) ,
where T 3 = {a , b}, N 3 = {S , A , B}, and the set of productions P 3 contains the following productions:
( 1) S → a A ( 2) A → a A ( 3) A → b B ( 4) B → b B ( 5) B → b .
The first production is used for generating the first symbol a. The second production generates successive symbols a in a recursive way. We generate the first symbol b with the help of the third production. The fourth production generates successive symbols
b in a recursive way (analogously to the second production). The fifth production is used for generating the last symbol b of the sentence. For example, a derivation of the sentence a 3 b 4 is performed as follows:
1 2 2 3 S 4 =⇒ a A =⇒ aa A =⇒ aaa A =⇒ aaab B =⇒
4 4 =⇒ aaabb B 5 =⇒ aaabbb B =⇒ aaabbbb .
(8.10) Now, let us define a language L 4 which is a modification of the language L 3 , in
the following way:
L 4 = {a m b m , m≥ 1} .
(8.11) The language L 4 differs from the language L 3 in demanding an equal number of
symbols a and b. The language L 4 cannot be generated with the help of grammars having productions of the form of grammar G 3 , since such productions do not ensure the condition of an equal number of symbols. It results from the fact that in such
a grammar we firstly generate a certain number of symbols a and then we start to generate symbols b, but the grammar “does not remember"how many symbols a have been generated. We say that a grammar having productions in the a form of the
productions of G 3 has too weak generative power to generate the language L 4 . Now, we introduce a grammar G 4 which is able to generate the language L 4 :
G 4 = (T 4 , N 4 , P 4 , S) ,
where T 4 = {a , b}, N 4 = {S}, and the set of productions P 4 contains the following productions:
108 8 Syntactic Pattern Analysis
( 1) S → a Sb ( 2) S → ab .
For example, a derivation of the sentence a 4 b 4 is performed in the following way:
(8.12) As we can see a solution to the problem of an equal number of symbols a and b is
1 1 1 S 2 =⇒ a Sb =⇒ aa Sbb =⇒ aaa Sbbb =⇒ aaaabbbb .
obtained by generating the same number of both symbols in each derivational step. Thus, the grammar G 4 has sufficient generative power to generate the language
L 4 . The generative power of classes of formal grammars results from the form of
their productions. Let us notice that the productions of grammars G 1 ,G 2 ,G 3 are of the following two forms:
< nonterminal symbol> → <terminal symbol><nonterminal symbol> or <nonterminal symbol> → <terminal symbol> .
Using productions of such forms, we can only “stick” a symbol onto the end of the phrase which has been derived till now. Grammars having only productions of such a form are called regular grammars. 6 In the Chomsky hierarchy such grammars have the weakest generative power. For grammars such as the grammar G 4 , we do not demand any specific form of the right-hand side of productions. We require only a single nonterminal symbol at the left-hand side of a production. Such grammars are called context-free grammars. 7 They have greater generative power than regular grammars. However, we have to pay a certain price for increasing the generative power of grammars. We discuss this issue in the next section.