Regular expressions
5.3 Regular expressions
Regular expressions are a classical and convenient way to describe, for example, the structure of terminal words. This section defines regular expressions, defines the
Regular Languages
language of a regular expression, and shows that regular expressions and regular grammars are equally expressive formalisms. We do not discuss implementations of (datatypes and functions for matching) regular expressions; implementations can be found in the literature, see [9, 6].
Definition 14: RE T , regular expressions over alphabet T
regular
The set RE T of regular expressions over alphabet T is inductively defined as follows:
expres-
for regular expressions R, S
where a ∈ T . The operator + is associative, commutative, and idempotent; the con- catenation operator, written as juxtaposition (so x concatenated with y is denoted by xy), is associative, and is the unit of it. In formulae this reads, for all regular expressions R, S, and V ,
2 Furthermore, the star operator, ∗, binds stronger than concatenation, and concate- nation binds stronger than +. Examples of regular expressions are:
(bc)∗ + Ø
+ b( ∗) The language (i.e. the “semantics”) of a regular expression over T is a set of T -
sequences compositionally defined on the structure of regular expressions. As follows.
Definition 15: Language of a regular expression
Function Lre :: RE T → {T ∗ } returns the language of a regular expression. It is defined inductively by:
Lre(Ø) = Ø
Lre( ) = {} Lre(b) = {b}
5.3 Regular expressions
Lre(x + y) = Lre(x)) ∪ Lre(y)
Lre(xy) = Lre(x) Lre(y) Lre(x∗) = (Lre (x)) ∗
2 Since ∪ is associative, commutative, and idempotent, set concatenation is associative
with { } as its unit, and function Lre is well defined. Note that the language Lreb∗ is the set consisting of zero, one or more concatenations of b, i.e., Lre(b∗) = ({b}) ∗ . As an example of a language of a regular expression, we compute the language of the regular expression ( + bc)d.
Lre(( + bc)d)
(Lre( + bc)) (Lre(d))
(Lre( ) ∪ Lre(bc)){d}
({ } ∪ (Lre(b))(Lre(c))){d}
{d, bcd} Regular expressions are used to describe the tokens of a language. For example, the
list
if p then e1 else e2 contains six tokens, three of which are identifiers. An identifier is an element in the
language of the regular expression
letter (letter + digit )∗ where
letter = a+b+...+z+
A+B+...+Z
digit = 0+1+...+9 see subsection 2.3.1.
In the beginning of this section we claimed that regular expressions and regular grammars are equivalent formalisms. We will prove this claim later, but first we illustrate the construction of a regular grammar out of a regular expressions in an example. Consider the following regular expression.
R = a∗ + + (a + b)∗ We aim at a regular grammar G such that Lre(R) = L(G) and again we take a
top-down approach.
Regular Languages
Suppose that nonterminal A generates the language Lre(a∗), nonterminal B gener- ates the language Lre( ), and nonterminal C generates the language Lre((a + b)∗). Suppose furthermore that the productions for A, B, and C satisfy the conditions imposed upon regular grammars. Then we obtain a regular grammar G with L(G) = Lre(R) by defining
S →A S →B S →C
where S is the start-symbol of G. It remains to construct productions for nonter- minals A, B, and C.
• The nonterminal A with productions
A → aA
A → generates the language Lre(a∗).
• Since Lre( ) = { }, the nonterminal B with production
B → generates the language { }.
• Nonterminal C with productions
C → aC
C → bC
C → generates the language Lre((a + b)∗). For a specific example it is not difficult to construct a regular grammar for a regular
expression. We now give the general result.
Theorem 16: Regular Grammar for Regular Expression
For each regular expression R there exists a regular grammar G such that
Lre(R) = L(G)
The proof of this theorem is given in Section 5.4. To obtain a regular expression that generates the same language as a given regular
grammar we go via an automaton. Given a regular grammar G, we can use the theorems from the previous sections to obtain a DFA D such that
L(G) = Ldfa(D)
5.4 Proofs
So if we can obtain a regular expression for a DFA D, we have found a regular expression for a regular grammar. To obtain a regular expression for a DFA D, we interpret each state of D as a regular expression defined as the sum of the concate- nation of outgoing terminal symbols with the resulting state. For our example DFA we obtain:
C = It is easy to merge these four regular expressions into a single regular expression,
partially because this is a simple example. Merging the regular expressions obtained from a DFA that may loop is more complicated, as we will briefly explain in the proof of the following theorem. In general, we have:
Theorem 17: Regular Expression for Regular Grammar
For each regular grammar G there exists a regular expression R such that
L(G) = Lre(R)
The proof of this theorem is given in Section 5.4.