Regular Languages, Expressions, and Automata

Item 2 – Regular Languages

2.1 – Languages on Literacy

Consider the alphabet = {a, b}. For all natural numbers n, there are only a finite number of words whose length is n. These strings can be ordered lexicographically, numbered 0, 1, 2… For example, for length 1: 0 is a, 1 is b. For length 2: 0 is aa, 1 is ab, and so on. So, assuming an arbitrary alphabet Q, since all alphabets are finite, we can assign an arbitrary order to the characters. Without loss of generality, we can write Q = {a0, a1, a2 … an} and number the words the same way: 0, a1 -> 1, a2 -> 2, … an -> n, a1a1 -> n+1 and so on. For each alphabet, the set of all words is countably infinite. The set of all languages is NOT.

2.2 Regular Languages and Regular Expressions

Regular Languages

Let Σ be an alphabet. The set of regular languages is defined recursively as follows:

  1. ∅ is a regular language (the empty language).
  2. {ε} is a regular language (the language containing only the empty string).
  3. For all a ∈ Σ, {a} is a regular language.
  4. If A and B are regular languages, then A ∪ B, A · B, and A* are regular languages.
  5. No other languages are regular.

Regular languages can only be obtained through a finite number of applications of the operations in point 4. Example with Σ = {a, b}:

  • ∅, {ε}, {a}, {b}
  • {a, b} = {a} ∪ {b}
  • {ab} = {a} · {b}
  • {a, b, ab}
  • {ai | i >= 0} = {a}*
  • L1 = {aibj | i >= 0, j >= 0} = {a}* · {b}*
  • {(ab)i | i >= 0} = ({a} · {b})*
  • {a, b}* is the largest language.

Non-Regular Languages: L2 = {aibi | i >= 0}. A subset of a regular language may not be regular. For example, L2 is not regular, even though it is a subset of L1. Example: Σ = {a, b, c}. Is the language L of all strings that do not contain the substring ‘ac’ regular? Yes: (c)*((a) ∪ (b)(c)*)*

Regular Expressions

Regular expressions denote regular languages. Example: A ∪ B denotes {a} ∪ {b} = {a, b}. The recursive definition of a regular expression over an alphabet is:

  1. ∅ and ε are regular expressions.
  2. a is a regular expression for all a ∈ Σ.
  3. If r and s are regular expressions, then r ∪ s, rs, and r* are also regular expressions.
  4. No other symbols are regular expressions.

Only those expressions obtained by applying points 1 and 2 with operations in 3 are regular expressions. L(r) is the language generated by r, and L(r) is always regular. Two regular expressions are equivalent if L(r) = L(s). Examples:

  • ab <—> {ab}
  • {a, b, ab} <—> a ∪ b ∪ ab
  • {ai | i >= 0} <—> a*
  • {a, b}* <—> (a ∪ b)*

Theorem (Simplified): Let r, s, and t be regular expressions over the same alphabet. Then:

  • r ∪ s = s ∪ r
  • r ∪ ∅ = ∅ ∪ r = r
  • r ∪ r = r
  • r · ε = ε · r = r
  • r ∅ = ∅ r = ∅
  • (rs)t = r(st)
  • r(s ∪ t) = rs ∪ rt
  • (r ∪ s)t = rt ∪ st
  • r** = r*
  • r*r* = r*
  • (ε ∪ r)* = r*
  • (r ∪ ε) = ε ∪ r*
  • rr* = r+
  • (r ∪ s)* = (r* ∪ s*)* = (r*s*)* = (r*s)*r* = r*(sr*)*
  • r(sr)* = (rs)*r
  • (r*s)* = ε ∪ (r ∪ s)*s
  • (rs*)* = ε ∪ r(r ∪ s)*
  • s(r ∪ ε)*(r ∪ ε) ∪ s = sr*
  • r · r* = r* · r = r+
  • r* = r+ ∪ ε
  • rr* = r+

2.3 Deterministic Finite Automata (DFA)

A DFA M = (Q, Σ, δ, s, F) is a collection of 5 elements:

  1. An input alphabet Σ.
  2. A finite set Q of states (non-empty finite set).
  3. An initial state s (or start state).
  4. A set F of final states (or accepting states).
  5. A function δ: Q x Σ -> Q that determines the next state for a unique pair (qi, a) for the current state and input (transition function).

A DFA can be represented by a state diagram, which is a directed graph whose nodes are states and whose edges are transitions. A final state is represented by a double circle. For any current state and input character, there is always a unique next state. So, for a pair (qi, a), there is one and only one value of the function (next state). The state is fully determined by following the information provided by the pair (qi, a), so it is deterministic. If the transition function of a DFA does not specify all cases, it is considered an unspecified transition. The transition function is represented by a transition table.

2.4 DFA and Languages

If M is a DFA, then the language accepted by M is L(M) = {w ∈ Σ* | w is accepted by M}. Hence, L(M) is the set of strings that, when processed by M, lead from the initial state to an accepting state. (Accepts strings that start at the initial state and finish in a final state).

Note: L(M) consists of all strings accepted by M, and it is not a set of strings that are all accepted by M. Two DFAs are equivalent if L(M1) = L(M2).

2.5 Non-deterministic Finite Automata (NFA)

If, from a state, zero, one, or more transitions are allowed with the same input symbol, we say that it is a non-deterministic FA. It is sometimes desirable to design NFAs instead of DFAs. We define an NFA as a collection of five objects M = (Q, Σ, δ, s, F), where:

  1. Q is a finite set of states.
  2. Σ is the input alphabet.
  3. s is a state in Q designated as the initial state.
  4. F is a set of final or accepting states.
  5. δ is a relation on (Q x Σ) x Q and is called the transition relation.

An NFA can be represented by a state diagram, as with DFAs, and also with a transition table. Note that in the transition table, cells can contain sets of states. Cells with ∅ indicate that there is no transition from the current state with the given input. If for a current state and input, there are multiple possible next states, it means that we can choose between various options, so it is non-deterministic. If M is an NFA, we define the language accepted by M as: L(M) = {w | w is a string accepted by M} = {w | (s, w) ∈ F}. A string w is accepted by M if M goes from its initial state to a final state after processing w (w is totally consumed). To determine if a string belongs to L(M), we should follow the transition diagram for M. We must find a path that ends in a final state when the entire string has been consumed. During the traversal, we should choose a non-deterministic transition from one state to another when there is more than one option for the same symbol. To say that a string is not in L(M), we must exhaust all possible paths in the state diagram for that string. NFAs do not do anything that DFAs cannot do. If a state is reached where no transition is specified, since the automaton is not completely specified, then we reject.

2.6 Equivalence of NFA and DFA

An automaton M is equivalent to another M’ if L(M) = L(M’). For every NFA accepting a language, there is always a DFA that accepts the same language.

2.7 Step to Convert an NFA to a DFA

It starts in the initial state, and those states are incorporated that are subsets of Q and that arise as a result of a transition from a previously added state (for each new state, we add its transitions). We stop when no more new states appear. An NFA with n states can become, at best, a DFA with 2n states, but in practice, the number of states increases slightly. Which are the final states of the new DFA? Any subset A contained in Q such that A ∩ F ≠ ∅. A is a new final state of the DFA (any subset of states in the DFA that contains a final state of the NFA is a final state).

2.8 NFA with Epsilon Transitions (NFA-ε)

We can extend the definition of an NFA to include transitions from state to state that do not depend on any input symbol -> epsilon transitions. They are called so because they do not consume any input symbol. In a transition table, it becomes like a symbol. To systematize the process, we can calculate the set of reachable states in an NFA with epsilon transitions. For every state q ∈ Q, we define the epsilon-closure of q as: ε-closure(q) = {p | p is reachable from q without consuming any input}.