Regular Languages, Expressions, and Automata

Posted on Dec 17, 2024 in Computers

Item 2 – Regular Languages

2.1 – Languages on Literacy

Consider the alphabet = {a, b}. For all natural numbers n, there are only a finite number of words whose length is n. These strings can be ordered lexicographically, numbered 0, 1, 2… For example, for length 1: 0 is a, 1 is b. For length 2: 0 is aa, 1 is ab, and so on. So, assuming an arbitrary alphabet Q, since all alphabets are finite, we can assign an arbitrary order to the characters. Without loss of generality, we can write Q = {a₀, a₁, a₂ … a_n} and number the words the same way: 0, a₁ -> 1, a₂ -> 2, … a_n -> n, a₁a₁ -> n+1 and so on. For each alphabet, the set of all words is countably infinite. The set of all languages is NOT.

2.2 Regular Languages and Regular Expressions

Regular Languages

Let Σ be an alphabet. The set of regular languages is defined recursively as follows:

∅ is a regular language (the empty language).
{ε} is a regular language (the language containing only the empty string).
For all a ∈ Σ, {a} is a regular language.
If A and B are regular languages, then A ∪ B, A · B, and A^* are regular languages.
No other languages are regular.

Regular languages can only be obtained through a finite number of applications of the operations in point 4. Example with Σ = {a, b}:

∅, {ε}, {a}, {b}
{a, b} = {a} ∪ {b}
{ab} = {a} · {b}
{a, b, ab}
{aⁱ | i >= 0} = {a}^*
L1 = {aⁱb^j | i >= 0, j >= 0} = {a}^* · {b}^*
{(ab)ⁱ | i >= 0} = ({a} · {b})^*
{a, b}^* is the largest language.

Non-Regular Languages: L2 = {aⁱbⁱ | i >= 0}. A subset of a regular language may not be regular. For example, L2 is not regular, even though it is a subset of L1. Example: Σ = {a, b, c}. Is the language L of all strings that do not contain the substring ‘ac’ regular? Yes: (c)^*((a) ∪ (b)(c)^*)^*

Regular Expressions

Regular expressions denote regular languages. Example: A ∪ B denotes {a} ∪ {b} = {a, b}. The recursive definition of a regular expression over an alphabet is:

∅ and ε are regular expressions.
a is a regular expression for all a ∈ Σ.
If r and s are regular expressions, then r ∪ s, rs, and r^* are also regular expressions.
No other symbols are regular expressions.

Only those expressions obtained by applying points 1 and 2 with operations in 3 are regular expressions. L(r) is the language generated by r, and L(r) is always regular. Two regular expressions are equivalent if L(r) = L(s). Examples:

ab <—> {ab}
{a, b, ab} <—> a ∪ b ∪ ab
{aⁱ | i >= 0} <—> a^*
{a, b}^* <—> (a ∪ b)^*

Theorem (Simplified): Let r, s, and t be regular expressions over the same alphabet. Then:

r ∪ s = s ∪ r
r ∪ ∅ = ∅ ∪ r = r
r ∪ r = r
r · ε = ε · r = r
r ∅ = ∅ r = ∅
(rs)t = r(st)
r(s ∪ t) = rs ∪ rt
(r ∪ s)t = rt ∪ st
r^** = r^*
r^*r^* = r^*
(ε ∪ r)^* = r^*
(r ∪ ε) = ε ∪ r^*
rr^* = r⁺
(r ∪ s)^* = (r^* ∪ s^*)^* = (r^*s^*)^* = (r^*s)^*r^* = r^*(sr^*)^*
r(sr)^* = (rs)^*r
(r^*s)^* = ε ∪ (r ∪ s)^*s
(rs^*)^* = ε ∪ r(r ∪ s)^*
s(r ∪ ε)^*(r ∪ ε) ∪ s = sr^*
r · r^* = r^* · r = r⁺
r^* = r⁺ ∪ ε
rr^* = r⁺

2.3 Deterministic Finite Automata (DFA)

A DFA M = (Q, Σ, δ, s, F) is a collection of 5 elements:

An input alphabet Σ.
A finite set Q of states (non-empty finite set).
An initial state s (or start state).
A set F of final states (or accepting states).
A function δ: Q x Σ -> Q that determines the next state for a unique pair (q_i, a) for the current state and input (transition function).

A DFA can be represented by a state diagram, which is a directed graph whose nodes are states and whose edges are transitions. A final state is represented by a double circle. For any current state and input character, there is always a unique next state. So, for a pair (q_i, a), there is one and only one value of the function (next state). The state is fully determined by following the information provided by the pair (q_i, a), so it is deterministic. If the transition function of a DFA does not specify all cases, it is considered an unspecified transition. The transition function is represented by a transition table.

2.4 DFA and Languages

If M is a DFA, then the language accepted by M is L(M) = {w ∈ Σ^* | w is accepted by M}. Hence, L(M) is the set of strings that, when processed by M, lead from the initial state to an accepting state. (Accepts strings that start at the initial state and finish in a final state).

Note: L(M) consists of all strings accepted by M, and it is not a set of strings that are all accepted by M. Two DFAs are equivalent if L(M1) = L(M2).

2.5 Non-deterministic Finite Automata (NFA)

If, from a state, zero, one, or more transitions are allowed with the same input symbol, we say that it is a non-deterministic FA. It is sometimes desirable to design NFAs instead of DFAs. We define an NFA as a collection of five objects M = (Q, Σ, δ, s, F), where:

Q is a finite set of states.
Σ is the input alphabet.
s is a state in Q designated as the initial state.
F is a set of final or accepting states.
δ is a relation on (Q x Σ) x Q and is called the transition relation.

An NFA can be represented by a state diagram, as with DFAs, and also with a transition table. Note that in the transition table, cells can contain sets of states. Cells with ∅ indicate that there is no transition from the current state with the given input. If for a current state and input, there are multiple possible next states, it means that we can choose between various options, so it is non-deterministic. If M is an NFA, we define the language accepted by M as: L(M) = {w | w is a string accepted by M} = {w | (s, w) ∈ F}. A string w is accepted by M if M goes from its initial state to a final state after processing w (w is totally consumed). To determine if a string belongs to L(M), we should follow the transition diagram for M. We must find a path that ends in a final state when the entire string has been consumed. During the traversal, we should choose a non-deterministic transition from one state to another when there is more than one option for the same symbol. To say that a string is not in L(M), we must exhaust all possible paths in the state diagram for that string. NFAs do not do anything that DFAs cannot do. If a state is reached where no transition is specified, since the automaton is not completely specified, then we reject.

2.6 Equivalence of NFA and DFA

An automaton M is equivalent to another M’ if L(M) = L(M’). For every NFA accepting a language, there is always a DFA that accepts the same language.

2.7 Step to Convert an NFA to a DFA

It starts in the initial state, and those states are incorporated that are subsets of Q and that arise as a result of a transition from a previously added state (for each new state, we add its transitions). We stop when no more new states appear. An NFA with n states can become, at best, a DFA with 2ⁿ states, but in practice, the number of states increases slightly. Which are the final states of the new DFA? Any subset A contained in Q such that A ∩ F ≠ ∅. A is a new final state of the DFA (any subset of states in the DFA that contains a final state of the NFA is a final state).

2.8 NFA with Epsilon Transitions (NFA-ε)

We can extend the definition of an NFA to include transitions from state to state that do not depend on any input symbol -> epsilon transitions. They are called so because they do not consume any input symbol. In a transition table, it becomes like a symbol. To systematize the process, we can calculate the set of reachable states in an NFA with epsilon transitions. For every state q ∈ Q, we define the epsilon-closure of q as: ε-closure(q) = {p | p is reachable from q without consuming any input}.