Finite Automata, Regular Expressions, and Lexical Analysis

Finite Automata and Regular Expressions

A finite state machine (FSM) or finite state automaton is a mathematical model of a system. It takes a string consisting of symbols from an alphabet and determines if the string belongs to the language that the automaton recognizes.

Formally, a finite state machine can be described as a quintuple (S, Σ, T, s, A), where:

  • S: is a set of states
  • Σ: is an alphabet
  • T: is the transition function
  • s: is the initial state
  • A: is the set of final or accepting states

Forms of Representation of Finite Automata

Besides being able to represent a finite state machine through its formal definition, it can also be represented by other notations that are more convenient and sometimes more intuitive. The most common are transition tables, transition diagrams, and regular expressions.

Lexical Analyzer

The main function of the lexical analyzer is to take a sequence of characters or symbols in the alphabet of a language and place them into categories known as lexical units. These units are used by the parser to determine if what is written in the source program is grammatically correct or not. Some of the lexical units are not used by the parser but are filtered out. Examples include comments, which document the program but do not affect grammar, and blanks, which serve to provide clarity to the code.

In the terminology used in the construction of a parser, the following terms are defined:

  • Pattern: Represents the rule for a sequence of characters to be considered a certain lexical unit.
  • Lexeme: Is the actual value of a character set that satisfies a pattern.
  • Token: Is the value associated with a lexical unit and is represented as an integer.
  • Alphabet: The set of valid characters for a language.
  • Lexical Units: The categories that classify the valid strings in a language.

The Role of the Lexical Analyzer

Although the lexical analyzer is the first stage of the compilation process, it is not the component that initiates the process. Instead, it processes input and sends its results to the parser. The parser actually starts the process by requesting a token from the lexical analyzer to begin its work. During these stages, close communication is maintained with the symbol table, which contains information about the entities used in the programs.

Description of Patterns

There are three ways to describe patterns: informal description, which uses natural language to describe the behavior of the lexical rule; regular expressions; and finite automata as transition diagrams.

Transition Diagrams

A transition diagram is the graphical representation of a set of states. These states may be initial, intermediate, or final. They may also have one or more outputs to other states.

The initial states are represented by a double circle and a zero, intermediate states by a circle and a number, and final states by a double circle and the last number. States are related to each other with an arc that has a name, which is the character or set of characters that cause the transition from one state to another.

Features of Transition Diagrams

  • Each state must be reached with the same set of characters every time there is a transition.
  • Each pattern has a character selector that allows a unique way to recognize the pattern to be applied.
  • To reach an accepting state, there must be a transition on the character that breaks the pattern of the lexical unit.

Informal Descriptions of Lexical Units

  • Identifier: A letter (uppercase or lowercase) optionally followed by other letters, digits, or underscores.
  • Comments: Begin with “/*” and end with “*/”. They can contain any symbol except for the end-of-file marker.
  • EOF: Indicates the end of the file.
  • Error: Any symbol or sequence that does not comply with the previous patterns.