Introduction to Lexical Analysis?

 Lexical analysis, also known as scanning or tokenization, is the first phase of the compilation process. Its primary task is to convert the input source code into a sequence of tokens for further processing by the compiler. 



Working Principle:

. Scanning:
The lexical analyzer scans the source code character by character.

. Tokenization: It groups characters into tokens based on predefined rules (e.g., regular expressions).

. Ignoring Whitespace and Comments: Whitespace characters (spaces, tabs, newlines) are typically ignored. Comments may also be discarded.

. Error Handling: Detects and reports lexical errors like invalid characters or misspelled tokens.
Token Output: Produces a stream of tokens to be used by the parser for further analysis.

Suppose we pass a statement through lexical analyzer – a = b + c;  
 It will generate token sequence like this: id=id+id;                 
Where each id refers to it’s variable in the symbol table referencing all details For example, consider the program

int main()
{
  // 2 variables
  int a, b;
  a = 10;
 return 0;
}


All the valid tokens are:
'int'  'main'  '('  ')'  '{'  'int'  'a' ','  'b'  ';'
 'a'  '='  '10'  ';' 'return'  '0'  ';'  '}'

What is a Token?
A lexical token is a sequence of characters that can be treated as a unit in the grammar of the programming languages.
 
Types of tokens:
1, Type token (id, number, real, . . . )
2. Punctuation tokens (IF, void, return, . . . )
3. Alphabetic tokens (keywords)

Examples of Tokens:
1. Keywords: if, else, while, int, return.
2. Identifiers: variableName, function_name, ClassName.
3. Literals: 123, 3.14, "Hello, World!".
4. Operators: +, -, *, /, =, ==.
5. Punctuation: (), {}, ;, ,.

Example of Non-Tokens:
Comments, preprocessor directive, macros, blanks, tabs, newline, etc.

Example: Count number of tokens:
int main()
{
  int a = 10, b = 20;
  printf("sum is:%d",a+b);
  return 0;
}
Answer: Total number of token: 27.

Lexeme: The sequence of characters matched by a pattern to form the corresponding token or a sequence of input characters that comprises a single token is called a lexeme. eg- “float”, “abs_zero_Kelvin”, “=”, “-”, “273”, “;” . 

Advantages:

. Efficiency:
Tokenization simplifies subsequent phases of compilation by reducing the complexity of code analysis.

. Modularity: Separating lexical analysis from other phases allows for easier maintenance and modification of the compiler.

. Error Detection: Lexical analysis catches errors early in the compilation process, improving debugging.

. Flexibility: Allows for easy integration of features like syntax highlighting and code formatting in text editors and IDEs.

Disadvantages:

. Complexity:
Implementing a robust lexical analyzer can be complex, especially for languages with intricate syntax rules.

. Performance Overhead: Tokenization adds an extra step to the compilation process, which may slightly slow down the overall compilation time.

. Context Sensitivity: Some languages have tokens whose meaning depends on context, making tokenization more challenging.

. Regular Expressions Limitations: Tokenization rules defined using regular expressions may not cover all edge cases or may lead to ambiguities.


Comments

Popular Posts