What is Parsing?

A brief introduction

Published: Aug 5, 2019 Last updated: Apr 23, 2022

Parsing is the process of converting formatted text into a data structure. A data structure type can be any suitable representation of the information engraved in the source text.

Tree type is a common and standard choice for XML parsing, HTML parsing, JSON parsing, and any programming language parsing. The output tree is called Parse Tree or Abstract Syntax Tree. In HTML context, it is called Document Object Model (DOM).
A CSV file parsing can result in a List of List of values or a List of Record objects.
Graph Type is a choice for natural language parsing.

A piece of program that does parsing is called Parser.

How it works

Parser analyses the source text against the defined format*.

If source text does not match against format, errors are thrown or returned.
If matches, then “data structure” is returned.

parsing defination

*format is coded inside Parser.

Format is the DNA of a parser.

Small Case Study

Consider an example of Date parsing from a string (source) in format DD-MM-YYYY to Date object:

class Date {
  int day;
  int month;
  int year;
}

parsing task as black box

Implementation Note

Regular Expression¹ (regex for short) would be my choice to parse Dates. Regex helps in matching and extracting parts of strings (if matched). It’s an example of “Proxy Parsing,” which involves delegating parsing to another abstract system.

Note: This will be a small-scale illustration of parsing with “Regex.” It might be a fair approach for one to two lines of input text. You may have to write a parser by hand (which is the most difficult) or use Parser Generator tools (moderately complex) for considerable input.

Code

The parsing and extracting date element is as:

String date = “20-05-2012”;
// 1. defining a regular expression
Pattern dateMatcher = Pattern.compile(“(\d{2})-(\d{2})-(\d{4})”);
// 2. matching a regular expression
Matcher matcher = dateMatcher.matcher(date);
// 3. if pattern matches
if (matcher.find()) {
  // extract day
  int day = Integer.parseInt(matcher.group(1));
  int month = Integer.parseInt(matcher.group(2));
  int year = Integer.parseInt(matcher.group(3));
  // 4. validate date : skipped code
  return new Date(day, month, year);
} else {
 // throw Error
 throw new DateParseException(date +” is not valid”)
}

The code is written in Java. How does this code work:

Date formatted as DD-MM-YYYY can be matched using regex:

(d{2})-(d{2})-(d{4})

where d matches any digit between 0 to 9,
{n} means how many times the last character type is expected
() is called capturing group and enclosed part of string that would be extracted.
Each group is numbered, the first group is assigned “1”, the second one is given number “2”, and so on

The given date as string is matched against the defined regex.
If the match is successful, then Day, month, and year are extracted from the source string using Group Constructprovided by Regular Expression API. It is a standard construct in any Regular Expression API in any language.
The extracted date parts can be validated; related code is skipped for you to exercise.

Regular Expression based parsing is limited to format, which regular expressions can define. A programming language or format like XML parsing is complex.

You can refer to the book Crafting Interpreters to get an idea about a full-scaled parsing.

You can also read about Lezer : Code Mirror’s Parsing System

Phases of Parsing

Parsing can be scoped as a composition of Scanning and Syntactic Analysis or just Syntactic Analysis.

Scanning

Scanning is a process of converting a stream of characters into tokens. scanning process illustration A token represents a “concept” introduced by format and it token can be considered as a label assigned to one or more characters. From a processing point of view: A token is an object, can contain type, lexeme³, location information, and more.

In Java language: if, while, int are examples of tokens. In date parsing, tokens are defined by Regex; \d{2} (day, month), — (separator), \d{4} (year) are tokens. Note: Day and month are the same token type. A token is defined by “the pattern of characters” not by locations of characters.

Scanning is also called Tokenization or Lexical Analysis. A part of the program which does scanning is called Lexer or Scanner or Tokenizer.

Syntactic Analysis

Syntactic Analysis examines the structure formed as “keeping tokens as they appeared.” It also validates and extracts engraved data to create the preferred Data Structure. syntactic analysis illustration In date parsing example: “day is followed by month and year.” The order is checked by Regex Engine, also extraction is done based on order and matches.

Errors reported by this phase are called Syntactic Errors. Going back to Date parsing example, 99-JAN-2021 is an invalid Date; however, 99–99–9999 is a valid Date because rule (\d{2})-(\d{2})-(\d{4}) says so. It may seem absurd, but parsers generally validate Syntactic Correctness.

Closing Notes

The scale of parsing determines the inclusion or exclusion of Scanning as part of it:

It is beneficial for big-scale parsing like language (natural or programming) parsing to keep Scanning, and Syntactic Analysis separated. In this case, Syntactic Analysis is called parsing.
For small-scale parsing like Date, it may not be beneficial to distinct Lexer and Syntactic Analysis. In this case, it is called Scannerless Parsing.

Many times, parsing tasks are delegated to parser code generators like Antlr or Lex . These tools require a set of rules or grammar and generate parser code.

parser and parser generator defination

The generated parsers produce a tree upon parsing, which may not be the desired data structure; however, these libraries provide sufficient APIs to convert the constructed tree to the desired data structure.

There is more in Parser world; the type of parsing styles: top-down or bottom-up, Leftmost or Rightmost derivation. These design styles limit parsing capabilities of the parser. You may visit the following link to get a synoptic view of these design choices and their comparison: A Guide To Parsing: Algorithms And Terminology

Footnotes

Regular Expression is an algebraic expression² representing a group of strings.
An algebraic expression utilizes symbols and operators.
A Lexeme is a raw string version of Token.