Syntactic analysis is the third phase of Natural Language Processing (NLP). By its name, it can be easily understood that it is used to analyze syntax, sometimes known as syntax or parsing analysis. This step aims to extract precise, or dictionary-like, semantics from the text. Syntax analysis compares the text to formal grammar rules to determine its meaning. The statement “heated ice cream,” for example, would be discarded by a semantic analyzer.
It’s used to carry out the parsing process. It is a software component that takes input data (text) and converts it into a structural representation after verifying it for valid syntax using formal grammar. It creates a data structure, which can be a parse tree, an abstract syntax tree, or another hierarchical structure.
The primary functions of parse include:
To report any errors in syntax.
To recover from a frequently recurring error so that the rest of the program may be processed.
To make a parse tree.
To make a symbol table.
Creating intermediate representations (IR).
Parsing is divided into two categories by derivation
The parser constructs the parse tree and then proceeds to convert the start symbol to the input in this type of parsing. To parse the input, the most common kind of top-down parsing employs a recursive approach. Backtracking is the fundamental drawback of recursive descent parsing.
In this type of parsing, the parser begins with the input symbol and attempts to build the parser tree up to the start.
We’ll need a series of production rules to acquire the input string. A collection of production rules is known as derivation. During parsing, we must choose the non-terminal that will be replaced and the production rule that will be used to replace the non-terminal.
There are two sorts of derivations in this part, which may be used to determine which non-terminal should be substituted with the production rule.
The sentential shape of input is read and substituted from left to right in the left-most derivation. The left-sentential form is the sentential shape in this circumstance.
The sentential shape of input is read and substituted from right to left in the right-most derivation. The right-sentential form is the sentential shape in this circumstance.
It may be characterized as a visual representation of a derivation. The root node of the parse tree is the starting element of derivation. The leaf nodes are endpoints in every parse tree, while the inside nodes are non-terminals. In-order traversal produces the original input string, a feature of parse trees.
The constituency relation is the foundation of phrase structure grammar introduced by Noam Chomsky. As a result, it’s also known as constituency grammar. Dependency grammar is the polar opposite of this.
We must first grasp the fundamentals of constituency connection and constituency grammar before presenting an example.
In all associated frameworks, the sentence structure is viewed from the perspective of constituency relations.
The subject-predicate division in Latin and Greek grammar is the source of the constituency connection.
To comprehend the basic sentence form, utilize the verb phrase VP and the noun phrase NP.
We will use the sentence “This tree is illustrating the constituency relation” to understand how syntactical analysis works with help of code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
import nltk from nltk import pos_tag, word_tokenize, RegexpParser # String to parse to_parse = "This tree is illustrating the constituency relation" # Find all parts of speech in above sentence tagged_parts = pos_tag(word_tokenize(to_parse)) # Defining grammar on basis of which we 've to extract grammar = r "" " NP: {<DT>?<JJ>*<NN>} P: { <IN>} V: {<V.*>} PP: {<p> <NP>} VP: {<V> <NP|PP>*}""" #Extracting all parts of speech parser = RegexpParser(grammar) # Print all parts of speech in above sentence output = parser.parse(tagged_parts) print("\nAfter Extracting the parts\n\n", output,"\n")
Note: nltk is a python library for NLP. also known as Natural Language Tool Kit.
Output
Highlighted words are parts of a sentence we passed to construct a syntax tree.
Some of the syntactic categories of a natural language are as follows which are being used in the grammar part of the above code:
Sentence(S)
Noun Phrase(NP)
Determiner(DT)
Verb Phrase(VP)
Prepositional Phrase(PP)
Verb(V)
Noun(N)
To get the graphical representation we have to run one more line of code i.e. output.draw()
Context-free grammar (CFG) is a superset of regular grammar and a method for describing languages. It may be seen in the diagram below.