Natural Language Processing (NLP) is the process of getting a computer to understand natural language. The data can be in the form of a text document, image, audio, or video. It also refers to the field of computer science or artificial intelligence that extracts linguistic information from the underlying data. In NLP, full automation can be easily achieved by using modern software libraries, modules, and packages.
There are various applications for NLP such as sentiment analysis, chatbot, speech recognition, machine translation, spell checking, information extraction, keyword search, advertisement matching, etc. Real-world examples are Google Assistant and Google translate.
NLTK is an open-source toolkit for natural language processing. This toolkit is one of the most powerful NLP libraries which contains packages to make machines understand human languages and respond in an appropriate manner. Using NLTK we can perform operations such as data cleaning, visualization, and vectorization that will help us in classifying our text.
Here, I am using Google colab which is a free development tool. Using Google colab, a programmer can write and execute code in Python, create/upload/share notebooks, and import/save notebooks from/to Google drive.
We can perform various tasks using NLTK such as discussed below in a detailed example.
Corpus is the training data needed for the task.
1 2 3 4 5
import numpy as np import nltk f = open("chatbot.txt", 'r', errors = 'ignore') raw_doc = f.read() raw_doc
Convert all the data coming as an input to either upper or lower case.
This will avoid misrepresentation and misinterpretation of words spelled with lower or upper cases.
1
raw_doc = raw_doc.lower() #converts text to lowercase
Tokenization is the structured process of converting a sentence into an individual collection of elements called tokens. It is also used to understand the importance of each of the words with respect to the sentence. These so-called tokens can be words, numbers, or punctuation marks.
Sentence Tokenization: It is the process of tokenizing a text into sentences. To do this, we use the method sent_tokenize as shown in the below code.
Word Tokenization: It is the process of tokenizing sentences or text into words and punctuation. To do so, we use the method word_tokenize as shown in the below code.
1 2 3 4 5 6
nltk.download('punkt') #using the punkt tokenizer nltk.download('wordnet') #using the wordNet dictionary from nltk.tokenize import sent_tokenize, word_tokenize sent_tokens = nltk.sent_tokenize(raw_data) #converts doc to list of sentences word_tokens = nltk.word_tokenize(raw_data) #converts doc to list of words
We are just cleaning the data (word tokens) using regular expressions.
1 2 3 4 5 6 7
import re clean_data = [] for words in word_tokens: item = [] result = re.sub(r "[^\w\s]", "", words) if result != "": clean_data.append(result)
Words that have very little meaning or little usage are considered stop words. For example, words such as “a”, “the”, “an”, “is”, etc are some of the stop words.
Example:
1 2 3 4 5 6 7 8 9 10 11
nltk.download('stopwords') from nltk.corpus import stopwords clean_data_1 = [] for words in clean_data: if not words in stopwords.words('english'): clean_data_1.append(words) clean_data_1
Stemming is the process of finding similarities between words with the same root words. Stemming acts on words without knowing the context. Therefore, it’s faster but doesn’t always yield the desired result.
1 2 3 4 5 6 7
from nltk.stem import PorterStemmer stemmer = PorterStemmer() stemmed_data = [stemmer.stem(word) for word in clean_data_1] stemmed_data
Lemmatization is the process of grouping together different infected forms of a word, called lemma. The output of lemmatization is a proper word. It uses a lexical vocabulary to derive the root form, is more time-consuming than stemming, and is most likely to yield accurate results. It uses a lexical database called WordNet.
1 2 3 4 5 6 7 8 9 10 11
from nltk.stem.wordnet import WordNetLemmatizer import nltk lemmer = nltk.stem.WordNetLemmatizer() nltk.download('wordnet') lemm_data = [] for word in clean_data_1: lemm_data.append(lemmer.lemmatize(word)) lemm_data
Usage of either stemming or lemmatization will depend on the situation. If speed is required, it’s better to resort to stemming, but if accuracy is required it’s best to use lemmatization.
In this process, the words in the text are assigned a label with corresponding parts of speech components such as nouns, verbs, adjectives, or adverbs. It takes features like the previous word, next word, and capitalization of the first word into consideration when assigning a POS tag to a word.
1 2
pos_data = pos_tag(lemm_data) pos_data
It basically means picking up individual pieces of information and grouping them into bigger pieces.
Chunking is a process that requires POS tagged input, and it provides chunks of phrases as output. Similar to POS tags, there is a standard set of chunk tags like Noun Phrase (NP), Verb Phrase (VP), etc.
1 2 3 4 5
chunkGram = "NP: {<DT>?<JJ>*<NN>}" chunkParser = nltk.RegexpParser(chunkGram) chunked = chunkParser.parse(pos_data) print(chunked)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
# # # CREATE VIRTUAL DISPLAY # # # !apt - get install - y xvfb # Install X Virtual Frame Buffer import os os.system('Xvfb :1 -screen 0 1600x1200x16 &') # create virtual display with size 1600 x1200 and 16 bit color.Color can be changed to 24 or 8 os.environ['DISPLAY'] = ':1.0' # tell X clients to use our virtual DISPLAY: 1.0. % matplotlib inline # # # INSTALL GHOSTSCRIPT(Required to display NLTK trees) # # #!apt install ghostscript python3 - tk from nltk.tree import Tree from IPython.display import display tree = Tree.fromstring(str(chunked)) display(tree)
(It’s a big tree which cannot be displayed in the article. The above image is just a screenshot of a part of it).
Displaying subtree:
1 2
for subtree in tree.subtrees(): display(subtree)
In this article, we have learned the essential steps of NLP. We have discussed how to import text to analyze, preprocess the text, and analyze the text step by step. These concepts can be implemented to perform tasks such as sentiment analysis, creating a chatbot, and much more.
Reference: