🎉 Wipro's Lab45 AI Platform is now free with your Topcoder account!

July 26, 2022

Natural Language Processing using NLTK (Python)

Juveriya MahreenNamal

DURATION

8min

NLTK

NLTK is an open-source toolkit for natural language processing. This toolkit is one of the most powerful NLP libraries which contains packages to make machines understand human languages and respond in an appropriate manner. Using NLTK we can perform operations such as data cleaning, visualization, and vectorization that will help us in classifying our text.

Here, I am using Google colab which is a free development tool. Using Google colab, a programmer can write and execute code in Python, create/upload/share notebooks, and import/save notebooks from/to Google drive.

We can perform various tasks using NLTK such as discussed below in a detailed example.

Import Corpus

Corpus is the training data needed for the task.

1
2
3
4
5
import numpy as np
import nltk
f = open("chatbot.txt", 'r', errors = 'ignore')
raw_doc = f.read()
raw_doc

Data Preprocessing - Text Case Handling

Convert all the data coming as an input to either upper or lower case.
This will avoid misrepresentation and misinterpretation of words spelled with lower or upper cases.

1
raw_doc = raw_doc.lower() #converts text to lowercase

Tokenization

Tokenization is the structured process of converting a sentence into an individual collection of elements called tokens. It is also used to understand the importance of each of the words with respect to the sentence. These so-called tokens can be words, numbers, or punctuation marks.
Sentence Tokenization: It is the process of tokenizing a text into sentences. To do this, we use the method sent_tokenize as shown in the below code.
Word Tokenization: It is the process of tokenizing sentences or text into words and punctuation. To do so, we use the method word_tokenize as shown in the below code.

1
2
3
4
5
6
nltk.download('punkt') #using the punkt tokenizer
nltk.download('wordnet') #using the wordNet dictionary
from nltk.tokenize
import sent_tokenize, word_tokenize
sent_tokens = nltk.sent_tokenize(raw_data) #converts doc to list of sentences
word_tokens = nltk.word_tokenize(raw_data) #converts doc to list of words

We are just cleaning the data (word tokens) using regular expressions.

1
2
3
4
5
6
7
import re
clean_data = []
for words in word_tokens:
  item = []
result = re.sub(r "[^\w\s]", "", words)
if result != "":
  clean_data.append(result)

Removing Stop Words

Words that have very little meaning or little usage are considered stop words. For example, words such as “a”, “the”, “an”, “is”, etc are some of the stop words.

Example:

1
2
3
4
5
6
7
8
9
10
11
nltk.download('stopwords')
from nltk.corpus
import stopwords

clean_data_1 = []

for words in clean_data:
  if not words in stopwords.words('english'):
  clean_data_1.append(words)

clean_data_1

Stemming

Stemming is the process of finding similarities between words with the same root words. Stemming acts on words without knowing the context. Therefore, it’s faster but doesn’t always yield the desired result.

1
2
3
4
5
6
7
from nltk.stem
import PorterStemmer
stemmer = PorterStemmer()

stemmed_data = [stemmer.stem(word) for word in clean_data_1]

stemmed_data

Lemmatization

Lemmatization is the process of grouping together different infected forms of a word, called lemma. The output of lemmatization is a proper word. It uses a lexical vocabulary to derive the root form, is more time-consuming than stemming, and is most likely to yield accurate results. It uses a lexical database called WordNet.

1
2
3
4
5
6
7
8
9
10
11
from nltk.stem.wordnet
import WordNetLemmatizer
import nltk
lemmer = nltk.stem.WordNetLemmatizer()
nltk.download('wordnet')

lemm_data = []
for word in clean_data_1:
  lemm_data.append(lemmer.lemmatize(word))

lemm_data

Usage of either stemming or lemmatization will depend on the situation. If speed is required, it’s better to resort to stemming, but if accuracy is required it’s best to use lemmatization.

Parts of Speech Tagging

In this process, the words in the text are assigned a label with corresponding parts of speech components such as nouns, verbs, adjectives, or adverbs. It takes features like the previous word, next word, and capitalization of the first word into consideration when assigning a POS tag to a word.

1
2
pos_data = pos_tag(lemm_data)
pos_data

Chunking

It basically means picking up individual pieces of information and grouping them into bigger pieces.
Chunking is a process that requires POS tagged input, and it provides chunks of phrases as output. Similar to POS tags, there is a standard set of chunk tags like Noun Phrase (NP), Verb Phrase (VP), etc.

1
2
3
4
5
chunkGram = "NP: {<DT>?<JJ>*<NN>}"
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(pos_data)

print(chunked)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# # # CREATE VIRTUAL DISPLAY # # #
  !apt - get install - y xvfb # Install X Virtual Frame Buffer
import os
os.system('Xvfb :1 -screen 0 1600x1200x16  &') # create virtual display with size 1600 x1200 and 16 bit color.Color can be changed to 24 or 8
os.environ['DISPLAY'] = ':1.0'
# tell X clients to use our virtual DISPLAY: 1.0.

  %
  matplotlib inline
# # # INSTALL GHOSTSCRIPT(Required to display NLTK trees) # # #!apt install ghostscript python3 - tk

from nltk.tree
import Tree
from IPython.display
import display
tree = Tree.fromstring(str(chunked))
display(tree)

(It’s a big tree which cannot be displayed in the article. The above image is just a screenshot of a part of it).

Displaying subtree:

1
2
for subtree in tree.subtrees():
  display(subtree)

Conclusion

In this article, we have learned the essential steps of NLP. We have discussed how to import text to analyze, preprocess the text, and analyze the text step by step. These concepts can be implemented to perform tasks such as sentiment analysis, creating a chatbot, and much more.

Reference:

Chat on Discord

July 26, 2022

Natural Language Processing using NLTK (Python)

DURATION

categories

Tags

share

FREELANCE OPPORTUNITIES