d

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore.

15 St Margarets, NY 10033
(+381) 11 123 4567
ouroffice@aware.com

 

KMF

Review Analysing Application using NLP — Part 1 | by Dulan Wirajith

In my previous article, I talked about how machines communicate with humans. In there, I have mentioned the NLP — Natural Language Processing. NLP did a significant role in communication with computers and humans. Today I am going to implement a Simple Review Annalysing Application using NLP and python. I used PyCharm IDE for coding.

Review Analysing Application is a Text Classification application. This article has four sections. they are

  1. Import dataset
    2. Cleaning dataset
    3. Create bag_of_words Model
    4. Create a Classification Model
    5. Test Accuracy

CSV means Comma-Separated Values, and TSV means Tab-Separated Values. Simply if the dataset is in .csv format, the values in the dataset are separated by comma(“,”). If the dataset is in .tsv format, the values in the dataset are separated by a tab. Usually, for NLP, we are getting .tsv form. Because in NLP, we are dealing with sentences. In sentences, there have most commas. See the example text,

Loved it...friendly servers, great food, wonderful and imaginative menu.

In this example, we have more commas. So if we used .csv, the separated values might be incorrect. But in here no tabs. Therefore we usually get the dataset format .tsv for the NLP.

#Importing the dataset

import pandas as pd# Importing the dataset
dataset = pd.read_csv('DataSet/review_dataset.tsv', delimiter ='t', quoting = 3)

To import the dataset, I used pandas library’s read_csv function. I set the delimiter as tab because the dataset is in .tsv format. And also, I set quoting to 3 because I want to ignore double quotes also.

fig 1 — After importing the dataset

Here I have 1000 reviews. That dataset has two columns. They are Review and Liked or Not. The review section has the String text given by someone. And Liked or Not section has a Boolean value about his satisfaction.

1. How Chatbots and Email Marketing Integration Can Help Your Business

2. Why Chatbots could be the next big thing for SMEs

3. My Journey into Conversation Design

4. Practical NLP for language learning

2. Clean the dataset

# get one review
review = dataset['Review'][0]
fig 2 — After getting one review

For clean the reviews, I use a library called re.

# Remove all things except letters

For the review analysing I want only words. I don’t want any punctuation marks or any numbers. Therefore I remove all unwanted stuff. And I store that sentence again in the review variable. For that, I used sub function in re library.

# remove all things except letters
import re
review = re.sub('[^a-zA-Z]', ' ', review)
fig 3 — After Remove all things except letters

Now you can see there haven’t any punctuations or numbers.

# Set all letters to lowercase

and then I change all letters to lowercase. because I want to number of words in the bag_of_words

# set all letters to lowercase
review = review.lower()
fig 4 — After Set all letters to lowercase

# Convert the string to a list

I convert the string value to a list because another step I’m going to remove non-significant words. For that, I use a for a loop. therefore I convert string to a list

# convert the string to a list
review_words = review.split()
fig 5 — After Convert the line to a list

# Remove non-significant words

In the review, there have words to unwanted for the review analysing purpose. Such as I, am, are, this, that… So I also remove those words. For that, I used the nltk — Natural Language Toolkit package. In nltk have a list named ‘stopwords’. Using that word list, I can remove non-significant words.

# remove non significant words
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
review = [word for word in review_words if not word in set(stopwords.words('english'))]
fig 6 — After Remove non-significant words

# Stemming => taking the root of the words

we are ding stemming because we want to remove unwanted words from the bag_of_words. Directly stemming means taking the origin of the word. As an example, I get a sentence,

Loved it...friendly servers, great food, wonderful and imaginative menu.

In this sentence have a word named ‘loved’. But this is not the root of that word. The origin of that word is ‘love’. ‘love/loved/loving’ these all words are given the same sense about the review positive or negative. Therefore we don’t want those all words. We can use only the root word for that. If we use the origin word only, our bag_of_words also will be small. We use stemming to reduce the number of words in the bag_of_words.

Simply we,

loved/loves/loving/love   =>   love

for that, I used PorterStemmer in nltk.stem.porter library.

# stemming => taking the root of the words
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()def stemming(word):
return stemmer.stem(word)
review_after_stemming = [stemming(word) for word in review]
fig 7 — After Stemming

# Convert list into a string

Now the cleaning process is over. Therefore I again convert that list to a string.

# convert list into string
review = ' '.join(review_after_stemming)
fig 8 — After Convert list into a string

# Doing this method to all reviews

Now I’m doing that cleaning process to all reviews.

# doing this method to all reviews
corpus = []
for i in range(0,1000):
review = dataset['Review'][i]
review = re.sub('[^a-zA-Z]', ' ', review)
review = review.lower()
review_words = review.split()
review = [word for word in review_words if not word in set(stopwords.words('english'))]
stemmer = PorterStemmer()
review_after_stemming = [stemming(word) for word in review]
review = ' '.join(review_after_stemming)
corpus.append(review)
fig 9 — After Cleaning all reviews

Now 1 – Import dataset and 2 – Cleaning dataset is done. We will see other sections in Review Analysing Application using NLP — Part 2.

Credit: Source link

Previous Next
Close
Test Caption
Test Description goes like this