NLP

In this project we employ Sklearn's pipelines to compare performances two robust models for NLP : Multinomial Naive Bayes and the Random Forests.

Data: Yelp Review Data Set from Kaggle.

** Data Head **

	business_id	date	review_id	stars	text	type	user_id	cool	useful
0	9yKzy9PApeiPPOUJEtnvkg	2011-01-26	fWKvX83p0-ka4JS3dc6E5A	5	My wife took me here on my birthday for breakf...	review	rLtl8ZkDX5vH5nAx9C3q5Q	2	5
1	ZRJwVLyzEJq1VAihDhYiow	2011-07-27	IjZ33sJrzXqU-0X6U8NwyA	5	I have no idea why some people give bad review...	review	0a2KyEL0d3Yb1V6aivbIuQ	0	0
2	6oRAC4uyJCsJl1X0WZpVSA	2012-06-14	IESLBzqUCLdSzSqm0eCSxQ	4	love the gyro plate. Rice is so good and I als...	review	0hT2KtfLiobPvh6cDC8JQg	0	1
3	_1QQZuf4zZOyFCvXc0o6Vg	2010-05-27	G-WvGaISbqqaMHlNnByodA	5	Rosie, Dakota, and I LOVE Chaparral Dog Park!!...	review	uZetl9T0NcROGOyFfughhg	1	2
4	6ozycU1RpktNG2-1BroVtw	2012-01-05	1uJFq2r5QfJG_6ExMRCaGw	5	General Manager Scott Petello is a good egg!!!...	review	vYmM4KTsC8ZfQBg-j5MWkw	0	0

** Crude Data Metrics **

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
business_id    10000 non-null object
date           10000 non-null object
review_id      10000 non-null object
stars          10000 non-null int64
text           10000 non-null object
type           10000 non-null object
user_id        10000 non-null object
cool           10000 non-null int64
useful         10000 non-null int64
funny          10000 non-null int64
dtypes: int64(4), object(6)
memory usage: 781.3+ KB

	stars	cool	useful	funny
count	10000.000000	10000.000000	10000.000000	10000.000000
mean	3.777500	0.876800	1.409300	0.701300
std	1.214636	2.067861	2.336647	1.907942
min	1.000000	0.000000	0.000000	0.000000
25%	3.000000	0.000000	0.000000	0.000000
50%	4.000000	0.000000	1.000000	0.000000
75%	5.000000	1.000000	2.000000	1.000000
max	5.000000	77.000000	76.000000	57.000000

** Data Exploration **

FacetGrid from the seaborn library to create a grid of 5 histograms of text length based off of the star ratings.

<seaborn.axisgrid.FacetGrid at 0x7facdee80410>

Boxplot of text length for each star category.

<matplotlib.axes._subplots.AxesSubplot at 0x7facde8a7210>

Countplot of the number of occurrences for each type of star rating.

<matplotlib.axes._subplots.AxesSubplot at 0x7fad1c8eded0>

** Group by the mean values of the numerical columns **

	cool	useful	funny	text length
stars
1	0.576769	1.604806	1.056075	826.524700
2	0.719525	1.563107	0.875944	842.265372
3	0.788501	1.306639	0.694730	758.505133
4	0.954623	1.395916	0.670448	712.944129
5	0.944261	1.381780	0.608631	625.015583

Correlations in the grouped dataframe:

	cool	useful	funny	text length
cool	1.000000	-0.743329	-0.944939	-0.857651
useful	-0.743329	1.000000	0.894506	0.699895
funny	-0.944939	0.894506	1.000000	0.843463
text length	-0.857651	0.699895	0.843463	1.000000

**Visualizing correlations as heatmap **

<matplotlib.axes._subplots.AxesSubplot at 0x7fad1c77e8d0>

NLP Classification

For this project, We create dataframe called yelp_class that contains the columns of yelp dataframe but for only the 1 or 5 star reviews. We create two objects X and y. X will be the 'text' column of yelp_class and y will be the 'stars' column of yelp_class.Use the count verctorizer to vectorize X. We start with using simple Multinomial Naive bayes classifier on this X,y dataset

Model Evaluations

** Confusion matrix and classification report using MNB**

[[156  77]
 [ 37 956]]


             precision    recall  f1-score   support

          1       0.81      0.67      0.73       233
          5       0.93      0.96      0.94       993

avg / total       0.90      0.91      0.90      1226

Really not bad. Now we use, TF-IDF weighing scheme on X and utilize Sckikit learn Pipeline method and evaluate on Multinomial Naive Bayes and Random Forest models

Model Evaluations

** Classification report and confusion matrix (Multinomial Naive Bayes):**

[[ 1 232] [ 0 993]]

         precision    recall  f1-score   support

      1       1.00      0.00      0.01       233
            5       0.81      1.00      0.90       993

avg / total 0.85 0.81 0.73 1226

** Classification report and confusion matrix (Random Forest):**

[[ 30 203]
 [  1 992]]


             precision    recall  f1-score   support

          1       0.97      0.13      0.23       233
          5       0.83      1.00      0.91       993

avg / total       0.86      0.83      0.78      1226

#TF-IDF with MNB prefromed worse. RF imporved recall for class 1. For this project, the model without TF-IDF weighing performed better.

Search Site

Data: Yelp Review Data Set from Kaggle.

NLP Classification

Model Evaluations

Model Evaluations

Leave a Reply Cancel reply

Data: Yelp Review Data Set from Kaggle.

NLP Classification

Model Evaluations

Model Evaluations

You may also like

Gaussian Mixture Models

Hypothesis Testing III – Bayesian Methods

Hypothesis Testing II – Gaussian Mixtures

Leave a Reply Cancel reply