Abstract
This report is aimed to answer the following two questions. 1. Does the use of swear words have any regiospecificity that result in heterogeneous in the data? 2. Does the use of swear words in customers’ review have an impact on the ratings they gave? Can it predict the stars they gave toward a business? Mainly, analysis using ANOVA on metropolis, multiple regression on ratings are performed. Results indicate that the usage of swear words is different by region and 25 of 45 swear words have predictability on the rating a customer gave. All code and files can be obtained from the link in the end.
Introduction
Background
We use swear words a lot more than our imagination. Consider the situation we use swear words on someone, it will mostly be an extreme circumstance. For instance, a staff accidentally knocks off your coffee on your new shirt in the morning, or a waiter with an awful attitude shouting at you. In these situations, an individual often can’t help but use swear words, to express one’s emotion. As a result, swear words are often the representation of emotional words.
In the case we use swear words, the frequency we use it in the review text of a business (e.g., a restaurant) should be less than we use in normal life conversation. For reviews are mostly not right immediately after an event, it is, however, after a dinner is finished or after one is getting home. According to this reason, the appearance of swear words in reviews should represent as an more extreme case, that the emotion last till one leave a comment or push him/her to leave one with swear words.
Aim and Questions
According to the inference above, in this project, I assume that the usage of swear words should reflect the preference of a customer and can predict the stars they gave (primary question). In order to do so, I will firstly examine whether there are any region-specific features in the usage of swear words.
In specific, the aim of this project is to answer the following two questions. 1. Does the use of swear words have any regiospecificity features that result in heterogeneous in the data? 2. Does the use of swear words in customers’ review have an impact on the ratings they gave? Can it predict the stars they gave toward a business?
Methods
Swear Words
Dictionary
To construct a dictionary of swear words, I merged two lists found online, badwordslist and Full List of Bad Words and Top Swear Words Banned by Google and the file can be downloaded frome here and here.
The merged list contains 824 words without cleaning and contains 621 words after cleaning. The full list can be obtain from the cloud.
Term Matrix
To establish the term matrix of swear words, package tm was used to generate corpus. Considering the efficiency and accuracy, a data frame contains 20000 rows is sampled randomly from the original review dataset.
After the corpus and the term matrix of swear words is generated, remove columns of swear words that haven’t appeared any time, a 20000 times 45 matrix is established. Then, a last column total is calculated to represent the total occurence of swear words.
Exploratory Figure
The figure below shows the occurence of the first 30 swear words that appears in the sample review text.
Regional Analysis
Exploratory Figure
According to the initial document of the dataset, the business are from ten metropolises including Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, Madison, Montreal, Waterloo, Edinburgh and Karlsruhe. An exploratory figure was drawn for confirmation. As the figure below shows, there are ten metropolises from four countries.
K-means Clustering
However, the original dataset doesn’t contain this information explicitly. In order to classify business into metropolitan region, the K-means clustering method is performed on the latitude and longitude variables in the business dataset before further analysis. The result suggests 10 perfect independent clusters, as wishes.
Heterogenous Analysis
To determine whether the usage of swear words is different by regions, an one-way ANOVA (analysis of variance) is performed using metropolis as independent variable and the total occurence of swear words as the dependent variable.
Prediction through Regression
To answer the first part of the second question, which is, does the use of swear words have an impact on customers’ ratings, Welch Two Sample t-test will be performed. Two groups will be created by whether the review text contains swear words or not.
Secondly, in order to predict the ratings a user gave according to his/her usage of swear words in a different area, multiple regression analysis will be performed using stars as outcome and all of the swear words as predictors by regions. Further analysis, including variable selection will also be done.
Results
Regional Analysis
The figure below illustrates the occurence of top 6 swear words in details by region.
Further, the result of ANOVA table is shown as below, the result indicates that the total usage of swear words is different by region F(9) = 4.27, p < .00.
Analysis of Variance Table
Response: total
Df Sum Sq Mean Sq F value Pr(>F)
metro 9 6.6 0.73539 4.2663 1.498e-05 ***
Residuals 19990 3445.7 0.17237
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
Prediction through Regression
The result of Welch Two Sample t-test suggests that two group are different on the ratings, t(1385) = 12.78, p < .00. The group contains swear words (M = 3.228) is significantly lower than the group doesen’t contains swear words (M = 3.781).
Welch Two Sample t-test
data: stars by containSwearWord
t = 12.781, df = 1384.7, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.4682011 0.6379796
sample estimates:
mean in group FALSE mean in group TRUE
3.781412 3.228321
The result of multiple regression after variable selection is shown below. 25 of 45 swear words are siginificant as a predictor to predict the stars they gave. The impact of these predictors is range from -2.276 to 1.321. Note that there are three variables have positive coefficient which will be discussed later.
Call:
lm(formula = stars ~ crap + ass + blow + screw + shit + wtf +
piss + jerk + facial + bitch + orgasm + dick + porn + prick +
bastard + pawn + retard + cox + lmao + fart + fuckin + knob +
turd + dickhead + goddamn, data = toLm)
Residuals:
Min 1Q Median 3Q Max
-3.5712 -0.7755 0.2245 1.2245 4.6778
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.775543 0.009372 402.874 < 2e-16 ***
crap -0.821638 0.092590 -8.874 < 2e-16 ***
ass -0.226521 0.097300 -2.328 0.019918 *
blow -0.139465 0.098550 -1.415 0.157034
screw -0.791378 0.107348 -7.372 1.74e-13 ***
shit -0.467141 0.113559 -4.114 3.91e-05 ***
wtf -1.048961 0.141033 -7.438 1.07e-13 ***
piss -1.283521 0.157910 -8.128 4.61e-16 ***
jerk -0.304784 0.140930 -2.163 0.030578 *
facial 0.244782 0.085305 2.869 0.004116 **
bitch -0.434923 0.190998 -2.277 0.022790 *
orgasm 0.795622 0.233910 3.401 0.000672 ***
dick -0.669238 0.287728 -2.326 0.020032 *
porn -0.342546 0.213906 -1.601 0.109307
prick -0.534046 0.360898 -1.480 0.138950
bastard -0.620077 0.392868 -1.578 0.114504
pawn -0.344840 0.211046 -1.634 0.102284
retard -1.553321 0.433623 -3.582 0.000342 ***
cox -0.375285 0.180246 -2.082 0.037348 *
lmao -1.197120 0.492058 -2.433 0.014988 *
fart -0.978203 0.653471 -1.497 0.134426
fuckin 1.321374 0.780335 1.693 0.090406 .
knob -1.775543 0.750940 -2.364 0.018067 *
turd -0.423289 0.250306 -1.691 0.090835 .
dickhead -1.940924 0.930837 -2.085 0.037069 *
goddamn -2.275543 0.919686 -2.474 0.013359 *
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
Residual standard error: 1.301 on 19974 degrees of freedom
Multiple R-squared: 0.02211, Adjusted R-squared: 0.02089
F-statistic: 18.06 on 25 and 19974 DF, p-value: < 2.2e-16
The figure below shows the amount of six swear words with greater predictability on stars.
Discussion
In the first part, the result reveals regiospecificity within ten metropolises, although one may question the result from the difference between sample size (20000) and the original data (~1600000), however, the sample process was totally random and can be reproduced with different settings. Also, some of the region significantly has a smaller occurence of swear words, this may be due to the difference between the observations amount of cities, in particular, Karlsruhe and Waterloo. Further analysis is needed to confirm the result.
In the second part, 25 of 45 variables have predictability on the rating a customer gave. This result collaborates my hypothesis that swear words can predict the stars in the review. In addition, in another analysis that didn’t show, swear words can also predict the votes it received, which will be interesting for further investigation.
These results of anaylsis suggest that 1. Swear words in customers’ review can be used to predict the stars they gave toward a business. 2. The use of swear words have regiospecificity, which means differences between different regions.
Citation
- All dataset are from Yelp Dataset Challenge.
- The Rmd file for this report containing all codes is available here.
—20160625