Category Archives: Academic


Featured Articles Abstract

This is a list of my featured articles. Some are abstracted below. If you want a translated version, please don’t hesitate to contact me.
For your reference, I was 20 years old (sophomore) at 2014, 16 at 2010.

Statistics Related

    • This report is aimed to answer the following two questions. 1. Does the use of swear words have any regiospecificity that result in heterogeneous in the data? 2. Does the use of swear words in customers’ review have an impact on the ratings they gave? Can it predict the stars they gave toward a business? Mainly, analysis using ANOVA on metropolis, multiple regression on ratings are performed. Results indicate that the usage of swear words is different by region and 25 of 45 swear words have predictability on the rating a customer gave. All code and files can be obtained from the link in the end. 
    • In six months, I posted different type of posts on Facebook, varying from all sorts of content I can imagine of. Recorded the number of likes and categorized the content. Using linear regression and dummy coding, I confirmed that there are some factors are significantly correlated to the number of likes, such as love-related content, the number of tag, the timing it posts (10:30 PM is the best time).

Psychology Related

    • I tried to develop a theory that make two theories from different fields coherent, causal cognition in psychology and causal process theory in philosophy developed by Dowe (2000). I proposed a prerequisite that if it is satisfied, psychological causation (defined by causal cognition) will contain physical causation (defined by causal process theory). Where this condition (c) is: sufficient information and logically correct causal reasoning. In this artical, I gave specific definitions from previous literature, examples for different conditions and possibly replies to critics.
    • A series of articles aim to evoke cognitive dissonance, pointing out the irrationality of human behavior and the discrepancy between beliefs and behaviors. Included topics: Gap of wealth (people only willing to help poors when they see the needs, ignore them otherwise), Social feedback (people take as much as resources they can from the society but most of them refuse of have no ability to payback), Happiness, Games of rules, Interest, Greed, Life value, The scheme of moral.

Literature Reviews

    • Review eight controversial theories trying to explain the evolutionary meaning of female orgrasm. This cannot be simply explained by reinforcement. Also, female orgasms are significantly different and more complex than male orgrams. 
    • Exclusiveness is a necessary trait for religion to grow, as meme.
    • Humor is an important trait in mate selection. As it is a traint reflect intellectual ability, therefore, its weight is different for men or women.
    • From the perspective of a little program, the game of lifes, to discuss the origin and the essence of life – existence.


  • 2014. [Humanities] Evolution, Nihilism and Tragic Pleasure
    • An explanation of the origin of Nihilism from the perspective of evolution, and link to tragic pleasure, the pleasure comes from seeing a tragedy.
  • 2010. [Information Theory] Loss of Information in Communication
    • Defined dichotomous correctness on information. Approximate the change of correctness under information transmission by statistics. Found the correctness of information will converge to 0.5 rapidly.
  • 2010. [Physics] Infinitely Elastic collision
    • Defined the state of static under Heinsberg’s uncertainty principle, under this definition, calculate the time for balls needed to stop jumping on the ground.
  • 2010. [Physics] How to Pause the Time
    • Defined time as the observation of movements. Proposed a pause of time is a state of removal of all energy.


Swear Words in Review: Regiospecificity and Predictability




This report is aimed to answer the following two questions. 1. Does the use of swear words have any regiospecificity that result in heterogeneous in the data? 2. Does the use of swear words in customers’ review have an impact on the ratings they gave? Can it predict the stars they gave toward a business? Mainly, analysis using ANOVA on metropolis, multiple regression on ratings are performed. Results indicate that the usage of swear words is different by region and 25 of 45 swear words have predictability on the rating a customer gave. All code and files can be obtained from the link in the end.



We use swear words a lot more than our imagination. Consider the situation we use swear words on someone, it will mostly be an extreme circumstance. For instance, a staff accidentally knocks off your coffee on your new shirt in the morning, or a waiter with an awful attitude shouting at you. In these situations, an individual often can’t help but use swear words, to express one’s emotion. As a result, swear words are often the representation of emotional words.
In the case we use swear words, the frequency we use it in the review text of a business (e.g., a restaurant) should be less than we use in normal life conversation. For reviews are mostly not right immediately after an event, it is, however, after a dinner is finished or after one is getting home. According to this reason, the appearance of swear words in reviews should represent as an more extreme case, that the emotion last till one leave a comment or push him/her to leave one with swear words.

Aim and Questions

According to the inference above, in this project, I assume that the usage of swear words should reflect the preference of a customer and can predict the stars they gave (primary question). In order to do so, I will firstly examine whether there are any region-specific features in the usage of swear words.
In specific, the aim of this project is to answer the following two questions. 1. Does the use of swear words have any regiospecificity features that result in heterogeneous in the data? 2. Does the use of swear words in customers’ review have an impact on the ratings they gave? Can it predict the stars they gave toward a business?


Swear Words


To construct a dictionary of swear words, I merged two lists found online, badwordslist and Full List of Bad Words and Top Swear Words Banned by Google and the file can be downloaded frome here and here.
The merged list contains 824 words without cleaning and contains 621 words after cleaning. The full list can be obtain from the cloud.

Term Matrix

To establish the term matrix of swear words, package tm was used to generate corpus. Considering the efficiency and accuracy, a data frame contains 20000 rows is sampled randomly from the original review dataset.
After the corpus and the term matrix of swear words is generated, remove columns of swear words that haven’t appeared any time, a 20000 times 45 matrix is established. Then, a last column total is calculated to represent the total occurence of swear words.

Exploratory Figure

The figure below shows the occurence of the first 30 swear words that appears in the sample review text.
plot of chunk swOccurence

Regional Analysis

Exploratory Figure

According to the initial document of the dataset, the business are from ten metropolises including Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, Madison, Montreal, Waterloo, Edinburgh and Karlsruhe. An exploratory figure was drawn for confirmation. As the figure below shows, there are ten metropolises from four countries.
plot of chunk businessLocationUSMap

K-means Clustering

However, the original dataset doesn’t contain this information explicitly. In order to classify business into metropolitan region, the K-means clustering method is performed on the latitude and longitude variables in the business dataset before further analysis. The result suggests 10 perfect independent clusters, as wishes.

Heterogenous Analysis

To determine whether the usage of swear words is different by regions, an one-way ANOVA (analysis of variance) is performed using metropolis as independent variable and the total occurence of swear words as the dependent variable.

Prediction through Regression

To answer the first part of the second question, which is, does the use of swear words have an impact on customers’ ratings, Welch Two Sample t-test will be performed. Two groups will be created by whether the review text contains swear words or not.
Secondly, in order to predict the ratings a user gave according to his/her usage of swear words in a different area, multiple regression analysis will be performed using stars as outcome and all of the swear words as predictors by regions. Further analysis, including variable selection will also be done.


Regional Analysis

The figure below illustrates the occurence of top 6 swear words in details by region.
plot of chunk regionalAnalysis_summaryTable_Occurence
Further, the result of ANOVA table is shown as below, the result indicates that the total usage of swear words is different by region F(9) = 4.27, p < .00.

Analysis of Variance Table

Response: total
Df Sum Sq Mean Sq F value Pr(>F)
metro 9 6.6 0.73539 4.2663 1.498e-05 ***
Residuals 19990 3445.7 0.17237
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Prediction through Regression

The result of Welch Two Sample t-test suggests that two group are different on the ratings, t(1385) = 12.78, p < .00. The group contains swear words (M = 3.228) is significantly lower than the group doesen’t contains swear words (M = 3.781).

Welch Two Sample t-test

data: stars by containSwearWord
t = 12.781, df = 1384.7, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.4682011 0.6379796
sample estimates:
mean in group FALSE mean in group TRUE
3.781412 3.228321

The result of multiple regression after variable selection is shown below. 25 of 45 swear words are siginificant as a predictor to predict the stars they gave. The impact of these predictors is range from -2.276 to 1.321. Note that there are three variables have positive coefficient which will be discussed later.

lm(formula = stars ~ crap + ass + blow + screw + shit + wtf +
piss + jerk + facial + bitch + orgasm + dick + porn + prick +
bastard + pawn + retard + cox + lmao + fart + fuckin + knob +
turd + dickhead + goddamn, data = toLm)

Min 1Q Median 3Q Max
-3.5712 -0.7755 0.2245 1.2245 4.6778

Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.775543 0.009372 402.874 < 2e-16 ***
crap -0.821638 0.092590 -8.874 < 2e-16 ***
ass -0.226521 0.097300 -2.328 0.019918 *
blow -0.139465 0.098550 -1.415 0.157034
screw -0.791378 0.107348 -7.372 1.74e-13 ***
shit -0.467141 0.113559 -4.114 3.91e-05 ***
wtf -1.048961 0.141033 -7.438 1.07e-13 ***
piss -1.283521 0.157910 -8.128 4.61e-16 ***
jerk -0.304784 0.140930 -2.163 0.030578 *
facial 0.244782 0.085305 2.869 0.004116 **
bitch -0.434923 0.190998 -2.277 0.022790 *
orgasm 0.795622 0.233910 3.401 0.000672 ***
dick -0.669238 0.287728 -2.326 0.020032 *
porn -0.342546 0.213906 -1.601 0.109307
prick -0.534046 0.360898 -1.480 0.138950
bastard -0.620077 0.392868 -1.578 0.114504
pawn -0.344840 0.211046 -1.634 0.102284
retard -1.553321 0.433623 -3.582 0.000342 ***
cox -0.375285 0.180246 -2.082 0.037348 *
lmao -1.197120 0.492058 -2.433 0.014988 *
fart -0.978203 0.653471 -1.497 0.134426
fuckin 1.321374 0.780335 1.693 0.090406 .
knob -1.775543 0.750940 -2.364 0.018067 *
turd -0.423289 0.250306 -1.691 0.090835 .
dickhead -1.940924 0.930837 -2.085 0.037069 *
goddamn -2.275543 0.919686 -2.474 0.013359 *
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.301 on 19974 degrees of freedom
Multiple R-squared: 0.02211, Adjusted R-squared: 0.02089
F-statistic: 18.06 on 25 and 19974 DF, p-value: < 2.2e-16

The figure below shows the amount of six swear words with greater predictability on stars.
plot of chunk lmBoxplot


In the first part, the result reveals regiospecificity within ten metropolises, although one may question the result from the difference between sample size (20000) and the original data (~1600000), however, the sample process was totally random and can be reproduced with different settings. Also, some of the region significantly has a smaller occurence of swear words, this may be due to the difference between the observations amount of cities, in particular, Karlsruhe and Waterloo. Further analysis is needed to confirm the result.
In the second part, 25 of 45 variables have predictability on the rating a customer gave. This result collaborates my hypothesis that swear words can predict the stars in the review. In addition, in another analysis that didn’t show, swear words can also predict the votes it received, which will be interesting for further investigation.
These results of anaylsis suggest that 1. Swear words in customers’ review can be used to predict the stars they gave toward a business. 2. The use of swear words have regiospecificity, which means differences between different regions.



Data Science Capstone Quiz


    Question 1

    After untaring the the dataset, how many files are there (including the documentation pdfs)?

    [1] 7

    Your Answer Score Explanation
    7 Correct 1.00
    Total 1.00 / 1.00

    Question 2

    The data files are in what format?
    Your Answer Score Explanation
    json Correct 1.00
    Total 1.00 / 1.00

    Question 3

    How many lines of text are there in the reviews file (in orders of magnitude)?

    [1] 1569264

    Your Answer Score Explanation
    One million Correct 1.00
    Ten thousand
    Ten million
    One hundred thousand
    Total 1.00 / 1.00

    Question 4

    Consider line 100 of the reviews file. “I’ve been going to the Grab n Eat for almost XXX years”

    review[100, ]$text
    [1] "I have been coming to Gab n Eat for almost 20 years and They have never let me down. I get a typical breakfast if eggs, ham, toast, and home fries. Delicious as usual. The ambience however is usually lacking. The walls are dark, with writing and signatures of semi famous people all over the place. Pictures of local people hang on the walls(i secretly want mine up there) along with posters galore. While its fun to look at the first 10 times, it gets a little boring after awhile. So today when I arrived I expected the same old experience. Wow was I wrong! As soon as I looked at the door I knew something was different. The place seemed lighter and brighter. To my pleasant surprise, they painted and got new counter tops!! They're not quite done yet but the place has a new Happy vibe to it. The awesome breakfast, the new decor and the 5 guys sitting at the counter making me laugh are why I will be back( maybe for lunch)."

    Your Answer Score Explanation
    20 Correct 1.00
    Total 1.00 / 1.00

    Question 5

    What percentage of the reviews are five star reviews (rounded to the nearest percentage point)?

    nrow(review[review$stars == 5, ])/nrow(review)
    [1] 0.3692986

    Your Answer Score Explanation
    37% Correct 1.00
    Total 1.00 / 1.00

    Question 6

    How many lines are there in the businesses file?

    [1] 61184

    Your Answer Score Explanation
    Around 15 million
    Around 60 thousand Correct 1.00
    Around 1.5 million
    Around 55 million
    Total 1.00 / 1.00

    Question 7

    Conditional on having an response for the attribute “Wi-Fi”, how many businesses are reported for having free wi-fi (rounded to the nearest percentage point)?

    x <- business$attributes$`Wi-Fi`
    x <- x[!]
    length(x[x == "free"])/length(x)
    [1] 0.4091519

    Your Answer Score Explanation
    40% Correct 1.00
    Total 1.00 / 1.00

    Question 8

    How many lines are in the tip file?

    [1] 495107

    Your Answer Score Explanation
    About 55 million
    About 60 thousand
    About 500 thousand Correct 1.00
    About 1.5 million
    Total 1.00 / 1.00

    Question 9

    In the tips file on the 1,000th line, fill in the blank: “Consistently terrible ______”

    tip[1000, ]$text
    [1] "Consistently terrible service. What's with the attitudes?"

    Your Answer Score Explanation
    service Correct 1.00
    Total 1.00 / 1.00

    Question 10

    What is the name of the user with over 10,000 compliment votes of type “funny”?

    x <- user[user$compliments$funny >= 10000, ]$name
    [1] "Brian"

    Your Answer Score Explanation
    Brian Correct 1.00
    Total 1.00 / 1.00

    Question 11

    Create a 2 by 2 cross tabulation table of when a user has more than 1 fans to if the user has more than 1 compliment vote of type “funny”. Treat missing values as 0 (fans or votes of that type). Pass the 2 by 2 table to fisher.test in R. What is the P-value for the test of independence?

    condition1 <- user$fans >= 1 & !$fans)
    condition2 <- user$compliments$funny >= 1 & !$compliments$funny)
    ta <- table(condition1, condition2)

    Fisher's Exact Test for Count Data

    data: ta
    p-value = 0.00146
    alternative hypothesis: true odds ratio is not equal to 1
    95 percent confidence interval:
    17.03876 17.85392
    sample estimates:
    odds ratio

    Your Answer Score Explanation
    around 0.05
    around 0.01 Correct 1.00
    around 0.20
    less than .001
    Total 1.00 / 1.00