Introduction

All quiz questions are from Coursera Data Science Capstone course.
All .json files are provided by Yelp.
Data sources is hiden for privacy concern.

Question 1

After untaring the the dataset, how many files are there (including the documentation pdfs)?

length(dir("../yelp_dataset_challenge_academic_dataset"))

[1] 7

Your Answer Score Explanation
2
3
5
7 Correct 1.00
Total 1.00 / 1.00

Question 2

The data files are in what format?
Your Answer Score Explanation
json Correct 1.00
.RData
csv
.xlsx
Total 1.00 / 1.00

Question 3

How many lines of text are there in the reviews file (in orders of magnitude)?

nrow(review)

[1] 1569264

Your Answer Score Explanation
One million Correct 1.00
Ten thousand
Ten million
One hundred thousand
Total 1.00 / 1.00

Question 4

Consider line 100 of the reviews file. “I’ve been going to the Grab n Eat for almost XXX years”

review[100, ]$text

[1] "I have been coming to Gab n Eat for almost 20 years and They have never let me down. I get a typical breakfast if eggs, ham, toast, and home fries. Delicious as usual. The ambience however is usually lacking. The walls are dark, with writing and signatures of semi famous people all over the place. Pictures of local people hang on the walls(i secretly want mine up there) along with posters galore. While its fun to look at the first 10 times, it gets a little boring after awhile. So today when I arrived I expected the same old experience. Wow was I wrong! As soon as I looked at the door I knew something was different. The place seemed lighter and brighter. To my pleasant surprise, they painted and got new counter tops!! They're not quite done yet but the place has a new Happy vibe to it. The awesome breakfast, the new decor and the 5 guys sitting at the counter making me laugh are why I will be back( maybe for lunch)."

Your Answer Score Explanation
20 Correct 1.00
10
2
5
Total 1.00 / 1.00

Question 5

What percentage of the reviews are five star reviews (rounded to the nearest percentage point)?

nrow(review[review$stars == 5, ])/nrow(review)

[1] 0.3692986

Your Answer Score Explanation
37% Correct 1.00
30%
10%
14%
Total 1.00 / 1.00

Question 6

How many lines are there in the businesses file?

nrow(business)

[1] 61184

Your Answer Score Explanation
Around 15 million
Around 60 thousand Correct 1.00
Around 1.5 million
Around 55 million
Total 1.00 / 1.00

Question 7

Conditional on having an response for the attribute “Wi-Fi”, how many businesses are reported for having free wi-fi (rounded to the nearest percentage point)?

x <- business$attributes$`Wi-Fi`
x <- x[!is.na(x)]
length(x[x == "free"])/length(x)

[1] 0.4091519

Your Answer Score Explanation
2%
57%
1%
40% Correct 1.00
Total 1.00 / 1.00

Question 8

How many lines are in the tip file?

nrow(tip)

[1] 495107

Your Answer Score Explanation
About 55 million
About 60 thousand
About 500 thousand Correct 1.00
About 1.5 million
Total 1.00 / 1.00

Question 9

In the tips file on the 1,000th line, fill in the blank: “Consistently terrible ______”

tip[1000, ]$text

[1] "Consistently terrible service. What's with the attitudes?"

Your Answer Score Explanation
service Correct 1.00
food
desserts
atmosphere
Total 1.00 / 1.00

Question 10

What is the name of the user with over 10,000 compliment votes of type “funny”?

x <- user[user$compliments$funny >= 10000, ]$name
x[!is.na(x)]

[1] "Brian"

Your Answer Score Explanation
Jeff
Roger
Ira
Brian Correct 1.00
Total 1.00 / 1.00

Question 11

Create a 2 by 2 cross tabulation table of when a user has more than 1 fans to if the user has more than 1 compliment vote of type “funny”. Treat missing values as 0 (fans or votes of that type). Pass the 2 by 2 table to fisher.test in R. What is the P-value for the test of independence?

condition1 <- user$fans >= 1 & !is.na(user$fans)
condition2 <- user$compliments$funny >= 1 & !is.na(user$compliments$funny)
ta <- table(condition1, condition2)
fisher.test(ta)


 Fisher's Exact Test for Count Data

data: ta
p-value = 0.00146
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
17.03876 17.85392
sample estimates:
odds ratio
17.43834

Your Answer Score Explanation
around 0.05
around 0.01 Correct 1.00
around 0.20
less than .001
Total 1.00 / 1.00

TLL

Data Science Capstone Quiz