Data Science Capstone Quiz


Question 1

After untaring the the dataset, how many files are there (including the documentation pdfs)?

[1] 7

Question 2

The data files are in what format?
Question 3

How many lines of text are there in the reviews file (in orders of magnitude)?

[1] 1569264

Ten thousand
Ten million
One hundred thousand
Question 4

Consider line 100 of the reviews file. “I’ve been going to the Grab n Eat for almost XXX years”

review[100, ]$text
[1] "I have been coming to Gab n Eat for almost 20 years and They have never let me down. I get a typical breakfast if eggs, ham, toast, and home fries. Delicious as usual. The ambience however is usually lacking. The walls are dark, with writing and signatures of semi famous people all over the place. Pictures of local people hang on the walls(i secretly want mine up there) along with posters galore. While its fun to look at the first 10 times, it gets a little boring after awhile. So today when I arrived I expected the same old experience. Wow was I wrong! As soon as I looked at the door I knew something was different. The place seemed lighter and brighter. To my pleasant surprise, they painted and got new counter tops!! They're not quite done yet but the place has a new Happy vibe to it. The awesome breakfast, the new decor and the 5 guys sitting at the counter making me laugh are why I will be back( maybe for lunch)."

Question 5

What percentage of the reviews are five star reviews (rounded to the nearest percentage point)?

nrow(review[review$stars == 5, ])/nrow(review)
[1] 0.3692986

Question 6

How many lines are there in the businesses file?

[1] 61184

Around 15 million
Around 1.5 million
Around 55 million
Question 7

Conditional on having an response for the attribute “Wi-Fi”, how many businesses are reported for having free wi-fi (rounded to the nearest percentage point)?

x <- business$attributes$`Wi-Fi`
x <- x[!]
length(x[x == "free"])/length(x)
[1] 0.4091519

Question 8

How many lines are in the tip file?

[1] 495107

About 55 million
About 60 thousand
About 1.5 million
Question 9

In the tips file on the 1,000th line, fill in the blank: “Consistently terrible ______”

tip[1000, ]$text
[1] "Consistently terrible service. What's with the attitudes?"

Question 10

What is the name of the user with over 10,000 compliment votes of type “funny”?

x <- user[user$compliments$funny >= 10000, ]$name
[1] "Brian"

Question 11

Create a 2 by 2 cross tabulation table of when a user has more than 1 fans to if the user has more than 1 compliment vote of type “funny”. Treat missing values as 0 (fans or votes of that type). Pass the 2 by 2 table to fisher.test in R. What is the P-value for the test of independence?

condition1 <- user$fans >= 1 & !$fans)
condition2 <- user$compliments$funny >= 1 & !$compliments$funny)
ta <- table(condition1, condition2)

 Fisher's Exact Test for Count Data

data: ta
p-value = 0.00146
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
17.03876 17.85392
sample estimates:
odds ratio

around 0.05
around 0.20
less than .001
