Data Science Capstone Quiz

Introduction

    Question 1

    After untaring the the dataset, how many files are there (including the documentation pdfs)?

    length(dir("../yelp_dataset_challenge_academic_dataset"))
    [1] 7

    Your Answer Score Explanation
    2
    3
    5
    7 Correct 1.00
    Total 1.00 / 1.00

    Question 2

    The data files are in what format?
    Your Answer Score Explanation
    json Correct 1.00
    .RData
    csv
    .xlsx
    Total 1.00 / 1.00

    Question 3

    How many lines of text are there in the reviews file (in orders of magnitude)?

    nrow(review)
    [1] 1569264

    Your Answer Score Explanation
    One million Correct 1.00
    Ten thousand
    Ten million
    One hundred thousand
    Total 1.00 / 1.00

    Question 4

    Consider line 100 of the reviews file. “I’ve been going to the Grab n Eat for almost XXX years”

    review[100, ]$text
    [1] "I have been coming to Gab n Eat for almost 20 years and They have never let me down. I get a typical breakfast if eggs, ham, toast, and home fries. Delicious as usual. The ambience however is usually lacking. The walls are dark, with writing and signatures of semi famous people all over the place. Pictures of local people hang on the walls(i secretly want mine up there) along with posters galore. While its fun to look at the first 10 times, it gets a little boring after awhile. So today when I arrived I expected the same old experience. Wow was I wrong! As soon as I looked at the door I knew something was different. The place seemed lighter and brighter. To my pleasant surprise, they painted and got new counter tops!! They're not quite done yet but the place has a new Happy vibe to it. The awesome breakfast, the new decor and the 5 guys sitting at the counter making me laugh are why I will be back( maybe for lunch)."

    Your Answer Score Explanation
    20 Correct 1.00
    10
    2
    5
    Total 1.00 / 1.00

    Question 5

    What percentage of the reviews are five star reviews (rounded to the nearest percentage point)?

    nrow(review[review$stars == 5, ])/nrow(review)
    [1] 0.3692986

    Your Answer Score Explanation
    37% Correct 1.00
    30%
    10%
    14%
    Total 1.00 / 1.00

    Question 6

    How many lines are there in the businesses file?

    nrow(business)
    [1] 61184

    Your Answer Score Explanation
    Around 15 million
    Around 60 thousand Correct 1.00
    Around 1.5 million
    Around 55 million
    Total 1.00 / 1.00

    Question 7

    Conditional on having an response for the attribute “Wi-Fi”, how many businesses are reported for having free wi-fi (rounded to the nearest percentage point)?

    x <- business$attributes$`Wi-Fi`
    x <- x[!is.na(x)]
    length(x[x == "free"])/length(x)
    [1] 0.4091519

    Your Answer Score Explanation
    2%
    57%
    1%
    40% Correct 1.00
    Total 1.00 / 1.00

    Question 8

    How many lines are in the tip file?

    nrow(tip)
    [1] 495107

    Your Answer Score Explanation
    About 55 million
    About 60 thousand
    About 500 thousand Correct 1.00
    About 1.5 million
    Total 1.00 / 1.00

    Question 9

    In the tips file on the 1,000th line, fill in the blank: “Consistently terrible ______”

    tip[1000, ]$text
    [1] "Consistently terrible service. What's with the attitudes?"

    Your Answer Score Explanation
    service Correct 1.00
    food
    desserts
    atmosphere
    Total 1.00 / 1.00

    Question 10

    What is the name of the user with over 10,000 compliment votes of type “funny”?

    x <- user[user$compliments$funny >= 10000, ]$name
    x[!is.na(x)]
    [1] "Brian"

    Your Answer Score Explanation
    Jeff
    Roger
    Ira
    Brian Correct 1.00
    Total 1.00 / 1.00

    Question 11

    Create a 2 by 2 cross tabulation table of when a user has more than 1 fans to if the user has more than 1 compliment vote of type “funny”. Treat missing values as 0 (fans or votes of that type). Pass the 2 by 2 table to fisher.test in R. What is the P-value for the test of independence?

    condition1 <- user$fans >= 1 & !is.na(user$fans)
    condition2 <- user$compliments$funny >= 1 & !is.na(user$compliments$funny)
    ta <- table(condition1, condition2)
    fisher.test(ta)

    Fisher's Exact Test for Count Data

    data: ta
    p-value = 0.00146
    alternative hypothesis: true odds ratio is not equal to 1
    95 percent confidence interval:
    17.03876 17.85392
    sample estimates:
    odds ratio
    17.43834

    Your Answer Score Explanation
    around 0.05
    around 0.01 Correct 1.00
    around 0.20
    less than .001
    Total 1.00 / 1.00

    Leave a Reply

    Your email address will not be published. Required fields are marked *