Dataset

Data McDonalds-Yelp-Sentiment-DFE.csv is get from https://www.crowdflower.com

Description of Variables

The dataset contains 1525 abservations of 10 variables:

  1. unit_id: id of record
  2. golden: value FALSE
  3. unit_state: value finalized
  4. trusted_judgments: value 3
  5. last_judgment_at: time. Example 2/21/15 0:36
  6. policies_violated: the type of policies, violated. Example: *RudeService
  7. policies_violated.confidence: the confidence of policies, violated. Example: 1.00.66670.6667
  8. city: City name
  9. policies_violated_gold: value NA
  10. review: review detail

Problem Statement

A sentiment analysis of negative McDonald’s reviews. Contributors were given reviews culled from low-rated McDonald’s from random metro areas and asked to classify why the locations received low reviews. Options given were:

Data Pre-Processing

The column names of data

##  [1] "X_unit_id"                    "X_golden"                    
##  [3] "X_unit_state"                 "X_trusted_judgments"         
##  [5] "X_last_judgment_at"           "policies_violated"           
##  [7] "policies_violated.confidence" "city"                        
##  [9] "policies_violated_gold"       "review"

Summary the data, the content of ‘review’ column is too long. So we ignore it

##    X_unit_id          X_golden          X_unit_state  X_trusted_judgments
##  Min.   :679455653   Mode :logical   finalized:1525   Min.   :3          
##  1st Qu.:679456040   FALSE:1525                       1st Qu.:3          
##  Median :679456428   NA's :0                          Median :3          
##  Mean   :679456981                                    Mean   :3          
##  3rd Qu.:679456819                                    3rd Qu.:3          
##  Max.   :679501402                                    Max.   :3          
##                                                                          
##     X_last_judgment_at    policies_violated policies_violated.confidence
##  2/21/15 0:36: 108     na          :295     1            :679           
##  2/21/15 0:38:  81     RudeService :177     1.0\n1.0     : 93           
##  2/21/15 0:22:  72     SlowService :127                  : 54           
##  2/21/15 0:29:  72     OrderProblem:116     0.6667       : 18           
##  2/21/15 0:25:  54     BadFood     :101     1.0\n0.6667  : 17           
##  2/21/15 0:13:  45     ScaryMcDs   : 71     1.0\n1.0\n1.0: 11           
##  (Other)     :1093     (Other)     :638     (Other)      :653           
##           city     policies_violated_gold
##  Las Vegas  :409   Mode:logical          
##  Chicago    :219   NA's:1525             
##  Los Angeles:167                         
##  New York   :165                         
##  Atlanta    :130                         
##  Houston    :105                         
##  (Other)    :330

The detail of data

##   X_unit_id X_golden X_unit_state X_trusted_judgments X_last_judgment_at
## 1 679455653    FALSE    finalized                   3       2/21/15 0:36
## 2 679455654    FALSE    finalized                   3       2/21/15 0:27
## 3 679455655    FALSE    finalized                   3       2/21/15 0:26
## 4 679455656    FALSE    finalized                   3       2/21/15 0:27
## 5 679455657    FALSE    finalized                   3       2/21/15 0:27
## 6 679455658    FALSE    finalized                   3       2/21/15 0:13
##                   policies_violated policies_violated.confidence    city
## 1 RudeService\nOrderProblem\nFilthy          1.0\n0.6667\n0.6667 Atlanta
## 2                       RudeService                            1 Atlanta
## 3         SlowService\nOrderProblem                     1.0\n1.0 Atlanta
## 4                                na                       0.6667 Atlanta
## 5                       RudeService                            1 Atlanta
## 6              BadFood\nSlowService               0.7111\n0.6444 Atlanta
##   policies_violated_gold
## 1                     NA
## 2                     NA
## 3                     NA
## 4                     NA
## 5                     NA
## 6                     NA

and the content of ‘review’ column

## [1] I'm not a huge mcds lover, but I've been to better ones. This is by far the worst one I've ever been too! It's filthy inside and if you get drive through they completely screw up your order every time! The staff is terribly unfriendly and nobody seems to care.                                                                                                                                                                                          
## [2] Terrible customer service. ξI came in at 9:30pm and stood in front of the register and no one bothered to say anything or help me for 5 minutes. ξThere was no one else waiting for their food inside either, just outside at the window. ξ I left and went to Chickfila next door and was greeted before I was all the way inside. This McDonalds is also dirty, the floor was covered with dropped food. Obviously filled with surly and unhappy workers.
## 1518 Levels: "And on the seventh day, he forsook rest, but opened THE FIRST McDONALDS to quest his famished soul, yea." ξI may be exaggerating on the power of Mickey Ds, but only cause they can sue. Anyway, this particular McDonalds is inside the Wal Mart on Forest Ln., and forms a nice reprieve before or after your purchases. No drive-thru, but there is another McDonalds just down the road, on Abrams, that does. Anywho, don't go here often, folks. It's nice for a forgotten lunch or a celebratory match, but don't make it a habit. Deuces, doc. ...

There are some fields that there is no meaning. Such as:

Let’s see the detail after cleaning

## [1] 1525    4
## [1] "policies_violated"            "policies_violated.confidence"
## [3] "city"                         "review"

Data Exploration

Summary the data

##     policies_violated policies_violated.confidence          city    
##  na          :295     1            :679            Las Vegas  :409  
##  RudeService :177     1.0\n1.0     : 93            Chicago    :219  
##  SlowService :127                  : 54            Los Angeles:167  
##  OrderProblem:116     0.6667       : 18            New York   :165  
##  BadFood     :101     1.0\n0.6667  : 17            Atlanta    :130  
##  ScaryMcDs   : 71     1.0\n1.0\n1.0: 11            Houston    :105  
##  (Other)     :638     (Other)      :653            (Other)    :330

Data Visualization

How many records that each city has?

Creating New Features

There are relationship between ‘policies_violated’ vs. ‘policies_violated.confidence’. We should create new fields base on ‘policies_violated’ with value from ‘policies_violated.confidence’

Firstly, we should remove the missed value. policies_violated = ‘na’ or policies_violated = ’’ is no meaning. So remove those records

## [1] 295   4
## [1] 54  4

Let’s see the result

##  [1] "policies_violated"            "policies_violated.confidence"
##  [3] "city"                         "review"                      
##  [5] "RudeService"                  "OrderProblem"                
##  [7] "Filthy"                       "SlowService"                 
##  [9] "BadFood"                      "ScaryMcDs"                   
## [11] "MissingFood"                  "Cost"                        
## [13] "na"
##   RudeService OrderProblem Filthy SlowService BadFood ScaryMcDs
## 1           0            0      0           0       0         0
## 2           0            0      0           0       0         0
## 3           0            0      0           0       0         0
## 5           0            0      0           0       0         0
## 6           0            0      0           0       0         0
## 7           0            0      0           0       0         0
##   MissingFood Cost na
## 1           0    0  0
## 2           0    0  0
## 3           0    0  0
## 5           0    0  0
## 6           0    0  0
## 7           0    0  0
##  [1] "policies_violated"            "policies_violated.confidence"
##  [3] "city"                         "review"                      
##  [5] "RudeService"                  "OrderProblem"                
##  [7] "Filthy"                       "SlowService"                 
##  [9] "BadFood"                      "ScaryMcDs"                   
## [11] "MissingFood"                  "Cost"
##   RudeService OrderProblem Filthy SlowService BadFood ScaryMcDs
## 1         1.0       0.6667 0.6667           0       0         0
## 2           1            0      0           0       0         0
## 3           0          1.0      0         1.0       0         0
## 5           1            0      0           0       0         0
## 6           0            0      0      0.6444  0.7111         0
## 7           0            0      0      0.6562       0    0.6562
##   MissingFood Cost
## 1           0    0
## 2           0    0
## 3           0    0
## 5           0    0
## 6           0    0
## 7           0    0

Next:

Let’s see the result:

##   RudeService OrderProblem    Filthy SlowService   BadFood ScaryMcDs
## 1   0.8370684    0.8590107 0.9361200   0.8636744 0.8655926 0.7032000
## 2   0.8856359    0.8812910 0.8800533   0.8869545 0.8445385 0.8957571
## 3   0.8353583    0.8759167 0.7656571   0.8304500 0.9057643 0.8073900
## 4   0.8446567    0.8720238 0.9164625   0.8834364 0.8355577 0.7997000
## 5   0.8641813    0.8832980 0.8514467   0.8560000 0.8407037 0.8460267
## 6   0.8886724    0.8328750 1.0000000   0.8966667 0.9103333 0.6654000
##   MissingFood      Cost      city
## 1   0.7445800 0.8905667   Atlanta
## 2   0.7487538 0.7504200 Las Vegas
## 3   0.6752333 0.7980600    Dallas
## 4   0.6909000 0.7322167  Portland
## 5   0.6639600 0.8399625   Chicago
## 6   0.8921333 0.8996000 Cleveland

Modeling:

In-progress!