Archive for February, 2015

Playing with R and SQLite

There are multiple questions:

  1. The operating regions are indicated by region_id. Generate a report of the average hourly_charge in each operating region as well as the overall average.

  2. Assuming that a booking is completed if it is not cancelled by the customer and has no reschedule events, generate a report based on the calendar week (running Sun-Sat) of the number of bookings done, number of bookings done using coupons, total hours booked, and number of bookings which were cancelled by the customer.

  3. Recurring bookings are bookings which happen on a regularly scheduled basis and are indicated by recurring_id and a frequency (freq) indicating how many weeks pass between each booking in the series. Determine the distribution of bookings based on the frequency of the recurring booking to which they belong across the days of the week on which they were completed.

  4. Say we have a problem with customers canceling and rescheduling bookings. Assuming all the bookings are from different users, pull metrics which you believe would give a general profile of these problem users.

Input:

  • Question asked before
  • handy_bookings_train.csv and handy_booking_test.csv

Output:

  • This document
#Set up
setwd("~/R projects/handy")
#load csv sql to be able to do sql queries on the csv files
#the R library use sqlite
library(sqldf)
#Some graph
library(ggplot2)
#load data
ds.train<-read.csv(file="data/handy_bookings_train.csv",header=T,sep=",",na.strings=c("NULL",""))
ds.test<-read.csv(file="data/handy_bookings_test.csv",header=T,sep=",",na.strings=c("NULL",""))

#Question 1

The operating regions are indicated by region_id. Generate a report of the average hourly_charge in each operating region as well as the overall average.

#average per region
avg.region<-sqldf("SELECT region_id as region,avg(hourly_charge) as average_charge  FROM 'ds.train' GROUP BY region_id")
## Loading required package: tcltk
#overall average
#I wasn't sure if you wanted the overall average weighted by region or just the overall average so I made both query
avg.overall<-sqldf("select avg(hourly_charge) from 'ds.train'")
#average overall weighted by region
avg.region.overall<-sqldf("select avg(t.average_charge) from (SELECT avg(hourly_charge) as average_charge from 'ds.train' group by region_id) t")
#This output the same result but it's kind of cheating since I'm using R and not pure sql
#avg.region.overall<-sqldf("select avg(average_charge) FROM 'avg.region'")
#If I had to pick one, I think it makes more sense to use the avg.overall one.

avg.region
##    region average_charge
## 1       4       2981.250
## 2       5       2853.342
## 3       6       2838.129
## 4       8       2969.583
## 5       9       2886.047
## 6      10       2936.000
## 7      11       2928.125
## 8      12       2945.614
## 9      13       3000.000
## 10     14       3146.032
## 11     15       3008.125
## 12     16       3108.333
## 13     17       3088.298
## 14     18       3077.778
## 15     20       3165.385
## 16     21       3003.175
## 17     22       3144.000
## 18     23       3053.704
## 19     24       3108.730
## 20     25       3068.831
## 21     26       3108.511
## 22     27       3047.826
## 23     28       2929.592
## 24     29       3095.455
## 25     31       3032.895
## 26     32       2906.061
## 27     33       3013.115
## 28     34       2868.750
## 29     35       3059.677
## 30     36       1007.380
## 31     38       3098.485
## 32     41       1000.000
## 33     42       1000.000
ggplot(data=avg.region,aes(x=region,y=average_charge))+geom_histogram(stat="identity")+scale_x_continuous(breaks=seq(min(avg.region$region),max(avg.region$region),1))+xlab("Region ID")+ylab("Average Charge")+ggtitle("Average charge per region")

avg.overall
##   avg(hourly_charge)
## 1           2674.819
avg.region.overall
##   avg(t.average_charge)
## 1              2832.674

#Question 2

Assuming that a booking is completed if it is not cancelled by the customer and has no reschedule events, generate a report based on the calendar week (running Sun-Sat) of the number of bookings done, number of bookings done using coupons, total hours booked, and number of bookings which were cancelled by the customer.

#The way I understood the question was to ask for the day of the week (sun-sat) of the week number (in the year). I am not sure it is exactly what you meant.
#Additionally, it wasn't clear to me if you wanted the report to be for the day of the week it was added or it was starting. I figured that added made more sense.
#This question is kind of tricky because I did not know exactly one row entry mean and what the column mean. I guess it was part of the exercise that's why I didn't ask.
#I have noticed that some date cancelation were after the start date and date start was always after date added. It made me assume that each row is for each customer, date added is when they added their order, date start is when they first start and frequencies indicate the  frequency of the task. Therefore, you can cancel after the start if your frequency is more than one.
#I assumed that an event would be canceled when we had a cancel date and not reschuduled when  reschedule_events_count was 0
#I also assumed this report was to give information of "what have been done" which mean that the booking hours should be only to the one who didn't cancel.




report<-sqldf("select cast (strftime('%W', date_start) as integer) as weeknumb, case cast (strftime('%w', date_added) as integer)
  when 0 then 'Sunday'
  when 1 then 'Monday'
  when 2 then 'Tuesday'
  when 3 then 'Wednesday'
  when 4 then 'Thursday'
  when 5 then 'Friday'
  when 6 then 'Saturday'
  else 'Unknwon' end as dow,
  count(case when (reschedule_events_count=0 OR customer_cancelation_date is null) then 1 end) as booking_done,
  count(case when (reschedule_events_count=0 OR customer_cancelation_date is null AND coupon is not null) then 1 end) as booking_done_coupon,
  sum(case when (reschedule_events_count=0 OR customer_cancelation_date is null) then hrs else 0 end) as booking_hours,
  count(case when (customer_cancelation_date is not null AND reschedule_events_count>0) then 1 end) as booking_cancelled
  from 'ds.train'
  group by weeknumb,cast (strftime('%w', date_added) as integer) order by weeknumb, cast (strftime('%w', date_start) as integer)")

##Please note that I could definitely have breaked down this query into string vars in order to not repeat myself. However, it was saying to use SQL. Therefore, I assumed I should do as if I could only do SQL in a terminal.
report
##    weeknumb       dow booking_done booking_done_coupon booking_hours
## 1        48    Sunday            7                   7          19.0
## 2        48  Thursday          108                  97         287.5
## 3        48  Saturday           57                  51         145.0
## 4        48    Monday          291                 275         771.0
## 5        48 Wednesday          141                 128         390.0
## 6        48   Tuesday          259                 238         658.5
## 7        48    Friday           89                  72         236.5
## 8        49    Monday          261                 240         716.5
## 9        49  Saturday          163                 148         449.5
## 10       49    Sunday          162                 149         448.0
## 11       49   Tuesday          316                 286         844.5
## 12       49  Thursday          243                 215         695.5
## 13       49    Friday          221                 202         617.5
## 14       49 Wednesday          250                 224         688.0
## 15       50    Sunday          151                 139         422.5
## 16       50    Monday          295                 273         815.0
## 17       50   Tuesday          267                 248         742.5
## 18       50 Wednesday          215                 192         596.5
## 19       50  Thursday          148                 125         431.0
## 20       50  Saturday          106                  96         314.0
## 21       50    Friday          119                 107         344.0
##    booking_cancelled
## 1                  0
## 2                  7
## 3                  2
## 4                  5
## 5                  4
## 6                  6
## 7                  4
## 8                  6
## 9                  3
## 10                 4
## 11                11
## 12                 6
## 13                18
## 14                10
## 15                 4
## 16                 9
## 17                13
## 18                 9
## 19                 3
## 20                 1
## 21                 6

#Question3

Recurring bookings are bookings which happen on a regularly scheduled basis and are indicated by recurring_id and a frequency (freq) indicating how many weeks pass between each booking in the series. Determine the distribution of bookings based on the frequency of the recurring booking to which they belong across the days of the week on which they were completed.

#On this question, I am really confused. Do you want me to give you the  classification model using machine learning (I see there are train and test data)?
#Do you want me to give you the shape of the data (normal distribution, linear...)
#Do you want me to give you some descriptive informations such as the range, mean, variance..
#The first line says SQL, I will just give you some general information but please tell me if you want more. I can definitely do it if that's what you are looking for.


distribution<-sqldf("select cast (avg(freq) as integer) as freq,case cast (strftime('%w', date_start) as integer)
  when 0 then 'Sunday'
  when 1 then 'Monday'
  when 2 then 'Tuesday'
  when 3 then 'Wednesday'
  when 4 then 'Thursday'
  when 5 then 'Friday'
  when 6 then 'Saturday'
  else 'Unknwon' end as dow,
  count(*) as freq_count
  from 'ds.train' where (recurring_id is not null AND freq is not NULL) AND (reschedule_events_count=0 OR customer_cancelation_date is null)
  group by cast (strftime('%w', date_start) as integer),freq order by cast (strftime('%w', date_added) as integer)")


ggplot(data=distribution,aes(x=as.factor(freq),y=freq_count))+geom_histogram(stat="identity")+facet_wrap(~dow)+xlab("Frequence")+ylab("Number of completed booking")+ggtitle("Number of completed booking per frequency accross the day of the week")

##Question 4

Say we have a problem with customers canceling and rescheduling bookings. Assuming all the bookings are from different users, pull metrics which you believe would give a general profile of these problem users.

#We can do many things to try to find some correlation between the data
#One thing we could look at is the disperssion of the cancelation date
#Did they usually cancel at the day of the week
#We could use machine learning again with a binary response which would be canceled and reshudeled against normal or find a linear correlation using pearson moment correlation for example.
#However, the way I understood the challenge was to only use pure SQL (without help from any programming language as if you where in a sql terminal) for each question which highly limit the possibilities.
#What we want to know is why these users are different than the one who don't cancel or reschedule
#Therefore, our metric should be able to compare both type of users.
#Moreover, there is the notion of time. Was there a special event that made customers leave. Therefore, a timeline of percentage of user leaving vs total user would be interesting.
#Also, it would be interesting to see the acquisition vs retention rate. Do people who cancel are first time user or people who have had services every week (looking at freq)

metric.retention<-sqldf("select (cast (strftime('%W', date_added) as integer) +(cast (strftime('%w', date_added) as integer))/7.0) as date_event, (100.0*(count(case when(reschedule_events_count>0 AND customer_cancelation_date is not null) then 1 end))/count(*)) as perc,
             (avg(case when(reschedule_events_count>0 AND customer_cancelation_date is not null) then freq  end)) as freq_leaving,
           (avg(case when(reschedule_events_count=0 OR customer_cancelation_date is null) then freq end)) as freq_not_leaving

            from 'ds.train' group by cast (strftime('%W', date_added) as integer),cast (strftime('%w', date_added) as integer) order by cast (strftime('%W', date_added) as integer), cast (strftime('%w', date_added) as integer)")

ggplot(data=metric.retention,aes(x=date_event,y=perc))+geom_point()+scale_x_continuous(breaks=seq(min(metric.retention$date_event),max(metric.retention$date_event),0.5))+geom_smooth()+xlab("Time")+ylab("Percentage of uncompleted booking")+ggtitle("Percentage of uncompleted booking accross time")
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

#We can see that there is a peak during the week 49 (saturday)
#It would be interesting to see if we had a problem of in our server or some reason that the data itself wouldn't tell us


ggplot(data=metric.retention,aes(x=date_event,y=freq_not_leaving))+geom_point()+scale_x_continuous(breaks=seq(min(metric.retention$date_event),max(metric.retention$date_event),0.5))+geom_line()+geom_point(aes(x=date_event,y=freq_leaving,color="red"))+geom_line(aes(x=date_event,y=freq_leaving,color="red"))+xlab("Time")+ylab("Average frequency")+ggtitle("Average frequency of uncompleted (red) vs completed booking")
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 3 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_path).

#We can see that while the average frequency for the staying has a small variation, the variation of the one for the one leaving is really higher. We can conclude that the one staying have usually a freq of 3 while the one leaving have other frequencies. Therefore, we should promote having frequency of 3 to keep them.

#Now, let's look at other interesting variables such as wether or not they have coupon (a deeper analysis of which coup trigger retention or not could be done as well),if they clicked for extra home cleaning, total chargethe region,if there was a campaign,number of provider, if provider is requested, if it was peak priced.
#If I had to do it only with sql I would look at the average and the variance of each variable for the group leaving and staying. This would give me a good global idea if there are some odd behavior and which metric makes more sense.

#However, I believe that the best way to know if someone is going to leave or not is to do preddictive analysis. Therefore, I'll be using a bit of machine learning.
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Cleaning data removing meaningless variables
rf.train<- ds.train %>% mutate(has_coupon=as.factor(as.numeric(is.na(coupon))),total_charge=as.numeric(hrs*hourly_charge),cancel=as.factor(as.numeric(!is.na(customer_cancelation_date) & reschedule_events_count>0))) %>% select_("-row_id","-customer_cancelation_date","-date_start","-date_added","-hrs","-hourly_charge","-coupon","-recurring_id","-reschedule_events_count","-user_bookings_count","-user_cancelled_bookings_count","-region_id")

rf.test<- ds.test %>% mutate(has_coupon=as.factor(as.numeric(is.na(coupon))),total_charge=as.numeric(hrs*hourly_charge),cancel=as.factor(as.numeric(!is.na(customer_cancelation_date) & reschedule_events_count>0))) %>% select_("-row_id","-customer_cancelation_date","-date_start","-date_added","-hrs","-hourly_charge","-coupon","-recurring_id","-reschedule_events_count","-user_bookings_count","-user_cancelled_bookings_count","-region_id")

rf.test[is.na(rf.test$freq),]<-0
rf.train[is.na(rf.train$freq),]<-0



rf.train$num_providers<-as.factor(rf.train$num_providers)
rf.train$has_campaign<-as.factor(rf.train$has_campaign)
rf.train$clicked_extra_home_cleaning<-as.factor(rf.train$clicked_extra_home_cleaning)
rf.train$peak_priced<-as.factor(rf.train$peak_priced)

rf.test$num_providers<-as.factor(rf.test$num_providers)
rf.test$has_campaign<-as.factor(rf.test$has_campaign)
rf.test$clicked_extra_home_cleaning<-as.factor(rf.test$clicked_extra_home_cleaning)
rf.test$peak_priced<-as.factor(rf.test$peak_priced)


clf<-randomForest(cancel~.,data=rf.train,na.action=na.omit,importanc=T)
clf$confusion
##      0 1 class.error
## 0 3948 0           0
## 1   52 0           1
#We can see that the confusion matrix is really bad, it is unecessary to look at the prediction, the clasifier is just bad
#pred<-predict(clf,rf.test %>% select_("-cancel"))
#varImpPlot(clf)
#Let's try with adaboost
############      AdaBoost   ###############
#I picked adaboost because it is supposed to be a great mix between accuracy and speed for high dim predictors with NA.
library(ada)
## Loading required package: rpart
#Helper function that allow to know what are the dependencies between vars
#Input: dataframe, name of y in dataframe, type of boost you want to apply, number of iteration and depth
#Output: classifier
clfAda<-function(data=ds,y,adatype="gentle",adaiter=70,cdepth=14){
  n <- nrow(data)
  indy<-which(names(data)==y)
  ind <- sample(1:n)
  xnam <- paste(names(data[-indy]), sep="")
  fmla <- as.formula(paste(y," ~ ", paste(xnam, collapse= "+")))
  trainval <- ceiling(n * .5)
  testval <- ceiling(n * .3)
  train <- data[ind[1:trainval],]
  test <- data[ind[(trainval + 1):(trainval + testval)],]
  valid <- data[ind[(trainval + testval + 1):n],]
  control <- rpart.control(cp = -1, maxdepth = cdepth,maxcompete = 1,xval = 0)
  clf <- ada(fmla, data = train, test.x = test[,-indy], test.y = test[,indy], type =adatype, control = control, iter = adaiter)
  clf <- addtest(clf, valid[,-indy], valid[,indy])
  clf
}

##Now we compute our adaboost and do our prediction and confusion matrix
#Take labels and assign variables after doing adaboost, prediction, confusion matrix and the variable importance
  xtest<-rf.test %>% select_("-cancel")
  ytest<-rf.test %>% select_("cancel")
  #classifier
  ada.clf<-clfAda(data=rf.train,y="cancel")
  #prediction
  ada.pred<-predict(ada.clf,xtest)
  #confusion matrix
  ada.conf<-table(ada.pred,ytest[,1])
  #var importance
  ada.vip<-varplot(ada.clf,plot.it=FALSE,type='scores')
ada.clf$confusion
##           Final Prediction
## True value    0
##          0 1969
##          1   31
#Again the confusion matrix is really bad, we can't predict if it is canceled
#We can't easily predict it
#Even if we could, the testing set is actually consisting of only completed booking which give us an unbalanced set
#I will then just look at the classification tree
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich
rf.train$freq<-as.factor(rf.train$freq)
plot(ctree(cancel~.,rf.train))

#We can see that the best metric to know if we complete or not a booking is if we have a provider or not. Other than that
#We could go further and look at the total charge, frequency, has_campaign, clicked_extra_home_cleaning, peak_priced, is_provider_requested, has_coupon.
#However, this is mostly descriptive analysis which trigger less accurate result than the classification done.

Playing with R – A bit of machine learning

##Introduction
The question is: Tell us something interesting about the ping backs we receive from videos.
Input:
*Question asked before
*tsv data file
*pdf file with data format
Output:
*This document

I picked R in order to do this analysis as it did appears to me that this is mainly an exploratory data analysis and R markdown + ggplot2 are very conveniant for that in my opinion.

We will be proceeding in this order:
*Import the data
*Clean the data
*Explore the data
*Try to find a predective model
*Explore the predective features that were found important
*Make asumption from findings

###1. Import the data

Quite easy here. The data being small I can just use read.csv without using data.table

#library import
library(party)
library(MASS)
library(e1071)
library(softImpute)
library(ggplot2)
library(plyr)
library(dplyr)
library(gridExtra)
library(chron)
library(reshape2)
library(choroplethr)
library(zipcode)
library(choroplethrMaps)
library(stringi)
library(httr)
library(tools)
library(caret)
library(randomForest)
library(rpart)
library(ada)
library(ade4)
library(mlbench)
library(rJava)
library(FSelector)
library(gbm)
library(verification)
library(Amelia)
library(rworldmap)
ds<-read.csv("data/support_ping_hw_ua.tsv",header=F,sep="t",stringsAsFactors=F)

##2.Clean the data

#Name it according to the pdf with the addition of the python script for the user agent detail
names(ds)<-c("id","date","user.agent","browser","country","domain","page.url","title","video.url","video.id","ping.embed","ping.play","ping.complete","ua.device.brand","ua.device.model","ua.device.family","ua.os.major","ua.os.patch_minor","ua.os.minor","ua.os.family","ua.os.patch","ua.ua.major","ua.ua.minor","ua.ua.family","ua.ua.patch")

##do some basic lookup of fields to see if they are usefull
#I am ommiting some parts as it would make it unecessary too long
#It was said to be from the  website
#Therefore let's check if all the data are from it
#table(ds[,"domain"])
#ok the table tells us that all data are from the same domain
#Therefore, it doesn't give us any usefull information, let's remove it.
ds=ds[-which(names(ds)=="domain")]
#urls give us lot of usefull informations such as
#port,hostname of video url, extension, the path to the video,query (if there is one),params
#username,fragment,password,scheme as defined in http://tools.ietf.org/html/rfc1808.html 
#let's break down the page url and video url

#This is an helper function to breakdown the urls
#Input: ds => data frame and field => field of data frame that contain url
#Output: a data frame of 10 variables of format field.variable where variable is:
#scheme,hostname,port,path,query,params,fragment,username,password,ext
geturl<-function(ds,field){
  url.struct<-sapply(ds[,field],parse_url)
  url.struct[sapply(url.struct, is.null)] <- NA 
  url.df<-data.frame(matrix(unlist((url.struct)), nrow=ncol(url.struct), byrow=T),stringsAsFactors=F)
  names(url.df)<-rownames(url.struct)
  url.df<-url.df %>% mutate(ext=(file_ext(as.character(path))))
  names(url.df)<-sapply(names(url.df),function(x) paste0(field,".",x))
  url.df

}

ds<-cbind.data.frame(ds,geturl(ds,"page.url"),geturl(ds,"video.url"))
ds$date<-as.numeric(as.POSIXct(ds$date))
#We tell R that empty string is NA
ds[ds==''] <- NA 
#One interesting thing that we can now extract from the data is if it is mobile or not
#It can be useful because the dataset is very small
ds<-ds%>%mutate(mobile=(as.numeric(!is.na(ua.device.brand))))
##A quick look at the pdf and using this command
#table((ds%>%mutate(n=(ping.embed+ping.play+ping.complete)))$n)
##show me that it was coded as a dummy variable let's regroup them and add ping.not.embed for easy of use 
ds<-ds%>%mutate(ping.not.embed=as.numeric((ping.embed+ping.play+ping.complete)==0),ping=(ifelse(ping.embed==1,"embed",ifelse(ping.play==1,"play",ifelse(ping.complete==1,"complete","not.embed")))))
##Finally, let's remove all the empty column (the one who only have NA)
notEmpty<-names(which(sapply(ds[],function(x) !all(is.na(x)))))
ds<-ds[notEmpty]
ds[,notEmpty] <- lapply(ds[,notEmpty] , function(x) factor(x,exclude=NA))

Here are all our finals features after the data cleaning:
id, date, user.agent, browser, country, page.url, title, video.url, video.id, ping.embed, ping.play, ping.complete, ua.device.brand, ua.device.model, ua.device.family, ua.os.major, ua.os.minor, ua.os.family, ua.os.patch, ua.ua.major, ua.ua.minor, ua.ua.family, ua.ua.patch, page.url.scheme, page.url.hostname, page.url.path, page.url.query, page.url.fragment, page.url.ext, video.url.scheme, video.url.hostname, video.url.port, video.url.path, video.url.query, video.url.ext, mobile, ping.not.embed, ping
Same of them are redundant (or the same but in more detail) but it will help to have a direct accessor when we want to explore them in deeper detail

The last call is a tought call which is, should I keep or not empty observation. In order to know that we first need to estimate the importance of them.
Let’s look at them.

We can see that most data which are missing are from the data I generated. This is not actually a problem since they are mainly here to give extra information if we want to be very specific about a certain observation.
Additionally, we can clearly see a pattern between missing value and if we take the column video.id we see that there are 0.895% of missing data which is really small.In addition for the video extension there 2.805% of missing data, this is relatively small.
These 348 correspond actually to all the www.youtube.com videos. Removing it wouldn’t allow us to predict any ping from www.youtube.com.
Where it becomes tricker it is for the title.
There is 57.493% of missing video title.
What to do then? A quick glance at the data tell me that there is no video with the same id but a different title. Therefore, when I am going to predict the data, I’ll just omit title which will overfit the model. As for video.id, I think it should be identified aside (using a knn or other model) to see if there is any special pattern that could explain it.
The trickiest question, is should I or not remove the youtube videos which have no extension, by doing that I really impute the model from maybe usefull informations. I have decided not to.

ds.original<-ds

ds<-ds %>% filter(!is.na(video.id))

##3. Explore the data

#Set the seed for reproductible example
set.seed(2015)

####We will be looking at the ping. Can we predict it from our dataset?

The idea, is to look if we can raisonabely predict some state of ping. If so, what are the important variables that drive our prediction. In order to do that I am going to use multiple classifier: Adaboost, SVM, GBM and randomForest.

First we reduce the number of feature to only keep the one that have significance using a wrapper approach for subset selection.

##Note:
#This wrapper approach wasn't actually giving me better result. I assumed it was because of the highly unbalanced data. I ended up doing it myself with trial.
#Wrapper approach for feature reduction using hill climbing search
#Input: data, y (binary)
#Output: list of selected factor reducing the error rate
feature.reducer<-function(data=ds,y){
  evaluator <- function(subset) {
    #k-fold cross validation
    k <- 5
    splits <- runif(nrow(data))
    results = sapply(1:k, function(i) {
    test.idx <- (splits >= (i - 1) / k) & (splits < i / k)
    train.idx <- !test.idx
    test <- data[test.idx, , drop=F]
    train <- data[train.idx, , drop=F]
    tree <- rpart(as.simple.formula(subset, y), train)
    error.rate = sum(with(test, get(y)) != predict(tree, test,type="c")) / nrow(test)
    return(1 - error.rate)
    })  
    return(mean(results))
  }
  subset <- hill.climbing.search(names(data)[-which(names(data)==y)], evaluator)
  subset
}

I also chose not to use another training sample for the feature selection as the data is already very small.
Finally, one of the main issues in wrapper method is overfitting and there are feature that are obiously overfitting/irrelevent.
Therefore, I decided to remove all the ua (except the OS, I gained more levels about the browser but the dataset being too small I decided not to use it since it would create too much of unbalanced data) fields before computing it
I am also removing informations that just wouldn’t make sense (i.e. the date being only for one day and we don’t have the “local” time but only the standardized time +0000 for all data)

Remove other factors that are irrelevent for the analysis:
*title (explained before)
*page.url.scheme has only one level which is http
*page.url.path has no meaning for predicting
*page.url.ext has only one level which is html
*page.url.fragment
*page.url.query
*video.url.port only 2.9363917% of the data and not relevant
*date is more debatable but the data being only on one day and not local zone I don’t see it making any sens. The only thing we could get out of it is if we have no ping at all during a certain period which could mean a failure of the server but this is more in the observation field than the prediction (we can’t predict a server failure from this data) which is what we are trying to achieve now
*ping.embeded, ping.play, ping.complete, ping that are our labels
*id even though this one is interesting, the high number of factor (1355) make it a bad predictor (even using svm or gbm).

#Get the ua names index that was generated
notua<-notEmpty[!grepl("^ua.",notEmpty)]
#remove other factors that have very few meaning

ind.features.filter<-notua[is.na(match(notua,c("page.url.scheme","page.url.fragment","page.url.query","page.url.path","page.url.ext","date","video.url.port","video.url.path","video.url.query","ping.embed","ping.play","ping.complete","ping.not.embed","ping","title","video.url","page.url","id")))]

ind.features.filter<-c(ind.features.filter,"ua.os.family")

#I went into several trial and realized that this step wasn't necessary for this analysis
#However, I didn't remove it as I think the feature selection is key to the process. Therefore, I wanted to show that I actually did work on the feature selection.
#feature.ping.reduced.ping<-ind.features.filter
#feature.ping.reduced.ping.not.embed<-feature.reducer(ds[c(ind.features.filter,"ping.not.embed")],"ping.not.embed")
#feature.ping.reduced.ping.embed<-feature.reducer(ds[c(ind.features.filter,"ping.embed")],"ping.embed")
#feature.ping.reduced.ping.play<-feature.reducer(ds[c(ind.features.filter,"ping.play")],"ping.play")
#feature.ping.reduced.ping.complete<-feature.reducer(ds[c(ind.features.filter,"ping.complete")],"ping.complete")

Here are our features:
user.agent, browser, country, video.id, page.url.hostname, video.url.scheme, video.url.hostname, video.url.ext, mobile, ua.os.family
If we look into their level we can see that we have features with very high dimensions.
In order to deal with high dimensional data, we will be using three classifiers that deal nicely with high categorical levels features and use binary classification if needed:
*Adaboost
*Generalized Boosted Regression Model (GBM)
*SVM

##Setting training and testing set vars

#Creating a function to create my sample of test from a subset of the data
#Input: data, name of features to keep, labels (what we want to predict), traning.percentage
#Output: assignement of global variable train,test with this pattern train.labelname,test.labelname 
#Warning: will overide any existing var with the same name
createSets<-function(data,ind.features.filter,labels,traning.percentage=0.8){
n <- nrow(data)
ind <- sample(1:n)
trainval <- ceiling(n * traning.percentage)
  lapply(labels,function(x){
    train<-data[ind[1:trainval],c(ind.features.filter,x)]
    test <- data[ind[(trainval + 1):n],c(ind.features.filter,x)]
    namTrain <- paste("train", x, sep = ".")
    namTest <- paste("test", x, sep = ".") 
    assign(namTrain,train,envir = .GlobalEnv)
    assign(namTest,test,envir = .GlobalEnv)
    })
}
labels<-c("ping.not.embed","ping.embed","ping.play","ping.complete")
invisible(createSets(ds,ind.features.filter,labels,0.8))

####Adaboost

############      AdaBoost   ###############
#I picked adaboost because it is supposed to be a great mix between accuracy and speed for high dim predictors.

#Helper function that allow to know what are the dependencies between vars
#Input: dataframe, name of y in dataframe, type of boost you want to apply, number of iteration and depth
#Output: classifier
clfAda<-function(data=ds,y,adatype="gentle",adaiter=70,cdepth=14){
  n <- nrow(data)
  indy<-which(names(data)==y)
  ind <- sample(1:n)
  xnam <- paste(names(data[-indy]), sep="")
  fmla <- as.formula(paste(y," ~ ", paste(xnam, collapse= "+")))
  trainval <- ceiling(n * .5)
  testval <- ceiling(n * .3)
  train <- data[ind[1:trainval],]
  test <- data[ind[(trainval + 1):(trainval + testval)],]
  valid <- data[ind[(trainval + testval + 1):n],]
  control <- rpart.control(cp = -1, maxdepth = cdepth,maxcompete = 1,xval = 0)
  clf <- ada(fmla, data = train, test.x = test[,-indy], test.y = test[,indy], type =adatype, control = control, iter = adaiter)
  clf <- addtest(clf, valid[,-indy], valid[,indy])
  clf
}

#AdaBoost is waiting for a binary response therefore we use our dummy variables only

##Now we compute our adaboost and do our prediction and confusion matrix
#TODO use parallel to make it faster
#Take labels and assign variables after doing adaboost, prediction, confusion matrix and the variable importance
invisible(lapply(labels,function(x){
  xtest<-select(get(paste("test",x,sep=".")),-matches(x))
  ytest<-select(get(paste("test",x,sep=".")),matches(x))
  #classifier
  ada.clf<-clfAda(data=get(paste("train",x,sep=".")),y=x)
  #prediction
  ada.pred<-predict(ada.clf,xtest)
  #confusion matrix
  ada.conf<-table(ada.pred,ytest[,1])
  #var importance
  ada.vip<-varplot(ada.clf,plot.it=FALSE,type='scores')
  #Create the variables names
  name.clf<-paste("ada.clf",x,sep=".")
  name.pred<-paste("ada.pred",x,sep=".")
  name.conf<-paste("ada.conf",x,sep=".")
  name.vip<-paste("ada.vip",x,sep=".")
  #store it
  assign(name.clf,ada.clf,envir = .GlobalEnv)
  assign(name.pred,ada.pred,envir = .GlobalEnv)
  assign(name.conf,ada.conf,envir = .GlobalEnv)
  assign(name.vip,ada.vip,envir = .GlobalEnv)
}))

#Let's check if we found corrects model

Note that usually, we try to maximize the accuraccy by adding/removing features when we don’t have a specific question (if the clf don’t do it for you).

Result for:
*Not Embed

## Call:
## ada(fmla, data = train, test.x = test[, -indy], test.y = test[, 
##     indy], type = adatype, control = control, iter = adaiter)
## 
## Loss: exponential Method: gentle   Iteration: 70 
## 
## Training Results
## 
## Accuracy: 0.767 Kappa: 0.539 
## 
## Testing Results
## 
## Accuracy: 0.718 Kappa: 0.442 
## 
## Accuracy: 0.709 Kappa: 0.428

We can see that the model is correct (relatively moderately high Accuracy/Kappa).

Let’s check if we had enough iterations:

Yes it’s ok

Let’s look into the confusion matrix on the test subject

## Confusion Matrix and Statistics
## 
##         
## ada.pred   0   1
##        0 812 181
##        1 508 957
##                                           
##                Accuracy : 0.7197          
##                  95% CI : (0.7015, 0.7374)
##     No Information Rate : 0.537           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4472          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6152          
##             Specificity : 0.8409          
##          Pos Pred Value : 0.8177          
##          Neg Pred Value : 0.6532          
##              Prevalence : 0.5370          
##          Detection Rate : 0.3303          
##    Detection Prevalence : 0.4040          
##       Balanced Accuracy : 0.7281          
##                                           
##        'Positive' Class : 0               
## 

The model prediction is correct.
Let’s look then at the variable importance.

Therefore, it seems like the fact that we are on mobile and the protocol (scheme) and os used
influence the most the fact that we will or not load the video. Which is common sense since we might load (or not) the video if our mobile/computer can (or not) read a specific protocol.

*Embed

I am going to go a bit faster, I did check the iteration/kappa/accuracy of the classifier whch are correct. The confusion matrix is also correct. Therefore, I can show you the var importance.

We can see that the country play an important role which did surprise me at first and then I figured out that it was certainly because the data was unbalanced. Then the classifier did put scheme and operating system.

*Play

Kappa is really low and therefore the test failed has no meaningfull result.

*Complete

Same thing.

An interesting fact is that the sensitivity (with 0 (not) being the positive class) was really high for both of them.
This tell us that the model is good to know when a video won’t be played or completed which is consistant with the two other states. Additionally, we can explain it by the high unbalance of the data.

We have seen that some prediction were inacurate. We have two main possibilities:
*Change the features (we could use a feature ranked approach such as chi-squared or information gain since the wrapper one didn’t result in effective results) or transform the features
*Use another model (which can also help for feature selection)

Adaboost was a great example, I’m going to go faster now for the other classifier.

####GBM

############      GBM   ###############
##I picked GBM because it is robust and relatively fast
#It also allow feature to have 1024 levels (unlike randomforest)
#Finally, it won several kaggle competition, thanks to its high acuracy

invisible(createSets(ds,ind.features.filter,"ping",0.8))
labels.all<-c("ping",labels)

xTrain<-train.ping %>% select(-ping,-user.agent)
xTest<-test.ping %>% select(-ping,-user.agent)

yTrain<-train.ping[,which(names(train.ping)=="ping")]
yTest<-test.ping[,which(names(train.ping)=="ping")]

#a high guess of how many trees we'll need
ntrees = 500


#An optimation would be to do cross validation to set the parameters
gbm.model <- gbm.fit(
x = xTrain 
, y = yTrain 
#x and y instead of using a formula
, distribution = "multinomial"
#use bernoulli for binary outcomes
#other values are "gaussian" for GBM regression
#or "adaboost"
, n.trees = ntrees
#Choosed this value to be large, then we will prune the
#tree after running the model
, shrinkage = 0.01
#smaller values of shrinkage typically give slightly better performance
#the cost is that the model takes longer to run for smaller values
, interaction.depth = 3
#TODO: use cross validation to choose interaction depth
, n.minobsinnode = 300
#n.minobsinnode has an important effect on overfitting
#decreasing this parameter increases the in-sample fit,
#but can result in overfitting
, nTrain = round(nrow(xTrain) * 0.8)
#select the number of trees at the end
# ,var.monotone = c()
#can help with overfitting, will smooth bumpy curves
, verbose = F #print the preliminary output
) 

#Verify we had enough tree
gbm.perf(gbm.model, plot.it = TRUE)
## Using test method...

## [1] 500
#Relative influence among the variables can be used in variable selection or seing if 
#parameter tunning (i.e. if there is overfitting)
summary(gbm.model)

##                                   var    rel.inf
## video.id                     video.id 67.0141245
## country                       country 23.7720645
## page.url.hostname   page.url.hostname  5.3680470
## ua.os.family             ua.os.family  3.6998816
## browser                       browser  0.1458824
## video.url.scheme     video.url.scheme  0.0000000
## video.url.hostname video.url.hostname  0.0000000
## video.url.ext           video.url.ext  0.0000000
## mobile                         mobile  0.0000000
#Print tree
#pretty.gbm.tree(gbm.model)


##Rest is commented since it was more for the exploration than drawing conclusion
#look at the effects of each variable
#for(i in 1:length(gbm.model$var.names)){
#  print(i)
#plot(gbm.model, i.var = i
#, ntrees = gbm.perf(gbm.model, plot.it = FALSE) #optimal number of trees
#, type = "response" #to get fitted probabilities
#)
#}
#Try to predict our test variables on the best tree
#gbm.test.predict<- predict(object = gbm.model,newdata =select(test.ping,-ping,-user.agent)
#, n.trees = gbm.perf(gbm.model, plot.it = FALSE)
#, type = "response")

#roc.area(select(test.ping.embed,ping.embed)[,1], gbm.test.predict)$A
#prop.table(table(matrix(gbm.test.predict)[,1],(select(test.ping.embed,ping.embed)[,1])))

#table(matrix(gbm.test.predict))

#str(gbm.test.predict)
#dim()
#str(data.frame(gbm.test.predict))
#str(as.data.frame(as.matrix(table(gbm.test.predict))))
#str(as.matrix(gbm.test.predict))

GBM is actually not working well at all here. looking at its tree and summary we can see too much importance on the first variable. I tried to tweak the param a lot and chage the features but it just would not work for this model. I therefore, turned myself into SVM that is also well designed for a problem with few data and lot of dimension.

####SVM

#I did include my code for svm, but I actually mostly used it during my exploratory phase where I was trying to do a rapid feature selection
##Need to do cross validation and parameter tunning and refraction of code
clfSvm<-function(data,y){
  indy<-which(names(data)==y)
  xnam <- paste(names(data[-indy]), sep="")
  fmla <- as.formula(paste(y," ~ ", paste(xnam, collapse= "+")))
  svm(formula=fmla,data=data,type="C-classification",na.action=na.exclude)  
}

invisible(lapply(c(labels,"ping"),function(x){
  xtest<-select(get(paste("test",x,sep=".")),-matches(x))
  ytest<-select(get(paste("test",x,sep=".")),matches(x))
  #classifier
  svm.clf<-clfSvm(data=get(paste("train",x,sep=".")),y=x)
  #prediction
  svm.pred<-predict(svm.clf,xtest,na.action=na.exclude)
  #confusion matrix
  svm.conf<-table(matrix(svm.pred)[,1],ytest[,1])
  #Create the variables names
  name.clf<-paste("svm.clf",x,sep=".")
  name.pred<-paste("svm.pred",x,sep=".")
  name.conf<-paste("svm.conf",x,sep=".")
  #store it
  assign(name.clf,svm.clf,envir = .GlobalEnv)
  assign(name.pred,svm.pred,envir = .GlobalEnv)
  assign(name.conf,svm.conf,envir = .GlobalEnv)

}))



confusionMatrix(svm.conf.ping.not.embed)
## Confusion Matrix and Statistics
## 
##    
##       0   1
##   0 906 397
##   1 377 734
##                                          
##                Accuracy : 0.6794         
##                  95% CI : (0.6603, 0.698)
##     No Information Rate : 0.5315         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.3555         
##  Mcnemar's Test P-Value : 0.4946         
##                                          
##             Sensitivity : 0.7062         
##             Specificity : 0.6490         
##          Pos Pred Value : 0.6953         
##          Neg Pred Value : 0.6607         
##              Prevalence : 0.5315         
##          Detection Rate : 0.3753         
##    Detection Prevalence : 0.5398         
##       Balanced Accuracy : 0.6776         
##                                          
##        'Positive' Class : 0              
## 
#confusionMatrix(svm.conf.ping.embed)
#summary(svm.clf.ping)
#str(svm.clf.ping)

#Things I didn't do: tune and tweak it.

Looking at the prediction (confusion matrix), SVM offers slight betters result than adaboost. Finally, I was wondering what random forest would give me, just to be sure which feature to look into more detail.

## Random Forest
##The idea behind using random forest was that it is a great tool for feature selection as well since the built in algorithm will automatically put aside predictor he doesn't like.
##However, you have to code dummy variables because I have too many level for some predictor
##It stills is really heavy because it create a lot of dummy variables and my old computer
##really did not enjoy it.
##The main variable of interest was country, I could have transformed it into continent to reduce the dimension.
##However, I feel like the final result would give informations way too broad
#Another alternative which I used was to drop it and only keep  variables with less than 54 classes

##dummy creation unused
#train.rf.dummy<-acm.disjonctif(train[,-c("ping.embed","ping.play","ping.complete")])
#train.rf.dummy<-cbind.data.frame(train[,c("ping.embed","ping.play","ping.complete")],train.rf.dummy)

#reduce the subset
ind.features.filter.rf<-names(ds[,unlist(lapply(ds,function (x)length(levels(x))<50))])
ind.features.filter.rf<-intersect(ind.features.filter.rf,ind.features.filter)

#random forest classifier
clfRf<-function(data,y){
  indy<-which(names(data)==y)
  xnam <- paste(names(data[-indy]), sep="")
  fmla <- as.formula(paste(y," ~ ", paste(xnam, collapse= "+")))
  randomForest(formula=fmla,data=data,na.action=na.roughfix,importance=T)
}

invisible(lapply(c(labels,"ping"),function(x){
  xtest<-get(paste("test",x,sep="."))[,c(ind.features.filter.rf,x)]
  xtest<-select(get(paste("test",x,sep=".")),-matches(x))
  ytest<-get(paste("test",x,sep="."))[,c(ind.features.filter.rf,x)]
  ytest<-select(get(paste("test",x,sep=".")),matches(x))
  train<-get(paste("train",x,sep="."))[,c(ind.features.filter.rf,x)]
  #classifier
  rf.clf<-clfRf(data=train,y=x)
  #prediction
  rf.pred<-predict(rf.clf,xtest,na.action=na.exclude)
  #confusion matrix
  rf.conf<-table(matrix(rf.pred)[,1],ytest[,1])
  #Create the variables names
  name.clf<-paste("rf.clf",x,sep=".")
  name.pred<-paste("rf.pred",x,sep=".")
  name.conf<-paste("rf.conf",x,sep=".")
  #store it
  assign(name.clf,rf.clf,envir = .GlobalEnv)
  assign(name.pred,rf.pred,envir = .GlobalEnv)
  assign(name.conf,rf.conf,envir = .GlobalEnv)
}))


#Observations
varImpPlot(rf.clf.ping.embed)

confusionMatrix(rf.conf.ping.embed)
## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 1562  289
##   1  214  349
##                                           
##                Accuracy : 0.7916          
##                  95% CI : (0.7749, 0.8077)
##     No Information Rate : 0.7357          
##     P-Value [Acc > NIR] : 9.952e-11       
##                                           
##                   Kappa : 0.4432          
##  Mcnemar's Test P-Value : 0.0009686       
##                                           
##             Sensitivity : 0.8795          
##             Specificity : 0.5470          
##          Pos Pred Value : 0.8439          
##          Neg Pred Value : 0.6199          
##              Prevalence : 0.7357          
##          Detection Rate : 0.6471          
##    Detection Prevalence : 0.7668          
##       Balanced Accuracy : 0.7133          
##                                           
##        'Positive' Class : 0               
## 

My observation from the confluence matrix and the variable importance for its best prediction (which was ping.embed), showed me that random forest think that the OS,page domain and the browser are of interest.

#####Prediction conclusions:

After many iteration on R with different ranking method, I found out that it wasn’t possible to predict if someone play a video or completed it with an accurate rate. I even transformed the response by creating a new variable “launched” (which was play OR complete) but the result weren’t better (using naives bayes or adaboost with many different feature combinaison set). I concluded that the information given weren’t enough and I could just predict if the video was embed or not embed. This is due of a set really too unbalanced.
Nevertheless, I found out that the OS, scheme, page hostname and country were of interrest.
The next step is to dig into these variables to find some fun informations.

Since this analysis was supposed to be just a quick glance, I decided to just plot two three charts and not go into the detail.

##Representations

Let’s start with some world map. We have seen that it was suposed to be related to embed

map.country<-ds %>% group_by(country) %>% summarize(n=n())
jc<-joinCountryData2Map(map.country,nameJoinColumn="country",joinCode="ISO2")
## 103 codes from your data successfully matched countries in the map
## 2 codes from your data failed to match with a country code in the map
## 140 codes from the map weren't represented in your data
mapCountryData(jc, nameColumnToPlot="n", mapTitle="Number of Observation per country",
  addLegend=TRUE,catMethod='pretty',
  oceanCol="lightblue", missingCountryCol="lightgrey")
## You asked for 7 categories, 6 were used due to pretty() classification

We see that our data is actually very highly unbalanced and therefore must be correlated to the american behavior.
If you actually look into more detail and take the country that have more than 1% of the data. You find out that there is only indonesia and the US:

ggplot(data=filter(map.country,n>.1*nrow(ds)),aes(x=country,y=n))+geom_histogram(stat="identity")+ggtitle("Number of observation for country that have more than 1% of the data")+ylab("Number of observation")

A lot more has to be done and now is kind of the funniest part. I quickly noticed that the US would either not load the video at all or play/complete it.

Note, I didn’t had time to finish the analysis but here are some table that are interesting even if we should filter them first because sometimes we have only one observation.

table(ds$ua.ua.family)
## 
##           Android            Chrome     Chrome Mobile Chrome Mobile iOS 
##                28              3971                49                 2 
##          Chromium           Firefox    Firefox Mobile         Iceweasel 
##                27              6306                 1                 8 
##                IE         IE Mobile              Iron     Mobile Safari 
##              1029                13                 1               178 
##             Opera      Opera Mobile            Puffin            Safari 
##                52                 1                 1               625 
##        UC Browser 
##                 2
prop.table(table(ds$browser,ds$ping),1)
##                              
##                                 complete      embed  not.embed       play
##   Android Stock Browser       0.03571429 0.50000000 0.25000000 0.21428571
##   Chrome                      0.04543210 0.46543210 0.32098765 0.16814815
##   ChromeiOS                   0.00000000 1.00000000 0.00000000 0.00000000
##   Firefox                     0.13792557 0.11718131 0.58353127 0.16136184
##   Microsoft Internet Explorer 0.11228407 0.23704415 0.44817658 0.20249520
##   Opera                       0.01923077 0.57692308 0.25000000 0.15384615
##   Other                       0.00000000 0.75000000 0.00000000 0.25000000
##   Safari                      0.06491885 0.39575531 0.33957553 0.19975031
prop.table(table(ds$video.url.ext,ds$ping),1)
##       
##           complete       embed   not.embed        play
##   flv  0.000000000 1.000000000 0.000000000 0.000000000
##   m3u8 0.013145540 0.399061033 0.399061033 0.188732394
##   m4a  0.033426184 0.635097493 0.169916435 0.161559889
##   mp3  0.009009009 0.252252252 0.576576577 0.162162162
##   mp4  0.114104291 0.226476814 0.490090437 0.169328459
##   smil 0.017391304 0.408695652 0.460869565 0.113043478
prop.table(table(ds$video.url.hostname,ds$ping),1)
##                                
##                                   complete      embed  not.embed
##   content.bitsontherun.com      0.07005348 0.47593583 0.27219251
##   content.jwplatform.com        0.09513606 0.33104778 0.39168631
##   d1s3yn3kxq96sy.cloudfront.net 0.04687500 0.54687500 0.23437500
##   demo.jwplayer.com             0.14666352 0.01438340 0.68380099
##   dev.wowza.longtailvideo.com   0.01379310 0.42758621 0.35862069
##   fms.12E5.edgecastcdn.net      0.00937500 0.50937500 0.37500000
##   playertest.longtailvideo.com  0.01785714 0.32142857 0.53571429
##   testseedbox.tk                0.00000000 0.00000000 0.50000000
##   videos-jp.jwpsrv.com          0.00000000 0.50000000 0.00000000
##   wowza.longtailvideo.com       0.00000000 0.73364486 0.00000000
##   www.att.com                   0.04000000 0.32000000 0.36000000
##   www.longtailvideo.com         0.01739130 0.40869565 0.46086957
##   www.youtube.com               0.04641350 0.60337553 0.19831224
##                                
##                                       play
##   content.bitsontherun.com      0.18181818
##   content.jwplatform.com        0.18212985
##   d1s3yn3kxq96sy.cloudfront.net 0.17187500
##   demo.jwplayer.com             0.15515209
##   dev.wowza.longtailvideo.com   0.20000000
##   fms.12E5.edgecastcdn.net      0.10625000
##   playertest.longtailvideo.com  0.12500000
##   testseedbox.tk                0.50000000
##   videos-jp.jwpsrv.com          0.50000000
##   wowza.longtailvideo.com       0.26635514
##   www.att.com                   0.28000000
##   www.longtailvideo.com         0.11304348
##   www.youtube.com               0.15189873
prop.table(table(ds$video.url.scheme,ds$ping),1)
##       
##         complete     embed not.embed      play
##   http 0.1021380 0.2568064 0.4696843 0.1713713
##   rtmp 0.0093750 0.5093750 0.3750000 0.1062500
prop.table(table(ds$mobile,ds$ping),1)
##    
##       complete      embed  not.embed       play
##   0 0.10120884 0.25752397 0.47186328 0.16940392
##   1 0.04013378 0.49832776 0.28093645 0.18060201
## starting httpd help server ... done
##  [1] "id"                 "date"               "user.agent"        
##  [4] "browser"            "country"            "page.url"          
##  [7] "title"              "video.url"          "video.id"          
## [10] "ping.embed"         "ping.play"          "ping.complete"     
## [13] "ua.device.brand"    "ua.device.model"    "ua.device.family"  
## [16] "ua.os.major"        "ua.os.minor"        "ua.os.family"      
## [19] "ua.os.patch"        "ua.ua.major"        "ua.ua.minor"       
## [22] "ua.ua.family"       "ua.ua.patch"        "page.url.scheme"   
## [25] "page.url.hostname"  "page.url.path"      "page.url.query"    
## [28] "page.url.fragment"  "page.url.ext"       "video.url.scheme"  
## [31] "video.url.hostname" "video.url.port"     "video.url.path"    
## [34] "video.url.query"    "video.url.ext"      "mobile"            
## [37] "ping.not.embed"     "ping"

Playing with R and Facebook Graph API

Playing with facebook

Tuesday, December 09, 2014

##This is a quick example of some data exploration with facebook

We will get my the posts/stories, and try to find out if there is some kind of relationship between the number of comments and the number of likes. The collection of my posts was inspired by https://gist.github.com/simecek/1634662 with some modification to adapt to the lastest (v2.2) graph API queries.

library(RCurl)
library(rjson)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(lazyeval)
# go to 'https://developers.facebook.com/tools/explorer' to get your access token
access_token<-readRDS("access_token.rds")
### MY FACEBOOK QUERY 
facebook <- function(query,token){
  myresult <- list()
  i <- 0 
  next.path<-sprintf( "https://graph.facebook.com/v2.2/%s&access_token=%s",query, access_token)
  # download all my posts
  while(length(next.path)!=0) {
    i<-i+1
    #You might get some unexpected escape (with warning), you should keep them
    myresult[[i]]<-fromJSON(getURL(next.path, ssl.verifypeer = FALSE, useragent = "R" ),unexpected.escape = "keep")
    next.path<-myresult[[i]]$paging$'next'
    }   
  return (myresult)  
  }
myposts<-facebook("me/posts?fields=story,message,comments.limit(1).summary(true),likes.limit(1).summary(true),created_time",access_token)
## Warning in fromJSON(getURL(next.path, ssl.verifypeer = FALSE, useragent =
## "R"), : unexpected escaped character 'm' at pos 89. Keeping value.
## Warning in fromJSON(getURL(next.path, ssl.verifypeer = FALSE, useragent =
## "R"), : unexpected escaped character 'o' at pos 51. Keeping value.
# parse the list, extract number of likes/comments and the corresponding text (status) and id
parse.master <- function(x, f)
  sapply(x$data, f)
parse.likes <- function(x) if(!is.na(unlist(x)['likes.summary.total_count'])) (as.numeric(unlist(x)['likes.summary.total_count'])) else 0
mylikes <- unlist(sapply(myposts, parse.master, f=parse.likes))
parse.comments <- function(x) if(!is.na(unlist(x)['comments.summary.total_count'])) (as.numeric(unlist(x)['comments.summary.total_count'])) else 0
mycomments <- unlist(sapply(myposts, parse.master, f=parse.comments))
parse.messages <- function(x) if(!is.null(x$message)){ x$message} else{if(!is.null(x$story)){x$story} else {NA}}
mymessages <- unlist(sapply(myposts, parse.master, f=parse.messages))
parse.id <- function(x) if(!is.null(x$id)){ x$id} else{NA}
myid <- unlist(sapply(myposts, parse.master, f=parse.id))
parse.time <- function(x) if(!is.null(x$created_time)){x$created_time} else{NA}
mytime <- unlist(sapply(myposts, parse.master, f=parse.time))
mytime<-(as.POSIXlt(mytime,format="%Y-%m-%dT%H:%M:%S"))
#put everything into a data.frame
fbPosts<-data.frame(postId=myid,message=mymessages,likes.count=mylikes,comments.count=mycomments,time=mytime,year=mytime$year+1900,dom=mytime$mday,hour=mytime$hour,wd=weekdays(mytime),month=months(mytime))

Now we can play with the data 🙂

#most commented
fbPosts[which.max(fbPosts$comments.count),]
##                                postId
## 525 10152254267528671_128457423841248
##                                     message likes.count comments.count
## 525 sera à grenoble l'an prochain. Fuck it.           0             18
##                    time year dom hour        wd month
## 525 2010-06-09 16:38:02 2010   9   16 Wednesday  June
#most liked
fbPosts[which.max(fbPosts$likes.count),]
##                                 postId
## 39 10152254267528671_10152119139903671
##                                                message likes.count
## 39 Internship found. Apartment found. What's next ? :)          18
##    comments.count                time year dom hour      wd month
## 39              5 2014-05-20 00:41:34 2014  20    0 Tuesday   May

Let’s look into the relationship between post and comments
Let’s group them by comments number and sum their likes number to have a better idea of the relationship.

fbPosts.group.comments <- fbPosts %>% group_by(comments.count) %>% summarise(sumLikes=sum(likes.count),posts=n()) %>% ungroup() %>% arrange(comments.count)
g1<-ggplot(data=fbPosts.group.comments,aes(x=comments.count,y=sumLikes))+geom_point(aes(size=posts))+geom_smooth(method="glm", family="quasipoisson")+xlab("Number of comments")+ylab("Number of likes")+ggtitle("Comments and likes per post")+scale_x_continuous(breaks=seq(0,50,5))+scale_y_continuous(breaks=seq(0,200,25))
grid.arrange(g1,ncol=1)

groupFb_fc<-function(ds,fc){
  return (ds %>% group_by_(fc) %>% summarise_(sumComment=interp(~sum((comments.count))),posts=interp(~n()),ratioC=interp(~sumComment/posts),sumLike=interp(~sum(likes.count)),ratioL=interp(~sumLike/posts)) %>% ungroup() %>% arrange_(fc))  
  }
plot.sum.likeComment<-function(ds,xvar,...){ 
  return (
    ggplot(data=ds,environment = environment())+geom_point(aes_string(size="posts",x=(xvar),y=("sumLike")))+geom_line(aes_string(x=(xvar),y="sumLike"))+geom_point(aes_string(x=(xvar),y="sumComment",size="posts"),color="red")+geom_line(aes_string(x=(xvar),y="sumComment"),color="red")+xlab(paste(xvar))+ylab("Number of likes and comments")+ggtitle(paste("Number of likes (black) and comments (red) per",xvar))
    )
  }
plot.ratio.likeComment<-function(ds,xvar,...){ 
  return (
    ggplot(data=ds,environment = environment())+geom_point(aes_string(size="posts",x=(xvar),y=("ratioL")))+geom_line(aes_string(x=(xvar),y="ratioL"))+geom_point(aes_string(x=(xvar),y="ratioC",size="posts"),color="red")+geom_line(aes_string(x=(xvar),y="ratioC"),color="red")+xlab(paste(xvar))+ylab("Ratio likes/post and comments/post")+ggtitle(paste("Number of likes (black) and comments (red) per post per",xvar))
    )
  }
fbPosts.group.year<-groupFb_fc(fbPosts,"year")
fbPosts.group.dom<-groupFb_fc(fbPosts,"dom")
fbPosts.group.hour<-groupFb_fc(fbPosts,"hour")
fbPosts.group.wd<-groupFb_fc(fbPosts,"wd")
fbPosts.group.month<-groupFb_fc(fbPosts,"month")
g2.sum<-plot.sum.likeComment(fbPosts.group.year,"year")
g3.sum<-plot.sum.likeComment(fbPosts.group.dom,"dom")
g4.sum<-plot.sum.likeComment(fbPosts.group.hour,"hour")
g5.sum<-plot.sum.likeComment(fbPosts.group.wd,"wd")
g6.sum<-plot.sum.likeComment(fbPosts.group.month,"month")
g2.ratio<-plot.ratio.likeComment(fbPosts.group.year,"year")
g3.ratio<-plot.ratio.likeComment(fbPosts.group.dom,"dom")
g4.ratio<-plot.ratio.likeComment(fbPosts.group.hour,"hour")
g5.ratio<-plot.ratio.likeComment(fbPosts.group.wd,"wd")
g6.ratio<-plot.ratio.likeComment(fbPosts.group.month,"month")
grid.arrange(g2.sum,g2.ratio,g3.sum,g3.ratio,g4.sum,g4.ratio,ncol=2)

grid.arrange(g5.sum,g5.ratio,g6.sum,g6.ratio,ncol=2)
## geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?
## geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?
## geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?
## geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?
## geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?
## geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?
## geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?
## geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?

In my case, we can see that there is a higher concentration of low comment posts.
We can see that There is an inverse relationship with the number of like and number of comments. We also see some interesting relationship with the time series.

Why do you think it is so?

Playing with R – EDA analysis for a job

Set up

library(knitr)
opts_knit$set(upload.fun = imgur_upload, base.url = NULL) # upload all images to imgur.com

This is the global set up for the code.

library(ggplot2)
library(data.table)
library(dplyr)
library(gridExtra)
library(chron)
library(reshape2)
library(choroplethr)
library(zipcode)
library(choroplethrMaps)


setwd("~/df")
epCountry<-fread("episode_country.csv",header=T,sep=",")
epRegion<-fread("episode_region.csv",header=T,sep=",")
ds<-read.csv("df_dataset.csv",header=T,sep=",")
epSerie<-read.csv("episode_series.csv")
epReview<-fread("episode_series_review.csv",header=T,sep=",")
## Warning in fread("episode_series_review.csv", header = T, sep = ","):
## Stopped reading at empty line 83846 of file, but text exists afterwards
## (discarded): GRACIASSSSS POR TRADUCIRLOOOO AL ESPAÑOLLLLL. PD:AUNQUE ME
## UBIESE GUSTADO MÁS SUBTITULADO AL COREANO. ES MUCHO MAS BONITOOO."
epReview<-read.csv("episode_series_review.csv")
prem<-read.csv("premium_log.csv",header=T,sep=",")
#setwd("E:/df_dataset")
# df <- read.csv("watched",nrows=20000,header=T,sep="|")

##Various Fixes

#fixing gender
ds$gender.fix<-ds$gender
ds$gender.fix[ds$gender.fix=="f"]  <- "F" 
ds$gender.fix[ds$gender.fix=="m"]  <- "M"
ds$gender.fix[ds$gender.fix==""]  <- NA 
#Fixing empty string
ds$promo_code.fix<-ds$promo_code
ds$promo_code.fix[ds$promo_code.fix==""]  <- NA 

#fixing id to name
ds$country_id <- epCountry$name[match(ds$country_id, epCountry$id)]
ds$region_id<-epRegion$name[match(ds$region_id,epRegion$id)]
#fixing dates
options(chron.year.expand = 
     function (y, cut.off = 13, century = c(1900, 2000), ...) {
        chron:::year.expand(y, cut.off = cut.off, century = century, ...)
     }
)
ds$premium_since.date<-as.Date(ds$premium_since,format="%m/%d/%y")
ds$dob.date<-ds$dob
ds$dob.date<-as.Date(chron(format(as.Date(ds$dob, "%m/%d/%y"), "%m/%d/%y")))
ds$date_joined.date<-as.Date(ds$date_joined,format="%m/%d/%y")
ds$premium_signup_date.date<-as.Date(ds$premium_signup_date,format="%m/%d/%y")
prem$date<-as.Date(prem$create_dt,format="%Y-%m-%d")
prem$date<-as.Date(prem$create_dt,format="%Y-%m-%d")
ds$free_trial_no_cc_first_apply_datetime.date<-as.Date(ds$free_trial_no_cc_first_apply_datetime,format="%m/%d/%y")
ds$premium_free_trial_end_date.date<-as.Date(ds$premium_free_trial_end_date,format="%m/%d/%y")
ds$last_login.date<-as.Date(ds$last_login,format="%m/%d/%y")
ds$last_active.date<-as.Date(ds$last_active,format="%m/%d/%y")
ds$premium_signup_date.date<-as.Date(ds$premium_signup_date,format="%m/%d/%y")
ds$premium_last_pay_date.date<-as.Date(ds$premium_last_pay_date,format="%m/%d/%y")
ds$premium_next_pay_date.date<-as.Date(ds$premium_next_pay_date,format="%m/%d/%y")

ds<-mutate(ds,pay_delay_day=as.numeric(premium_next_pay_date.date-premium_last_pay_date.date))

ds<-mutate(ds,premium_join_delay=as.numeric(premium_since.date-date_joined.date))

Looking at the progression of premium worlwide

ds.fc_by_date_region<- ds %>% filter(!is.na(premium_since.date)) %>% group_by(premium_since.date,region_id) %>% summarise(n=n()) %>% ungroup() %>% arrange(premium_since.date)
(region<-ggplot(aes(x = premium_since.date, y = n),data = ds.fc_by_date_region) +   geom_line()+facet_wrap(~region_id)+xlab("Date become premium")+ylab("Number of premium")+ggtitle("Premium joined per region"))

#ggsave(file="region_id_premium.pdf",region)

In term of raw premium number Latin America is doing the best. It is interesting to note that theere is a number of country that are not referenced. Let’s focus on Latin America

Latin America momentum

ds.la<-subset(ds,region_id=="Latin America")
#ds.la$premium_since.date<-as.Date(ds.la$premium_since,format="%m/%d/%y")
ds.la.fc_by_date_region<- ds.la %>% filter(!is.na(country_id)& !is.na(premium_since.date)) %>% group_by(premium_since.date,country_id) %>% summarise(n=n()) %>% ungroup() %>% arrange(premium_since.date)
(momentum<-ggplot(aes(x = premium_since.date, y = n),data = ds.la.fc_by_date_region) +   geom_line()+facet_wrap(~country_id)+ggtitle("Customers who became premium")+ylab("Number of premium")+xlab("Date became premium in Latin America"))

#ggsave(file="momentum_LA.pdf",momentum)

We notice that Peru and Mexico are leading, let’s look into more detail

g1_Mexi<-ggplot(aes(x = premium_since.date, y = n),data = filter(ds.la.fc_by_date_region,country_id=="Mexico",premium_since.date>c("2014-05-01"))) +   geom_point(alpha=1/2)+geom_smooth()+ggtitle("Customers who became premium for Mexico")+ylab("Number of premium")+xlab("2014 Date became premium")+coord_cartesian()
#ggsave(file="momentum_mexic.pdf",g1_Mexi)
g1_Peru<-ggplot(aes(x = premium_since.date, y = n),data = filter(ds.la.fc_by_date_region,country_id=="Peru")) +   geom_point()+geom_smooth()+ggtitle("Customers who became premium for Peru")+ylab("Number of premium")+xlab("2014 Date became premium")
grid.arrange(g1_Mexi,g1_Peru,ncol=1)
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

#ggsave(file="momentum_peru.pdf",g1_Peru)

We can see that both country have more and more premium customers. These two countries are interesting because their growth is significative. If we look back into Latin America Chart, we can see that these are the one with significant amount of data that have the highest growth. Let’s look about the gender.

##Focusing on Mexico these both countries

ds.mexico<-filter(ds,country_id=="Mexico",premium_since.date>c("2014-05-01"))
ds.peru<-filter(ds,country_id=="Peru",premium_since.date>c("2014-05-01"))

ds.mexico.by_gender<- ds.mexico %>% filter(!is.na(premium_since.date)) %>% group_by(premium_since.date,gender.fix) %>% summarise(n=n()) %>% ungroup() %>% arrange(premium_since.date)

ds.peru.by_gender<- ds.peru %>% filter(!is.na(premium_since.date)) %>% group_by(premium_since.date,gender.fix) %>% summarise(n=n()) %>% ungroup() %>% arrange(premium_since.date)


g2_Mexi<-ggplot(aes(x = premium_since.date, y = n,color=gender.fix),data = ds.mexico.by_gender) +   geom_point(alpha=1/2)+ggtitle("Customers who became premium for Mexico")+ylab("Number of premium")+xlab("2014 Date became premium")+coord_cartesian()+scale_y_continuous(breaks=seq(0,300,25))
g2_Peru<-ggplot(aes(x = premium_since.date, y = n,color=gender.fix),data = ds.peru.by_gender) +   geom_point(alpha=1/2)+ggtitle("Customers who became premium for Peru")+ylab("Number of premium")+xlab("2014 Date became premium")+coord_cartesian()+scale_y_continuous(breaks=seq(0,300,25))


grid.arrange(g2_Mexi,g2_Peru,ncol=1)

we can’t see clearly but it seems that women are leading it and the gap stay constant let’s look into it.

ds.peru.fc_gender.all<-ds.peru %>% group_by(premium_since.date,gender.fix) %>% filter(!is.na(gender.fix)) %>% summarise(n=n()) %>% ungroup() %>% arrange(premium_since.date)
ds.peru.fc_gender.all.wide<-dcast(ds.peru.fc_gender.all,premium_since.date~gender.fix,value.var="n",fun.aggregate = sum, na.rm = TRUE)
ds.peru.fc_gender.all.wide<-ds.peru.fc_gender.all.wide %>% group_by(premium_since.date) %>% mutate(percentF=F/(F+M+N),percentM=M/(F+M+N),percentN=N/(F+M+N)) %>% ungroup() %>% arrange(premium_since.date)


ds.peru.fc_gender.nopremanymore<-ds.peru %>% group_by(premium_since.date,gender.fix) %>% filter(!is.na(gender.fix),is_premium==0,!is.na(premium_since.date)) %>% summarise(n=n()) %>% ungroup() %>% arrange(premium_since.date)
ds.peru.fc_gender.nopremanymore.wide<-dcast(ds.peru.fc_gender.nopremanymore,premium_since.date~gender.fix,value.var="n",fun.aggregate = sum, na.rm = TRUE)
ds.peru.fc_gender.nopremanymore.wide<-ds.peru.fc_gender.nopremanymore.wide %>% group_by(premium_since.date) %>% mutate(percentF=F/(F+M+N),percentM=M/(F+M+N),percentN=N/(F+M+N)) %>% ungroup() %>% arrange(premium_since.date)

#g.fc_gender.prem<-ggplot(data=ds.age.fc_gender.prem.wide,aes(x=age,y=percentF))+geom_smooth()+geom_hline(yintercept=1,alpha=0.2,linetype=2)+coord_cartesian(ylim=c(0,1),xlim=c(15,50))+ylab("Premium %girls")+scale_x_continuous(breaks=seq(0,50,2))+geom_point(alpha=1/50)
g.peru.fc_gender<-ggplot(data=ds.peru.fc_gender.all.wide,aes(x=premium_since.date,y=percentF))+geom_line()+geom_hline(yintercept=1,alpha=0.2,linetype=2)+coord_cartesian(ylim=c(0,1))+ylab("%girls")+scale_y_continuous(breaks=seq(0,1,.05))+geom_line(data=ds.peru.fc_gender.nopremanymore.wide,aes(x=premium_since.date,y=percentF),colour="green")+ggtitle("%girls by date joined as premium for Peru (green=was_premium,black=all premium)")+xlab("Date joined as premium")






ds.mexico.fc_gender.all<-ds.mexico %>% group_by(premium_since.date,gender.fix) %>% filter(!is.na(gender.fix)) %>% summarise(n=n()) %>% ungroup() %>% arrange(premium_since.date)
ds.mexico.fc_gender.all.wide<-dcast(ds.mexico.fc_gender.all,premium_since.date~gender.fix,value.var="n",fun.aggregate = sum, na.rm = TRUE)
ds.mexico.fc_gender.all.wide<-ds.mexico.fc_gender.all.wide %>% group_by(premium_since.date) %>% mutate(percentF=F/(F+M+N),percentM=M/(F+M+N),percentN=N/(F+M+N)) %>% ungroup() %>% arrange(premium_since.date)


ds.mexico.fc_gender.nopremanymore<-ds.mexico %>% group_by(premium_since.date,gender.fix) %>% filter(!is.na(gender.fix),is_premium==0,!is.na(premium_since.date)) %>% summarise(n=n()) %>% ungroup() %>% arrange(premium_since.date)
ds.mexico.fc_gender.nopremanymore.wide<-dcast(ds.mexico.fc_gender.nopremanymore,premium_since.date~gender.fix,value.var="n",fun.aggregate = sum, na.rm = TRUE)
ds.mexico.fc_gender.nopremanymore.wide<-ds.mexico.fc_gender.nopremanymore.wide %>% group_by(premium_since.date) %>% mutate(percentF=F/(F+M+N),percentM=M/(F+M+N),percentN=N/(F+M+N)) %>% ungroup() %>% arrange(premium_since.date)

#g.fc_gender.prem<-ggplot(data=ds.age.fc_gender.prem.wide,aes(x=age,y=percentF))+geom_smooth()+geom_hline(yintercept=1,alpha=0.2,linetype=2)+coord_cartesian(ylim=c(0,1),xlim=c(15,50))+ylab("Premium %girls")+scale_x_continuous(breaks=seq(0,50,2))+geom_point(alpha=1/50)
g.mexico.fc_gender<-ggplot(data=ds.mexico.fc_gender.all.wide,aes(x=premium_since.date,y=percentF))+geom_line()+geom_hline(yintercept=1,alpha=0.2,linetype=2)+coord_cartesian(ylim=c(0,1))+ylab("%girls")+scale_y_continuous(breaks=seq(0,1,.05))+geom_line(data=ds.mexico.fc_gender.nopremanymore.wide,aes(x=premium_since.date,y=percentF),colour="green")+ggtitle("%girls by date joined as premium for Mexico (green=was_premium,black=all premium)")+xlab("Date joined as premium")

grid.arrange(g.mexico.fc_gender,g.peru.fc_gender)

We don’t see any notifiable difference. Except recently

tail(ds.mexico.fc_gender.nopremanymore.wide)
## Source: local data frame [6 x 7]
## 
##   premium_since.date   F  M N  percentF  percentM percentN
## 1         2014-10-08 102 19 0 0.8429752 0.1570248        0
## 2         2014-10-09  66 13 0 0.8354430 0.1645570        0
## 3         2014-10-11   1  0 0 1.0000000 0.0000000        0
## 4         2014-10-13   1  0 0 1.0000000 0.0000000        0
## 5         2014-10-15   1  0 0 1.0000000 0.0000000        0
## 6         2014-10-22   1  0 0 1.0000000 0.0000000        0
tail(ds.peru.fc_gender.nopremanymore.wide)
## Source: local data frame [6 x 7]
## 
##   premium_since.date   F  M N  percentF  percentM percentN
## 1         2014-10-08 151 18 0 0.8934911 0.1065089        0
## 2         2014-10-09  61 14 0 0.8133333 0.1866667        0
## 3         2014-10-12   1  0 0 1.0000000 0.0000000        0
## 4         2014-10-14   3  0 0 1.0000000 0.0000000        0
## 5         2014-10-15   1  0 0 1.0000000 0.0000000        0
## 6         2014-10-18   1  0 0 1.0000000 0.0000000        0

We can’t see that we actually don’t have much data which explain the difference at the end.
Therefore, we can say that there is a majority of girls and there is no pick when they became premium in term of gender. Let’s now evaluate how long they generally stay premium. In order to do that there are different way to do it. We can look into the the next pay date. If the next pay date is less than max(last_active,last_login) (we assume someone either login or was active the last day the data were taken) it means they are not premium anymore because they don’t pay the premium membership anymore (assuming you don’t allow people to pay in more than once for other products). Another way of looking at it is to look at people who joined as premium (premium_since.date not NA) but are not premium anymore (is_premium=0).

la<- max(c(max(ds$last_active.date,na.rm = T),max(ds$last_login.date,na.rm = T)))
nrow(filter(ds,premium_next_pay_date.date<la))
## [1] 6249

We can see some issues with the data right away

nrow(filter(ds,is_premium==1,premium_next_pay_date.date<la))
## [1] 3932

We have 3932 data that are reported to be premium but the next pay date is inferior as of the last day the data were taken.

nrow(filter(ds,!is.na(premium_since.date),is_premium==0))
## [1] 124247

Additionally, we have 124247 which has been reported as being premium one day since premium_since.date is not NA and have the is_premium at false which mean they are not premium anymore.
It is hard to make sens out of these data but we will assume that the boolean is_premium and the premium_since.date are more relevant than the first assumption since the first assumption assume that people pay everymonth when they might be able to pay once and leave and then come back. Therefore it could not be possible to monitor when they will pay next time.

table(filter(ds,!is.na(premium_since.date),is_premium==0)$pay_delay_day)
## 
##  30  31 365 
##   1   5   3

We can see that we don’t have much information concerning their subscription habit in term of frequency to pay.
However, we have seen the in Latin America, Peru and Mexico. We have a majority of girls that stay equally in term of percentage compared to male with a market with a growing number of premium. Therefore, advertising specially for these two country could be interesting. Let’s look if they are more attracted by one promo rather than another and also what kind of subscription they register to.

##Promo related to acquisition

ds.promo.active<- ds %>% filter(!is.na(premium_since.date),is_premium==1,premium_since.date>c("2014-04-01")) %>% group_by(premium_since.date,promo=!is.na(promo_code.fix)) %>% summarise(n=n()) %>% ungroup() %>% arrange(premium_since.date)
promoForActualPremium<-ggplot(aes(x = premium_since.date, y = n,color=promo),data = ds.promo.active) +   geom_point()+ggtitle("Effect of promotion on premium customers who are still premium (true=joined with a promo code, false=without)")+xlab("Joined date  as premium in 2014")+ylab("Number of premium")

ds.promo.inactive<- ds %>% filter(!is.na(premium_since.date),is_premium==0,premium_since.date>c("2014-04-01")) %>% group_by(premium_since.date,promo=!is.na(promo_code.fix)) %>% summarise(n=n()) %>% ungroup() %>% arrange(premium_since.date)
promoForInactivePremium<-ggplot(aes(x = premium_since.date, y = n,color=promo),data = ds.promo.inactive) +   geom_point()+ggtitle("Effect of promotion on premium customers who are not premium anymore (true=joined with a promo code, false=without)")+xlab("Joined date  as premium in 2014")+ylab("Number of premium")
grid.arrange(promoForActualPremium,promoForInactivePremium,ncol=1)

We can clearly see that user who decided to stop being premium are the one who didn’t join with a promo code. Therefore, in order to retain our user, we should make sure that they join with a promo code.

Let’s look into Peru and Mexico.

ds.promo.mexico.active<- ds %>% filter(!is.na(premium_since.date),is_premium==1,country_id=="Mexico",premium_since.date>c("2014-04-01")) %>% group_by(premium_since.date,promo=!is.na(promo_code.fix)) %>% summarise(n=n()) %>% ungroup() %>% arrange(premium_since.date)
ds.promo.mexico.inactive<- ds %>% filter(!is.na(premium_since.date),is_premium==0,country_id=="Mexico",premium_since.date>c("2014-04-01")) %>% group_by(premium_since.date,promo=!is.na(promo_code.fix)) %>% summarise(n=n()) %>% ungroup() %>% arrange(premium_since.date)


ds.promo.peru.active<- ds %>% filter(!is.na(premium_since.date),is_premium==1,country_id=="Peru",premium_since.date>c("2014-04-01")) %>% group_by(premium_since.date,promo=!is.na(promo_code.fix)) %>% summarise(n=n()) %>% ungroup() %>% arrange(premium_since.date)
ds.promo.peru.inactive<- ds %>% filter(!is.na(premium_since.date),is_premium==0,country_id=="Peru",premium_since.date>c("2014-04-01")) %>% group_by(premium_since.date,promo=!is.na(promo_code.fix)) %>% summarise(n=n()) %>% ungroup() %>% arrange(premium_since.date)


g.promo.mexico.active<-ggplot(aes(x = premium_since.date, y = n,color=promo),data = ds.promo.mexico.active) +   geom_point()+xlab("Date joined as premium in 2014")+ylab("Number of premium")+ggtitle("Number of premium who are still premium by date joined in Mexico")
g.promo.mexico.inactive<-ggplot(aes(x = premium_since.date, y = n,color=promo),data = ds.promo.mexico.inactive) +   geom_point()+xlab("Date joined as premium in 2014")+ylab("Number of premium")+ggtitle("Number of premium who are not premium anymore by date joined in Mexico")


g.promo.peru.active<-ggplot(aes(x = premium_since.date, y = n,color=promo),data = ds.promo.peru.active) +   geom_point()+xlab("Date joined as premium in 2014")+ylab("Number of premium")+ggtitle("Number of premium who are still premium by date joined in Peru")
g.promo.peru.inactive<-ggplot(aes(x = premium_since.date, y = n,color=promo),data = ds.promo.peru.inactive) +   geom_point()+xlab("Date joined as premium in 2014")+ylab("Number of premium")+ggtitle("Number of premium who are not premium anymore by date joined in Peru")

grid.arrange(g.promo.mexico.active,g.promo.mexico.inactive,g.promo.peru.active,g.promo.peru.inactive,ncol=2)

We can see that there is not one person who joined with a promo code and is not premium anymore. However, the effect of promo code in Mexico and Peru is very low. If we assume that we didn’t do special marketing campaign on this demographic area, we should then consider doing it for specially these two countries because this is how globally we retain our customer from leaving.

Let’s look at successfull promo.

#ggsave(file="promoForPremCanada.pdf",promoForPremCanada)
#same

ds.promo.group<- ds %>% filter(!is.na(premium_since.date),premium_since.date>c("2014-04-01")) %>% group_by(premium_since.date,promo_code) %>% summarise(n=n()) %>% ungroup() %>% arrange(premium_since.date)


#allCampaign<-ggplot(aes(x = premium_since.date, y = n,color=promo_code),data = ds.promo.group) +   geom_point()
#ggsave(file="allCampaign.pdf",allCampaign)
#organic are leading


bottomCampaign<-ggplot(aes(x = premium_since.date, y = n,color=promo_code),data = filter(ds.promo.group,n<100,n>10,promo_code!="")) +   geom_point() + ggtitle("effect on promo (who attracted more than 10 customer in one day per date joined")+xlab("Date joined as premium in 2014")+ylab("Number of premium")
#ggsave(file="bottomCampaign.pdf",bottomCampaign)

organic<-ggplot(aes(x = premium_since.date, y = n,color=promo_code),data = filter(ds.promo.group,promo_code=="")) +   geom_point()+ggtitle("effect on organic campaign customer in one day per date joined")+xlab("Date joined as premium in 2014")+ylab("Number of premium")
#ggsave(file="organicacquirement.pdf",organic)

grid.arrange(bottomCampaign,organic)

We can see that the 5 year promo was the most successfull promotion and organic premium seem to be affected by the season (less in summer when they go in vacation). Therefore, I’d recommand to do more marketing as the 5YEARS promo marketing campaign and targeted marketing to Peru and Mexico that respect their values.

##Let’s look at their age.

length(filter(ds,Sys.Date()<dob.date | is.na(dob.date))$dob.date)/length(ds$dob.date)
## [1] 0.1840954

looking at the date of birth we can note that most of them are inacurrate
About 18.5% of the date is wrong

nrow(filter(ds,is.na(dob.date),gender.fix=="F"))/nrow(filter(ds,is.na(dob.date),gender.fix=='M'))
## [1] 4.880942
nrow(filter(ds,gender.fix=="F"))/nrow(filter(ds,gender.fix=='M'))
## [1] 4.204354

Moreover, there isn’t a gap between the gender of the data we have and the one we ignore (the 18.5% ignored are not only male/female)

Let’s now calculate their age and look at the global trends.

ds.age<-filter(ds,Sys.Date()>dob.date,!is.na(dob.date))
sysD<-rep(Sys.Date(),times=length(ds.age$dob.date))
#We trunc the data because it does not add any value to not do it and it allows us to group the data more and have better prediction
ds.age$age<-trunc(as.numeric(sysD-ds.age$dob.date)/365)
#fixing gender
#ds.age$gender.fix<-ds.age$gender
#ds.age$gender.fix[ds.age$gender.fix=="f"]  <- "F" 
#ds.age$gender.fix[ds.age$gender.fix=="m"]  <- "M"
#ds.age$gender.fix[ds.age$gender.fix==""]  <- NA 
g.age.all<-ggplot(aes(x=age,fill=gender.fix),data=ds.age)+geom_histogram(binwidth=1)+scale_x_continuous(breaks=seq(0,100,5))+xlab("Age in years")+ylab("Number of customer")+ggtitle("Age related to number of customers")
g.age.trunc<-ggplot(aes(x=age,fill=gender.fix),data=ds.age)+geom_histogram(binwidth=1)+coord_cartesian(xlim=c(15,30))+scale_x_continuous(breaks=seq(0,100,1))+xlab("Age in years")+ylab("Number of customer")+ggtitle("Age related to number of customers")
grid.arrange(g.age.all,g.age.trunc,ncol=1)

summary(ds.age$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   20.00   24.00   27.93   32.00  102.00

We can clearly see that woman are more attracted to our product and that our main customers database are between 18 and 25.

Let’s take a demographic broader look

ageD<-filter(ds.age,!is.na(gender.fix))
ggplot(aes(x=age,fill=gender.fix),data=ageD)+geom_histogram(binwidth=1)+coord_cartesian()+facet_wrap(~region_id)+scale_x_continuous(breaks=seq(0,100,5))

We can see that Latin America have way more girls in percentage than boys than US/Canada.
There is a wrong region id that we should ignore, since North America is supposed to be active, we could assume the NA belong to them (id=8). Let’s try to find out if we can have the country id for our NA

unique(filter(ds,is.na(region_id))$country_id)
## [1] NA
length(filter(ds,is.na(region_id))$country_id)
## [1] 243536

We can see that we can’t get the country id. However, there are too many data in these rows. We can’t ignore them. Since, we want to see globally if there is a country that have more men than women, let’s keep it.
Let’s take a demographic look to see if there are main differences between countries that have enough data (more than 10k)

#ggplot(aes(x=age,fill=gender.fix),data=filter(ds.age,!is.na(gender.fix)))+geom_histogram(binwidth=1)+coord_cartesian()+facet_wrap(~country_id)
#No country with special outlet
countryList<-unique((ds.age %>% group_by(country_id)%>%filter(!is.na(gender.fix),n()>10000)%>%ungroup())$country_id)

ggplot(aes(x=age,fill=gender.fix),data=filter(ds.age,!is.na(gender.fix),country_id %in% countryList))+geom_histogram(binwidth=1)+coord_cartesian()+facet_wrap(~country_id)

We can see that our main customers are from Canada, Indonesia, Malaysia,Mexico, Peru, Philippines and USA with an important number of unknown.

There isn’t a main difference globally.

Now let’s look in term of monetization.

g.age.noprem<-ggplot(aes(x=age,fill=gender.fix),data=filter(ds.age,!is.na(gender.fix),is_premium==0))+geom_histogram(binwidth=1)+coord_cartesian()+ylab("Number of customer not premium")+scale_x_continuous(breaks=seq(0,100,5))+xlab("Age in year")+ggtitle("Number of customer not premium per age")

g.age.prem<-ggplot(aes(x=age,fill=gender.fix),data=filter(ds.age,!is.na(gender.fix),is_premium==1))+geom_histogram(binwidth=1)+coord_cartesian()+ylab("Number of customer premium")+scale_x_continuous(breaks=seq(0,100,5))+ggtitle("Number of customer premium per age")
grid.arrange(g.age.prem,g.age.noprem,ncol=1)

We can see that globally the trend is the same except it seems like there are more men that monetized than girl compared to the custumer not premium. Let’s look into more detail.

library(reshape2)
summary(filter(ds.age,!is.na(gender.fix),is_premium==1)$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   19.00   23.00   27.37   31.00  102.00
summary(filter(ds.age,!is.na(gender.fix),is_premium==0)$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   20.00   23.00   26.96   30.00  102.00

As we said no major difference in term of age between premium and not premium. Let’s look in term of frequencies

ds.age.fc_convert<-ds.age %>% group_by(age,is_premium) %>% filter(!is.na(gender.fix)) %>% summarise(n=n()) %>% ungroup() %>% arrange(age)
ds.age.fc_convert.wide<-dcast(ds.age.fc_convert,age~is_premium,value.var="n",fun.aggregate = sum, na.rm = TRUE)
ds.age.fc_convert.wide$prem<-ds.age.fc_convert.wide$'1'
ds.age.fc_convert.wide$nonprem<-ds.age.fc_convert.wide$'0'
ds.age.fc_convert.wide<-ds.age.fc_convert.wide %>% group_by(age) %>% mutate(percentConversionRate=prem/(nonprem+prem)) %>% ungroup() %>% arrange(age)

g.fc_convert.all<-ggplot(data=ds.age.fc_convert.wide,aes(x=age,y=percentConversionRate),breaks=seq(10,60,5))+geom_smooth()+coord_cartesian(ylim=c(0,0.40),xlim=c(10,80))+ylab("% of premium conversion")+scale_x_continuous(breaks=seq(0,80,2))+geom_line()+ggtitle("Conversion rate by age for male and female")+scale_y_continuous(breaks=seq(0,0.4,0.05))




ds.age.fc_convert.male<-ds.age %>% group_by(age,is_premium) %>% filter(!is.na(gender.fix),gender.fix=="M") %>% summarise(n=n()) %>% ungroup() %>% arrange(age)
ds.age.fc_convert.male.wide<-dcast(ds.age.fc_convert.male,age~is_premium,value.var="n",fun.aggregate = sum, na.rm = TRUE)
ds.age.fc_convert.male.wide$prem<-ds.age.fc_convert.male.wide$'1'
ds.age.fc_convert.male.wide$nonprem<-ds.age.fc_convert.male.wide$'0'
ds.age.fc_convert.male.wide<-ds.age.fc_convert.male.wide %>% group_by(age) %>% mutate(percentConversionRate=prem/(nonprem+prem)) %>% ungroup() %>% arrange(age)

g.fc_convert.male<-ggplot(data=ds.age.fc_convert.male.wide,aes(x=age,y=percentConversionRate),breaks=seq(10,60,5))+geom_smooth()+coord_cartesian(ylim=c(0,0.40),xlim=c(10,80))+ylab("% of premium conversion")+scale_x_continuous(breaks=seq(0,80,2))+geom_line()+ggtitle("Conversion rate by age for male")+scale_y_continuous(breaks=seq(0,0.4,0.05))





ds.age.fc_convert.female<-ds.age %>% group_by(age,is_premium) %>% filter(!is.na(gender.fix),gender.fix=="F") %>% summarise(n=n()) %>% ungroup() %>% arrange(age)
ds.age.fc_convert.female.wide<-dcast(ds.age.fc_convert.female,age~is_premium,value.var="n",fun.aggregate = sum, na.rm = TRUE)
ds.age.fc_convert.female.wide$prem<-ds.age.fc_convert.female.wide$'1'
ds.age.fc_convert.female.wide$nonprem<-ds.age.fc_convert.female.wide$'0'
ds.age.fc_convert.female.wide<-ds.age.fc_convert.female.wide %>% group_by(age) %>% mutate(percentConversionRate=prem/(nonprem+prem)) %>% ungroup() %>% arrange(age)

g.fc_convert.female<-ggplot(data=ds.age.fc_convert.female.wide,aes(x=age,y=percentConversionRate),breaks=seq(10,60,5))+geom_smooth()+coord_cartesian(ylim=c(0,0.40),xlim=c(10,80))+ylab("% of premium conversion")+scale_x_continuous(breaks=seq(0,80,2))+geom_line()+ggtitle("Conversion rate by age for female")+scale_y_continuous(breaks=seq(0,0.4,0.05))

grid.arrange(g.fc_convert.female,g.fc_convert.male,g.fc_convert.all,ncol=3)
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

The 19,31 are our main customers and there is no significant change in term of conversion rate or gender for them. However, we do not convert them as much as the older customers. We also see a high conversion rate for the under 18 (with a pick at 12), which might be either fake data or the parent that pay for them. Therefore, it is harder to convert 19,28 but we have a bigger pool of data for them.

It seems also that the gender has no effect on the conversion rate.

Now let’s look if there is a notable difference in term of percentage of girl that is/was premium and the one who are not.

ds.age.fc_gender.noprem<-ds.age %>% group_by(age,gender.fix) %>% filter(!is.na(gender.fix),is_premium==0) %>% summarise(n=n()) %>% ungroup() %>% arrange(age)
ds.age.fc_gender.noprem.wide<-dcast(ds.age.fc_gender.noprem,age~gender.fix,value.var="n",fun.aggregate = sum, na.rm = TRUE)
ds.age.fc_gender.noprem.wide<-ds.age.fc_gender.noprem.wide %>% group_by(age) %>% mutate(percentF=F/(F+M+N),percentM=M/(F+M+N),percentN=N/(F+M+N)) %>% ungroup() %>% arrange(age)


ds.age.fc_gender.prem<-ds.age %>% group_by(age,gender.fix) %>% filter(!is.na(gender.fix),is_premium==1) %>% summarise(n=n()) %>% ungroup() %>% arrange(age)
ds.age.fc_gender.prem.wide<-dcast(ds.age.fc_gender.prem,age~gender.fix,value.var="n",fun.aggregate = sum, na.rm = TRUE)
ds.age.fc_gender.prem.wide<-ds.age.fc_gender.prem.wide %>% group_by(age) %>% mutate(percentF=F/(F+M+N),percentM=M/(F+M+N),percentN=N/(F+M+N)) %>% ungroup() %>% arrange(age)


ds.age.fc_gender.nopremanymore<-ds.age %>% group_by(age,gender.fix) %>% filter(!is.na(gender.fix),is_premium==0,!is.na(premium_since.date)) %>% summarise(n=n()) %>% ungroup() %>% arrange(age)
ds.age.fc_gender.nopremanymore.wide<-dcast(ds.age.fc_gender.nopremanymore,age~gender.fix,value.var="n",fun.aggregate = sum, na.rm = TRUE)
ds.age.fc_gender.nopremanymore.wide<-ds.age.fc_gender.nopremanymore.wide %>% group_by(age) %>% mutate(percentF=F/(F+M+N),percentM=M/(F+M+N),percentN=N/(F+M+N)) %>% ungroup() %>% arrange(age)

#g.fc_gender.prem<-ggplot(data=ds.age.fc_gender.prem.wide,aes(x=age,y=percentF))+geom_smooth()+geom_hline(yintercept=1,alpha=0.2,linetype=2)+coord_cartesian(ylim=c(0,1),xlim=c(15,50))+ylab("Premium %girls")+scale_x_continuous(breaks=seq(0,50,2))+geom_point(alpha=1/50)
g.fc_gender.noprem<-ggplot(data=ds.age.fc_gender.noprem.wide,aes(x=age,y=percentF),breaks=seq(10,60,5))+geom_line()+geom_hline(yintercept=1,alpha=0.2,linetype=2)+coord_cartesian(ylim=c(0,1),xlim=c(10,80))+ylab("%girls")+scale_x_continuous(breaks=seq(0,80,2))+scale_y_continuous(breaks=seq(0,1,.05))
g.fc_gender.noprem+geom_line(data=ds.age.fc_gender.prem.wide,aes(x=age,y=percentF),colour="red")+geom_line(data=ds.age.fc_gender.nopremanymore.wide,aes(x=age,y=percentF),colour="green")+ggtitle("%girls by age (red=premium,green=was_premium,black=not premium)")

We can see that in proportion we have more girl than guys that are premium than not. Moreover, girls tend to convert more than guy especially when they get older. Therefore, girls who try our product are more likely to become premium members. Finally, we can see that there are more boy than girls that join our website as they get older. An assumption could be that they do register for their kid and the male do it as (regretfully) male still have the more income than women in average in couple (at least in Europe: http://www.theguardian.com/world/2014/mar/06/french-married-women-earn-less-male-partners).

Therefore, we can conclude that we should also do target marketing our advertisements for couple who have young children and especially the male in the couple who is 50+.

##Where is our conversion rate the most important?

(sort(table(filter(ds)$country_id),decreasing = T))[1:3]
## 
## United States   Philippines        Canada 
##        170203         30696         24130
(sort(table(filter(ds,is_premium==1)$country_id),decreasing = T))[1:3]
## 
## United States          Peru        Mexico 
##          9457          3210          2545
(sort(table(filter(ds,is_premium==0)$country_id),decreasing = T))[1:3]
## 
## United States   Philippines      Malaysia 
##        160746         30678         23208

We can see that USA, Peru, Mexico, Canada are our main customers as we said earlier.
Let’s look who got the best conversion rate (people who have the best ratio premium/customer base). In order to do that we will focus on these who have enough observation in order to it to make sens (>20000).

ds.prem_fc<-ds %>% group_by(country_id) %>% summarise(premiumPerc=sum(is_premium)/n(),number_observation=n()) %>% ungroup() %>% arrange(country_id)

g.premium.conversion<-ggplot(data=filter(ds.prem_fc,number_observation>20000),aes(x=country_id,y=premiumPerc))+geom_histogram(stat="identity")+xlab("Country Name")+ylab("Percentage of conversion to premium")+ggtitle("Conversion rate by country for country who have more than 2k observations")
g.premium.nbObservation<-ggplot(data=filter(ds.prem_fc,number_observation>20000),aes(x=country_id,y=number_observation))+geom_histogram(stat="identity")+xlab("Country Name")+ylab("Number of customers")+ggtitle("Number of customers by country for country who have more than 2k observations")

grid.arrange(g.premium.conversion,g.premium.nbObservation,ncol=1)

As we have seen so far, Peru and Mexico have very high potential since their conversion rate are the highest. Canada and the USA are following them. However, the USA has way more customer than the others. Therefore, focusing on them primarely make sens. Additionally, we should look into demographically where these users are to do special targeting campaign only for these cities.

##Looking into USA states and cities

data(state.regions)
data(zipcode)

ds.zipmap.premium<-ds %>% filter(!is.na(zip),zip!="",country_id=="United States",is_premium==1) %>% group_by(zip) %>% summarise(valuePostal=n())
ds.zipmap.premium$zip.fix<-clean.zipcodes(ds.zipmap.premium$zip)
ds.zipmap.premium<-ds.zipmap.premium %>% filter(!is.na(zip.fix)) %>% group_by(zip.fix) %>% summarise(value=sum(valuePostal))
ds.zipmap.premium$region<-ds.zipmap.premium$zip.fix
g.zip.premium.focus<-zip_map(ds.zipmap.premium,buckets=9,zoom    = c("california","florida","new york","texas"))+ggtitle("Premium in California, Florida, New York and Texas")+scale_color_brewer(name="Customer", palette=8)
## Scale for 'colour' is already present. Adding another scale for 'colour', which will replace the existing scale.
g.zip.premium.all<-zip_map(ds.zipmap.premium,buckets=9)+ggtitle("United State premium")+scale_color_brewer(name="Customer", palette=8)
## Scale for 'colour' is already present. Adding another scale for 'colour', which will replace the existing scale.
ds.zipmap.premium$city <- zipcode$city[match(as.character(ds.zipmap.premium$region), as.character(zipcode$zip))]
ds.zipmap.premium.plot<-ds.zipmap.premium %>% group_by(city) %>% summarise(value=sum(value)) %>% ungroup() %>% arrange(city)

g.zip.premium.city<-ggplot(data=filter(ds.zipmap.premium.plot,value>100),aes(x=city,y=value))+geom_histogram(stat="identity")+scale_y_continuous(breaks=seq(0,30000,50))+ggtitle("Number of premium for the top cities")+xlab("City name")+ylab("Number of premium")

ds.zipmap.premium$state.abb <- zipcode$state[match(as.character(ds.zipmap.premium$region), as.character(zipcode$zip))]
ds.zipmap.premium$state <- state.regions$region[match(as.character(ds.zipmap.premium$state.abb), as.character(state.regions$abb))]
ds.statemap.premium<-ds.zipmap.premium %>% group_by(state) %>% summarise(value=sum(value))
ds.statemap.premium$region<-ds.statemap.premium$state

g.state.premium<-state_choropleth(ds.statemap.premium,buckets = 9)+ggtitle("Premium per state")
g.state.premium.state<-ggplot(data=filter(ds.statemap.premium,value>300),aes(x=state,y=value))+geom_histogram(stat="identity")+scale_y_continuous(breaks=seq(0,30000,250))+ggtitle("Number of premium for the top states")+xlab("City name")+ylab("Number of premium")




ds.zipmap.not_prem<-ds %>% filter(!is.na(zip),zip!="",country_id=="United States",is_premium==0) %>% group_by(zip) %>% summarise(valuePostal=n())
ds.zipmap.not_prem$zip.fix<-clean.zipcodes(ds.zipmap.not_prem$zip)
ds.zipmap.not_prem<-ds.zipmap.not_prem %>% filter(!is.na(zip.fix)) %>% group_by(zip.fix) %>% summarise(value=sum(valuePostal))
ds.zipmap.not_prem$region<-ds.zipmap.not_prem$zip.fix
g.zip.not_prem.focus<-zip_map(ds.zipmap.not_prem,buckets=9,zoom    = c("california","florida","new york","texas"))+ggtitle("not_prem in California, Florida, New York and Texas")+scale_color_brewer(name="Customer", palette=8)
## Scale for 'colour' is already present. Adding another scale for 'colour', which will replace the existing scale.
g.zip.not_prem.all<-zip_map(ds.zipmap.not_prem,buckets=9)+ggtitle("United State not premium")+scale_color_brewer(name="Customer", palette=8)
## Scale for 'colour' is already present. Adding another scale for 'colour', which will replace the existing scale.
ds.zipmap.not_prem$city <- zipcode$city[match(as.character(ds.zipmap.not_prem$region), as.character(zipcode$zip))]

ds.zipmap.not_prem.plot<-ds.zipmap.not_prem %>% group_by(city) %>% summarise(value=sum(value)) %>% ungroup() %>% arrange(city)
g.zip.not_prem.city<-ggplot(data=filter(ds.zipmap.not_prem.plot,value>1500),aes(x=city,y=value))+geom_histogram(stat="identity")+scale_y_continuous(breaks=seq(0,30000,500))+ggtitle("Number of not premium for the top cities")+xlab("City name")+ylab("Number of not premium")

ds.zipmap.not_prem$state.abb <- zipcode$state[match(as.character(ds.zipmap.not_prem$region), as.character(zipcode$zip))]
ds.zipmap.not_prem$state <- state.regions$region[match(as.character(ds.zipmap.not_prem$state.abb), as.character(state.regions$abb))]
ds.statemap.not_prem<-ds.zipmap.not_prem %>% group_by(state) %>% summarise(value=sum(value))
ds.statemap.not_prem$region<-ds.statemap.not_prem$state

g.state.not_prem<-state_choropleth(ds.statemap.not_prem,buckets = 9)+ggtitle("not premium per state")
g.state.not_prem.state<-ggplot(data=filter(ds.statemap.not_prem,value>4000),aes(x=state,y=value))+geom_histogram(stat="identity")+scale_y_continuous(breaks=seq(0,30000,2500))+ggtitle("Number of not premium for the top states")+xlab("City name")+ylab("Number of not premium")



ds.zipmap.all<-ds.zipmap.not_prem
ds.zipmap.all$not_prem_value<-ds.zipmap.all$value
ds.zipmap.all$prem_value<-ds.zipmap.premium$value[match(ds.zipmap.all$city,ds.zipmap.premium$city)]
ds.zipmap.all$prem_value[is.na(ds.zipmap.all$prem_value)]<-0
ds.zipmap.all<-ds.zipmap.all %>% group_by(city) %>% summarise(percentPrem=sum(prem_value)/(sum(prem_value)+sum(not_prem_value)),observation=sum(prem_value)+sum(not_prem_value)) %>% ungroup() %>% arrange(city)

g.zip.conversion<-ggplot(data=(filter(ds.zipmap.all,observation>2000)),aes(x=city,y=percentPrem))+geom_histogram(stat="identity")+ggtitle("Conversion rate in city with more than 2k customers")+xlab("City name")+ylab("Conversion rate")

g.zip.observation<-ggplot(data=(filter(ds.zipmap.all,observation>2000)),aes(x=city,y=observation))+geom_histogram(stat="identity")+ggtitle("Number of customer in city with more than 2k customers")+xlab("City name")+ylab("Number of customer")


grid.arrange(g.state.not_prem,g.state.premium,g.state.not_prem.state,g.state.premium.state,ncol=2)

grid.arrange(g.zip.not_prem.all,g.zip.premium.all,g.zip.not_prem.focus,g.zip.premium.focus,ncol=2)

grid.arrange(g.zip.not_prem.city,g.zip.premium.city,ncol=2)

grid.arrange(g.zip.conversion,g.zip.observation)

We can see that in the USA, we should focus on Houston and New York that have respectively the highest number of conversion for a “big city” and the highest number of customer.
If we keep our logic, we should try to attrack young female people with promotion.

Churn

prem.fc_by_date_platform<- prem %>% filter(!is.na(date) & !is.na(platform)) %>% group_by(date,platform) %>% summarise(n=n()) %>% ungroup() %>% arrange(date)
churnplat<-ggplot(aes(x = date, y = n),data = prem.fc_by_date_platform) +   geom_line()+facet_wrap(~platform)+ggtitle("Number of comments in time per platform")+ylab("Number of comment")+xlab("Date for year 2014")
#ggsave(file="churnByPlat.pdf",churnplat)
churn_www<-ggplot(aes(x = date, y = n),data = filter(prem.fc_by_date_platform,platform=="www",date>c("2014-04-01"))) +   geom_point()+ggtitle("Number of comments in time for www")+ylab("Number of comment")+xlab("Date for year 2014")
grid.arrange(churnplat,churn_www)
## geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?

#ggsave(file="churn_www.pdf",churn_www)

Churn tell us that we have to focus on our www platform as it is our highest.

Focusing on Canada and USA

ds.canada<-subset(ds,country_id=="Canada")
#canada premium vs world
sum(ds.canada$is_premium)/length(ds.canada$is_premium)*100
## [1] 4.218815
#premium vs all
sum(ds$is_premium)/length(ds$is_premium)*100
## [1] 3.847
ds.fc_by_date_canada<- ds.canada %>% filter(!is.na(premium_since.date)) %>% group_by(premium_since.date,is_premium) %>% summarise(n=n()) %>% ungroup() %>% arrange(premium_since.date)
#promocode ranking
sort(table(ds$promo_code),decreasing = T)[1:10]
## 
##                              Awards2012              5YEARS 
##              628797                5229                3906 
##           FLASHSALE       DFLovesCanada       ThanksMom2013 
##                1972                1821                 674 
##            DFSale25 2011_08_email_promo      DFAnnualSale25 
##                 667                 620                 524 
##     studentdiscount 
##                 503
#5YEARS and Award2012 most famous


ggplot(data=ds,aes(x=date_joined.date,fill=gender.fix))+geom_histogram()+facet_wrap(~region_id)  
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

ggplot(data=filter(ds,country_id=="United States" | country_id=="Canada" ),aes(x=date_joined.date,fill=gender.fix))+geom_histogram()+facet_wrap(~country_id)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

We can see that there is a peak for the United State and Canada. We have seen previously this peak and we have seen that it wasn’t related to a promo code. It would be interresting to investigate further on what happenned during this period (was there a lunch of a mobile app..)

ds.prem.joined <- ds %>% filter(premium_join_delay>0,!is.na(gender.fix)) %>% group_by(premium_join_delay,gender.fix) %>% summarise(n=n()) %>% ungroup() %>% arrange(premium_join_delay)

ds.prem.joined.wide<-dcast(ds.prem.joined,premium_join_delay~gender.fix,value.var="n",fun.aggregate = sum, na.rm = TRUE)

g.prem.joined<-ggplot(data=ds.prem.joined.wide,aes(x=premium_join_delay, y=(M)))+geom_point(color="red")+geom_point(data=ds.prem.joined.wide,aes(x=premium_join_delay, y=F),color="blue")+ggtitle("Time in day customer take between when they join and when they become premium, blue=women, red=men")+xlab("Time in number of days")+ylab("Number of premium converted")

g.prem.joined.zoom<-g.prem.joined+scale_y_continuous(breaks=seq(0,750,25))+scale_x_continuous(breaks=seq(0,750,25))+coord_cartesian(xlim=c(0,450),ylim=c(0,750))

grid.arrange(g.prem.joined,g.prem.joined.zoom,ncol=1)

We can see that they become premium in the first year and the more we wait the more unlikely they become premium. Therefore, we should focus our advertising to new user extensively.

In conclusion, we have seen that:

  • We should target female
  • We should target 20 to 30 female or 50+ familly male in couple who have young female child
  • We should target Mexico, Peru and Philippines
  • We should target Huston and New York
  • We should target people who just registered
  • We should do more promo code as the 5 Years or Award 2012 to retain users as they are more unlikely to leave
Go to Top