Playing with R – A bit of machine learning

##Introduction
The question is: Tell us something interesting about the ping backs we receive from videos.
Input:
*Question asked before
*tsv data file
*pdf file with data format
Output:
*This document

I picked R in order to do this analysis as it did appears to me that this is mainly an exploratory data analysis and R markdown + ggplot2 are very conveniant for that in my opinion.

We will be proceeding in this order:
*Import the data
*Clean the data
*Explore the data
*Try to find a predective model
*Explore the predective features that were found important
*Make asumption from findings

###1. Import the data

Quite easy here. The data being small I can just use read.csv without using data.table

#library import
library(party)
library(MASS)
library(e1071)
library(softImpute)
library(ggplot2)
library(plyr)
library(dplyr)
library(gridExtra)
library(chron)
library(reshape2)
library(choroplethr)
library(zipcode)
library(choroplethrMaps)
library(stringi)
library(httr)
library(tools)
library(caret)
library(randomForest)
library(rpart)
library(ada)
library(ade4)
library(mlbench)
library(rJava)
library(FSelector)
library(gbm)
library(verification)
library(Amelia)
library(rworldmap)

ds<-read.csv("data/support_ping_hw_ua.tsv",header=F,sep="t",stringsAsFactors=F)

##2.Clean the data

#Name it according to the pdf with the addition of the python script for the user agent detail
names(ds)<-c("id","date","user.agent","browser","country","domain","page.url","title","video.url","video.id","ping.embed","ping.play","ping.complete","ua.device.brand","ua.device.model","ua.device.family","ua.os.major","ua.os.patch_minor","ua.os.minor","ua.os.family","ua.os.patch","ua.ua.major","ua.ua.minor","ua.ua.family","ua.ua.patch")

##do some basic lookup of fields to see if they are usefull
#I am ommiting some parts as it would make it unecessary too long
#It was said to be from the  website
#Therefore let's check if all the data are from it
#table(ds[,"domain"])
#ok the table tells us that all data are from the same domain
#Therefore, it doesn't give us any usefull information, let's remove it.
ds=ds[-which(names(ds)=="domain")]
#urls give us lot of usefull informations such as
#port,hostname of video url, extension, the path to the video,query (if there is one),params
#username,fragment,password,scheme as defined in http://tools.ietf.org/html/rfc1808.html 
#let's break down the page url and video url

#This is an helper function to breakdown the urls
#Input: ds => data frame and field => field of data frame that contain url
#Output: a data frame of 10 variables of format field.variable where variable is:
#scheme,hostname,port,path,query,params,fragment,username,password,ext
geturl<-function(ds,field){
  url.struct<-sapply(ds[,field],parse_url)
  url.struct[sapply(url.struct, is.null)] <- NA 
  url.df<-data.frame(matrix(unlist((url.struct)), nrow=ncol(url.struct), byrow=T),stringsAsFactors=F)
  names(url.df)<-rownames(url.struct)
  url.df<-url.df %>% mutate(ext=(file_ext(as.character(path))))
  names(url.df)<-sapply(names(url.df),function(x) paste0(field,".",x))
  url.df

}

ds<-cbind.data.frame(ds,geturl(ds,"page.url"),geturl(ds,"video.url"))
ds$date<-as.numeric(as.POSIXct(ds$date))
#We tell R that empty string is NA
ds[ds==''] <- NA 
#One interesting thing that we can now extract from the data is if it is mobile or not
#It can be useful because the dataset is very small
ds<-ds%>%mutate(mobile=(as.numeric(!is.na(ua.device.brand))))
##A quick look at the pdf and using this command
#table((ds%>%mutate(n=(ping.embed+ping.play+ping.complete)))$n)
##show me that it was coded as a dummy variable let's regroup them and add ping.not.embed for easy of use 
ds<-ds%>%mutate(ping.not.embed=as.numeric((ping.embed+ping.play+ping.complete)==0),ping=(ifelse(ping.embed==1,"embed",ifelse(ping.play==1,"play",ifelse(ping.complete==1,"complete","not.embed")))))
##Finally, let's remove all the empty column (the one who only have NA)
notEmpty<-names(which(sapply(ds[],function(x) !all(is.na(x)))))
ds<-ds[notEmpty]
ds[,notEmpty] <- lapply(ds[,notEmpty] , function(x) factor(x,exclude=NA))

Here are all our finals features after the data cleaning:
id, date, user.agent, browser, country, page.url, title, video.url, video.id, ping.embed, ping.play, ping.complete, ua.device.brand, ua.device.model, ua.device.family, ua.os.major, ua.os.minor, ua.os.family, ua.os.patch, ua.ua.major, ua.ua.minor, ua.ua.family, ua.ua.patch, page.url.scheme, page.url.hostname, page.url.path, page.url.query, page.url.fragment, page.url.ext, video.url.scheme, video.url.hostname, video.url.port, video.url.path, video.url.query, video.url.ext, mobile, ping.not.embed, ping
Same of them are redundant (or the same but in more detail) but it will help to have a direct accessor when we want to explore them in deeper detail

The last call is a tought call which is, should I keep or not empty observation. In order to know that we first need to estimate the importance of them.
Let’s look at them.

We can see that most data which are missing are from the data I generated. This is not actually a problem since they are mainly here to give extra information if we want to be very specific about a certain observation.
Additionally, we can clearly see a pattern between missing value and if we take the column video.id we see that there are 0.895% of missing data which is really small.In addition for the video extension there 2.805% of missing data, this is relatively small.
These 348 correspond actually to all the www.youtube.com videos. Removing it wouldn’t allow us to predict any ping from www.youtube.com.
Where it becomes tricker it is for the title.
There is 57.493% of missing video title.
What to do then? A quick glance at the data tell me that there is no video with the same id but a different title. Therefore, when I am going to predict the data, I’ll just omit title which will overfit the model. As for video.id, I think it should be identified aside (using a knn or other model) to see if there is any special pattern that could explain it.
The trickiest question, is should I or not remove the youtube videos which have no extension, by doing that I really impute the model from maybe usefull informations. I have decided not to.

ds.original<-ds

ds<-ds %>% filter(!is.na(video.id))

##3. Explore the data

#Set the seed for reproductible example
set.seed(2015)

####We will be looking at the ping. Can we predict it from our dataset?

The idea, is to look if we can raisonabely predict some state of ping. If so, what are the important variables that drive our prediction. In order to do that I am going to use multiple classifier: Adaboost, SVM, GBM and randomForest.

First we reduce the number of feature to only keep the one that have significance using a wrapper approach for subset selection.

##Note:
#This wrapper approach wasn't actually giving me better result. I assumed it was because of the highly unbalanced data. I ended up doing it myself with trial.
#Wrapper approach for feature reduction using hill climbing search
#Input: data, y (binary)
#Output: list of selected factor reducing the error rate
feature.reducer<-function(data=ds,y){
  evaluator <- function(subset) {
    #k-fold cross validation
    k <- 5
    splits <- runif(nrow(data))
    results = sapply(1:k, function(i) {
    test.idx <- (splits >= (i - 1) / k) & (splits < i / k)
    train.idx <- !test.idx
    test <- data[test.idx, , drop=F]
    train <- data[train.idx, , drop=F]
    tree <- rpart(as.simple.formula(subset, y), train)
    error.rate = sum(with(test, get(y)) != predict(tree, test,type="c")) / nrow(test)
    return(1 - error.rate)
    })  
    return(mean(results))
  }
  subset <- hill.climbing.search(names(data)[-which(names(data)==y)], evaluator)
  subset
}

I also chose not to use another training sample for the feature selection as the data is already very small.
Finally, one of the main issues in wrapper method is overfitting and there are feature that are obiously overfitting/irrelevent.
Therefore, I decided to remove all the ua (except the OS, I gained more levels about the browser but the dataset being too small I decided not to use it since it would create too much of unbalanced data) fields before computing it
I am also removing informations that just wouldn’t make sense (i.e. the date being only for one day and we don’t have the “local” time but only the standardized time +0000 for all data)

Remove other factors that are irrelevent for the analysis:
*title (explained before)
*page.url.scheme has only one level which is http
*page.url.path has no meaning for predicting
*page.url.ext has only one level which is html
*page.url.fragment
*page.url.query
*video.url.port only 2.9363917% of the data and not relevant
*date is more debatable but the data being only on one day and not local zone I don’t see it making any sens. The only thing we could get out of it is if we have no ping at all during a certain period which could mean a failure of the server but this is more in the observation field than the prediction (we can’t predict a server failure from this data) which is what we are trying to achieve now
*ping.embeded, ping.play, ping.complete, ping that are our labels
*id even though this one is interesting, the high number of factor (1355) make it a bad predictor (even using svm or gbm).

#Get the ua names index that was generated
notua<-notEmpty[!grepl("^ua.",notEmpty)]
#remove other factors that have very few meaning

ind.features.filter<-notua[is.na(match(notua,c("page.url.scheme","page.url.fragment","page.url.query","page.url.path","page.url.ext","date","video.url.port","video.url.path","video.url.query","ping.embed","ping.play","ping.complete","ping.not.embed","ping","title","video.url","page.url","id")))]

ind.features.filter<-c(ind.features.filter,"ua.os.family")

#I went into several trial and realized that this step wasn't necessary for this analysis
#However, I didn't remove it as I think the feature selection is key to the process. Therefore, I wanted to show that I actually did work on the feature selection.
#feature.ping.reduced.ping<-ind.features.filter
#feature.ping.reduced.ping.not.embed<-feature.reducer(ds[c(ind.features.filter,"ping.not.embed")],"ping.not.embed")
#feature.ping.reduced.ping.embed<-feature.reducer(ds[c(ind.features.filter,"ping.embed")],"ping.embed")
#feature.ping.reduced.ping.play<-feature.reducer(ds[c(ind.features.filter,"ping.play")],"ping.play")
#feature.ping.reduced.ping.complete<-feature.reducer(ds[c(ind.features.filter,"ping.complete")],"ping.complete")

Here are our features:
user.agent, browser, country, video.id, page.url.hostname, video.url.scheme, video.url.hostname, video.url.ext, mobile, ua.os.family
If we look into their level we can see that we have features with very high dimensions.
In order to deal with high dimensional data, we will be using three classifiers that deal nicely with high categorical levels features and use binary classification if needed:
*Adaboost
*Generalized Boosted Regression Model (GBM)
*SVM

##Setting training and testing set vars

#Creating a function to create my sample of test from a subset of the data
#Input: data, name of features to keep, labels (what we want to predict), traning.percentage
#Output: assignement of global variable train,test with this pattern train.labelname,test.labelname 
#Warning: will overide any existing var with the same name
createSets<-function(data,ind.features.filter,labels,traning.percentage=0.8){
n <- nrow(data)
ind <- sample(1:n)
trainval <- ceiling(n * traning.percentage)
  lapply(labels,function(x){
    train<-data[ind[1:trainval],c(ind.features.filter,x)]
    test <- data[ind[(trainval + 1):n],c(ind.features.filter,x)]
    namTrain <- paste("train", x, sep = ".")
    namTest <- paste("test", x, sep = ".") 
    assign(namTrain,train,envir = .GlobalEnv)
    assign(namTest,test,envir = .GlobalEnv)
    })
}
labels<-c("ping.not.embed","ping.embed","ping.play","ping.complete")
invisible(createSets(ds,ind.features.filter,labels,0.8))

####Adaboost

############      AdaBoost   ###############
#I picked adaboost because it is supposed to be a great mix between accuracy and speed for high dim predictors.

#Helper function that allow to know what are the dependencies between vars
#Input: dataframe, name of y in dataframe, type of boost you want to apply, number of iteration and depth
#Output: classifier
clfAda<-function(data=ds,y,adatype="gentle",adaiter=70,cdepth=14){
  n <- nrow(data)
  indy<-which(names(data)==y)
  ind <- sample(1:n)
  xnam <- paste(names(data[-indy]), sep="")
  fmla <- as.formula(paste(y," ~ ", paste(xnam, collapse= "+")))
  trainval <- ceiling(n * .5)
  testval <- ceiling(n * .3)
  train <- data[ind[1:trainval],]
  test <- data[ind[(trainval + 1):(trainval + testval)],]
  valid <- data[ind[(trainval + testval + 1):n],]
  control <- rpart.control(cp = -1, maxdepth = cdepth,maxcompete = 1,xval = 0)
  clf <- ada(fmla, data = train, test.x = test[,-indy], test.y = test[,indy], type =adatype, control = control, iter = adaiter)
  clf <- addtest(clf, valid[,-indy], valid[,indy])
  clf
}

#AdaBoost is waiting for a binary response therefore we use our dummy variables only

##Now we compute our adaboost and do our prediction and confusion matrix
#TODO use parallel to make it faster
#Take labels and assign variables after doing adaboost, prediction, confusion matrix and the variable importance
invisible(lapply(labels,function(x){
  xtest<-select(get(paste("test",x,sep=".")),-matches(x))
  ytest<-select(get(paste("test",x,sep=".")),matches(x))
  #classifier
  ada.clf<-clfAda(data=get(paste("train",x,sep=".")),y=x)
  #prediction
  ada.pred<-predict(ada.clf,xtest)
  #confusion matrix
  ada.conf<-table(ada.pred,ytest[,1])
  #var importance
  ada.vip<-varplot(ada.clf,plot.it=FALSE,type='scores')
  #Create the variables names
  name.clf<-paste("ada.clf",x,sep=".")
  name.pred<-paste("ada.pred",x,sep=".")
  name.conf<-paste("ada.conf",x,sep=".")
  name.vip<-paste("ada.vip",x,sep=".")
  #store it
  assign(name.clf,ada.clf,envir = .GlobalEnv)
  assign(name.pred,ada.pred,envir = .GlobalEnv)
  assign(name.conf,ada.conf,envir = .GlobalEnv)
  assign(name.vip,ada.vip,envir = .GlobalEnv)
}))

#Let's check if we found corrects model

Note that usually, we try to maximize the accuraccy by adding/removing features when we don’t have a specific question (if the clf don’t do it for you).

Result for:
*Not Embed

## Call:
## ada(fmla, data = train, test.x = test[, -indy], test.y = test[, 
##     indy], type = adatype, control = control, iter = adaiter)
## 
## Loss: exponential Method: gentle   Iteration: 70 
## 
## Training Results
## 
## Accuracy: 0.767 Kappa: 0.539 
## 
## Testing Results
## 
## Accuracy: 0.718 Kappa: 0.442 
## 
## Accuracy: 0.709 Kappa: 0.428

We can see that the model is correct (relatively moderately high Accuracy/Kappa).

Let’s check if we had enough iterations:

Yes it’s ok

Let’s look into the confusion matrix on the test subject

## Confusion Matrix and Statistics
## 
##         
## ada.pred   0   1
##        0 812 181
##        1 508 957
##                                           
##                Accuracy : 0.7197          
##                  95% CI : (0.7015, 0.7374)
##     No Information Rate : 0.537           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4472          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6152          
##             Specificity : 0.8409          
##          Pos Pred Value : 0.8177          
##          Neg Pred Value : 0.6532          
##              Prevalence : 0.5370          
##          Detection Rate : 0.3303          
##    Detection Prevalence : 0.4040          
##       Balanced Accuracy : 0.7281          
##                                           
##        'Positive' Class : 0               
##

The model prediction is correct.
Let’s look then at the variable importance.

Therefore, it seems like the fact that we are on mobile and the protocol (scheme) and os used
influence the most the fact that we will or not load the video. Which is common sense since we might load (or not) the video if our mobile/computer can (or not) read a specific protocol.

*Embed

I am going to go a bit faster, I did check the iteration/kappa/accuracy of the classifier whch are correct. The confusion matrix is also correct. Therefore, I can show you the var importance.

We can see that the country play an important role which did surprise me at first and then I figured out that it was certainly because the data was unbalanced. Then the classifier did put scheme and operating system.

*Play

Kappa is really low and therefore the test failed has no meaningfull result.

*Complete

Same thing.

An interesting fact is that the sensitivity (with 0 (not) being the positive class) was really high for both of them.
This tell us that the model is good to know when a video won’t be played or completed which is consistant with the two other states. Additionally, we can explain it by the high unbalance of the data.

We have seen that some prediction were inacurate. We have two main possibilities:
*Change the features (we could use a feature ranked approach such as chi-squared or information gain since the wrapper one didn’t result in effective results) or transform the features
*Use another model (which can also help for feature selection)

Adaboost was a great example, I’m going to go faster now for the other classifier.

####GBM

############      GBM   ###############
##I picked GBM because it is robust and relatively fast
#It also allow feature to have 1024 levels (unlike randomforest)
#Finally, it won several kaggle competition, thanks to its high acuracy

invisible(createSets(ds,ind.features.filter,"ping",0.8))
labels.all<-c("ping",labels)

xTrain<-train.ping %>% select(-ping,-user.agent)
xTest<-test.ping %>% select(-ping,-user.agent)

yTrain<-train.ping[,which(names(train.ping)=="ping")]
yTest<-test.ping[,which(names(train.ping)=="ping")]

#a high guess of how many trees we'll need
ntrees = 500


#An optimation would be to do cross validation to set the parameters
gbm.model <- gbm.fit(
x = xTrain 
, y = yTrain 
#x and y instead of using a formula
, distribution = "multinomial"
#use bernoulli for binary outcomes
#other values are "gaussian" for GBM regression
#or "adaboost"
, n.trees = ntrees
#Choosed this value to be large, then we will prune the
#tree after running the model
, shrinkage = 0.01
#smaller values of shrinkage typically give slightly better performance
#the cost is that the model takes longer to run for smaller values
, interaction.depth = 3
#TODO: use cross validation to choose interaction depth
, n.minobsinnode = 300
#n.minobsinnode has an important effect on overfitting
#decreasing this parameter increases the in-sample fit,
#but can result in overfitting
, nTrain = round(nrow(xTrain) * 0.8)
#select the number of trees at the end
# ,var.monotone = c()
#can help with overfitting, will smooth bumpy curves
, verbose = F #print the preliminary output
) 

#Verify we had enough tree
gbm.perf(gbm.model, plot.it = TRUE)

## Using test method...

## [1] 500

#Relative influence among the variables can be used in variable selection or seing if 
#parameter tunning (i.e. if there is overfitting)
summary(gbm.model)

##                                   var    rel.inf
## video.id                     video.id 67.0141245
## country                       country 23.7720645
## page.url.hostname   page.url.hostname  5.3680470
## ua.os.family             ua.os.family  3.6998816
## browser                       browser  0.1458824
## video.url.scheme     video.url.scheme  0.0000000
## video.url.hostname video.url.hostname  0.0000000
## video.url.ext           video.url.ext  0.0000000
## mobile                         mobile  0.0000000

#Print tree
#pretty.gbm.tree(gbm.model)


##Rest is commented since it was more for the exploration than drawing conclusion
#look at the effects of each variable
#for(i in 1:length(gbm.model$var.names)){
#  print(i)
#plot(gbm.model, i.var = i
#, ntrees = gbm.perf(gbm.model, plot.it = FALSE) #optimal number of trees
#, type = "response" #to get fitted probabilities
#)
#}
#Try to predict our test variables on the best tree
#gbm.test.predict<- predict(object = gbm.model,newdata =select(test.ping,-ping,-user.agent)
#, n.trees = gbm.perf(gbm.model, plot.it = FALSE)
#, type = "response")

#roc.area(select(test.ping.embed,ping.embed)[,1], gbm.test.predict)$A
#prop.table(table(matrix(gbm.test.predict)[,1],(select(test.ping.embed,ping.embed)[,1])))

#table(matrix(gbm.test.predict))

#str(gbm.test.predict)
#dim()
#str(data.frame(gbm.test.predict))
#str(as.data.frame(as.matrix(table(gbm.test.predict))))
#str(as.matrix(gbm.test.predict))

GBM is actually not working well at all here. looking at its tree and summary we can see too much importance on the first variable. I tried to tweak the param a lot and chage the features but it just would not work for this model. I therefore, turned myself into SVM that is also well designed for a problem with few data and lot of dimension.

####SVM

#I did include my code for svm, but I actually mostly used it during my exploratory phase where I was trying to do a rapid feature selection
##Need to do cross validation and parameter tunning and refraction of code
clfSvm<-function(data,y){
  indy<-which(names(data)==y)
  xnam <- paste(names(data[-indy]), sep="")
  fmla <- as.formula(paste(y," ~ ", paste(xnam, collapse= "+")))
  svm(formula=fmla,data=data,type="C-classification",na.action=na.exclude)  
}

invisible(lapply(c(labels,"ping"),function(x){
  xtest<-select(get(paste("test",x,sep=".")),-matches(x))
  ytest<-select(get(paste("test",x,sep=".")),matches(x))
  #classifier
  svm.clf<-clfSvm(data=get(paste("train",x,sep=".")),y=x)
  #prediction
  svm.pred<-predict(svm.clf,xtest,na.action=na.exclude)
  #confusion matrix
  svm.conf<-table(matrix(svm.pred)[,1],ytest[,1])
  #Create the variables names
  name.clf<-paste("svm.clf",x,sep=".")
  name.pred<-paste("svm.pred",x,sep=".")
  name.conf<-paste("svm.conf",x,sep=".")
  #store it
  assign(name.clf,svm.clf,envir = .GlobalEnv)
  assign(name.pred,svm.pred,envir = .GlobalEnv)
  assign(name.conf,svm.conf,envir = .GlobalEnv)

}))



confusionMatrix(svm.conf.ping.not.embed)

## Confusion Matrix and Statistics
## 
##    
##       0   1
##   0 906 397
##   1 377 734
##                                          
##                Accuracy : 0.6794         
##                  95% CI : (0.6603, 0.698)
##     No Information Rate : 0.5315         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.3555         
##  Mcnemar's Test P-Value : 0.4946         
##                                          
##             Sensitivity : 0.7062         
##             Specificity : 0.6490         
##          Pos Pred Value : 0.6953         
##          Neg Pred Value : 0.6607         
##              Prevalence : 0.5315         
##          Detection Rate : 0.3753         
##    Detection Prevalence : 0.5398         
##       Balanced Accuracy : 0.6776         
##                                          
##        'Positive' Class : 0              
##

#confusionMatrix(svm.conf.ping.embed)
#summary(svm.clf.ping)
#str(svm.clf.ping)

#Things I didn't do: tune and tweak it.

Looking at the prediction (confusion matrix), SVM offers slight betters result than adaboost. Finally, I was wondering what random forest would give me, just to be sure which feature to look into more detail.

## Random Forest
##The idea behind using random forest was that it is a great tool for feature selection as well since the built in algorithm will automatically put aside predictor he doesn't like.
##However, you have to code dummy variables because I have too many level for some predictor
##It stills is really heavy because it create a lot of dummy variables and my old computer
##really did not enjoy it.
##The main variable of interest was country, I could have transformed it into continent to reduce the dimension.
##However, I feel like the final result would give informations way too broad
#Another alternative which I used was to drop it and only keep  variables with less than 54 classes

##dummy creation unused
#train.rf.dummy<-acm.disjonctif(train[,-c("ping.embed","ping.play","ping.complete")])
#train.rf.dummy<-cbind.data.frame(train[,c("ping.embed","ping.play","ping.complete")],train.rf.dummy)

#reduce the subset
ind.features.filter.rf<-names(ds[,unlist(lapply(ds,function (x)length(levels(x))<50))])
ind.features.filter.rf<-intersect(ind.features.filter.rf,ind.features.filter)

#random forest classifier
clfRf<-function(data,y){
  indy<-which(names(data)==y)
  xnam <- paste(names(data[-indy]), sep="")
  fmla <- as.formula(paste(y," ~ ", paste(xnam, collapse= "+")))
  randomForest(formula=fmla,data=data,na.action=na.roughfix,importance=T)
}

invisible(lapply(c(labels,"ping"),function(x){
  xtest<-get(paste("test",x,sep="."))[,c(ind.features.filter.rf,x)]
  xtest<-select(get(paste("test",x,sep=".")),-matches(x))
  ytest<-get(paste("test",x,sep="."))[,c(ind.features.filter.rf,x)]
  ytest<-select(get(paste("test",x,sep=".")),matches(x))
  train<-get(paste("train",x,sep="."))[,c(ind.features.filter.rf,x)]
  #classifier
  rf.clf<-clfRf(data=train,y=x)
  #prediction
  rf.pred<-predict(rf.clf,xtest,na.action=na.exclude)
  #confusion matrix
  rf.conf<-table(matrix(rf.pred)[,1],ytest[,1])
  #Create the variables names
  name.clf<-paste("rf.clf",x,sep=".")
  name.pred<-paste("rf.pred",x,sep=".")
  name.conf<-paste("rf.conf",x,sep=".")
  #store it
  assign(name.clf,rf.clf,envir = .GlobalEnv)
  assign(name.pred,rf.pred,envir = .GlobalEnv)
  assign(name.conf,rf.conf,envir = .GlobalEnv)
}))


#Observations
varImpPlot(rf.clf.ping.embed)

confusionMatrix(rf.conf.ping.embed)

## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 1562  289
##   1  214  349
##                                           
##                Accuracy : 0.7916          
##                  95% CI : (0.7749, 0.8077)
##     No Information Rate : 0.7357          
##     P-Value [Acc > NIR] : 9.952e-11       
##                                           
##                   Kappa : 0.4432          
##  Mcnemar's Test P-Value : 0.0009686       
##                                           
##             Sensitivity : 0.8795          
##             Specificity : 0.5470          
##          Pos Pred Value : 0.8439          
##          Neg Pred Value : 0.6199          
##              Prevalence : 0.7357          
##          Detection Rate : 0.6471          
##    Detection Prevalence : 0.7668          
##       Balanced Accuracy : 0.7133          
##                                           
##        'Positive' Class : 0               
##

My observation from the confluence matrix and the variable importance for its best prediction (which was ping.embed), showed me that random forest think that the OS,page domain and the browser are of interest.

#####Prediction conclusions:

After many iteration on R with different ranking method, I found out that it wasn’t possible to predict if someone play a video or completed it with an accurate rate. I even transformed the response by creating a new variable “launched” (which was play OR complete) but the result weren’t better (using naives bayes or adaboost with many different feature combinaison set). I concluded that the information given weren’t enough and I could just predict if the video was embed or not embed. This is due of a set really too unbalanced.
Nevertheless, I found out that the OS, scheme, page hostname and country were of interrest.
The next step is to dig into these variables to find some fun informations.

Since this analysis was supposed to be just a quick glance, I decided to just plot two three charts and not go into the detail.

##Representations

Let’s start with some world map. We have seen that it was suposed to be related to embed

map.country<-ds %>% group_by(country) %>% summarize(n=n())
jc<-joinCountryData2Map(map.country,nameJoinColumn="country",joinCode="ISO2")

## 103 codes from your data successfully matched countries in the map
## 2 codes from your data failed to match with a country code in the map
## 140 codes from the map weren't represented in your data

mapCountryData(jc, nameColumnToPlot="n", mapTitle="Number of Observation per country",
  addLegend=TRUE,catMethod='pretty',
  oceanCol="lightblue", missingCountryCol="lightgrey")

## You asked for 7 categories, 6 were used due to pretty() classification

We see that our data is actually very highly unbalanced and therefore must be correlated to the american behavior.
If you actually look into more detail and take the country that have more than 1% of the data. You find out that there is only indonesia and the US:

ggplot(data=filter(map.country,n>.1*nrow(ds)),aes(x=country,y=n))+geom_histogram(stat="identity")+ggtitle("Number of observation for country that have more than 1% of the data")+ylab("Number of observation")

A lot more has to be done and now is kind of the funniest part. I quickly noticed that the US would either not load the video at all or play/complete it.

Note, I didn’t had time to finish the analysis but here are some table that are interesting even if we should filter them first because sometimes we have only one observation.

table(ds$ua.ua.family)

## 
##           Android            Chrome     Chrome Mobile Chrome Mobile iOS 
##                28              3971                49                 2 
##          Chromium           Firefox    Firefox Mobile         Iceweasel 
##                27              6306                 1                 8 
##                IE         IE Mobile              Iron     Mobile Safari 
##              1029                13                 1               178 
##             Opera      Opera Mobile            Puffin            Safari 
##                52                 1                 1               625 
##        UC Browser 
##                 2

prop.table(table(ds$browser,ds$ping),1)

##                              
##                                 complete      embed  not.embed       play
##   Android Stock Browser       0.03571429 0.50000000 0.25000000 0.21428571
##   Chrome                      0.04543210 0.46543210 0.32098765 0.16814815
##   ChromeiOS                   0.00000000 1.00000000 0.00000000 0.00000000
##   Firefox                     0.13792557 0.11718131 0.58353127 0.16136184
##   Microsoft Internet Explorer 0.11228407 0.23704415 0.44817658 0.20249520
##   Opera                       0.01923077 0.57692308 0.25000000 0.15384615
##   Other                       0.00000000 0.75000000 0.00000000 0.25000000
##   Safari                      0.06491885 0.39575531 0.33957553 0.19975031

prop.table(table(ds$video.url.ext,ds$ping),1)

##       
##           complete       embed   not.embed        play
##   flv  0.000000000 1.000000000 0.000000000 0.000000000
##   m3u8 0.013145540 0.399061033 0.399061033 0.188732394
##   m4a  0.033426184 0.635097493 0.169916435 0.161559889
##   mp3  0.009009009 0.252252252 0.576576577 0.162162162
##   mp4  0.114104291 0.226476814 0.490090437 0.169328459
##   smil 0.017391304 0.408695652 0.460869565 0.113043478

prop.table(table(ds$video.url.hostname,ds$ping),1)

##                                
##                                   complete      embed  not.embed
##   content.bitsontherun.com      0.07005348 0.47593583 0.27219251
##   content.jwplatform.com        0.09513606 0.33104778 0.39168631
##   d1s3yn3kxq96sy.cloudfront.net 0.04687500 0.54687500 0.23437500
##   demo.jwplayer.com             0.14666352 0.01438340 0.68380099
##   dev.wowza.longtailvideo.com   0.01379310 0.42758621 0.35862069
##   fms.12E5.edgecastcdn.net      0.00937500 0.50937500 0.37500000
##   playertest.longtailvideo.com  0.01785714 0.32142857 0.53571429
##   testseedbox.tk                0.00000000 0.00000000 0.50000000
##   videos-jp.jwpsrv.com          0.00000000 0.50000000 0.00000000
##   wowza.longtailvideo.com       0.00000000 0.73364486 0.00000000
##   www.att.com                   0.04000000 0.32000000 0.36000000
##   www.longtailvideo.com         0.01739130 0.40869565 0.46086957
##   www.youtube.com               0.04641350 0.60337553 0.19831224
##                                
##                                       play
##   content.bitsontherun.com      0.18181818
##   content.jwplatform.com        0.18212985
##   d1s3yn3kxq96sy.cloudfront.net 0.17187500
##   demo.jwplayer.com             0.15515209
##   dev.wowza.longtailvideo.com   0.20000000
##   fms.12E5.edgecastcdn.net      0.10625000
##   playertest.longtailvideo.com  0.12500000
##   testseedbox.tk                0.50000000
##   videos-jp.jwpsrv.com          0.50000000
##   wowza.longtailvideo.com       0.26635514
##   www.att.com                   0.28000000
##   www.longtailvideo.com         0.11304348
##   www.youtube.com               0.15189873

prop.table(table(ds$video.url.scheme,ds$ping),1)

##       
##         complete     embed not.embed      play
##   http 0.1021380 0.2568064 0.4696843 0.1713713
##   rtmp 0.0093750 0.5093750 0.3750000 0.1062500

prop.table(table(ds$mobile,ds$ping),1)

##    
##       complete      embed  not.embed       play
##   0 0.10120884 0.25752397 0.47186328 0.16940392
##   1 0.04013378 0.49832776 0.28093645 0.18060201

## starting httpd help server ... done

##  [1] "id"                 "date"               "user.agent"        
##  [4] "browser"            "country"            "page.url"          
##  [7] "title"              "video.url"          "video.id"          
## [10] "ping.embed"         "ping.play"          "ping.complete"     
## [13] "ua.device.brand"    "ua.device.model"    "ua.device.family"  
## [16] "ua.os.major"        "ua.os.minor"        "ua.os.family"      
## [19] "ua.os.patch"        "ua.ua.major"        "ua.ua.minor"       
## [22] "ua.ua.family"       "ua.ua.patch"        "page.url.scheme"   
## [25] "page.url.hostname"  "page.url.path"      "page.url.query"    
## [28] "page.url.fragment"  "page.url.ext"       "video.url.scheme"  
## [31] "video.url.hostname" "video.url.port"     "video.url.path"    
## [34] "video.url.query"    "video.url.ext"      "mobile"            
## [37] "ping.not.embed"     "ping"

Playing with R – A bit of machine learning

Archives

Meta