Thursday, January 1, 2015


Abstract:

In this project, we have pursued and tackled the Click-Through Rate (CTR) Prediction problem. It is one of the most exciting and important applications in Machine Learning. Given an advertisement, we desired to predict whether or not it will be clicked by users. The task is easy to define but a bit more complicated than it sounds.
We as a team had favored more than one approach for this issue. Thus our strategy was to decentralize the task and work individually but report back the progress to the group at regular intervals. This way every team member would feel being considered and would provide a fair contribution without jeopardizing their comfort with their preferred programming languages and algorithms. My preferred programming languages are Java and MATLAB scripts. 
I used Java for pre-processing the train data (a Big Data) and then used MATLAB scripts to implement a Machine Learning algorithm called 'Naive Bayes classifer' for the CTR prediction!



Motivation:

Prof. Fei sha and Prof. Yan Liu, who teach Machine Learning course at University of Southern California (USC) happened to discover a competition on Kaggle and the problem statement of which was described as the same as our topic. And thus we participated in this Kaggle Competition which was held world-wide. 



Approach:

The Training data was a Big Data with around 40 million rows of data and 24 columns for each. Out of the 24 columns, one of them was the advertisement ID and one was the column for binary Labels (clicked:1, not clicked:0). And other 22 columns represented the 22 features of the advertisements that are to be considered in our prediction model. In the Test data there were around 4 million rows of data with 23 columns. Note that the 'Label' column was excluded in the Test data and that was our job to include it with as high accuracy as possible. 
I chose to go with the Generative approach. I created two probabilistic models out of the training data, one for all the clicked ads and another one for the ads not clicked. These models are the backbone of my Naive Bayes ClassifierLet Y represent labels and X represent the features then the two possible models will be P(Y=0|X) and P(Y=1|X). Thus, using Naive Bayes, with a strong independence assumptions between the features, we can easily find out those two models. After that we go to each row in the test data and see which of the two models it matches the closest. We then label that particular row to that respective model and move on to the next row.


Handling the Big Data: 

Training Naive Bayes Classifier on 40 million rows of train data in a single go will be a hectic job for the program running on the system with AMD quad-core processor with 4 GB of RAM. With 750 GB of hard disk capacity the Space complexity is not an issue as much as Time complexity is. One feasible way to tackle this is to borrow help from an open source file system, Apache Hadoop, which is designed to specifically process the Big Data. I, however, decided to train my classifier the hard way, i.e., to adopt the Divide-and-Conquer technique. I created the models based on each divided section and finally combining the results recursively. 
I wrote a basic Java code to divide the approx. 8 GB file withholding 40 million rows of data, into around 200 files of with 200,000 rows of data each. Now we can train the Naive Bayes Classifier on these divided files individually and combine their results. Note that Hadoop does the same task behind the scene and maybe even more efficiently but I opted to do it myself for the purpose of more learning. 



Naive-Bayes Classifier:

The strong independence assumption between the features is the only vulnerable point. However, once it has been trained on 40 million rows of data, I believe it has the ability to predict with a higher accuracy compared to other Generative approach algorithms. 
Here, I created a data structure of 3-Dimensional space and called it a matrix 'A' and I created another vector of size 2 and called it 'B'. Matrix A and Vector B are described a follows,

Matrix A has 3 dimensions. Vector B has size 2x1. Both are initialized with all zeros and they hold the counts only. The counts in A are incremented with respect to 3 factors i.e. features, their unique values and labels. Counts in B is simply counts of two different labels. We first pick a cell and increment the count by following a simple procedure. The procedure is elucidated in the following pseudo code:

<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> 
pseudo code:
for i from 1 to 200,000 rows of data
  for j from 1 to 22 features
    Match the value of a feature to its unique value in A
    Store the index of that value to a variable 'unique_index'
    if the label of that row is 0
        increment A(j,unique_index,0);
        increment B(0);
    else
        increment A(j,unique_index,1);
        increment B(1);
    end-if   
  end-for
end-for
<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>

This matrix A and B was calculated for every divided file with 200,000 rows of data each, and then all the counts were combined recursively. Once the classifier was trained on all 40 million rows, the matrix A represents the two probabilistic models in terms of counts. For a better understanding, consider the front frame (2D frame) as data holding a potential information of P(Y=0|X) and the frame behind it as data holding a potential information of P(Y=1|X). 



5-fold Cross Validation:

While making a progress it was important to assure myself that I am going in the right direction. Thus for first few files I decided to use the K-fold Cross Validation for k=5. Out of 200,000 rows of data, I trained my classifier on randomly chosen 80% of them and tested on the remaining 20% of the data. While testing I matched the predicted values of labels with the actual values of labels in order to get the accuracy rate. After carrying out many such cross validations, my highest accuracy rate was 72%. On completion of the cross validation I was assured that my Naive Bayes classifier wouldn't disappoint me.



Results:

My Naive Bayes classifier could predict the labels for all the test data. And simultaneously all the team members came up with their own predicted values for all the test rows. In the Kaggle competition we had to upload a file with two columns, that would be the advertisement ID, followed by the probability calculated of it getting clicked by the users. The Kaggle competition devised a way to rank us all participants based on the Log Loss scaleWe uploaded our outputs one by one and retained the output who's Log Loss was better. 
Finally, one of our team member's output Log Loss was the best among us. He had used Python scripts to implement the Online Learning algorithm, with the adaptive learning rate ability. I would post the link to his blog on the same once he has completed and published it! 




Evaluation:

What I learnt using this approach:

  • The MATLAB scripts takes a longer time to run the same algorithm than other programming languages such as C/C++ or Python scripts given the same system properties 
  • The drawback of the 3D structure (Matrix A) is that some features may contain fewer unique values, thus the remaining number of columns will hold value zero and will be unused. However this is one issue that we can afford by trading insignificant space complexity for the structured data, simply for the ease of visualization of the data
  • Other algorithms, such as Neural Networks, do have a greater chance of obtaining higher accuracy in terms of learning ability and predicting. Also, the Online Learning algorithm suits better for the huge Train data, like Big Data, due to it's adaptive learning rate which is modified every time a new row of data is processed making it a better classifier at every input reads
  • With Naive Bayes Classifier, the strong independence assumption between its features is not very assuring. Nevertheless, implementing Laplace smoothing for this classifier is always a better idea to avoid surprising errors in generating probabilistic models





Contact:
Kedar Prabhu
Los Angeles, CA

Email: trojanguy31@gmail.com