Machine Learning

I’ll be taking statistical learning class next semester. So I just wanted to get a head start on some of the algorithms. I’m currently enrolled in an online class called Predictive Analytics offered through coursera and this is pretty much my “partial” answers to one of the assignments. I hope to use this code as a template for future assignments. The code uses decision tree, random forest and support vector machine on flow cytometer data to compare accuracy of three models.

A flow cytometer delivers a flow of particles through capilliary. By shining lasers of different wavelengths and measuring the absorption and refraction patterns, you can determine how large the particle is and some information about its color and other properties, allowing you to detect it.

While there are a number of challenging analytics tasks associated with this data, a central task is classification of particles. Based on the optical measurements of the particle, it can be identified as one of several populations.

coursera_pm_1

courera_pm_2

Looks like CRYPTO is missing using Decision Tree! Let’s look at the accuracy of Decision Tree, Random Forest and SVM .

courera_pm_3

The accuracy using DT, RF and SVM were .854, .923, and .921 respectively. Also, the most important variables in the data set, as suggested by Gini impurity measure were, pe and chl_small. The higher the index, the more important is the variable! Upon deleting one of the observations the accuracy of the svm model went up by 0.05. Here’s the confusion matrix from three models.

courera_pm_4

All the three models suggest Ultra being misclassified in higher proportion as Pico and Nano. Also, Synecho is misclassified as pico.

This post aims to build predictive models using Generalized Linear Model (GLM) and Principal Component Analysis (PCA). In SLR, the error terms are assumed to be normally distributed, therefore the expected value of response follows the same distribution as error terms. What if the response is binary or if it follows some other arbitrary distribution? This is where GLM kicks in.

PCA is a statistical procedure that aims to reduce correlated predictors to linearly uncorrelated ones called principal components. The first principal component is the linear combination of x-variables that accounts for maximum possible variability in the data. There’s a lot more into it, but that’s the basic idea. Let’s focus on building two models using these two methods.

At first a data set is split into two parts – a training and a test set. You build your model off the training set and evaluate your test set. Follow along the code below to understand what’s going on.

I have used table as a list. However things like accuracy, sensitivity, specificity etc between two models can also be compared.

Note: A part of the code above comes from one of the question in PML class offered in coursera by JHU.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

DataTweet

Analytics, Computing and Visualization Adventures

Predictive Analytics: Decision Tree, Random Forest and SVM

Building Predictive Models (GLM vs. PCA)

Follow Blog via Email

Blog Stats

Search

Calendar