Wednesday, May 10, 2017

Data Science isn't Rocket Science

Science is intimidating. It conjures up images of lab coats, telescopes and microscopes, or petri dishes and chemicals. But Data Science isn’t rocket science. Science is about discovery and Data Science is discovery in data.

You don’t need an army of PhDs for successful data science. Instead you need a basic understanding of statistics and a little computing power. Then follow this recipe of five steps to put Data Science to work for your business.

Step 1. Decide what to predict

First and foremost, you must know what you want to predict. Data Science in business is about making predictions. That predicting might be finding customers that will buy a product; or which patients will be readmitted to a hospital; or when to buy shares of stock.

Everyone knows the story of Target sending coupons for pre-natal items to a teenage girl, only to surprise her father. But that case didn’t happen by chance. Instead, someone at Target decided to specifically focus on pregnant women. That person decided to predict which of their customers were pregnant.

Make your prediction on a single thing. That thing will be represented by a prediction variable. The variable might simply be “Yes” or “No”, such as in the Target pregnancy example. It could be an item in a list, such as a day of the week. It frequently will be a number, such as the price of a barrel of crude oil.

Step 2. Make your hypothesis

Your hypothesis is the key to the puzzle. And yes, the word hypothesis comes right out of the scientific method because that’s what we’re doing; we’re applying the scientific method to data to make predictions. You decided in the previous step what to predict, now it’s time to guess how to make that prediction.

Some insight into your problem is helpful here. For example, in the case of Target above, someone had made the presumption that items a person purchased could indicate whether they are pregnant. They very likely narrowed the items down to a specific list of items or types of items. The hypothesis may have been as simple as “someone who buys pre-natal vitamins is likely to be pregnant”. Or even more specific, “a woman between 18 and 45 who buys pre-natal vitamins is likely to be pregnant”.

Step 3. Get the data

Of course, this whole exercise assumes the data is available to make these predictions. You will need to collect data points on any attribute you are testing, as well as the values you wish to predict.
Again, going back to our example to finding customers who are expecting. Assuming we want to build a model on the hypothesis that “a woman between 18 and 45 who buys pre-natal vitamins is likely to be pregnant”. We will need the following data:
  • ·        A list of customers who bought products.
  • ·         A list of the products they bought.
  • ·         For each customer, we need their age and gender.
  • ·         For the products, we need to know which are pre-natal vitamins.
  • ·         And most importantly, we need to know who is pregnant.

In this age of big data, you may have all the data available. That is, every customer, and every product they purchased. If you have all the data, great, you can build your model based on the population, which is the statistician’s way of saying “all the data”. Otherwise we will use a sample, which is a way of saying some of the data.

Some data points may be difficult to obtain. In our example, the fact that a customer is expecting a child may not be readily available. In this case, special steps will be required to obtain the data. Target may have performed a customer survey or used some other means of gathering the information directly from the individual.

Step 4. Build a model

Building a model is where the fun starts. This is the statistical model, or predictive model. A basic understanding of statistics and knowledge of modeling software is necessary.

Before building a model, you should run some analysis to see if your hypothesis is worth pursuing. The typical first step is testing a null hypothesis for statistical significance. The null hypothesis checks that the predictor variable affects the prediction. In our case, the null hypothesis would be “knowing a person is a woman, between 18 and 45, who bought pre-natal vitamins has no impact on their being pregnant.”

In short, we compare the number of pregnancies in a random sample of customers against 18 to 45-year-old women buying pre-natal vitamins. If the difference in pregnancy rates between these two groups is greater than 5%, the null hypothesis is disproved, and our assumption is considered statistically significant.

Some caution is needed here, because a 5% difference could be attributed to improbable random samples. You can protect yourself against improbable results by repeating the test against additional random samples.

Once you have your data. And your hypothesis is sound. Building a predictive model is relatively simple. For example, the R code for creating a model looks like this:

modelFit = train(class ~ .,method="rf",
data=trainingCV, prox=TRUE)

And the code for making predictions looks like this:

prediction = predict(modelFit, testCV)

This code is illustrative to show that modelling does not require complicated commands.

Step 5. Check your results

Before creating your model, divide your data into two sets; a training set and a test set. The training data is used to create the model. The test set is used to demonstrate that it works. Typically, you would split the original data such that 80% of it is used for training with the remaining 20% used for testing.

In our sample, let’s say we have 1,000 customers in our data. We would randomly select 800 for training and 200 for testing. But there is nothing sacred about an 80/20 split. In fact, if your dataset is very large, say 100,000 or more, you could create multiple test sets using a 60/20/20 split.

When your data is prepared, create your model; if using R then run the train() function. Then take the created model and run it against the test data; if using R then run the predict() function. The predict() function will make prediction of whether the customer is pregnant or not.

The results of the prediction are compared against the original data to determine if our model makes reliable predictions. Using our sample, we would make predictions against 200 people. If the model correctly determines pregnancy in 150 of them, then our prediction was 75% accurate.

Your software should be able to provide you with an Area Under the Curve (AUC) analysis. AUC values close to 1 indicate a very good model, those near .5 are little better than flipping a coin. You should strive for AUCs of .80 or more.

And now we’re done

Well, maybe we’re not done. If you’re AUC is poor, then you should start the process over. But it’s not a complete loss, because even a bad model gives knowledge; knowing what doesn’t work is important too.


Of course this is a blog post. And I have over-simplified every step. Still, creating predictive model is not as intimidating as one might think. It’s most definitely not like putting a man on the moon.

No comments:

Post a Comment

You might also like ...