Science is intimidating. It conjures up images of lab coats,
telescopes and microscopes, or petri dishes and chemicals. But
Data Science isn’t rocket science.
Science is about discovery and
Data
Science is discovery in data.
You don’t need an army of PhDs for successful data science.
Instead you need a basic understanding of statistics and a little computing power.
Then follow this recipe of five steps to put Data Science to work for your
business.
Step 1. Decide what
to predict
First and foremost, you must know what you want to predict.
Data Science in business is about making predictions. That predicting might be finding
customers that will buy a product; or which patients will be readmitted to a
hospital; or when to buy shares of stock.
Everyone knows the story of Target sending coupons for
pre-natal items to a teenage girl, only to surprise her father. But that case
didn’t happen by chance. Instead, someone at Target decided to specifically focus
on pregnant women. That person decided to predict which of their customers were
pregnant.
Make your prediction on a single thing. That thing will be
represented by a prediction variable. The variable might simply be “Yes” or
“No”, such as in the Target pregnancy example. It could be an item in a list,
such as a day of the week. It frequently will be a number, such as the price of
a barrel of crude oil.
Step 2. Make your
hypothesis
Your hypothesis is the key to the puzzle. And yes, the word hypothesis comes right out of the
scientific method because that’s what we’re doing; we’re applying the
scientific method to data to make predictions. You decided in the previous step
what to predict, now it’s time to guess how to make that prediction.
Some insight into your problem is helpful here. For example,
in the case of Target above, someone had made the presumption that items a
person purchased could indicate whether they are pregnant. They very likely
narrowed the items down to a specific list of items or types of items. The
hypothesis may have been as simple as “someone who buys pre-natal vitamins is
likely to be pregnant”. Or even more specific, “a woman between 18 and 45 who
buys pre-natal vitamins is likely to be pregnant”.
Step 3. Get the data
Of course, this whole exercise assumes the data is available
to make these predictions. You will need to collect data points on any
attribute you are testing, as well as the values you wish to predict.
Again, going back to our example to finding customers who
are expecting. Assuming we want to build a model on the hypothesis that “a
woman between 18 and 45 who buys pre-natal vitamins is likely to be pregnant”.
We will need the following data:
- · A list of customers who bought products.
- ·
A list of the products they bought.
- ·
For each customer, we need their age and gender.
- ·
For the products, we need to know which are
pre-natal vitamins.
- ·
And most importantly, we need to know who is
pregnant.
In this age of big
data, you may have all the data available. That is, every customer, and every
product they purchased. If you have all the data, great, you can build your
model based on the population, which
is the statistician’s way of saying “all the data”. Otherwise we will use a sample, which is a way of saying some of
the data.
Some data points may be difficult to obtain. In our example,
the fact that a customer is expecting a child may not be readily available. In
this case, special steps will be required to obtain the data. Target may have
performed a customer survey or used some other means of gathering the
information directly from the individual.
Step 4. Build a model
Building a model is where the fun starts. This is the
statistical model, or predictive model. A basic understanding of statistics and
knowledge of modeling software is necessary.
Before building a model, you should run some analysis to see
if your hypothesis is worth pursuing. The typical first step is testing a null hypothesis for statistical significance. The null hypothesis checks that the
predictor variable affects the prediction. In our case, the null hypothesis
would be “knowing a person is a woman, between 18 and 45, who bought pre-natal
vitamins has no impact on their being pregnant.”
In short, we compare the number of pregnancies in a random
sample of customers against 18 to 45-year-old women buying pre-natal vitamins. If
the difference in pregnancy rates between these two groups is greater than 5%,
the null hypothesis is disproved, and our assumption is considered
statistically significant.
Some caution is needed here, because a 5% difference could
be attributed to improbable random samples. You can protect yourself against
improbable results by repeating the test against additional random samples.
Once you have your data. And your hypothesis is sound.
Building a predictive model is relatively simple. For example, the R code for
creating a model looks like this:
modelFit = train(class ~ .,method="rf",
data=trainingCV, prox=TRUE)
And the code for making predictions looks like this:
prediction = predict(modelFit, testCV)
This code is illustrative to show that modelling does not
require complicated commands.
Step 5. Check your
results
Before creating your model, divide your data into two sets;
a training set and a test set. The training data is used to
create the model. The test set is used to demonstrate that it works. Typically,
you would split the original data such that 80% of it is used for training with
the remaining 20% used for testing.
In our sample, let’s say we have 1,000 customers in our data.
We would randomly select 800 for training and 200 for testing. But there is
nothing sacred about an 80/20 split. In fact, if your dataset is very large,
say 100,000 or more, you could create multiple test sets using a 60/20/20
split.
When your data is prepared, create your model; if using R
then run the train() function. Then take the created model and run it against
the test data; if using R then run the predict() function. The predict()
function will make prediction of whether the customer is pregnant or not.
The results of the prediction are compared against the
original data to determine if our model makes reliable predictions. Using our
sample, we would make predictions against 200 people. If the model correctly
determines pregnancy in 150 of them, then our prediction was 75% accurate.
Your software should be able to provide you with an Area Under the Curve (AUC) analysis. AUC
values close to 1 indicate a very good model, those near .5 are little better
than flipping a coin. You should strive for AUCs of .80 or more.
And now we’re done
Well, maybe we’re not done. If you’re AUC is poor, then you
should start the process over. But it’s not a complete loss, because even a bad
model gives knowledge; knowing what doesn’t work is important too.
Of course this is a blog post. And I have over-simplified
every step. Still, creating predictive model is not as intimidating as one
might think. It’s most definitely not like putting a man on the moon.