Saturday, May 13, 2017

Why Data Science Can't Find the Needle in the Haystack

A colleague of mine wants a predictive model. He is trying to determine which people on a health insurance plan will visit the hospital in the next couple of months. He has pretty good data for making this type of prediction. He knows who visited the hospital in the past; and he knows they are more likely to revisit. He knows what illnesses these people have, and which illnesses likely result in hospital visits. He knows what drugs have been prescribed, and whether patients are taking their drugs.
Even with this data and even with very good models, he still complains that the predictions are not good enough. His problem is too many false positives. And he simply doesn’t have enough employees to review every patient the model predicts.

Data science is a great tool, but it is not perfect. And if you are going to weld the data science tool, you should be aware its’ shortcomings. Data science is simply not very good at finding a needle in a haystack.

This concept can be illustrated with an example. Say a banker is managing the mortgages of 500,000 homeowners. He knows from experience that roughly 1,000 of these homeowners will default on their loan. There is plenty of data to help zero in on these 1,000 people: zip code, income, payment history, and credit rating. He knows that if I can put the right people some assistance, they may not default on their loan.

We have the data of build a predictive model. But regardless of how good the model is, it will not be perfect. When the model is run, each homeowner will be classified as at-risk for default or not at-risk. In this scenario, there are four possible outcomes for each homeowner. The homeowner is at-risk and is properly identified by the model; the homeowner is not at-risk is properly identified by the model. These are the two accurate predictions.

Every predictor gets some wrong too. When the homeowner is not at-risk, but the model says he is, that is a false positive. If the homeowner is at-risk and was not identified by the model, that is a false negative. In statistics, false positives are also referred to as “type I errors”. False negatives are referred to as “type II errors”.

Now let’s say that we construct a model that is 80% accurate, which is a rule-of-thumb threshold for a good prediction. With 80% accuracy on 500,000 loans, 400,00 will be correctly predicted and 100,000 will not. At an 80% prediction rate, 800 of the 1,000 homeowners would be predicted correctly.

Of course, an 80% success rate means there is also a 20% failure rate. I mentioned that 100,000 are not predicted correctly. There are 200 false negatives that are the balance of the 1,000 target loans. Subtracting the 200 false negatives from the 100,000-people identified incorrectly leaves 99,800 false positives. That is the extraordinary 500 times as many false positives as correctly predicted at-risk loans.

Even if the model can predict at the incredible rate of 99% the numbers of false-positives will outnumber the correctly identified at-risk cases by nearly five to one.


The problem here isn’t with data science or prediction methods. It simple math. When trying to use statistics to find a very small number among a very large number the false-positives will always greatly outnumber the actual positive prediction. This type of problem is truly a needle in a haystack.

Wednesday, May 10, 2017

Data Science isn't Rocket Science

Science is intimidating. It conjures up images of lab coats, telescopes and microscopes, or petri dishes and chemicals. But Data Science isn’t rocket science. Science is about discovery and Data Science is discovery in data.

You don’t need an army of PhDs for successful data science. Instead you need a basic understanding of statistics and a little computing power. Then follow this recipe of five steps to put Data Science to work for your business.

Step 1. Decide what to predict

First and foremost, you must know what you want to predict. Data Science in business is about making predictions. That predicting might be finding customers that will buy a product; or which patients will be readmitted to a hospital; or when to buy shares of stock.

Everyone knows the story of Target sending coupons for pre-natal items to a teenage girl, only to surprise her father. But that case didn’t happen by chance. Instead, someone at Target decided to specifically focus on pregnant women. That person decided to predict which of their customers were pregnant.

Make your prediction on a single thing. That thing will be represented by a prediction variable. The variable might simply be “Yes” or “No”, such as in the Target pregnancy example. It could be an item in a list, such as a day of the week. It frequently will be a number, such as the price of a barrel of crude oil.

Step 2. Make your hypothesis

Your hypothesis is the key to the puzzle. And yes, the word hypothesis comes right out of the scientific method because that’s what we’re doing; we’re applying the scientific method to data to make predictions. You decided in the previous step what to predict, now it’s time to guess how to make that prediction.

Some insight into your problem is helpful here. For example, in the case of Target above, someone had made the presumption that items a person purchased could indicate whether they are pregnant. They very likely narrowed the items down to a specific list of items or types of items. The hypothesis may have been as simple as “someone who buys pre-natal vitamins is likely to be pregnant”. Or even more specific, “a woman between 18 and 45 who buys pre-natal vitamins is likely to be pregnant”.

Step 3. Get the data

Of course, this whole exercise assumes the data is available to make these predictions. You will need to collect data points on any attribute you are testing, as well as the values you wish to predict.
Again, going back to our example to finding customers who are expecting. Assuming we want to build a model on the hypothesis that “a woman between 18 and 45 who buys pre-natal vitamins is likely to be pregnant”. We will need the following data:
  • ·        A list of customers who bought products.
  • ·         A list of the products they bought.
  • ·         For each customer, we need their age and gender.
  • ·         For the products, we need to know which are pre-natal vitamins.
  • ·         And most importantly, we need to know who is pregnant.

In this age of big data, you may have all the data available. That is, every customer, and every product they purchased. If you have all the data, great, you can build your model based on the population, which is the statistician’s way of saying “all the data”. Otherwise we will use a sample, which is a way of saying some of the data.

Some data points may be difficult to obtain. In our example, the fact that a customer is expecting a child may not be readily available. In this case, special steps will be required to obtain the data. Target may have performed a customer survey or used some other means of gathering the information directly from the individual.

Step 4. Build a model

Building a model is where the fun starts. This is the statistical model, or predictive model. A basic understanding of statistics and knowledge of modeling software is necessary.

Before building a model, you should run some analysis to see if your hypothesis is worth pursuing. The typical first step is testing a null hypothesis for statistical significance. The null hypothesis checks that the predictor variable affects the prediction. In our case, the null hypothesis would be “knowing a person is a woman, between 18 and 45, who bought pre-natal vitamins has no impact on their being pregnant.”

In short, we compare the number of pregnancies in a random sample of customers against 18 to 45-year-old women buying pre-natal vitamins. If the difference in pregnancy rates between these two groups is greater than 5%, the null hypothesis is disproved, and our assumption is considered statistically significant.

Some caution is needed here, because a 5% difference could be attributed to improbable random samples. You can protect yourself against improbable results by repeating the test against additional random samples.

Once you have your data. And your hypothesis is sound. Building a predictive model is relatively simple. For example, the R code for creating a model looks like this:

modelFit = train(class ~ .,method="rf",
data=trainingCV, prox=TRUE)

And the code for making predictions looks like this:

prediction = predict(modelFit, testCV)

This code is illustrative to show that modelling does not require complicated commands.

Step 5. Check your results

Before creating your model, divide your data into two sets; a training set and a test set. The training data is used to create the model. The test set is used to demonstrate that it works. Typically, you would split the original data such that 80% of it is used for training with the remaining 20% used for testing.

In our sample, let’s say we have 1,000 customers in our data. We would randomly select 800 for training and 200 for testing. But there is nothing sacred about an 80/20 split. In fact, if your dataset is very large, say 100,000 or more, you could create multiple test sets using a 60/20/20 split.

When your data is prepared, create your model; if using R then run the train() function. Then take the created model and run it against the test data; if using R then run the predict() function. The predict() function will make prediction of whether the customer is pregnant or not.

The results of the prediction are compared against the original data to determine if our model makes reliable predictions. Using our sample, we would make predictions against 200 people. If the model correctly determines pregnancy in 150 of them, then our prediction was 75% accurate.

Your software should be able to provide you with an Area Under the Curve (AUC) analysis. AUC values close to 1 indicate a very good model, those near .5 are little better than flipping a coin. You should strive for AUCs of .80 or more.

And now we’re done

Well, maybe we’re not done. If you’re AUC is poor, then you should start the process over. But it’s not a complete loss, because even a bad model gives knowledge; knowing what doesn’t work is important too.


Of course this is a blog post. And I have over-simplified every step. Still, creating predictive model is not as intimidating as one might think. It’s most definitely not like putting a man on the moon.

Tuesday, May 02, 2017

Beware of Expert Blindness

More and more companies are looking to big data to help them market their products or improve their services. Of course that means more companies are seeking out data scientists and statisticians. But to truly take advantage of big data means the firm must commit to the principles of data science. That is often easier said than done.

Enter the Subject Matter Expert and the common trap of “Subject Expert Blindness”.
Yes. The Subject Matter Expert; the person who has spent a career building knowledge of their business. These are the people who drive a company’s offering; or whose stamp of approval is necessary on any significant project. They believe their experience and learning has given them special insight that others simply do not have.

If you are a specialist in data science, then it is unlikely that you have spent years earning experience in any particular industry. Automotive. Healthcare. Insurance. Finance. It doesn’t matter because your expertise is data. Data is data. And you tell your story with the data.

The expert does not rely on data. Or only needs it to confirm their preconceived insight. The expert, then, becomes blind to alternatives hidden in the data.

Take the case of a recent project of mine. I was approached by a firm looking to find groups of people who were likely to be the most expensive customers to service. The expert provided a list of twenty such groups and asked that we demonstrate that these are statistically more likely to consume services than an ”average” customer.

But the notion of pre-determined groups is silly in the world of data science. Why not run have the data tell us what the highest risk groups are? If the results match the expert’s groups then great, her hunches are confirmed. But the expert will never find the hidden gems that the data often exposes. The expert is simply blind to the alternatives.

The bottom-line: let the data tell the story. Don’t force the story onto the data. Resist the temptation to rely on personal experience to shape the story before the data is even crunched.

You might also like ...