Saturday, May 13, 2017

Why Data Science Can't Find the Needle in the Haystack

A colleague of mine wants a predictive model. He is trying to determine which people on a health insurance plan will visit the hospital in the next couple of months. He has pretty good data for making this type of prediction. He knows who visited the hospital in the past; and he knows they are more likely to revisit. He knows what illnesses these people have, and which illnesses likely result in hospital visits. He knows what drugs have been prescribed, and whether patients are taking their drugs.
Even with this data and even with very good models, he still complains that the predictions are not good enough. His problem is too many false positives. And he simply doesn’t have enough employees to review every patient the model predicts.

Data science is a great tool, but it is not perfect. And if you are going to weld the data science tool, you should be aware its’ shortcomings. Data science is simply not very good at finding a needle in a haystack.

This concept can be illustrated with an example. Say a banker is managing the mortgages of 500,000 homeowners. He knows from experience that roughly 1,000 of these homeowners will default on their loan. There is plenty of data to help zero in on these 1,000 people: zip code, income, payment history, and credit rating. He knows that if I can put the right people some assistance, they may not default on their loan.

We have the data of build a predictive model. But regardless of how good the model is, it will not be perfect. When the model is run, each homeowner will be classified as at-risk for default or not at-risk. In this scenario, there are four possible outcomes for each homeowner. The homeowner is at-risk and is properly identified by the model; the homeowner is not at-risk is properly identified by the model. These are the two accurate predictions.

Every predictor gets some wrong too. When the homeowner is not at-risk, but the model says he is, that is a false positive. If the homeowner is at-risk and was not identified by the model, that is a false negative. In statistics, false positives are also referred to as “type I errors”. False negatives are referred to as “type II errors”.

Now let’s say that we construct a model that is 80% accurate, which is a rule-of-thumb threshold for a good prediction. With 80% accuracy on 500,000 loans, 400,00 will be correctly predicted and 100,000 will not. At an 80% prediction rate, 800 of the 1,000 homeowners would be predicted correctly.

Of course, an 80% success rate means there is also a 20% failure rate. I mentioned that 100,000 are not predicted correctly. There are 200 false negatives that are the balance of the 1,000 target loans. Subtracting the 200 false negatives from the 100,000-people identified incorrectly leaves 99,800 false positives. That is the extraordinary 500 times as many false positives as correctly predicted at-risk loans.

Even if the model can predict at the incredible rate of 99% the numbers of false-positives will outnumber the correctly identified at-risk cases by nearly five to one.


The problem here isn’t with data science or prediction methods. It simple math. When trying to use statistics to find a very small number among a very large number the false-positives will always greatly outnumber the actual positive prediction. This type of problem is truly a needle in a haystack.

No comments:

Post a Comment

You might also like ...