A colleague of mine wants a predictive model. He is trying
to determine which people on a health insurance plan will visit the hospital in
the next couple of months. He has pretty good data for making this type of
prediction. He knows who visited the hospital in the past; and he knows they
are more likely to revisit. He knows what illnesses these people have, and
which illnesses likely result in hospital visits. He knows what drugs have been
prescribed, and whether patients are taking their drugs.
Even with this data and even with very good models, he still
complains that the predictions are not good enough. His problem is too many
false positives. And he simply doesn’t have enough employees to review every
patient the model predicts.
Data science is a great tool, but it is not perfect. And if
you are going to weld the data science tool, you should be aware its’ shortcomings.
Data science is simply not very good at finding a needle in a haystack.
This concept can be illustrated with an example. Say a
banker is managing the mortgages of 500,000 homeowners. He knows from
experience that roughly 1,000 of these homeowners will default on their loan.
There is plenty of data to help zero in on these 1,000 people: zip code,
income, payment history, and credit rating. He knows that if I can put the
right people some assistance, they may not default on their loan.
We have the data of build a predictive model. But regardless
of how good the model is, it will not be perfect. When the model is run, each
homeowner will be classified as at-risk for default or not at-risk. In this
scenario, there are four possible outcomes for each homeowner. The homeowner is
at-risk and is properly identified by the model; the homeowner is not at-risk
is properly identified by the model. These are the two accurate predictions.
Every predictor gets some wrong too. When the homeowner is
not at-risk, but the model says he is, that is a false positive. If the
homeowner is at-risk and was not identified by the model, that is a false
negative. In statistics, false positives are also referred to as “type I errors”.
False negatives are referred to as “type II errors”.
Now let’s say that we construct a model that is 80%
accurate, which is a rule-of-thumb threshold for a good prediction. With 80%
accuracy on 500,000 loans, 400,00 will be correctly predicted and 100,000 will
not. At an 80% prediction rate, 800 of the 1,000 homeowners would be predicted
correctly.
Of course, an 80% success rate means there is also a 20%
failure rate. I mentioned that 100,000 are not predicted correctly. There are
200 false negatives that are the balance of the 1,000 target loans. Subtracting
the 200 false negatives from the 100,000-people identified incorrectly leaves
99,800 false positives. That is the extraordinary 500 times as many false
positives as correctly predicted at-risk loans.
Even if the model can predict at the incredible rate of 99%
the numbers of false-positives will outnumber the correctly identified at-risk
cases by nearly five to one.
The problem here isn’t with data science or prediction methods.
It simple math. When trying to use statistics to find a very small number among
a very large number the false-positives will always greatly outnumber the
actual positive prediction. This type of problem is truly a needle in a haystack.
No comments:
Post a Comment