The Potential Pitfalls of Machine Learning Algorithms in Medicine

digital composite image of a doctor with a stethoscope
digital composite image of a doctor with a stethoscope
The ability of machine learning programs to diagnose medical conditions and predict outcomes is discussed in an opinion article.

Back in the 1990s an intrepid group of researchers out of the University of Pittsburgh set out to write a computer program that could do a better job than doctors of predicting whether serious complications would develop in patients who presented with pneumonia.1 Success may have been a long shot, but it was definitely a shot worth taking. After all, the researchers figured that if they pulled it off, they could both lower costs and improve patient outcomes in one fell swoop. So they built a neural network — basically a computer program that responds dynamically to external inputs — and turned it loose on a database covering three-quarters of a million patients in 78 hospitals across 23 states.

The results were curious, to say the least. The program seemed to have determined that patients with pneumonia and asthma had better outcomes than those who did not have asthma. Asthma, it appeared, was somehow providing some sort of protection.2 The neural net, which was by many measures outperforming both traditional software and expert doctors, would often recommend that if a patient had pneumonia, he or she should be admitted, but if that same person had pneumonia and a history of asthma you could safely send them home.

Of course, any second-year medical student could tell you that this is dangerous nonsense. Far from being protective, asthma is a serious condition that can exponentially complicate the management of any number of pulmonary illnesses, let alone pneumonia. It is hard to imagine a better pretense for automatic admission than acute presentation of pneumonia symptoms with a history of asthma attacks. Whither the neural net?

The singular goal of the software in the pneumonia study was to maximize the accuracy of its judgments. In this pursuit, it leveraged a concept called machine learning. As an idea, machine learning — the field of computer science concerned with developing algorithms that can both learn from data and improve themselves without being explicitly programmed — represents something of an ideal for medical problem solving.

In theory, these programs can ingest enormous quantities of information, discern the subtlest patterns (while simultaneously disregarding any distracting noise), and make meaningful and actionable predictions about future behaviors and outcomes. More than that, as it is used more frequently, the software only gets smarter. That machine learning algorithms are not more pervasive in contemporary medical practice, probably has more to do with inadequate data sets and insufficient computing power than anything else. The concept remains sound. If the future is not now, then it is surely breathing down our necks.

Machine learning scientists are obsessed with the inevitable tradeoff between accuracy and intelligibility.3 Accuracy is exactly what you think it is: how often the software arrives at the correct answer. Intelligibility, on the hand, has to do with our ability to understand why or how the algorithm arrived at its solution. Intelligibility is often compromised in hyperaccurate machine learning software precisely because the code doesn’t depend on explicit instruction to learn or improve; as the software becomes more intelligent, it doesn’t stop to tell us how. As a consequence, the most accurate machine learning models tend to be least intelligible, and vice versa. Because of this, machine learning software designers are forced to decide how much accuracy they should be willing to trade away in the quest for increased intelligibility.

Related Articles

I can tell you this much: the answer is not zero, and not just because we would do well to avoid a 2001: A Space Odyssey situation. Unintelligible models, accurate though they may be, are inevitably limited by their inability to distinguish between correlation and causation. This is not anyone’s fault, least of all the fault of the algorithms themselves. The explicit, solitary objective of the software is to find patterns that exist in the data; whether those patterns occur by chance, a true causal relationship, or something in between is completely irrelevant to the program’s objectives. As a consequence, any causal link uncovered by an unintelligible algorithm is presumptively suspect: we simply cannot tell if the relationship is the result of overfitting, interaction, or correlation with other, possibly unmeasured, variables, or a true underlying phenomenon. Half of being smart, as they say, is knowing what you are dumb about.

Which brings us back to the pneumonia study. The uber-accurate neural net was, as you would guess, also unintelligible. The researchers had no idea how or why it reached its conclusions. Luckily, they also used the same data set to simultaneously test a number of rule-based algorithms that didn’t employ machine learning principles. While these approaches were not quite as accurate (although they still did quite well, all things considered), they were substantially more intelligible. By analyzing the outputs from those algorithms, the researchers discovered that patients with pneumonia and comorbid asthma had better outcomes because they were often automatically admitted to intensive care as a matter of policy. This higher-level care was apparently enough to overcome the whatever deleterious effects the asthma might have otherwise caused.4 It stands to reason that is also why the unintelligible neural net advocated sending people with asthma home; it correctly surmised that those with asthma somehow achieved better outcomes, but never bothered to ask why.

To my mind, the critical insight from this story is not about machine learning software at all. It is about the irreplaceable value of expertise. It is impossible to make responsible policy on the basis of correlations alone; the paradoxical results of the pneumonia study teach us as much. As intelligent as machine learning software promises to be, only human experts — doctors who have spent a lifetime making all the mistakes that can be made in their narrow field — are capable of preventing us from blindly following a misguided algorithm down the wrong rabbit hole. Experts remain the only ones who can ask, and answer, the critical question — why?


  1. Ambrosino R, Buchanan BG, Cooper GF, Fine MJ. The use of misclassification costs to learn rule-based decision support models for cost-effective hospital admission strategies. Proc Annu Symp Comput Appl Med Care. 1995;304-308.
  2. Cooper GF, Abraham V, Aliferis CF, et al. Predicting dire outcomes of patients with community acquired pneumonia. J Biomed Inform. 2005;38(5):347-366.
  3. Caruana R. Intelligible machine learning models for health care. Microsoft Research. Published June 19, 2015. Accessed November 14, 2017.
  4. Caruana R, Lou Y, Gehrke J, Koch P, Sturm M, Elhadad N. Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission. Presented at: 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 10-13, 2015, Sydney, Australia.