Disease-detecting AI flounders when applied to images from other hospitals
Mount Sinai researchers have added to evidence that disease-detecting artificial intelligence systems may not be consistently reliable across data from different healthcare systems.
In 60% of assessments, artificial neural networks trained to diagnose pneumonia using X-rays from one hospital system were less effective when turned on images from another healthcare network.
The findings add to concerns that internal test performance overstates the accuracy of neural networks, potentially leading to patient harm if external healthcare systems rely too heavily on their results.
Computer-aided medical image analysis has been hailed as a way to offset shortages of radiologists and bring top-tier diagnostic abilities to remote regions. However, the field is in its infancy. Research suggests neural networks trained and tuned using subsets of images from a hospital system perform well when turned on another subset of data from the same healthcare network. However, there is a shortfall of studies assessing whether these neural networks accurately analyze external images.
Research in the broader field of artificial intelligence suggests confounding factors could impede the generalizability of neural networks to external data. A paper published in JAMA last year showed that the specificity of a system for detecting diabetic retinopathy fell from more than 90% to 73% when it moved from internal to external data. Experts disagreed with the system's findings one-third of the time.
To further probe the topic, researchers from the Icahn School of Medicine at Mount Sinai collected 158,000 chest X-rays taken at three medical institutions. Neural networks trained to spot pneumonia in images from one or two medical institutions performed poorly when turned on data from another healthcare system. The neural networks performed significantly worse in 60% of the comparisons.
The finding has implications for the spread of neural networks from research settings into real-world use at multiple healthcare systems.
"Our findings should give pause to those considering rapid deployment of such systems without first assessing their performance in a variety of real-world clinical settings," the researchers wrote in PLOS Medicine. "If external test performance of a system is inferior to internal test performance, clinicians may erroneously believe systems to be more accurate than they truly are in the deployed context, creating the potential for patient harm."
Efforts to understand the cause of the divergent performance are complicated by the scale of neural networks — the architecture used in the study has 7 million parameters — but there is evidence that the system used variables other than underlying pathology to make its predictions.
Images contain a multitude of details that reveal the origin of the image. Some of these details, such as the use of metallic tokens indicating laterality, are obvious to the human eye. The Mount Sinai research suggests neural networks also pick up on subtle differences, for example in the processing and compression of images. If one of these features indicates the image comes from a site or scanner associated with greater disease prevalence, the neural network will use it in its predictions.
The reliance on non-pathological features will increase the accuracy of the neural network when it is analyzing images from the same hospital system. However, the presence of these features in images taken at other facilities may not correlate to disease prevalence, leading to poor generalizability.
- PLOS Medicine Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study
- JAMA Development and Validation of a Deep Learning System for Diabetic Retinopathy and Related Eye Diseases Using Retinal Images From Multiethnic Populations With Diabetes