In a first such warning when it comes to the role of Artificial Intelligence in making sense of critical health data, a team of US researchers has said AI in the medical space must be carefully tested for performance across a wide range of populations as the deep learning models may fall short.
The findings should give pause to those considering rapid deployment of AI platforms without rigorously assessing their performance in real-world clinical settings reflective of where they are being deployed, observed the team from the Icahn School of Medicine at Mount Sinai School of Medicine.
AI tools trained to detect pneumonia on chest X-rays suffered significant decreases in performance when tested on data from outside health systems, according to the study published in a special issue of PLOS Medicine on machine learning and health care.
These findings suggest that the deep learning models may not perform as accurately as expected.
“Deep learning models trained to perform medical diagnosis can generalise well, but this cannot be taken for granted since patient populations and imaging techniques differ significantly across institutions,” said Senior Author Eric Oermann, MD, Instructor in Neurosurgery at the Icahn School of Medicine at Mount Sinai.
To reach this conslusion, the researchers assessed how AI models identified pneumonia in 158,000 chest X-rays across three medical institutions — the National Institutes of Health, The Mount Sinai Hospital and Indiana University Hospital.
In three out of five comparisons, the convolutional neural networks’ (CNNs) performance in diagnosing diseases on X-rays from hospitals outside of its own network was significantly lower than on X-rays from the original health system.
However, CNNs were able to detect the hospital system where an X-ray was acquired with a high-degree of accuracy, and cheated at their predictive task based on the prevalence of pneumonia at the training institution.
“If AI systems are to be used for medical diagnosis, they must be tailored to carefully consider clinical questions, tested for a variety of real-world scenarios, and carefully assessed to determine how they impact accurate diagnosis,” explained Study’s First Author John Zech.