AI Mammogram Diagnosis Under Real-World Conditions Two studies test Google's breast cancer detection models in clinics

Published
Reading time
4 min read
Map with UK sites; flowchart depicts mammogram study steps, highlighting AI’s role alongside doctors.
Loading the Elevenlabs Text to Speech AudioNative Player...

Introduced in 2020, Google’s AI system for detecting breast cancer in mammograms still hasn't been used to diagnose current patients. Two studies evaluated how well it would integrate with protocols at UK clinics. 

What’s new: In a test on real-world data, Google’s breast-cancer detection system identified slightly more cancers with fewer false positives than examinations by the first of two expert doctors. More significantly, it identified a quarter of cancers that human doctors missed but became apparent later. In a companion study, the system performed about as well as a second expert (who considered the first’s opinion). However, some doctors reported distrust in the system’s output. The studies were conducted by Christopher J. Kelly, Marc Wilson, and colleagues at Google, Imperial College London, University of Surrey, Royal Surrey National Health Service Foundation Trust, and several National Health Service Breast Screening Centres.

How it works: Google’s system uses three convolutional neural networks that were trained on a mammography database to produce embeddings, determine potential cancerous regions, and classify the probability of cancer.

Tests and results: In the two studies, the AI system helped to identify more cancers, and to identify them faster and earlier, in a typical UK diagnostic process.

  • A retrospective test evaluated the system’s ability to detect cancers based on 116,000 mammograms of women of ages 50 to 70 taken at five hospitals in 2016. The authors selected scans of the same women taken up to 39 months apart and compared diagnoses by their system versus a human expert. It achieved a sensitivity of 0.541 (the proportion of positives correctly identified), significantly higher than the 0.437 achieved by the first of two human evaluations. Its specificity was 0.943 (the proportion of negatives correctly identified) compared to the human rate of 0.952 — lower, but statistically equivalent. The AI system also successfully identified 25 percent of cases that humans had missed initially but became apparent three years later.
  • Considering 46,000 scans, the authors simulated what it would be like if the system were to replace the second of two human evaluators. The system achieved slightly better sensitivity and specificity, indicating that using AI for the second evaluation could save time while gaining accuracy. According to the clinics’ protocols, in cases where cancer was detected or the AI disagreed with the human, the cases were sent to an arbitration panel for a final determination. The AI sent 1,800 more cases to arbitration (an absolute increase of 4 percentage points; 5,300 total). Assuming that arbitration took five times as much human effort as reading does, the authors concluded that, despite sending more cases to arbitration, the system would reduce human effort by roughly 40 percent.
  • A live test evaluated the system’s ability to integrate with the real-world National Health Service infrastructure. The system labeled high or low risk around 9,250 fresh scans of women of ages 50 to 70 taken at 12 clinics during a few months in 2023 and 2024. (The test did not affect patient care. Patients were diagnosed by doctors in the usual way, and neither doctors nor patients were informed of the AI system’s diagnosis.) The system was much faster than human doctors, achieving a median processing time of 17.7 minutes from screen to interpretation compared to more than two days for the first of two human evaluations. The authors followed up three months later to determine the ground truth of whether a patient had cancer. As in the retrospective study, the system achieved better sensitivity than the first human evaluation and lower but statistically equivalent specificity.

Behind the news: Efforts to use AI for breast cancer detection began with earlier computer-aided detection (CAD) systems in the 1990s and 2000s, but the field accelerated in the mid 2010s as deep-learning models trained on large mammography datasets began outperforming older methods. In 2020, researchers at Google showed that an AI system could match or exceed expert radiologists in screening mammograms while reducing both false positives and false negatives. In late 2022, Google licensed the system to iCAD, which offers a breast-imaging platform, for deployment in real-world clinics. In 2023, Google and iCAD expanded their partnership into a 20-year worldwide commercialization agreement aimed at using Google’s AI as an independent “second reader” of 2D mammography. The partnership currently aims to secure regulatory approval for potential deployment in breast-cancer screening systems that use double-reading workflows.

Why it matters: Around 2.3 million women are diagnosed with breast cancer annually worldwide, and 760,000 don’t survive. Early diagnosis is critical. Yet the diagnostic system is overburdened. In the UK, for instance, a consultant breast radiologist has only four hours available weekly to look at the 5,000 scans they must read annually to maintain their certification. These studies show that AI can ease diagnostic workloads and improve outcomes by helping to prioritize scans or serving as a default co-reader. But they also highlight a need to build trust in the technology among doctors. This may require educating physicians in how AI systems work and making the systems’ output more explainable.

We’re thinking: As AI systems find their way into medicine, they raise important questions about the steps needed to build trust in the technology, and what checks and balances will yield the best outcomes. Developers can talk directly with doctors about what they need to gain trust in an AI system's output.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox