Resaro: Evaluating the Performance & Robustness of an AI System for Chest X-ray (CXR) Assessments

An assessment of the feasibility of deploying a commercial AI solution to evaluate Chest X-ray (CXR) images.

Background & Description

A healthcare organisation in Singapore engaged Resaro to evaluate the feasibility of deploying a commercial AI solution for evaluating Chest X-ray (CXR) images.

This evaluation was done within the context of a triage framework, which implements a two-tiered approach. The first step in this is, classifying CXRs as either “normal” or “abnormal” and the second is categorising the “abnormal” cases into those that are critical, needing urgent attention, and non-critical abnormalities. The main purpose of the evaluation was to identify the overall performance and robustness of the CXR AI system, as well as to get deployment recommendations such as optimal thresholds for AI predictions.

For this evaluation, we used CXR images from participating healthcare facilities in Singapore. We corrected potential deviation between the evaluation dataset and the representative population in the healthcare institution using statistical reweighting methods. The analysis included assessing the performance of the AI solution on these CXR images, through metrics such as sensitivity and specificity. Additionally, we identified and calculated operationally relevant metrics and thresholds to offer a well-rounded view of the model’s utility and reliability in real-world applications. To contextualise the AI’s performance, we compared its results against the diagnostic accuracy of experienced human radiologists.

Resaro’s analysis extended to assessing fairness, where the performance difference between different subpopulations were investigated. While the primary investigation showed no bias across gender, performance differences between age groups flagged up the need for further analysis for any hidden confounders. Additionally, the analysis included robustness testing, which the team conducted by measuring the AI model’s performance to inputs that are algorithmically augmented with image corruptions encountered in typical CXR imaging.

This evaluation has been pivotal in building clinical confidence and fostering trust in the effectiveness of the CXR AI solution when deployed in the local setting. Resaro used the results to inform decision makers on the trade-offs between the right operational threshold to use.

Relevant Cross-Sectoral Regulatory Principles

Safety, Security & Robustness

Ensuring AI safety and robustness is critical in the context of CXR analysis, as the accuracy and reliability of these systems directly impact clinical decision-making and the overall trust in deploying AI in high-stakes healthcare environments.

This evaluation process enhances safety by ensuring the AI solution reliably prioritises patients who are in need of urgent care, preventing critical cases from being overlooked and minimising unnecessary workload for radiologists on non-critical cases.

Comparing AI performance with human radiologists confirms that the system meets high operational standards and is feasible in clinical settings. Additionally, the robustness assessment tested the system’s resilience against common image corruptions, revealing strong robustness overall while identifying specific types of corruptions to which the system is more sensitive.

Appropriate Transparency & Explainability

The performance analysis provided clear indicators of the system’s performance across various metrics, allowing the clinicians to understand the strengths and weaknesses of the system. A study on operational thresholds during the analysis helped prioritise various levels of sensitivity and specificity at different triage levels. The robustness analysis informed operators on which types of corruptions to monitor and filter for consistent performance in real-world conditions.

Fairness

As part of the evaluation, Resaro conducted bias audits to assess performance variations across sub-populations defined by sensitive attributes like gender and age. The evaluation did not identify any evidence of bias across genders. However, some performance differences were observed within age groups. Recommendations were made to investigate additional factors in future evaluations to better understand and address any potential age-related bias.

Why we took this approach

Performance metrics (such as accuracy, sensitivity, specificity, etc.) reported by AI vendors may not fully reflect the system’s effectiveness in a local setting, as distribution shifts in patient or health data can significantly impact results. This highlights the importance of conducting independent evaluations tailored to the specific context of use.

To ensure the AI system’s feasibility for the use case of reviewing CXRs for triaging, we conducted evaluations using real CXR images closely aligned with those encountered in routine operation. Comparing the AI’s performance with that of working radiologists allowed us to identify any performance gains or losses from modifying the CXR review pipeline. Additionally, robustness testing focused on typical image corruptions verified by medical professionals, ensuring that the system remains reliable under realistic conditions and that any performance fluctuations are well-understood.

Benefits to the organisation using the technique

This evaluation helped to build trust in the AI triage solution, ensuring that performance remains consistent when transitioning to an AI-assisted workflow rather than relying solely on human radiologists. It enabled the organisation to set the right thresholds to maintain appropriate sensitivity and specificity across different triage levels, ensuring accurate prioritisation of patient cases. Additionally, the evaluation helped identify potential biases in the system, facilitating deeper analysis and adjustments for fair treatment across patient demographics. By revealing specific types of image corruption that can impact AI predictions, it also guides the implementation of filters in the pipeline to flag CXRs that are overly degraded. These measures collectively contribute to more reliable, equitable, and accurate AI-supported patient care.

Limitations of the approach

A possible limitation of our approach lies in the assumptions made regarding the prevalence of various findings in a representative population to reweigh the evaluation dataset. The validity of these assumptions should be reviewed as the system is implemented at scale at more healthcare institutions in Singapore. Additionally, we recommend a larger and more diverse dataset for future evaluations to identify other potential confounders in AI predictions, such as variations introduced by different X-ray machines. Lastly, we propose adopting a more fine-grained triaging approach in the future to gather a deeper understanding of the AI system’s performance at a more detailed level.

Further AI Assurance Information

Updates to this page

Published 5 December 2024