Deliverable 1: principles for the evaluation of artificial intelligence or machine learning-enabled medical devices to assure safety, effectiveness and ethicality
Published 30 December 2021
Introduction
As part of the G7’s health track artificial intelligence (AI) governance workstream 2021, member states committed to the creation of 2 deliverables on the subject of governance:
- The first paper seeks to define good practice for clinically evaluating artificial intelligence or machine learning (AI/ML) enabled medical devices.
- The second focuses on how to assess the suitability of AI/ML-enabled medical devices developed in one G7 country for deployment in another G7 country.
These papers are complementary and should therefore be read in combination to gain a more complete picture of the G7’s stance on the governance of AI in health.
This paper is the result of a concerted effort by G7 nations to contribute to the creation of harmonised principles for the evaluation of AI/ML-enabled medical devices, and the promotion of their effectiveness, performance, safety and ethicality. It builds on the efforts of existing international work led by the:
- Global Digital Health Partnership (GDHP)
- Institute of Electrical and Electronics Engineers (IEEE)
- International Medical Device Regulators Forum (IMDRF)
- International Organization for Standardization (ISO)
- International Telecommunication Union (ITU)
- World Health Organization (WHO)
- Organisation for Economic Co-operation and Development (OECD)
A total of 3 working group sessions were held to reach consensus on the content of this paper.
Current evaluation landscape
The use of AI/ML-enabled medical devices presents significant potential benefit to patients and health systems in a wide range of clinical applications including:
- screening
- diagnosis
- decision support for treatment selection
The rapid emergence of AI/ML-enabled medical devices provides novel challenges to current regulatory and governance systems, which are based on more traditional forms of Software as a Medical Device (SaMD).
The IMDRF describes AI[footnote 1] as:
a branch of computer science, statistics and engineering that uses algorithms or models to perform tasks and exhibit behaviours such as learning, making decisions and making predictions.
Machine learning (ML) is described as a subset of AI[footnote 1] that:
allows computer algorithms to learn through data, without being explicitly programmed, to perform a task.
Regulators, international standards bodies[footnote 2] and health technology assessors across the world are grappling with how they can provide assurance that AI/ML-enabled medical devices are safe, effective and performant – not just under test conditions but in the real world.
Specific challenges include the challenge of:
- generalisability (that is, avoiding failure when a technology is deployed in a population or setting with different characteristics to its training environment)
- continuous assurance (that is, ensuring safety in a system that is frequently or even continuously updating)
- inclusion and fairness (considering the whole pathway from training to wide-scale deployment)
We note the work of the ITU-WHO and GDHP to call for the development and international recognition of a framework for the clinical evaluation of health AI based on the democratic values we share as G7 nations, and our shared commitment to responsible stewardship of trustworthy AI as set out in the OECD AI principles.
We recognise the importance of this work for aligning regulatory requirements for AI/ML-enabled medical devices being spearheaded by the IMDRF, regional and country-specific regulatory bodies, as well as the work by the ISO on software in medical device or quality management standards.
We also recognise the existence of frameworks to validate AI technologies developed in some of the G7 jurisdictions.
As G7 nations, we recognise the need to work together to define and promote good practices in the evaluation of AI/ML-enabled medical devices in the health and care setting to promote high standards, and patient safety and privacy.
We also recognise the need to provide assurances to users and promote the trustworthiness of this technology while also clarifying expectations for innovators.
In addition, the harmonisation of standards across G7 countries strengthens our trade relations and the attractiveness of our markets (as innovations can move across countries with reduced costs of entry), and provides global leadership.
Vision for the future
We are committed to enabling the development and adoption of safe, effective, performant and ethical AI/ML-enabled medical devices in health and care systems across the world. The G7 are well positioned to contribute to the leadership needed to fulfil this mission.
We follow the WHO’s key ethical principles for the use of AI for health, defined in the following paragraph, to define the scope of what we mean by ethical. The breadth of this definition goes beyond the remit of medical device regulators.
We commend the work of the WHO in defining 6 principles[footnote 3] to ensure AI works to the public benefit of all countries, maximising the promise of technology, and holding stakeholders accountable to those who will rely on the technology and whose health will be affected by its use.
The principles are:
- protecting human autonomy
- promoting human wellbeing and safety, and the public interest
- ensuring transparency, explainability and intelligibility
- fostering responsibility and accountability
- ensuring inclusiveness and equity
- promoting AI that is responsive and sustainable
We will endeavour to implement the WHO principles in our own countries and through major international initiatives around the world.
To enable responsible innovation that enables patients to benefit from safe, effective, performant and ethical AI/ML-enabled medical devices, we aim to build evaluation systems in our countries for these technologies that:
1. Champion inclusiveness, fairness and transparency
We aim to champion inclusiveness, fairness and transparency, building on the democratic values we share as G7 nations. We should provide assurance that AI/ML-enabled medical devices are designed to be inclusive, enabling widespread benefit, and supporting deployment and adoption across various population groups.
2. Foster a patient-centred approach
We acknowledge the importance of patient and public voices in the development and deployment of AI/ML-enabled medical devices in health. We intend to provide assurance that AI/ML-enabled medical devices are co-designed with stakeholders from diverse backgrounds.
3. Provide proportionate continuous evaluation
We recognise the need for a clinical evaluation and monitoring system that is rigorous, independent, transparent, and proportionate to allow safe, effective and performant technologies to benefit patients, while balancing risks and benefits of deploying technologies.
Key principles for clinical evaluation
We recognise the need for international alignment on the evaluation of AI/ML-enabled medical devices for healthcare to protect the safety of patients across the world and drive responsible trustworthy innovation.
The principles laid out in this paper cover applications of AI/ML-enabled medical devices in health and care. The evaluation of this type of medical devices should be rigorous, independent[footnote 4], continuous and proportionate.
To develop this shared understanding, we propose the following key principles that may be considered intrinsic to any framework for the evaluation of AI/ML-enabled medical devices and to increase a safety culture among manufacturers. These should be thought of as running in parallel rather than a sequential plan.
1. Assessing the suitability of the AI/ML-enabled medical devices to the need
1.1. Manufacturers of the AI/ML-enabled medical devices should be sufficiently transparent and provide information to users about the rationale and intended purpose of the device, including defining:
- the clinical need that it aims to address
- the intended use population and setting for which it is designed
- the intended outcomes for patients and health systems[footnote 5]
1.2. Users of AI/ML-enabled medical devices should be able to use the information provided by manufacturers (see 1.1) to understand the technology in order to assess its suitability for its intended clinical need. Users should be able to clearly understand:
- the performance of the model for appropriate subgroups at the intersections of demographics and clinical status
- the characteristics of the data used to train and test the model
- acceptable inputs
- known limitations
- the user interface interpretation and clinical workflow integration of the model
1.3. A set of comprehensive ex-ante requirements should be set. The technology should demonstrate user-centred design, with understanding and integration of user needs to improve safe usage, effective and performant implementation, and successful adoption.[footnote 6]
1.4. Regulators (or approved or notified bodies) appointed by relevant jurisdictions should be made appropriately aware of clinically relevant or significant device modifications and updates to establish if the model has maintained its performance safety and an acceptable benefit-risk profile (see 4.1).
Users should be made aware of these types of modifications and updates from real-world performance monitoring, with a means to communicate product concerns to the developer. In particular, users should also be made aware of any updates or modifications that impact the intended use of the device or differing levels of performance on certain population subgroups.
2. Evaluating the technical performance of the AI/ML-enabled medical devices
2.1. To promote technical robustness, manufacturers of AI/ML-enabled medical devices should test performance by comparing it to existing benchmarks, ensuring that the results are reproducible in different settings and reported using standard performance metrics.[footnote 7]
Regulators should put in place a thorough validation process to provide assurance that claims are tested using cross-validation or independent data sets reflecting the intended purpose (including reflecting the diversity of the intended population and setting) before manufacturers can place the product on the market. Potential sources of dependence including patient, data acquisition and site factors are considered to provide assurance of independence.
This is discussed further in Deliverable 2: principles to support the development and deployment of AI/ML-enabled medical devices across jurisdictions.
3. Evaluating the clinical performance of the AI/ML-enabled medical devices
3.1. The effects of AI/ML-enabled medical devices should be evaluated in clinically relevant conditions. Regulators should contribute to developing standards for the design clinical studies. These standards need to balance collecting robust evidence with the need to minimise risk to patients and minimise negative impact on clinical workflow.
Manufacturers in their study design should take into account proactively what kind of effects their studies would have on healthcare organisations and potentially explore the possibility of retrospective studies in order to minimise disruption.
Manufacturers should follow these standards and should strive to have a clear comparator, whether this be a parallel comparator group or a sequential comparator (before-after comparison).
3.2. The level of intervention in clinical studies may range from studies in which the technology is run in ‘shadow’ or ‘silent mode’ (where the technology is run in parallel with the existing service provision) to full interventional clinical trials or investigations where it is deployed as per its intended purpose.
3.3. The outcomes assessed should be pre-defined by manufacturers and consider patient, user and system elements.
Manufacturers and users of AI/ML-enabled medical devices should provide assurance that metrics of effectiveness and safety should include outcomes that are meaningful to patients and clinical outcome assessments, including patient-reported outcomes where possible.
Manufacturers should encourage that user and system elements include assessments of:
- acceptability
- human-computer interactions (including how the output is interpreted and actioned)
- wider impact on care pathways
3.4. Manufacturers should consider and regulators should monitor the scale and speed of deployment, the issue of generalisability, and this will vary depending on the nature of the intervention and the evidence, provided that performance will not deteriorate across populations and settings (see Deliverable 2: principles to support the development and deployment of AI/ML-enabled medical devices across jurisdictions).
3.5. Manufacturers should transparently report studies for health using the most appropriate internationally recognised standard, such as CONSORT-AI for reporting randomised controlled trials of AI interventions, and equivalent tools for diagnostic test accuracy studies (STARD-AI) and prediction model studies (TRIPOD-AI).
4. Evaluating the long-term safety and wider impact of AI/ML-enabled medical devices
4.1. Like with other medical devices, the evaluation of an AI/ML-enabled medical device necessitates a life cycle-driven approach from manufacturers.
Data quality, methods of data acquisition (including models of instrument), population characteristics, and clinical practice all change over time in ways that can impact the validity and utility of AI/ML-enabled medical devices.
Additionally, as AI/ML-enabled medical devices in deployment become more frequently updated – and, in due course, ‘continuously-learning’[footnote 8] – assurance of performance will require continuous monitoring from manufacturers[footnote 9]. Depending on the nature of the changes made to the device through ‘continuous learning’ or updates, manufacturers might need to consider a new conformity assessment.
A delicate balance will have to be established with regulators in order to avoid continually undertaking conformity assessment.
4.2. Long-term monitoring through the product life cycle by manufacturers should not only provide assurance of no significant decline in the performance of the technology itself, but also evaluate the wider impact of deploying the technology, including actual patient and health system benefit, and attempted identification of any unintended harms and unintended bias.
Manufacturers should promote internal procedures for the monitoring of scientific developments and changes in medical practice relevant to the AI/ML-enabled medical device they have developed so as to provide assurance of safety and performance.
5. Evaluating ethical aspects of AI/ML-enabled medical devices, including inclusion and fairness
Manufacturers and users of AI/ML-enabled medical devices should consider ethical implications from design to deployment including:
5.1. Design: is the technology designed to be inclusive and support deployment across a wide range of population groups?
5.2. Data: is the technology trained and tested on a sufficiently diverse population that provides assurance that its use would be inclusive and equitable? The training and testing data sets should be transparently and accurately documented, including demographics of those data sets and the performance of the technology against those demographics.[footnote 10]
5.3. Algorithm: to what extent is the technology explainable and intelligible? Answering this question is complex as there are several dimensions to the interpretability of algorithms[footnote 11], and good methodologies to conduct assessments of interpretability are few and far between. [footnote 12]
5.4. Discrimination: is there a performance drop across a given subpopulation that leads to discrimination? Or is discrimination caused because performance is maintained across that given subpopulation group?[footnote 13]
5.5. Wider impact: does deployment into health systems and adoption for routine use lead to meaningful improvements in patient health outcomes in the long term, and avoidance of widening health disparities or other unintended adverse consequences, such as directing low-income patients to AI/ML-enabled medical devices and preventing them from accessing medical staff? Does the deployment of AI/ML-enabled medical devices present any challenges to clinical workflow or challenges regarding adoption due to varying levels of digital literacy of the workforce?
Conclusion
We have brought together our collective strengths as the G7 to generate some level of harmonisation of principles for the clinical evaluation of AI/ML-enabled medical devices, and support responsible and trustworthy innovation.
It will also support the efficient and safe translation of the opportunity of AI/ML-enabled medical devices into real lasting benefit for patients across the world.
We are committed to continue working together and with other international initiatives, and to use this work as a stepping stone for potential further work on international harmonisation of standards.
-
IMDRF Artificial Intelligence Medical Devices Working Group. ‘Machine Learning-enabled Medical Devices – A subset of Artificial Intelligence-enabled Medical Devices: Key Terms and Definitions.’. IMDRF/AIMD WG (PD1)/N67. 29 September 2021:4. ↩ ↩2
-
ISO. ‘ISO/IEC JTC 1/SC 42 - Artificial Intelligence’ (viewed on 23 December 2021). ↩
-
WHO. ‘Ethics and governance of artificial intelligence for health: WHO guidance.’ Geneva, 2021. Licence: CC BY-NC-SA 3.0 IGO. ↩
-
We recognise that this does not apply to Class I medical devices under EU and UK law. ↩
-
There are already regulations in place in several G7 members countries that cover this (for example, intended use under the Medical Devices Regulation). ↩
-
International standards already exist in this space. ↩
-
This is already covered in some countries by the Medical Devices Regulation. ↩
-
There are currently no truly continuously-learning devices on the market. Models are currently ‘batch trained’ and updated. ↩
-
There are already international standards in existence that can help deal with the iterative nature of software, such as ISO’s ‘ISO/IEC/IEEE 12207:2017 Systems and software engineering — Software life cycle processes’ (viewed on 23 December 2021) or the US Food and Drug Administration’s total product life cycle regulatory approach for AI/ML-based SaMD. ↩
-
There is growing literature on this topic including: Gebru T, Morgenstern J, Vecchione B and others. ‘Datasheets for Datasets.’ ArXiv 2021: 1803.09010; Sendak MP, Gao M, Brajer N and others. ‘Presenting machine learning model information to clinical end users with model facts labels.’ npj Digital Medicine 2020: volume 3, issue 41; and Mitchell M, Wu S, Zaldivar A and others. ‘Model Cards for Model Reporting.’ Proceedings of the Conference on Fairness, Accountability, and Transparency 2019. We also recognise that there are also initiatives within some of the G7 jurisdictions that aim to enable researchers, innovators, policy makers and regulators to make the most of the available health data while ensuring trust and security. These include the development of frameworks on the quality of health data and associated labels. ↩
-
Lipton ZC. ‘In Machine Learning, the Concept of Interpretability Is Both Important and Slippery.’ Machine Learning: volume 28; Riccardo Guidotti R and others. ‘A Survey of Methods for Explaining Black Box Models.’ ACM Computing Surveys 2019: volume 51, no. 5. ↩
-
Doshi-Velez F and Kim B. ‘Towards A Rigorous Science of Interpretable Machine Learning’. ArXiv 2017:1702.08608. ↩
-
Existing international standards could be used to cover this such as IEC 62304 in conjunction with ISO 13485 or the IEC 82304 for health software. ↩