NICE: Priority Screening Algorithm
Takes a set of titles and abstracts of research papers manually screened by the user and learns from those decisions to rank the unscreened papers, assisting staff in identifying the more relevant papers faster.
Tier 1 Information
Name
Priority Screening Algorithm
Description
This algorithm takes a set of titles and abstracts of research papers manually screened by the user and learns from those decisions to rank the unscreened papers in order of relevance. This tool assists staff in identifying the more relevant papers faster.
Website URL
From page 5 onwards in the following link provides a further description on how the tool works: https://eppi.ioe.ac.uk/CMS/Portals/35/machine_learning_in_eppi-reviewer_v_7_web_version.pdf
Contact email
DITSL@nice.org.uk
Tier 2 - Owner and Responsibility
1.1 - Organisation or department
National Institute for Health and Care Excellence
1.2 - Team
Professional Team
1.3 - Senior responsible owner
Associate Director
1.4 - External supplier involvement
Yes
1.4.1 - External supplier
EPPI Centre, University College London (UCL)
1.4.2 - Companies House Number
N/A
1.4.3 - External supplier role
The UCL team worked with NICE to implement the screening classifier into EPPI R5 (an application for systematic reviewing). The underpinning algorithms for the screening classifier are managed and maintained by UCL staff.
1.4.4 - Procurement procedure type
The services of UCL were not procured, but provided to NICE without charge in the spirit of collaboration.
Tier 2 - Description and Rationale
2.1 - Detailed description
The way in which priority screening works is through a process known as ‘active learning’. Briefly put, ‘active learning’ is an iterative process whereby the accuracy of the predictions made by the machine are improved through interaction with users (reviewers). When used in a review, active learning involves the reviewer screening a small number of records manually; the machine then ‘learns’ from these decisions and generates a list of records for the reviewer to look at next. This cycle continues, with the number of reviewer decisions growing, until a given stopping criterion is reached and the process ends (e.g. the reviewer has identified all the relevant records they had expected; they have run out of time; they have screened all the records manually.
The modelling algorithm is logistic regression. The algorithm takes a set of titles and abstracts that are labelled as ‘included’ or ‘excluded’. It vectorizes them into tf-idf ‘bag-of-words’ vectors and builds a logistic regression model. This model is then used to score unlabelled records. It is hosted on the Microsoft Azure platform in a way that is scaleable, reducing risk from overloading primary servers.
2.2 - Scope
This tool’s purpose is to prioritise the scientific papers that could possibley include the content that users are looking for during the screening phase of systematic review. The model is actively trained based on the manual screening decisions of the systematic reviewer. It ‘learns’ to apply a label and then labels ‘unseen’ records. With this information it then ranks (and re-ranks) records that are being screened in systematic reviews. Applying this model on the ‘unseen’/unreviewed scientific papers, has the effect of bringing the more relevant paper to the top of the list to be screened next.
2.3 - Benefit
As it ‘learns’ the ranking can be used to recommend those records most likely to be relevant for screening next. It thus enables reviewers to locate the most likely relevant records earlier in the screening process than they otherwise would have done.
2.4 - Previous process
Prior to this model being implemented, staff had to manually go through documents in an unsorted order meaning much time was wasted reading reviews that were not relevant.
2.5 - Alternatives considered
There were multiple other algorithmic solutions considered to undertake this task such as; Support vector machines, decision trees, BERT models etc. This specific model was chosen because it undertakes the task of generalisation well and could be quickly and easily deployed to this task.
Tier 2 - Decision making Process
3.1 - Process integration
The priority classifier is available as a standalone option for sorting studies in EPPI R5. This does not run automatically and it is a choice made by the reviewer to use the classifier. It ‘learns’ to apply a label and then labels ‘unseen’ records with a priority ordering for reviewing. It can be used in many situations, but is used to rank (and re-rank) records that are being screened for the systematic reviews process.
3.2 - Provided information
It is used to rank (and re-rank) records that are being screened in systematic reviews. This means that in real-time, studies that are more likely to be relevant are pushed to the top of the dataset. However, the model does not make screening decisions and it remains the role of the reviewer to decide whether to retain or exlcude a study. Please see Figure 4 on page 8 for an overview of the model and user interface https://eppi.ioe.ac.uk/CMS/Portals/35/machine_learning_in_eppi-reviewer_v_7_web_version.pdf.
3.3 - Frequency and scale of usage
This tool is generally not used very often in evidence reviews mainly due to the users experiences in using the tool previously. Reviewers have highlighted that they do not always enjoy screening a dataset with relevant studies pushed to the top which leaves all the irrelevant studies to the end as this makes the process feel longer.
3.4 - Human decisions and review
This tool is used to rank (and re-rank) records that have not been reviewed yet and due to be screened in systematic reviews. However, it remains the role of the reviewer to decide whether to retain ir exlcude a study.
3.5 - Required training
No specific training is required as the tool is available as an option embedded into EPPI R5 and there are instructions in the user manual.
3.6 - Appeals and review
The decision made by the tool does not directly affect the public. It’s used to improve the efficiency of our processes. All approaches to screening are accompanied by a 10% Quality Assurance check which involves another reviewer blind screening 10% of the studies and then a comparison is made of the screening decisions. Any discrepancies are disucssed to agree whether the study should be included or excluded with disagreements escalated to a third independent reviewer.
Tier 2 - Tool Specification
4.1.1 - System architecture
A logistic regression algorithm that is hosted on the Azure Machine Learning platform. ER5 accesses it through Azure DataFactory, with data exchange managed through Blob storage.
4.1.2 - Phase
Production
4.1.3 - Maintenance
There is no model as such to maintain
4.1.4 - Models
There are no models retained - the algorithm simply builds a model, applies it, and then the model itself is discarded.
Tier 2 - Model Specification
4.2.1 - Model name
Priority screening
4.2.2 - Model version
v3.00
4.2.3 - Model task
To rank records during eligibility assessment in systematic reviews
4.2.4 - Model input
Titles and abstracts of research documents
4.2.5 - Model output
Scores of relevance for each unread research document
4.2.6 - Model architecture
Logistic regression
4.2.7 - Model performance
https://eppi.ioe.ac.uk/CMS/Portals/35/machine_learning_in_eppi-reviewer_v_7_web_version.pdf
Results have been reported by several teams. Wallace et al report that their technique might have reduced screening effort by between 40 and 50% in three reviews ; and by between 67 and 92% in four examples of review updates. Aaron Cohen and colleagues present similarly promising results, and at least three groups are building systems which use machine learning to facilitate the retrieval of studies in reviews. However, the advisability of truncating screening in specific situations is unknown at present, and is the subject of current empirical work.
Due to this not being single model trained on a representative dataset (a new model is built many times during a screening e.g. every time user screens 25 more papers), the performance metrics won’t correlate with what’s reported in the research papers. The main advantage of the tool is to get to the important papers faster.
4.2.8 - Datasets
The model is trained on the previous real time screening, and applied on the current screening decision.
Tier 2 - Data Specification
4.3.1 - Source data name
The model is bulit in real-time when a person is screening and therefore is trained on the previous real time screening, and applied on the current screening decision.
4.3.2 - Data modality
Text
4.3.3 - Data description
Title and Abstract of scientific publications are the ‘predictors’, and the screening decisions made by the reviewer is the ‘target’ variables for logistic regression algorithm.
4.3.4 - Data quantities
In EPPI-Reviewer, the length of iteration grows as the number of documents screened increases. When relatively few records have been screened, the algorithm re-learns and re-scores remaining records every 25 items screened. Once hundreds – and thousands – of records have been screened, the algorithm runs less frequently (every 100, 500 and 100 records) in order to conserve server resource.
4.3.5 - Sensitive attributes
None
4.3.9 - Data cleaning
Title and Asbtract are converted into vectors using tf-idf ‘bag-of-words’ approach which is a technique for representing text data as an unordered collection of words, or bag, based on their frequency into vectors.
4.3.10 - Data sharing agreements
NICE data is sent to servers hosted by the supplier (UCL) to rank the research papers according to relevancy. NICE has a data sharing arrangement with UCL. In brief this agreement attributes the ownership of data (inputed by NICE staff) to NICE. UCL does not store, copy, disclose or use NICE data except as necessary for delivering the EPPI-Reviewer services. The systems that hold NICE data complies with UCL security policy. In case of a corruption or loss of data, UCL will inform NICE immediately and propose the remedial action.
4.3.11 - Data access and storage
Only titles and abstracts of research papers are transferred to UCL. There is no personal data involved. The data is not stored and discarded as soon as the model is built.
Tier 2 - Risks, Mitigations and Impact Assessments
5.1 - Impact assessment
There are no formal impact assessments conducted.
5.2 - Risks and mitigations
There is a risk that a potentially relevant study could be missed. However, it should be noted that this could also be the case with manual screening. The NICE manual outlines that there is currently no published guidance on setting thresholds for stopping screening where priority screening has been used. Any methods used should be documented in the review protocol and agreed in advance with the team with responsibility for quality assurance. Any thresholds set should, at minimum, consider the following: • the number of references identified so far through the search, and how this identification rate has changed over the review (for example, how many candidate papers were found in each 1,000 screened) • the overall number of studies expected, which may be based on a previous version of the guideline (if it is an update), published systematic reviews, or the experience of the guideline committee • the ratio of relevant/irrelevant records found at the random sampling stage (if undertaken) before priority screening. The actual thresholds used for each review question should be clearly documented, either in the guideline methods chapter or in the evidence review documents. Mitigations to using this approach include the fact that the evidence synthesis is discussed with committee members who can flag if they think a potentially relevant study has been missed. The evidence synthesis is also consulted on with stakeholders who can also flag potentially relevant evidence. In all cases, any evidence flagged through sources beyond the search and sift stage would be considered against the protocol for inclusion in the review.