Summary of the final report

Question 1

1. Research into the reliability of examinations in England

Accepted Answer

1.1 Key stage 2 national curriculum tests

In each year from 1995 to 2008, 11-year-olds took the national science test (along with national tests in English and mathematics), with the result that each child was given a level for each subject showing their progress in the national curriculum.

For the science tests, as a trial in each year, a small number of children also took the test that was to be used in the subsequent year. This helped the test writers ensure that tests were of comparable difficulty each year, but it also allowed researchers to see whether children would have received the same level on both exams - a key measure of fairness. In 2004, 72% of children received the same level on the live test and the 2005 pre-test. In 2008, the figure had risen to 79% of children receiving the same result on the 2008 live test and the 2009 pre-test.

Why wouldn’t all the children receive the same level from both tests? It is difficult in practice to produce tests which are absolutely ‘equal’ in this respect – writers have to choose topics from the curriculum to test on, and some topics may be harder than others (there are too many topics to cover them all and, if the same topics came up every year, then students and teachers might not cover the topics that don’t). Markers try to mark tests consistently, but longer answers leave room for subjectivity, no matter how tight the mark scheme. Even the day and time of the examination could make a difference to how students perform.

For the 2009 pre-test, the researchers calculated that an estimated 11% of candidates were ‘misclassified’, i.e. given one level higher or lower than their ‘actual’ level. It is important to note here that ‘misclassified’ doesn’t mean that something has gone wrong in the marking process; it just means that the candidate would have got a different result on a test from a different year.

Separate work on the live 2009 and 2010 national tests in English, science, and mathematics used six different statistical methods to calculate classification accuracy. These different methods produced largely consistent results (with the small differences relating to the different assumptions that each of the methods make). Estimates of classification accuracy were around 90% for mathematics, 87% for science, and 85% for English. The classification accuracy for mathematics is higher than for English and science probably because mathematics answers tend to be right or wrong, reducing marker variation. The amount of estimated misclassification has reduced over the time of national testing, reflecting the fact that the assessment process becomes more reliable with experience (as, for example, mark schemes become clearer).

1.2 GCSEs and A levels

GCSE and A level assessments are made up of units and components, with candidates taking several different examinations/assessment activities, from which their scores are aggregated to produce a final grade. Research into a range of qualifications from November 2008 to June 2009 showed that classification accuracy for a range of units (the proportion of the candidature placed at the correct grade) ranged from 50% to 70% (with 90%+ in either the correct grade or the adjacent grade above or below), but that classification accuracy for the qualification (i.e. when the various unit results are combined) would be substantially higher. The research also found that for units consisting of mostly short answer or structured response items with little room for marker interpretation, test-related unreliability was higher than marker-related unreliability. In other units, with longer answers and more complex mark schemes, marking unreliability (i.e. that one marker might award more or fewer marks for a candidate’s response) may be a greater factor.

1.3 Workplace-based qualifications

Workplace qualifications work differently from GCSEs and A levels. Candidates tend only to be entered when they are ready, and assessments tend to be based around ‘competency’ – the candidate is expected to complete almost all the assessment activities correctly, with the outcome being a pass or fail (not graded). Workplace assessment tends also to be based on either observing the candidate performing tasks, or looking at evidence of their performance, with no limit on the numbers of attempts, and encompassing a wide variety of settings and performance activities, all of which introduce potential unreliability of assessment.

The research looked at a small number of National Vocational Qualifications (NVQs), gathering additional data (over and above that normally produced for assessment) for the analysis, which showed that assessors have a very high level of agreement about candidates’ performance, but which also highlighted that much more data would need to be collected for vocational qualifications to allow these types of analyses to take place routinely.

Question 2

2. Practicalities

Accepted Answer

There are unavoidable trade-offs in the design of an examination system. It may not be surprising, for example, that longer tests are shown to provide more-reliable results, but awarding organisations have to make practical decisions both about fairness to candidates taking many exams and the costs of marking and setting long papers. Similarly, the more grades that are available for an exam, the greater the unreliability associated with the grade (because the grade is covered by a smaller mark range and is therefore more susceptible to misclassification by marking error, for example).

Finally, a theme that occurs repeatedly throughout the reliability programme is the balance between reliability and authenticity (often called ‘validity’). For qualifications to be valued, they must test the skills and knowledge they claim to test, and these skills and knowledge must be valued by society. Multiple choice tests can be very ‘reliable’, in that tests can be statistically chosen to be balanced and marker variation is eliminated, but simply requiring a candidate to pick one of four options is not authentic for many situations – patients don’t turn up at clinics with a definitive list of possible ailments for the doctor to pick from. From a validity perspective, the experts working in the programme place great value on essay and long-answer questions (including, for example, worked problems in mathematics), particularly for assessments of academic subjects like GCSEs, A levels and national curriculum tests (the main subject of the Ofqual reliability programme) even though they present challenges for reliability.

Question 3

3. Public perceptions of unreliability in examinations

Accepted Answer

The reliability programme looked at the public perception of unreliability in examinations, talking to teachers, parents and students through a series of workshops and surveys. It is clear that the public has a degree of understanding of unreliability, distinguishing for example between the various factors that can introduce measurement error into examinations. It is also clear that, although there is tolerance for inevitable variability in the process (for example, topic sampling and a degree of subjectivity in marking), there is little tolerance for ‘preventable errors’ such as markers not following the mark scheme, or adding up the marks wrongly, especially where these errors result in a student getting the wrong grade (not just the wrong score). Students and parents show a high degree of trust in the system - teachers less so, particularly where their involvement in examinations appeals has shown them where assessment errors can occur.

Throughout the discussions, technical terminology presents problems - ‘measurement error’ in common meaning suggests a mistake has been made in the examination process, whereas the statistical meaning points to the difference between the observed and notional ‘true’ scores. Similarly, in common meaning ‘reliability’ is perceived as an absolute (a test is either reliable or not), not a sliding scale of confidence as it is in statistics.

It is clear that providing information to the public about assessment reliability in an effective way is difficult. The concepts are hard to explain well, and unreliability can seem like an intrinsically bad news story with plenty of opportunity for misinterpretation. If assessment reliability information is to be published, it needs to be accompanied by resources to help the public understand what the information means.

Question 4

4. Supporting the development of Ofqual’s policy on reliability

Accepted Answer

The reliability programme was created in recognition of the fact that there had been little sustained and systematic evaluation of the reliability of results from England’s assessment systems, and little understanding of the public’s knowledge of and attitudes towards unreliability in these results. The programme’s technical advisory group and policy advisory group made a number of recommendations which will be used as a basis to develop Ofqual’s policy on reliability:

Ofqual should outline the primary purpose of each qualification and Ofqual should regulate against that purpose.
Awarding organisations should publish their standard setting practices in order to make the regulation of reliability in standard setting more transparent.
Awarding organisations should report on the reliability of assessments, using different measures for different types of assessments, but with consistency of approach between awarding organisations.

The programme has also made recommendations for further work:

Assessment reliability information should be available in the public domain, provided both by awarding organisations as a routine part of their assessment monitoring and from investigative research by Ofqual. However, this information needs to be accompanied by public education activity to help with understanding the difficult concepts, and there needs to be capability within Ofqual to manage media coverage of the topic.
Reliability information should be focused at the level that has impact for the public: reliability around qualification grades, for example, is more important than reliability around unit results or assessment scores.
Ofqual should lead work to look at reliability measurement in non-examination assessment methods such as teacher assessment and workplace observation.

Summary of the final report

Applies to England, Northern Ireland and Wales

0.1 Introduction

0.2 How is examination reliability measured and reported?

1. Research into the reliability of examinations in England

1.1 Key stage 2 national curriculum tests

1.2 GCSEs and A levels

1.3 Workplace-based qualifications

2. Practicalities

3. Public perceptions of unreliability in examinations

4. Supporting the development of Ofqual’s policy on reliability

Is this page useful?

Help us improve GOV.UK

Help us improve GOV.UK

Cookies on GOV.UK

Applies to England, Northern Ireland and Wales

0.1 Introduction

0.2 How is examination reliability measured and reported?

1. Research into the reliability of examinations in England

1.1 Key stage 2 national curriculum tests

1.2 GCSEs and A levels

1.3 Workplace-based qualifications

2. Practicalities

3. Public perceptions of unreliability in examinations

4. Supporting the development of Ofqual’s policy on reliability

Is this page useful?

Help us improve GOV.UK

Help us improve GOV.UK