Summary of the technical seminar report
Published 16 May 2013
Applies to England, Northern Ireland and Wales
1. The participants
The participants in this seminar were generally people with a prior understanding of research evidence about the reliability of educational assessments. Despite that shared knowledge, they clearly held different views about the most acceptable ways of estimating different kinds of unreliability in relation to UK public examinations and national curriculum assessments, and about the desirability and possible mechanisms for communicating such matters to the general public. The very fact that Ofqual had initiated the reliability programme as ‘an investigation into factors that can undermine the ability of examiners and assessors to produce a grade that truly reflects a candidate’s achievement or mastery of a subject’ had signalled a move (endorsed by Kathleen Tattersall, the first Chair of Ofqual) to be more open about the uncertainties associated with educational assessments.
2. The semantics surrounding reliability
At this seminar there was a great deal of discussion about the semantics associated with this topic, with reliability, error, variability, misclassification and chance all being advocated as words which can help the public to understand the issues under consideration. A number of those present were very keen to distinguish between real mistakes, which could be made during any assessment process, and more-intangible things that can also affect results. Examples of the former include marking errors, computational errors when adding up marks, and issuing the wrong grade to a candidate. More-indirect sources of unreliability relate to the fact that candidates perform at different levels on different occasions, and when confronted by different but apparently equivalent examination or assessment tasks.
Some participants at the seminar supported the release of information about the unreliability of UK assessments but wanted this information to be associated with different sources of assessment unreliability. Others were concerned that the emphasis on reliability was less important than an emphasis on validity, especially as a highly reliable test or assessment can have very low validity. Another more pragmatic response, from one of the awarding organisations, pointed to the need to determine how much risk people wanted to remove from assessment results, because removing risk from such results generally requires putting more time and resources into the assessment process.
3. The impact of releasing information on reliability
Another theme generating a good deal of discussion at the seminar was the impact on the general public of releasing more information about the unreliability of assessment results. Most participants appeared to support the idea of releasing more information, but agreeing how exactly to do this was far more complicated and contentious. If Ofqual was to demand the release of such information on a routine basis, then it would need to establish clear guidelines to be followed by all awarding organisations and assessment agencies. The general view was that the public could cope with the release of such information and that over a period of time any immediate negative reactions would dissipate.
4. Research into reliability
During the seminar a number of references were made to research into the reliability of assessments, both current and past, and a number of the participants were able to share detailed findings from specific studies conducted within specific assessment contexts. Some of these were based on theoretical modelling; others had involved the collection of empirical data, for example getting the same group of students to take equivalent versions of the same test.
5. Summary
The seminar did not seek to make decisions or draw conclusions, but served the function of exposing the ongoing work of the Ofqual reliability programme to a wider audience and, by doing that, allowed a wide variety of insights to inform aspects of the future work of the programme.