Conceptualising and interpreting reliability: Summary
Published 16 May 2013
Applies to England, Northern Ireland and Wales
1. Introduction
Perhaps because of our familiarity with examinations, we may not realise that it is actually very difficult to assess skills, knowledge and understanding accurately. Why? Well, often we can’t even define precisely what skills and abilities we are trying to assess. To use an example from this report, how can we assess someone’s historical knowledge accurately? Of course, we could ask them a series of questions (dates of battles, names of kings and queens, etc.) which would give us some idea, but what if we asked a different set of questions? Would we get the same result? Intuitively, we know that there would be some variation. Changing the type of question, and asking students to reason about historical events – to explain why certain events played out the way they did – might give us a quite different picture of an individual’s historical knowledge. And what would be the effect of using multiple-choice questions instead of essay questions? Or, if we did use essay questions, what influence would individual markers have on the grade awarded for the assessment? And would the results be exactly the same if we ran the test on a Monday morning instead of a Friday afternoon?
All of these issues touch upon the issue of assessment reliability – how well we can assess what we set out to assess, and whether the result of our assessment would be different if we changed the way we assess or the circumstances of the assessment in some way. Obviously we want to make our public examinations as consistent and reliable as possible, but how do we actually measure the reliability of an assessment? That is the topic which the report addresses. Chapter one starts by pointing out that it is not obvious how to measure reliability, nor what conclusions to draw from the measures used. The various techniques currently in use to estimate the reliability of assessments have been developed and refined by psychologists and statisticians over the course of the last hundred years or so, and although they differ, they all attempt to model reliability in some way. Modelling, in this sense, means that they use a statistical representation (i.e. one or more equations) to describe (or model) assessment reliability.
2. Using statistical models to estimate assessment reliability
Chapter two of the report looks in detail at the oldest statistical model, known as ‘classical test theory’ or ‘true score theory’. According to this theory, the mark awarded to a candidate in an examination contains a (hopefully, small) amount of error. That is, if you were a solid A-grade student who in a perfect world would have received 75% in your maths exam, you may in fact have scored a mark somewhere between 73% and 77% on a particular test. The difference between your idealised (or true) score (75%) and the score you were actually given (73% or 77% or …) is the error. Classical test theory represents this model of the assessment process mathematically, by using equations. Additional equations have been derived which can be used to estimate the error. In simple terms, assessments where the error is low (across all candidates and sittings of the examination) have high reliability. Using the equations in classical test theory, we can get some measure of the reliability of an assessment.
Classical test theory provides just one mathematical model that can be used to estimate assessment reliability. There are several others, and there are variations of each, and the report provides a useful historical overview of the development of the main ones. Much of the discussion in the report is, by necessity, of a mathematical nature. Some of the concepts will, however, be familiar, even if the mathematical treatment is not. Any football fan, for example, will know the truism that ‘the best team wins the league’. This phrase implies that, even though the better teams may occasionally draw or even lose matches to lesser teams, over the course of a season the best team will accumulate the most points, the second-best team the next most points, and so on. Football fans implicitly understand that, although the result of individual matches can be unreliable, the more matches a team plays over the course of the season, the more it will find its ‘true’ position in the league, and the more the effect of the unexpected results will be minimised. Unlikely as it may seem, there is in fact a direct analogy here with mathematical modelling of assessment reliability. The games are a form of assessment, and the more assessments that a candidate (team) takes, the more reliable the overall grading will be. Mathematicians might explain this phenomenon in terms of random variables converging to an expected value, or by saying that ‘the expected value of the error of measurement is zero’. But, even though the mathematical language and formulae they use may not be accessible to all, the underlying concepts will be familiar.
3. Cronbach’s alpha and item response theory (IRT)
From classical test theory, the report moves on to discuss methods developed more recently, including ‘Cronbach’s alpha’, which emerged from classical test theory and is widely used and very familiar to many assessment practitioners. Cronbach’s alpha generates an internal consistency reliability figure for a test by summing the error statistics of individual questions within it.
The report also highlights the impact that the advent of cheap and widely available computing power has had on the measurement of assessment reliability. Modern computers make it feasible to apply complex statistical calculations to large datasets, and so have made approaches like item response theory (IRT) possible. IRT is now perhaps the dominant approach to modelling candidates’ interactions with assessment questions and examinations.
IRT takes a quite different approach from classical test theory, in that it attempts to model the interaction between the test taker and each individual question. Widespread as this technique now is (particularly in the USA), the authors of the current report do not believe that it is particularly useful in the measurement of reliability. This is because the fundamental principles of IRT do not address concepts such as sampling error and replicability, which are central to the definition of reliability used in the Ofqual programme.
4. Generalisability Theory
Chapter three introduces another modern approach, generalisability theory (G-theory), which was developed from classical test theory but offers more in the way of flexibility and sophistication. The main difference between the two is that in G-theory the measurement error (i.e. the difference between your idealised – or true – score and the score you were actually given) can be broken down into multiple constituent parts, and the contribution of each part to measurement error separately quantified. The chapter illustrates how G-theory provides a conceptual framework for reliability that encompasses and extends all of true score theory.
For example, G-theory allows you to run ‘what-if’ analyses with multiple error sources rather than a single one, to explore ways of reducing the overall error (unreliability) of the assessment. G-theory further extends true score theory by introducing the concept of ‘absolute’ reliability to complement the classical ‘relative’ reliability measures. G-theory can provide numerical measures of reliability in the range 0 to 1, where 1 is perfectly reliable, and values of around 0.8 and above are generally considered good. So, although the mathematical theory surrounding the concept may be inaccessible to non-mathematicians, the outcomes are easily digestible.
Perhaps the most outwardly obvious aspect of (un)reliability in assessment is the influence of the markers on the marks awarded. For longer text answers and essays in particular, there is often a subjective element to the marking, which may lead to (possibly significant) differences between the marks awarded. Of course, awarding organisations and Ofqual are acutely aware of the need to manage and minimise this potential source of unreliability, and have developed rigorous moderation procedures and processes designed to identify and minimise inconsistencies in marking. This moderation process is often referred to as ‘standardisation’, and the report provides a useful overview of the procedure as well as illustrating the potential of G-theory for the design and analysis of marker reliability studies.
5. Applying Generalisability Theory
Chapter four explains how G-theory can be applied to complex real-world scenarios, examining the influence of questions, markers, etc. on assessment reliability, and even examining the interaction between markers, pupils, questions, and other factors. The mathematical analysis may be complex but the report has been written in plain English and these illuminating scenarios are therefore readily accessible. This section also contains explanations and summaries of various international research projects that have applied G-theory to large-scale assessment programmes.
6. Conclusions
After this detailed explanation of the various measures and models of reliability, the final section of the report considers the implications of the availability of these tools for UK public examinations, beginning with a discussion of the best way to report reliability to the general public. This leads naturally to a discussion of which of the possible mathematical indicators of reliability should be used (the authors arguing for G-theory). The report goes on to recommend that, whether reporting the results to the public or not, awarding organisations should be making more use of G-theory reliability measures to quality assure their assessments.