Estimates of reliability of qualifications: Summary
Published 16 May 2013
Applies to England, Northern Ireland and Wales
1. The reliability of GCSEs and A levels
The fact that GCSEs and A levels are not awarded on the basis of a single examination but are made up of a number of different assessments (termed ‘units’ or ‘components’) introduces a raft of practical considerations into the calculation of assessment reliability. If we then try to compare reliability across a range of GCSEs and A levels, as this report does, then the choice and application of suitable techniques is even more problematic, and the data even more difficult to interpret. To address this last point, the report often takes the view that ‘a picture is worth a thousand words’, and makes liberal use of graphical representations of assessment reliability for different GCSE and A level results. This can be a good way to make complex statistical relationships more accessible to the layperson, and can also be useful when there is no widely accepted statistical measure to quote.
As with some of the other reports, this starts with an overview of the statistical models of assessment reliability used in the analyses. Modelling, in this sense, means that they use a mathematical representation (i.e. one or more equations) to describe (or model) assessment reliability. The report also talks frequently about estimating reliability, because assessment reliability is not something we can measure directly, like the length of an object. Instead, we use statistical models to try to represent the underlying processes, and statistical measures derived from the model are considered to be estimates of the real world phenomena. In this report, the authors base much of their analysis on Cronbach’s alpha, the most widely used estimate of assessment reliability, chosen here for pragmatic reasons: Cronbach’s alpha and the associated Standard Error of Measurement (SEM) are relatively easy to calculate, an important consideration for a report, such as this, which covers a large number of tests.
2. Estimating the reliability of units and components
Section 1 of the report presents an analysis of Cronbach’s alpha and SEM for different units/components of GCSEs and A levels. The real analysis in the report, however, is the consideration given to the various factors that affect interpretation of the raw (statistical) data. For example, the report explores the relationship of Cronbach’s alpha and SEM with test length (i.e. the number of questions or the total number of marks in the test). This examines a basic premise of Classical Test Theory (CTT), that a longer test will in general be more reliable than a shorter one. The report makes good use of graphs here to present the data, which helps to illuminate what might otherwise have been a rather opaque subject. The graphs help to show that Cronbach’s alpha and SEM cannot be properly interpreted across different assessments without considering the number of questions in, and maximum mark of, the test. The authors go on to suggest the use of a different (but related) statistic, the ‘Bandwidth:SEM ratio’. The advantage of this statistic, explained fully in the report, is that it is relatively unaffected by the number of questions or maximum mark of the test.
The report goes on to describe (IRT), a technique for modelling questions and assessments that is very popular, particularly in the USA. IRT models the interaction between the test taker and each individual question (item). IRT does not specifically model test reliability in the same way that CTT does, but from it a statistical measure related to reliability – called the ‘Person Separation Reliability’ index – can be derived. The authors tabulate the Person Separation Reliability index for twelve GCE/GCSE units or components, and compare the values with Cronbach’s alpha calculated on the same tests. The IRT-based index gives slightly higher values for the coefficients of reliability, but the relative ordering of the two sets of results is very similar. The fact that two quite different statistical models produce such similar estimates for assessment reliability gives confidence that these different models of reliability are measuring similar things.
3. Estimating the reliability of GCSEs and A levels
The analysis so far has estimated reliability for units or components only. The units/components analysed were, however, part of a whole GCSE or A level, so the report now moves on to consider the larger, and more complex question of assessment reliability for a complete GCSE or A level. Though this is obviously the more important measure for the learner and the general public, the report soon makes it plain that it is very difficult to estimate the reliability of a whole GCSE or A level. For ‘unitised’ qualifications, such as today’s GCSEs, different learners can choose different units within the same qualification, and each of these can be assessed differently (some use coursework, some a written paper, some a practical examination, etc); the maximum marks available in each unit can be different too (e.g. out of 25, or out of 40). There is a detailed explanation of the complex process involved in aggregating the marks from these different units on pages 25 to 27 of the report. This complexity in the assessment structures makes it very unclear how the reliability of a unitised assessment such as a GCSE is to be estimated. The authors of the report do, however, use both CTT and IRT to calculate reliability measures for a small number of GCSE and A levels. In each case, the reliability of the whole assessment was estimated to be higher than that of the individual units or components.
4. Marker agreement and marker error
Section 2 of the report discusses how the influence of markers affects variability in the marks awarded. This would seem to be the aspect of assessment reliability that is most easily understood by schools and the wider public alike; it is certainly the aspect they are most concerned about. Awarding organisations apparently report that the vast majority of appeals from examinees or schools relate to marking reliability. The public seem willing to accept that examinees might have got a different result if they had done a slightly different test, or in different conditions, or on a different day, putting this down to ‘the luck of the draw’. They are much less likely, however, to accept that their answers were given a lower mark than they deserved.
As a starting point, the report summarises what is actually meant by marker ‘agreement’ and marker ‘error’, and how these can be measured (typically by having a number of scripts marked by different markers). This leads into an overview of the research on marker reliability in GCSEs and A Levels – an area where the outcomes are fairly consistent. Marker agreement is higher on assessments containing shorter or more-structured questions than it is on examinations containing essays. In general, the less subjective the mark scheme, the higher the agreement between markers: marker agreement in subjects like mathematics is very high, and predictably lower in more-subjective subjects such as English. An interesting aspect of marker reliability studies is whether the second marker is aware of the marks awarded by the first marker or not. Research typically shows that, where the first marker’s marks are visible to the second marker, agreement between the markers is higher. No student of human nature would be surprised by this result.
What might not be widely appreciated by the general public is the way in which many examination scripts are marked these days. Until about six years ago, paper scripts were posted out to markers (usually teachers) in batches for marking. Markers would typically mark whole scripts (i.e. all the questions in the paper), and monitoring was carried out by a team leader who would mark a sample of each marker’s work, with procedures in place to deal with any discrepancies that arose. Nowadays much of the marking of scripts is carried out on-screen. Scripts are scanned, markers use their own computer to log in to the awarding organisation’s online system, view candidate answers on-screen, and enter marks directly into the computer. One advantage of this is that the total marks for the assessment can now be calculated automatically by the computer, which removes one source of human error. The method of quality assurance is also different for on-screen marking: markers may now mark individual questions rather than whole scripts, and their marking allocation contains ‘seeded’ questions (questions which have been pre-marked by senior examiners), enabling automatic reports to be generated that detail the agreement of each marker with the ‘definitive’ marks of the senior examiners.
The report provides details of both paper-based and on-screen marking studies using GCSE and A level data. In general, marker agreement is fairly high: the data would seem to suggest that paper-based marking of public examinations has become more reliable over the years, and marker agreement for on-screen marking, albeit on papers with fewer subjective questions to mark, is impressively high.
5. The reliability of grades
Of course in a GCSE or an A level, the outcome for the examinee is generally a grade, rather than a raw score, and so Section 3 of the report considers grade-related variability. This is obviously a headline issue for the general public. There is an understandable demand that the grade which a learner receives for a GCSE or A level should be exactly the grade they deserve, since this can affect the life chances of the learner. There is also, of course, the concern (perhaps more for universities and employers than for the general public) that these results are repeatable year after year – in other words, that exams are not seen to be getting easier (it never seems to be reported that exams are getting more difficult).
Grade boundaries are important when considering the reliability of grades. Grade boundaries indicate the minimum marks needed to achieve a certain grade. A learner might, for example, in one GCSE examination subject in 2011 have to achieve at least 56% to be awarded a B, and 65% to be awarded an A: so 56% is the grade boundary for a B, and 65% the grade boundary for an A. Awarding organisations manage the grade boundaries – for each subject and each time it is examined – to try to ensure that the grade a learner achieves in a subject represents the same level of achievement as in previous years.
So an important issue in terms of grade reliability is the setting of the grade boundaries, since a small change either way (eg plus or minus 1%) could affect the grade awarded to thousands of candidates. As the report makes clear, however, there is no such thing as a ‘correct’ value for a grade boundary. Grade boundaries are set by panels of experts within awarding organisations. Each panel will have knowledge of the subject, curriculum, assessment specification, mark schemes, issues that have arisen during marking, etc. They use their experience and professional judgement in setting the grade boundaries for each assessment (ie GCSE or A Level) each year. This is part of the quality assurance process which awarding organisations undertake to try to ensure that candidates emerge with the ‘correct’ grades, and that this happens consistently year after year. This is a difficult area to investigate quantitatively, but the authors of the report investigated the impact that slightly different decisions at unit or component level could have on the grade boundaries at the overall assessment level. They conclude that it would be worthwhile to present some indication of how assessment outcomes would change with minor changes to the unit boundaries. This would help put fluctuations in pass rates over time into context.
Unusually perhaps, this report finishes with a summary section, which provides a concise statement of the main results of the report. This is useful in that the reader can begin by reading the summary, and then dip into the main body of the report to get more detail.
*[CTT}: Classical Test Theory *[IRT]: Item Response Theory