Component reliability in GCSE and GCE: Summary
Published 16 May 2013
Applies to England, Northern Ireland and Wales
1. Overview
Overview by: Mike Cresswell, AlphaPlus Consultancy Ltd.
This report exemplifies the use of Generalisability Theory (G-theory) to study the reliability of examinations using operational examination data. The study analysed operational data for 20 examination components from most of the GCSE and GCE awarding bodies to explore the sorts of issues and findings which arise when generalisability analysis is applied to operational data.
2. Purpose
The purpose of the project was to exemplify the use of Generalisability Theory (G-theory) to study the reliability of examinations using operational examination data. Analyses based upon G-theory make it possible to evaluate the contribution to the variation in a measurement of various known factors. In this case, the focus was on the impact of test-related and marker-related factors on reliability.
3. Method
The study analysed operational data for 20 examination components in 11 subjects, to explore the sorts of issues and findings which arise when generalisability analysis is applied to operational data. The data came from most of the GCSE and GCE awarding bodies and covered 11 different subjects, ranging from mathematics to music. The data analysed consisted of candidates’ marks on each question in the exam component concerned, where each candidate’s work was marked only once, as is normal practice for GCSE and GCE. As a result, for these components no analyses of marker effects were possible and the work therefore focused on estimating the effects of question sampling and component design on reliability. One additional dataset was available from a research study into examiner standardisation in one of the awarding bodies; in this case, it was possible to explore the impact of both question sampling and marker variation.
4. Findings
Many examples from the study illustrate how G-theory can illuminate the effects of questions and question paper design on reliability. Overall, various influencing factors were identified:
- the number of questions in the test
- the extent of inconsistency in candidates’ responses to different questions
- the shape of the overall component mark distribution
These are the factors which would be expected to emerge in any analysis of the influences of question sampling and examination component design on reliability. The major benefit of using a G-theory approach is the ability it provides to go further and easily estimate the impact on reliability of design variations, such as the use of different numbers of questions.
The exceptional analysis, where it was possible to look at marker variation as well as the effect of question sampling, demonstrates this strength of G-theory when more data are available and a more complex analytical design is therefore possible. In particular, the potential reliability improvement from using multiple marking can be estimated and set alongside the improvements which would come from using more questions. In the case analysed, doubling the number of questions from three to six would produce an improvement in reliability similar to, but slightly better than, that achieved from the introduction of double marking. Clearly, information like this has great potential use during the design and development of new examinations.
G-theory also allows the estimation of reliability coefficients which reflect the use of fixed cut scores to partition the mark scale into grades – ‘absolute candidate measurement’ . This would be the case if, for example, parallel forms were used with the same pre-determined cut scores. In this case, the absolute level of candidates’ marks must be reliable as well as their rank order, and generalisability analysis allows this version of reliability to be easily estimated. These absolute measurement reliability coefficients are generally a little lower than the more conventional relative measurement ones, because they impose more-severe demands on the consistency of marks from the test, although the extent of the difference varies between components.
5. Conclusions
Some designs of GCSE and GCE examination components are not amenable to effective internal analysis – most obviously, those which consist of a single task such as an essay or project. This limitation is not one which affects G-theory only – it is clearly not possible to analyse the effect of examination component structure when any such structure is masked by the use of only a single mark to encode each candidate’s performance. Less restrictedly, while generalisability analysis is possible when there are significantly different mark allocations to questions comprising an examination component, it is not then possible meaningfully to explore variations in design, such as changing the number of questions, unless there are at least two questions with the same mark tariff, in which case multivariate generalisability analysis can be carried out .
When G-theory is used to investigate the effects of question sampling and component structure, the results will be affected by the dimensionality of the component. In the simplest cases considered in the report, that is designs involving only questions as a source of measurement error, G-theory produces a reliability statistic identical to Cronbach’s alpha. Where examination components, such as sectioned tests, are not specifically designed to measure a single dimension of attainment, but rather to sample a number of potential dimensions, the test data can be analysed using multivariate G-theory, as illustrated in Chapter 5 of the report. In current GCSE and GCE practice, cut scores are set afresh for each year’s examination. The highest and lowest cut scores are fixed by a process which takes account of the distribution of candidates’ outcomes and the intermediate ones move accordingly because they are defined in terms of mark intervals between the highest and lowest. Since this is neither pure norm referencing, nor pure criterion referencing, it is not obvious whether reliability measures based upon absolute or relative measurement are most appropriate.
A general finding from the analysis is that margins of error for relative measurement tended to have values in the range 5% to 18% of the underlying mark scale, and most typically 15% to 17%. This means that, under Normal distribution assumptions, 95% confidence intervals around candidates’ total paper marks will be of the order of 10% to 40% of the mark scale, clearly spanning several grade boundaries in some cases.
The benefits of generalisability analysis for assessing reliability are greatest during examination design and development, especially if trials are designed to produce data that make it possible to investigate different factors affecting reliability. Generalisability analysis could also be useful in monitoring and studying reliability in existing examinations if monitoring studies were designed to add to operational data by collecting collateral information about candidates, schools and colleges and, where appropriate, by involving multiple marking.
The report recommends that G-studies, and follow-on what if analyses, should be set up and analysed during an ongoing research and development programme, in which marker effects and paper structures are simultaneously investigated, along with gender and other candidate-related effects.