Classification accuracy and consistency in GCSE and A Level examinations offered by the Assessment and Qualifications Alliance (AQA) November 2008 to June 2009: Summary
Published 16 May 2013
Applies to England, Northern Ireland and Wales
1. Overview
Overview by: Mike Cresswell, AlphaPlus Consultancy Ltd.
The report describes work to investigate the levels of classification accuracy and classification consistency measured for a range of individual components of GCSE and A level examinations. The focus of the work was not on marking reliability, so the components analysed all involved objective, short-answer or structured-response questions where it was judged reasonable to assume perfectly reliable marking.
2. Purpose
The purpose of the work was to investigate the levels of classification accuracy and classification consistency measured for a range of individual components of GCSE and A level examinations. Classification accuracy is the extent to which reported grades agree with the grade which truly reflects each candidate’s attainment. Classification consistency is the extent to which two assessments produce the same reported grades.
Since GCSE and A level examinations are high stakes assessments and candidates’ grades are used in important selection decisions, their classification accuracy and consistency can be seen as important indicators of assessment quality. These statistics have a more direct interpretation than conventional reliability measures but are not without their own issues and assumptions. The motivation for the work was to inform discussion about whether classification measures might be used in routine reporting of how reliable GCSE and A level examinations are.
3. Method
The focus of the work was not on marking reliability, so the components analysed all involved objective, short-answer or structured-response questions where it was judged reasonable to assume perfectly reliable marking. Two approaches to measuring classification accuracy and classification consistency were used:
- Item Response Theory (IRT) modelling, specifically Rasch modelling.
- An approach due to Livingston and Lewis.
The technical details of these two approaches differ considerably, as do the underlying technical assumptions. For example, the Rasch model makes a strong assumption that all the questions in a test assess the same variable, representing the extent of the candidates’ attainment (knowledge, skills and understanding), whereas the Livingston and Lewis approach makes no such assumption. So different tests might fit one or other approach better as, indeed, may the patterns of answers of different groups of candidates.
However, in both cases, the aim of the modelling is the same. Because no measurement can be perfect, candidates will not, in general, always get the mark which perfectly matches their true attainment: some will score above and some below their true mark. Although they are most likely to score their true mark, for every other possible mark on the test there is a finite chance that they will actually score that. The chance of getting any particular mark falls rapidly for marks that are further and further away from the candidate’s true mark, but it is never quite zero. The models used in this study make it possible to evaluate, for each candidate, the probability that they will get each possible mark on the test, given their estimated true mark. From this, it is possible to calculate the probability that they will get each possible grade, because a grade is simply a range of marks.
Once those calculations are complete, the results can be averaged over all the candidates to generate statistics showing how likely it is in general that candidates will actually get the grade corresponding to their true mark (classification accuracy) or get the same grade (whether or not it is their true grade) from two different tests (classification consistency).
4. Findings
Across a range of GCSE and A level examination components, classification accuracy statistics between 45% and 80% were obtained. The values based upon the Livingston and Lewis model were slightly lower than those based upon the IRT (Rasch) model. The differences between the models are not, however, large enough to have any substantive importance.
Classification consistency statistics were inevitably slightly smaller, since they represent the comparison of two fallible measures (two tests), rather than one fallible measure with the true result.
The reasons for the variations in classification quality between different examination components include:
- the average number of marks in each grade (wider grades produce better classification quality)
- the location of the grade boundaries as a set on the mark distribution (the mark distribution for an assessment shows how many students got a particular mark for the assessment): grades which span the central part of the mark distribution produce better classification quality
- the degree to which the components contained some questions which were either rather easy or rather difficult for the candidates taking them, and the extent of their easiness or difficulty (more questions with difficulty well matched to the candidates’ abilities produce better classification quality)
- the length of the components (longer tests produce better classification quality)
Because the classification quality statistics are based upon the analysis of candidates’ scores, they are as strongly influenced by the candidates as by the quality of the examination. This was illustrated within the study when the quality measures for one component rose sharply because a much more able group of candidates took it on the second occasion. As well as computing the classification quality measures, the study also looked at the probability that candidates were awarded the grade corresponding to their true score or one adjacent to it. Since no measurement can be perfect, no examination result can ever claim to be accurate to less than plus or minus one grade. For the GCSE and A level examination components analysed, at least 89% of all candidates with a particular grade have true marks within that grade or one immediately adjacent to it. For some components the figure is much higher, up to 100%.
5. Conclusions
If meaningful comparisons are to be made between classification indices for examinations set at different times and by different awarding organisations (or which are different in other ways), then there must be consistency of choice of model and index. However, the differences between the different models and measures are not so great that substantive differences of interpretation will arise from different choices. The choice is not crucial, therefore, as long as the same measures are always used.
Classification indices must be interpreted in the context of each assessment. The figures are not meaningful in themselves. For example, there may be good reasons (perhaps relating to the validity of the assessment) why examination components are not designed simply to maximise classification quality.
Classification indices may set up false expectations of what can be achieved. Although classification accuracy indices between 45% and 85% seem low, in practice this equates to between 89% and 100% of candidates with true marks in the grade they are awarded, or an adjacent grade. Since no examination result can be accurate to better than ± one grade, these are broadly acceptable figures.
Until those directly involved in the qualifications system have gained experience and understanding of the classification indices – how they should be interpreted and what influences their values – it would be unwise to routinely publish them.