A focus on teacher assessment reliability in GCSE and GCE: Summary
Published 16 May 2013
Applies to England, Northern Ireland and Wales
1. Method
The research reviewed material published by Ofqual and the GCE and GCSE awarding organisations in England, Wales and Northern Ireland. Discussions were held with relevant personnel within those awarding organisations, and a review was carried out of published and unpublished research, including reports from previous reviews, on the topic of teacher assessment and, especially, its reliability.
2. Findings
The report describes the development of teacher assessment in national curriculum assessments, the workplace, and GCE and GCSE examinations. It is characteristic of these developments, the report establishes, that policy is not concerned with reliability, and emphasises validity instead. The report acknowledges the practical and theoretical difficulties of designing reliability studies of teacher assessment and also establishes that the UK examination system has not developed a more modern approach based on generalisability theory analysis. The absence of any policy pressure on the GCE and GCSE examination system to investigate reliability is seen as a major factor in the almost complete absence of empirical research in the UK on the reliability of teacher assessment in the past 20 years. This lack of empirical evidence means that the report is largely theoretical, relying mainly on the application to teacher assessment of known principles of assessment – such as the reliability benefits of standardisation of procedures. In this, it is no different from most of the papers it reviews.
3. The work assessed and the conditions under which it is completed
The report points out that, even under the new GCSE regime (in which teacher-assessed work is done under controlled conditions, rather than as traditional coursework) there is great scope for variation in the nature of the tasks the students tackle and the conditions under which they are performed. The report identifies the following issues, all of which are likely to affect reliability adversely, although their relevance varies according to the details of individual specifications and the degree of control required by the criteria for each subject:
- the breadth of choice given to teachers and students in terms of theme, topic or task, enabling them to select those which suit their learning or interests – essentially biasing the sample provided of the student’s knowledge, skills and understanding
- the high likelihood that different, supposedly equivalent, tasks facilitate different levels of performance and therefore produce different teacher assessment outcomes
- teachers’ foreknowledge of the tasks and the potential conflict of interest between their concern about the results of their pupils and their role in preparing them for the tasks
- the difficulty of assessing an individual through a task carried out collectively by a group of students
- the possibility of differing but legitimate interpretations by different teachers of precisely what constitutes controlled conditions
In addition, in GCE examinations, where controlled conditions are not typically in place, there is scope for students to receive significantly differing levels of assistance from peers, parents and others, or to engage in plagiarism. All of this adversely affects the validity of the assessment outcomes and thereby reduces their reliability as measures of the individual’s attainment in the subject.
There is, however, a complete lack of empirical work on the effects of any of these issues on the reliability of teacher assessment. This means that it is impossible to quantify the impact of any of these factors on teacher assessments in practice in GCSE and GCE examinations.
4. Teacher judgements, marking schemes and reliability
The report notes that the judgements required of teachers assessing their students for formal purposes are very complex. There are insurmountable difficulties in evaluating reliability where the evidence which is being assessed is not recorded, as is the case with oral performance. Where the evidence is physical (for example, portfolios, artwork, models of designs, etc.), a variety of types of marking scheme can be used and these are devised to reflect the assessment objectives defined for the teacher-assessed component.
In most cases, physical products are produced for teacher assessment. Multiple marking studies are therefore theoretically possible – although not, by definition, with perfect replication, since only one participant can be any individual student’s own teacher.
The report quotes some reliability measures for the rating of writing by teachers in English national curriculum assessments and also in the Scottish Survey of Achievement. The coefficients range from 0.7 to 0.9 but there is no indication of the types of marking scheme involved or the extent of any standardisation processes involved in these studies. It is therefore impossible to interpret the implications of these coefficients for reliability in GCE and GCSE teacher assessment.
Practical difficulties ruled out the possibility of carrying out quantitative studies of reliability in portfolio assessment.
The report quotes evidence of a general lack of internal standardisation within schools and colleges and a low level of expertise, knowledge and motivation among teachers when it comes to internal standardisation. It reports on various approaches which are taken to tackle these issues in various systems within the UK and internationally, among them:
- involvement in assessment criteria development
- use of exemplar or benchmark work and assessments
- training meetings from awarding organisations
- agreement trials using pre-rated work
However, the extent to which these approaches are used varies and, once again, there is no empirical evidence available to evaluate the effectiveness in practice of these approaches to improving the reliability of teacher assessment outcomes.
5. External moderation and mark adjustment
The report reviews the system all the GCSE and GCE awarding organisations use for sampling teachers’ assessments. There are some differences in practice between them but the sample sizes and criteria for action are the same, as are the possible actions which can be taken for an individual centre (school or college) within any particular subject: acceptance of the teachers’ marks; adjustment of the teachers’ marks; or complete replacement of the teachers’ marks by marks awarded by a moderator.
There is an obvious question about the effectiveness of moderation, and the mark adjustments it involves, in providing reliable marks from teacher assessments. Moreover, for any given subject, the size of the sample of teacher-assessed work taken from each centre during external moderation is small. As always, however, any such concerns remain largely theoretical, because empirical evidence on the reliability of external moderation is sparse. The only study that has been identified indicates that the proportion of candidates whose final grades would be different if a moderator’s mark replaced their adjusted mark can be as high as 40% (in history) and as low as 10% (in psychology). Differences between subjects of this kind raise many questions in themselves, but the relevant study was carried out in 1992 and has not been replicated. Recent Australian work suggests that teacher assessment might be slightly better than test results in terms of classification accuracy.
6. Conclusions
There is an urgent need for empirical work to be done on the reliability of teacher assessment generally, and especially in the context of the high-stakes GCE and GCSE examinations. The report recommends a design involving multiple assessors and using generalisability theory.