Parallel universes and parallel measures: estimating the reliability of test results: Summary
Published 16 May 2013
Applies to England, Northern Ireland and Wales
1. Educational assessment and the concept of reliability
In everyday use, ‘reliable’ means ‘that which can be relied on’, but the technical definition in educational assessment is narrower: ‘the extent to which a candidate would get the same test result if the testing procedure was repeated’. The technical definition in educational assessment is a sliding scale – not black or white – and encourages us to consider why a candidate’s result might be different from one instance to the next. Here are some possible reasons:
- The candidate’s performance might vary a little from one day to the next, particularly if the conditions of the exam change (morning or afternoon, who is administering the test, whether the caretaker is mowing the lawn outside, whether the candidate has a headache or not, etc.)
- One exam marker might be more or less lenient on particular questions than the next (or even the same marker might be more or less lenient from one day to the next)
- One exam paper may include different test questions from the next, eliciting different facets of the candidate’s understanding (tests usually sample from the curriculum because there is not enough time to test everything, and candidates may choose to revise one topic but not another)
Even if the tests seem of a good quality, if the candidate takes several tests with no opportunity for learning or improvement in between (impossible to achieve, or to be certain of, in practice), it’s likely that they will get slightly different results from one test to the next. Some of this variation will be effectively random, but some of it could be found and removed, if we knew where to look.
2. Measurement error
In statistical terms, this variation from one test instance to the next is described in terms of the candidate having a notional ‘true score’ – the score that represents their actual ability – plus or minus an ‘error score’ (different for each test instance), the total of the two being the ‘actual test score’. Statisticians often report the error in terms of the Standard Error of Measurement (SEM, a statistical term describing how the candidate’s scores on repeated tests would be distributed around their true score). So, a score on the KS2 test might be 34 +/–3 (SEM) out of 100. The SEM for the KS2 test is three marks either way and one SEM represents a confidence interval of one standard deviation, or around 68%. So this shorthand says that the candidate’s actual score on the test was 34, and their true score is 68% likely (+/– one standard deviation) to be in the range 31–37. The more reliable a test, the lower the SEM.
This approach provides a single ‘amount of error’ for a test, whereas other approaches look at whether the errors are likely to be bigger or smaller for different test scores. For example, a test full of medium-difficulty questions is probably going to give more-accurate information about candidates of medium ability and more error-prone information about the strongest and weakest candidates.
Statisticians use a number of methods to measure the error in a test. One method involves asking candidates, where possible, to sit more than one test (with no time for learning or practice in between): the difference in performance between the two tests for each candidate is a measure of the error in both tests. Usually candidates would be divided into two groups, with each group sitting the two tests in a different sequence (i.e. group A sits test 1 then test 2, and group B sits test 2 then test 1), so as to deal with the effects of practice and possibly tiredness/boredom. This approach is used for key stage tests: candidates in 2009 (for example) sat the 2009 test, and a small sample then sat the 2010 test. The data provides information about error, and also helps calibrate the 2010 test to allow standards to be maintained.
As a variant, test results could be compared with a teacher’s estimate of the student’s level at the time they sat the test. In the KS2 reading test, the correlation between candidates taking the 2007 and 2008 tests was 0.812 and the correlation with teacher assessment was 0.766 (the difference between the live and pre-test might be down to students being more motivated for the live test!). The differences between the teacher assessment and the pre-test might be due to the teacher considering skills which the test can’t or doesn’t measure.
Measurement error on a test can also be modelled statistically. There are various methods used for this, all of them quite complicated, and all involving making some assumptions about the nature of the test and/or the candidates. Of course, for many projects, both methods of error estimation can be used and the results compared.
3. Internal reliability
When undertaking reliability investigations, researchers often start by looking for internal reliability in a test. Consider a test that is designed to measure a single educational concept, for example fractions, or reading (concepts here are usually called ‘constructs’ in the recognition that we create these categories of knowledge). We would expect a good candidate to do better than a weaker candidate on all items. If, for a particular item, they don’t, then either the item is measuring something else, or the construct is so broad that candidates can have differing profiles of skills for different aspects of the construct. A fractions question might be hard for a good maths student because it includes a lot of reading and they happen to be a poor reader. Another student might be good at adding and subtracting fractions, but poor at multiplying them.
It is relatively easy to look at how consistent tests are – researchers split the test questions into two halves and look at how candidates do on each half. They then repeat this for every possible combination of ‘halves’ and come up with an average correlation between the two halves – this measure is called Cronbach’s alpha and is the starting point for much reliability work. In the case of the KS2 reading test, the Cronbach’s alpha of 0.883 implies that 88% of the variation in pupils’ scores is due to variation in their true score, and 12% due to measurement error . Some people doubt the effectiveness of Cronbach’s alpha as a secure method: it makes the assumption that the test is measuring precisely the skills that it is intended to measure – and it is possible for tests to have a high Cronbach’s alpha but in fact be measuring the wrong construct very well. It also doesn’t cope so well with marker variation.
4. Classification accuracy
Looking at reliability in test scores, as the above methods do, allows researchers to produce confidence intervals around results. In England, however, exams tend to be classified, with candidates’ results coming in the form of a level or grade, covering a range of marks – for example, in the KS2 test, candidates receive a level from 2 to 5.
Measurement errors in marks are more tolerable provided they do not risk changing the grade allocated to a candidate. Clearly, candidates with scores near the grade boundaries are going to be more at risk of misclassification than those with scores in the middle of the band (e.g. a score of 32 is close to the level 4/5 boundary of 31: the minimum score for a level 5 award). Researchers often also consider the difference between classification accuracy (measuring how well the test assigns grades to candidates against how they would be graded according to their notional true score) and classification consistency (how similar two test instances, each with their own measurement errors, are in their classification of the same group of candidates). Based on modelling the error in the tests, the probability of misclassification of a student in the KS2 2007 test was 17% overall but, for students scoring near the grade boundaries, the misclassification probability rises to 40%.
5. Terminology
‘Reliability’ and ‘measurement error’ both mean different things in their everyday use from their meaning in technical psychometric use.
Terminology | Everyday usage | Technical psychometric usage |
---|---|---|
Reliability | Black or white – something is either reliable or not | A quantification of the extent to which a measurement is repeatable |
Measurement error | Something has gone wrong in running the exams, and somebody is probably to blame | A recognition that tests try to measure a candidate’s actual ability but inevitably end up with a degree of approximation, even when they have been run as well as possible |
The implications to a non-professional audience of the term reliability are perhaps too favourable, while those of measurement error are too negative, so the report considers some alternatives outside the normal range. There is much to be said for signal to noise ratio (SNR): it is intuitive and avoids the value judgement associated with terms like reliability. However, its use of decibels as a scale is unfamiliar in assessment. Would, perhaps, proportion relevant variance (PRV) be better? It is more precise, but much harder to understand. To replace standard error of measurement, an expression indicating the diffuse nature of our knowledge of an individual’s score is required, such as measurement sharpness. But none of these terms is without its problems.
6. The public’s perception of measurement error
A number of research projects (both in the Ofqual Reliability Programme and elsewhere) have looked at how the public respond to the issue of examination reliability. It is clear that, although public understanding is limited overall, there is acceptance of the random element in assessment (eg ‘I revised this topic but it didn’t come up’, or ‘I had a bad day because of a headache/cold/personal circumstances’). There is less acceptance of variability in marking, even where the marking is inherently subjective. While the examination professionals generally accept a degree of variability in marking essays, the public expects further marking to be undertaken until an agreed result is found.
One other aspect of differences between providers and consumers is observed, especially in variations from one occasion to another. Measurement specialists conceptualise this in terms of a true score and an error score relating to the assessment instance, whereas non-professionals tend to think more in terms of a true score, but one which is frequently not attained because of negative outside influences – ie failings in the assessment process. These differences and discussions about responsibility for variation detract from the more important question: ‘Is an assessment good enough and, if not, what do we want to do about it?’
As an example, research suggests that two highly trained markers have a 40% chance of agreeing on an essay score (marked on a scale of 1 to 5). This doesn’t seem too good, but the markers are already trained, so further training doesn’t look a promising avenue to explore. A larger sample of written work would almost certainly give a more stable result, so perhaps more exam-based essays are needed? But this will increase marking cost and exam time (both of which many regard as already too high). Coursework offers an alternative but raises concerns about authenticity – is it all the student’s own work? Or might the reaction be that the scoring of a single essay was only one aspect of the exam, and that things were likely to even out over a wider span of assessments?
7. Summary
Although the concepts of reliability are relatively simple, the statistical analysis to measure reliability is complex. The fact that specialist use of assessment terminology differs somewhat from everyday use of many of the same terms only adds to the confusion. Perhaps as a result of this, awarding organisations in England have tended not to publish reliability information. Nevertheless, transparency in reporting public examination results is now seen as essential to fostering an open and inclusive debate so that decisions can be taken on the best available evidence. For this reason, the report recommends that awarding organisations and the regulator should publish figures for the reliability of public examinations.