Estimating the reliability of composite scores: Summary
Published 16 May 2013
Applies to England, Northern Ireland and Wales
1. How are component scores combined?
Reliability in composite scores is important in an exam system such as England’s, where so many qualifications are awarded on the basis of scores from multiple components. The interpretation of composite scores requires a high level of understanding of the psychometric properties of the individual components and the way the components are combined.
Consider, as an example, a qualification made up of two units worth a maximum of 50 marks each. The simplest way to combine the scores is to add them up. Some examinations in England, however, place a greater emphasis on assessment components with higher reliability and/or validity. An essay paper might, for example, be scaled to be worth 70% of the total marks; and in some examinations (A levels and GCSEs, for example) marks from individual units are also transposed onto a non-linear scale to allow for variations in difficulty across exams for units taken by candidates in different years.
In fact, there are potentially many other ways of combining scores using different weightings for different components. Using statistics, higher-quality questions or sections can be weighted more highly (for example, items that best discriminate between the strongest and weakest candidates), harder questions can be given more weight, or more-reliable items can be given more weight.
2. Reliability of composite scores
If we know that the internal reliability of a unit is 0.83 and the reliability for a second unit is 0.91, what is the reliability of the composite?
Generally speaking, the reliability of a composite is related to the properties of its components, but is also affected by the weighting, and the extent to which the components themselves are correlated with each other. The three main models statisticians use for reliability analysis in general are extended to cover composite scores. These three models, which are described in outline below, are described in mathematical detail in the full report.
Statistical model | Advantages | Disadvantages |
---|---|---|
Classical test theory or true score theory This is technically the simplest method, and is based on the assumption that in any test the candidate’s test score is made up of their true score (the score they would have got if the test were perfect) and an error score. |
Simple (relatively!) Most widely used and well understood. Possible before the advent of computing. |
Provides no way of working out what factors are contributing to the reliability. Makes assumptions (although weak ones) about the educational domain being tested. |
Generalisability theory or G theory This is technically the most complex and aims to separate out the individual contributing factors in the measurement error. It is an extension of classical test theory. |
Allows different sources of variability in testing to be isolated and estimated, enabling test designers to target the areas where improvements will yield the best gains. Makes fewer assumptions about the test and educational domain. |
Very complex and potentially expensive to implement, and not intuitive. Makes weak assumptions. Requires computer processing. |
Item response theory Models candidates’ ability and item difficulty on a scale so that, if a candidate’s ability is known, we can predict how they will do on an item before they attempt it. |
Supports optimal test design. | Makes strong assumptions about the educational domain being tested. Requires computer processing. |
The statistics to calculate composite reliability are complex and make substantial assumptions about the tests and candidates, so the models all provide results that must be regarded as estimates. For the trial dataset used in the paper, the three main methods of calculation produce very similar results, although each has particular advantages in certain settings. For the worked example in the report, the composite reliability for the two units varies from 0.86 to 0.94, depending on how the two units are weighted and which models are used for the analysis. Note that, even at the lower end of the range, the composite is more reliable than the less reliable component.
3. Conclusions
Relatively little research has been carried out into the reliability of composite scores for public tests and examinations in England. More work is needed to better understand how combining scores affects reliability, and also to allow consideration of how awarding organisations could publish this information to inform public debate about qualifications.