Aggregating to improve ethnicity data quality
Published 30 June 2023
1. Introduction
Data on the Ethnicity facts and figures website shows outcomes for different ethnic groups. It covers areas such as crime and policing, education and employment.
Users sometimes want to know how different analysis methods can make data more reliable. This report looks at how aggregating data for more than one time period can improve data quality.
This report is part of a series that looks at the quality and reliability of ethnicity data. Others include:
- Using relative likelihoods to compare ethnic disparities
- How different or similar aggregated ethnic groups are
- Stop and search data and the effect of geographical differences
- Gypsy, Roma and Irish Traveller ethnic group
- Differences in the quality of ethnicity data reported by individuals and third parties
- Which differences in ethnic group data are real?
This report uses similar techniques to the final report here, including calculating confidence intervals and understanding significant differences between ethnic groups.
2. Summary
This report focuses on aggregating data over more than one time period. It also looks at:
- aggregation to improve data quality, especially for ethnic groups with small populations
- the benefits and limitations of different methods of aggregating, such as aggregating ethnic groups and aggregating different geographies
3. Recommendations
We recommend that data providers:
- give details about why they have aggregated data, and describe any limitations to the aggregation method
- explain why they aggregated for a certain number of time periods
- show the impact on the data in using more than one time period
- provide advice for users on appropriate ways that data might be aggregated
- use appropriate techniques when comparing aggregated data – this can be more complex when time periods are being compared that overlap
- publish data to allow users to produce aggregations – for example, weighted numerators and denominators
- understand user needs for reliable ethnicity data for small groups against the need for more up-to-date data
4. Ethnic groups with relatively small populations
Sometimes we use survey data to produce estimates for ethnic groups. The smaller the size of the survey sample, generally the less reliable the estimate for that group will be.
This can be an issue when we produce estimates for ethnic groups with small populations. Table 1 shows the 11 ethnic groups with the smallest populations.
Table 1: Ethnic groups from the 2021 England and Wales Census with less than 1% of the total population
Ethnicity | Percentage | Number |
---|---|---|
Gypsy or Irish Traveller | 0.1 | 67,800 |
Roma | 0.2 | 101,000 |
Mixed white and black African | 0.4 | 249,600 |
Black other | 0.5 | 297,800 |
Arab | 0.6 | 331,900 |
Chinese | 0.7 | 445,600 |
Mixed other | 0.8 | 467,100 |
Mixed white and Asian | 0.8 | 488,200 |
White Irish | 0.9 | 507,500 |
Mixed white and black Caribbean | 0.9 | 513,000 |
The groups with the smallest percentage of the population were:
- gypsy and Irish traveller (0.1%)
- Roma (0.2%)
- mixed white and black African (0.4%)
We can multiply the number of respondents in a data collection by these proportions. This can give us an idea of the sample size we can expect for an ethnic group.
The Annual Population Survey (APS) 2017 to 2019 dataset used for this report had 257,000 records of people of working age in England and Wales. This means we should expect around 260 responses from the white Gypsy and Irish Traveller population.
However, 65 people of working age in England and Wales in this dataset identified as being in the white Gypsy and Irish Traveller classification. This summary of statistics discusses data quality issues for this ethnic classification in more detail.
So, even when the data collection has a large number of respondents, the sample size for ethnic groups might still lead to unreliable estimates. It could also mean that data is suppressed because it is disclosive or unreliable, making it harder to understand differences over time or between groups.
Unlike sample surveys, administrative data collections do not have the same issue of sampling variability. But a small number of respondents in administrative data might be problematic too. For example, sometimes we have to suppress administrative data to avoid disclosing the identity of individual respondents.
This means we might want to use different ways to increase the number of respondents either:
- during data collection
- during the analysis
5. Increasing the number of respondents
This report focuses on aggregating data during analysis. However, increasing the number of respondents in a data collection method can also help improve the reliability of estimates taken from it. It can also help reduce the need for aggregation in the future.
There are 3 main ways to do this:
- improve response rates
- use survey boosts or oversampling
- link datasets together
5.1 Improve response rates
Ways to improve response rates include:
- designing high quality data collections
- choosing the correct survey mode
- helping respondents understand the uses of their data
- providing translated survey documents
- offering incentives (or higher incentives) for participation
- sending follow-up emails to people who have not responded
5.2 Use survey boosts or oversampling
You can increase the number of people with different characteristics in your survey. This is known as a “survey boost” or “oversampling”. Surveys are sometimes boosted:
- for the whole country
- in smaller areas, like local authorities
- in different types of areas, like rural areas
- for specific groups of people, such as people with different ethnicities
- for a combination of all these factors
A survey boost can happen proportionately across groups or areas. Sometimes groups or areas are boosted disproportionately (or oversampled) relative to their proportion in the population.
For example, 3.1% of the population in England and Wales are from the Indian ethnic group. To improve the reliability of data for that population in a survey, we could ensure that 5% of our survey sample was from the Indian ethnic group.
The ONS has regularly boosted samples for their social surveys since the start of the COVID-19 pandemic.
5.3 Link datasets together
Linking datasets will not usually increase the number of respondents in a data collection. But missing ethnicity records might be obtained if a dataset with many missing records is linked to one that is more complete.
6. Different ways of aggregating
Aggregating data can improve its reliability. We describe the following 3 main ways to aggregate data:
- aggregating years that do not overlap
- aggregating years that overlap
- aggregating by another characteristic
6.1 Aggregating years that do not overlap
For example, with a dataset that has data from 2009 to 2020, aggregating:
- 2009 to 2012
- 2013 to 2016
- 2017 to 2020
The time periods do not overlap – each year only appears in one aggregation.
Sometimes non-consecutive time periods are aggregated, for example 2010, 2012 and 2014.
6.2 Aggregating years that overlap
For example, for the same dataset as above, aggregating:
- 2014 to 2017
- 2015 to 2018
- 2016 to 2019
- 2017 to 2020
In the example, each 4-year period contains 3 years the same in:
- the 4-year period before
- the 4-year period after
These are called ‘rolling’ or ‘moving averages’. Using moving averages is one way of smoothing data to see underlying trends.
6.3 Aggregating by another characteristic
Users might also combine ethnic groups. An RDU report showed how aggregating this way can hide large differences in outcomes between smaller groups.
For this reason, the government has stopped using the term ‘BAME’ in its own communications. It is encouraging other public sector bodies to do the same.
We encourage people to follow 4 broad principles as often as possible:
- use harmonised standards for ethnicity
- collect data in as much detail as possible
- analyse and report data in as much detail as possible
- avoid bespoke aggregations
If aggregation is required, we also recommend avoiding aggregating ethnic group data outside of the 5 broad groups:
- Asian
- black
- mixed
- white
- other
Doing this reduces comparability with other datasets. For example, avoid aggregating the white Irish and Arab groups.
7. Benefits of aggregation
There are benefits to aggregating more than one time period.
7.1 More reliable estimates
Confidence intervals are often used on Ethnicity facts and figures. They help determine the reliability of estimates based on survey data. Previous RDU research describes how to calculate different types of confidence intervals for ethnicity data.
A wider confidence interval means a less reliable estimate in a survey. They have less analytical value and can be the result of either or both:
- a small sample in the survey
- a large variation within survey responses
Analysts usually use 95% confidence intervals. 90% and 99% confidence intervals are also sometimes used.
We use the word ‘likely’ to refer to all the outcomes that could have happened. A confidence interval of 95% means if we could take 100 random samples, and create 100 confidence intervals around a particular variable, the true value of that variable would be expected to fall within 95 of the 100 confidence intervals.
For a simple random sample[footnote 1] survey, there is a relationship between the confidence interval and the sample size. The confidence interval varies inversely with the square root of the sample size. For example, if you quadruple your sample size, it reduces the width of the confidence interval by half.
If we aggregate time periods, we usually increase the sample size. This should then reduce the width of the confidence interval for an ethnic group. This will increase the reliability of estimates, as measured by confidence intervals.
More reliable estimates might show more statistically significant differences between different ethnic groups, and different aggregated time periods for the same group.
An example of aggregating data
This example uses data for the Crime Survey for England and Wales (CSEW).
The survey asks people about their experiences of some criminal offences in the 12 months before the interview. It is run by the Office for National Statistics (ONS), and around 35,000 households take part each year. It covers a broad range of victim-based crimes experienced by the people interviewed. The crimes do not have to be reported to and recorded by the police.
The Ethnicity facts and figures website contains statistics from the survey. One page shows the percentage of people who said they had confidence in their local police.
The analysis below shows the impact of aggregating years for the 5 aggregated ethnic groups. We also show data for people who did not give their ethnicity. This unknown group has the:
- smallest sample size
- widest confidence intervals
- lowest proportion of respondents who had confidence in the police
Figure 1: Percentage of each ethnic group that had confidence in their local police. England, number of survey years aggregated to the year ending March 2020
The charts show that, in every ethnic group, the width of the confidence intervals gets narrower the more years of survey data you combine. For each number of years combined (1, 3, 5 or 7), the white ethnic group has the narrowest confidence intervals and the unknown ethnic group has the widest.
For example, 3 years aggregated means the aggregation of survey years for the years ending March 2018, March 2019 and March 2020.
Figure 1 and table 1 shows how sample size increases can impact the width of the confidence intervals. As we aggregate more years, the width of the interval for each ethnic group reduces. This means we might be able to detect more significant differences between groups.
For example:
- using one year of data, there are no significant differences between the white group and the mixed ethnic group
- if we aggregate 2 or more years, then the difference in the percentages of people who had confidence in their local police between these 2 groups is significant
If the CSEW was a simple random sample with the same sample size in each year, the width of the confidence interval would decrease by √3 (73.2%) if we use 3 years of data instead of one year.
CSEW is a more complex survey[footnote 2] but the sample sizes for each ethnic group do not vary much for each year. The approximation of √3 works well with reductions in confidence intervals in the range 71% to 81% for the known ethnic groups.
Table 2: summary aggregated statistics for confidence in the local police
Number of years aggregated | Ethnicity | Lower confidence bound (%) | Estimate (%) | Upper confidence bound (%) | Confidence interval width (%) | Sample size |
---|---|---|---|---|---|---|
1 | White | 73.7 | 74.4 | 75.0 | 1.3 | 29,918 |
3 | White | 75.7 | 76.0 | 76.4 | 0.7 | 91,345 |
5 | White | 76.9 | 77.1 | 77.4 | 0.5 | 154,983 |
7 | White | 76.6 | 76.8 | 77.0 | 0.4 | 217,175 |
1 | Mixed | 65.5 | 70.6 | 75.7 | 10.2 | 416 |
3 | Mixed | 66.5 | 69.6 | 72.8 | 6.3 | 1,163 |
5 | Mixed | 68.2 | 70.6 | 73.1 | 4.9 | 1,897 |
7 | Mixed | 67.8 | 69.9 | 72.1 | 4.3 | 2,545 |
1 | Asian | 74.5 | 76.7 | 78.9 | 4.3 | 2,065 |
3 | Asian | 77.2 | 78.4 | 79.7 | 2.4 | 6,156 |
5 | Asian | 77.9 | 78.9 | 79.8 | 1.9 | 9,902 |
7 | Asian | 77.9 | 78.8 | 79.6 | 1.6 | 13,375 |
1 | Black | 60.7 | 64.5 | 68.3 | 7.6 | 957 |
3 | Black | 67.7 | 69.7 | 71.8 | 4.1 | 2,855 |
5 | Black | 69.0 | 70.6 | 72.2 | 3.1 | 4,783 |
7 | Black | 69.2 | 70.6 | 71.9 | 2.7 | 6,644 |
1 | Other | 68.6 | 74.9 | 81.1 | 12.5 | 277 |
3 | Other | 74.8 | 78.1 | 81.3 | 6.5 | 862 |
5 | Other | 75.6 | 78.1 | 80.6 | 5.0 | 1,453 |
7 | Other | 75.6 | 77.8 | 80.0 | 4.4 | 1,923 |
1 | Unknown | 50.4 | 62.1 | 73.7 | 23.3 | 101 |
3 | Unknown | 47.7 | 55.9 | 64.1 | 16.4 | 231 |
5 | Unknown | 51.1 | 58.1 | 65.1 | 14.0 | 336 |
7 | Unknown | 50.1 | 56.4 | 62.7 | 12.6 | 413 |
Note: Number of years aggregated to March 2020. For example, 3 years in the table is an aggregation of years ending March 2018, March 2019 and March 2020.
7.2 Reducing the amount of suppressed data
Data is sometimes suppressed to protect the identity of individuals, and to prevent users drawing the wrong conclusions from unreliable data.
The Department for Education (DfE) publishes adoption scorecards which show the number of adoptions of children from ethnic minority backgrounds by local authority as a 3-year average. Aggregation in this dataset reduces the amount of suppression and increases the amount of available data.
For timeliness indicators [footnote 3] on the scorecard, counts of 10 or fewer are suppressed. In this case, the suppression of data is more about not disclosing information about individual children – a very sensitive topic – than data reliability.
DfE also published data on adopted and looked-after children. Data for the number of children leaving care due to adoption was shown for 151 local authorities.
For the ‘other than white’ ethnic group:
- 46 local authorities (30%) had data suppressed on the number and proportion of children leaving care who are adopted for 3 years to March 2020 combined
- 86 local authorities (57%) had data suppressed for one year of data (2020)
3 years are chosen because:
- the majority of local authorities (70%) have data available, and trends can be tracked over time
- 2-year averages still need a lot of suppression and may need secondary suppression where single year figures are also shown
- 4-year averages reduce the ability to track trends over time
The ONS suppresses estimates from the APS based on fewer than 100 respondents for data broken down by ethnicity. This is done to protect people’s confidentiality and because the numbers involved are too small to make reliable generalisations.
Figure 2 shows how the number of estimates for analysis increases in the APS. We have used the Indian ethnic group for an age and National Statistics socio-economic classification (NS-SEC) analysis for:
- 2019 data
- 2017, 2018 and 2019 data combined
Figure 2: Estimates available from the APS for the Indian ethnic group, for 1 year (2019) and 3 years combined (2017, 2018 and 2019) [footnote 4]
The charts show the number of available estimates for the Indian ethnic group from the Annual Population Survey for 1 year, and for 3 years for an analysis of 12 age groups and 8 socio economic classifications (96 available estimates). An available estimate is one with a sample size of 100 or more. For 1 year of data, there are 12 available estimates. For 3 years, there are 24 available estimates.
The number of data points available (the blue circles) increases as we combine 3 years together (figure 2). This does not necessarily mean these data points are fit for purpose for the needs of a user – they might have wide confidence intervals around them, for example. However, the potential for more analysis goes up as we have more data points.
8. Limitations of aggregating data
Aggregating for more than one time period always requires underlying numerators and denominators. This is to calculate correct estimates, for example of a percentage, from aggregated data.
We recommend that Ethnicity facts and figures data providers supply this underlying data. Users can then aggregate different time periods.
Some datasets on Ethnicity facts and figures already have this information. Data from the Active Lives Survey is a good example.
Averaging percentages across time periods is incorrect and not advised, unless you know the denominators are the same for each year you are trying to aggregate.
8.1 Not analysing the most recent data
Sometimes users are interested in the most recent statistics in a dataset, for example, the most recent year. Any aggregation of more than one time period means that it is not possible to draw conclusions from the most recent data.
This is the issue with aggregating time periods – the person analysing or using the data is often making a trade-off between 2 factors:
- timeliness of the data – analysing the most recent data
- detail – such as being able to analyse data for smaller ethnic groups
Decisions about which data to aggregate are easier to make if an analyst understands the needs of their users.
A related issue is that aggregating time periods reduces the number of data points you have available – for example, to track trends over time.
Even using moving averages means there is at least one fewer data point available for analysis.
8.2 Survey design
The design of some surveys, such as the Labour Force Survey (LFS), use a ‘wave’ structure. This means the same people are surveyed more than once in consecutive time periods.[footnote 5] This can create problems in aggregating, as the same people will appear more than once.
The APS is itself an aggregation of 4 successive quarters of the LFS. It takes account of this issue by only combining different people from certain waves to ensure no-one appears more than once.
Some other datasets are not as suitable for aggregation over time. An example is a longitudinal dataset that collects data for the same people across a number of time periods.
We recommend that ethnicity data providers help users decide which comparisons are recommended and how to aggregate data. For example, the ONS recommends comparing the latest LFS quarter with the previous non-overlapping quarter, rather than the 3 months ending with the previous month.
8.3 Weighting survey data
Weighting for non-response might be problematic in some aggregations. In practice, the individual survey weights are quite often used after aggregation. A more correct way is to aggregate the data and then calculate non-response weights. The availability of data is often a practical limitation here.
8.4 Describing data
It can be more difficult to describe data based on aggregated time periods than a single time period. Any commentary needs to be clear about timeliness. Charts, tables and metadata should also clearly explain which time periods have been aggregated.
Some statistical tests can also be more complex when analysing moving averages. If 2 aggregated-year estimates are being compared and there are no overlapping years, the statistical testing is straightforward.
The standard error of the difference in the 2 estimates is the square root of the sum of the 2 variances.
A user can then apply a standard statistical test (for example, a t-test) of the difference between the 2 sets of data:
|Diff/se(diff)|
where se(diff) = √(var (x2) + var (x1))
Comparing aggregate years in moving averages is more complex. This is because the datasets are not independent – the same people will appear in more than one year.
If we have created 3-year combined estimates for 2018 to 2020 and for 2019 to 2021, this means 2019 and 2020 are overlapping years.
If we make the assumption that the weights are the same in each year, the 2018 to 2020 estimate is:
x̄ 2018-2020 = (x2018 / 3) + (x2019 / 3) + (x2020 / 3)
The 2019 to 2021 estimate is:
x̄ 2019-2021 = (x2019 / 3) + (x2020 / 3) + (x2021 / 3)
The difference between the 2 estimates is:
x̄ 2019-2021 - x̄ 2018-2020 = (x2021 / 3) - (x2018 / 3)
The difference between the 2 combined year statistics would be a third of the difference between the first and last years (2021 and 2018). It makes no use of the data from the overlapping years.
To test the difference for significance, it is only necessary to test whether the first and last years are significantly different.
With complex datasets like the LFS, survey weighting and rotation introduce different covariances and make the formula more complex.
The formula for the standard error of the difference in this case (not assuming independence) is:
se(diff) = √((var (x2) + var (x1) - 2cov(x2,x1))
Sometimes the underlying data might not be available to calculate this precisely. In some cases an approximation might be fit for purpose.
Whichever method users choose, the Equality Hub will support those using a method that attempts to understand the uncertainty around an estimate, rather than doing nothing at all.
8.5 Missing underlying trends
Some statistics might be less appropriate for aggregation over time. These include:
- statistics on a steep upward or downward trend
- those that fluctuate year on year
Aggregated time series data can lose the link to the underlying trend. It might be important to understand whether data for a particular series is changing rapidly or not.
Time series data that are more stable might be more suitable for aggregation, for example, to improve their robustness for smaller ethnic groups.
The stop and search figures for England and Wales (excluding data for the British Transport Police and Greater Manchester Police) are used as an example here. These figures are from an administrative dataset. We might need to aggregate them over time to be comparable with another aggregated dataset.
Stop and search rates for the black Caribbean group went down from 173 stop and searches for every 1,000 people in the year ending March 2009, to 59 in the year ending March 2014. The average stop and search rate for black Caribbean people for that 6-year period is 118 – twice the rate for the year ending March 2014, and two-thirds the rate for the year ending March 2009.
By averaging over these years, we lose important knowledge about the trend over that time period. However, the average from the years ending March 2015 to 2020 is 32. This is more representative of the figures for each year in that period. The rate for the year ending March 2016 was 32 stop and searches for every 1,000 people and for the year ending March 2020 it was 38. We might feel more comfortable aggregating across that time period.
Figure 3: Stop and search rate for black Caribbean people, years ending March 2007 to 2020, England and Wales excluding British Transport Police and Greater Manchester Police
The stop and search rate for black Caribbean people went down sharply every year from the years ending March 2009 until 2015. It has been more stable between the years ending 2015 and 2020, at between 27 and 38 incidents per 1,000 people.
Deciding whether to aggregate in these cases is often a matter of judgement. The aim should be to not misreport the data. Factors to consider include:
- the relative size of the aggregated figure compared to the figures for the start and end points
- how the aggregate figure changes if different start and end points are used
- whether you are happy aggregating overlapping time periods
- which aggregated time periods you want to analyse - to be consistent with other data, for example
8.6 Ethnicity classifications change over time
If ethnicity classifications have changed over time, aggregation for more than one time period might not be possible.
Without the underlying data, aggregating the other or Asian groups for datasets that use the 2001 and 2011 Census classifications might not be possible. This is because the Chinese group moved from the ‘other’ ethnic group to the Asian group between those 2 years, in estimates produced for the 5 main ethnic groups.
Data providers should be clear in their metadata about whether and how different ethnic groups can be aggregated over time.
9. Conclusions
This report has demonstrated that there are benefits to aggregating data over more than one time period, or by aggregating other classifications – for example, more than one ethnic group.
These can mean more reliable data (as measured by confidence intervals) and less suppression. However, there are some limitations such as:
- not having the most up-to-date data to analyse
- needing the underlying data to do the aggregation
- complex surveys making the aggregation more difficult
- losing information about underlying trends
Data producers should provide as much information as possible to help users wanting to aggregate data to understand and overcome some of these limitations.
10. Acknowledgements
We acknowledge the help and assistance from analysis colleagues in the:
- ONS Crime Survey for England and Wales
- ONS Data Quality Hub
- Home Office
- Department for Education
Data from the Crime Survey for England and Wales was sourced from the UK Data Service.
We also cite the Welsh Government report on generating aggregate statistics which has provided some invaluable background information.
-
A simple random sample is a sample in which each member of the population has an exactly equal chance of being selected. ↩
-
The survey package in R has been used to account for this to calculate confidence intervals. ↩
-
For example, the average time between a child entering care and moving in with its adoptive family ↩
-
2017 to 2019 data has been used for this example as the years are pre-COVID-19 pandemic. ↩
-
In the LFS, people are interviewed for 5 successive waves at 3-monthly intervals. See figure 3.1 in LFS guidance Volume 1: Background and methodology ↩