Which differences in ethnic group data are real?
Published 21 December 2021
1. Introduction
The Race Disparity Unit’s (RDU) Ethnicity facts and figures website shows data about the experiences and outcomes for different ethnic groups in areas including crime and policing, education and employment.
Our users need to understand whether observed differences between ethnic groups or time periods are reliable and reflect real differences in the whole population, or if they may be due to natural variations in the data we have collected.
In this report we will talk about statistical significance. This has recently been a source of discussion and controversy in statistics.
This does not mean a statistically significant result implies causality and that differences seen are directly due to people being from different ethnic groups. Also, determining that something is statistically significant does not, on its own, show that the finding is of interest scientifically or for policy.
However, we use statistical significance to inform what’s published on Ethnicity facts and figures.
This report shows how real differences can be determined using different analytical techniques. It explains some of the strengths and limitations of each.
It covers:
- why we want to be confident about estimates for different ethnic groups
- how we can use confidence intervals to assess our confidence in estimates
- 2 different ways of producing confidence intervals that are applicable in different circumstances
- some of the methods we can use to understand differences between estimates
2. Conclusions
There are 3 main conclusions to this report:
- it is important to understand significant differences between estimates to be confident that the differences are not likely to be due to natural variations in data alone
- some ways of understanding differences are more useful in some circumstances than others – for example, where sample sizes are small, or where estimates are extreme values close to 0% or 100%
- it is important that data suppliers provide confidence intervals and other supporting information where possible to help users understand the uncertainty around estimates for ethnic groups
3. Confidence in data
When using survey data from the measures on Ethnicity facts and figures we need to understand how confident we can be in an estimate.
We also need to know whether things are different or not for estimates:
- between ethnic groups
- for the same ethnic group over time
- between geographies like regions, or other characteristics like men and women
We need to be clear that the data we report in statistical surveys are estimates of real figures in the population and will differ from them for a variety of reasons. When we see differences or changes based on those estimates, we want to understand if the differences or changes are more likely to be real, rather than the sort of variation we would expect to see.
For example, data on smoking from the 2019 Annual Population Survey shows that 14.4% of white adults in England smoked cigarettes. This is an estimate based on a sample of around 147,000 people in total, of which around 129,000 people were in the white group.
We might ask if:
- this is representative of white people as a whole (of 37.5 million people aged 16 and over in England in 2011)
- this is statistically significantly different from the estimated 15.6% of people in the ‘other’ ethnic group who smoke, that was based on 2,155 responses
- it could just be variation because of who has been surveyed in that particular year for those 2 groups
We are interested in how representative the percentages are of the whole population. We cannot include everyone each time we do a survey (the exception to this is the Census), and in the example about the proportion of people who smoked cigarettes, we might have sampled more smokers in the ‘other’ ethnic group and fewer smokers in the white group by chance. If we repeated the survey, that might be reversed and the percentages might be closer together. So, we need to use different analysis techniques to answer the 3 questions above.
Understanding how estimates could vary across samples and then understanding the differences between ethnic groups is important because analysts can then derive accurate insights into data. They can then make better decisions using that information.
4. Confidence in estimates
4.1 Confidence intervals
To understand how precise an estimate is, confidence intervals can be used. A confidence interval can be thought of as a range of values around an estimate that the ‘true’ value – the value we would get if we could survey everybody in the group – is highly likely to be in. The estimate is often the central point of this range.
The wider the confidence interval, the less reliable the estimate. Wide confidence intervals can be the result of a small sample in the survey, large variation in survey responses, or both.
Usually, a 95% confidence interval is used. However, others like 90% or 99% are also sometimes used.
A confidence interval of 95% means that if we take 100 random samples and create 100 confidence intervals, the true value would be expected to fall within 95 of the 100 confidence intervals.
Figure 1: Percentage of adults who smoke, with 95% confidence intervals, by ethnicity (England, 2019)
Source: Annual Population Survey
Figure 1 shows the confidence intervals for 5 aggregated ethnic groups for data from the Annual Population Survey on adult smoking. The 95% confidence interval for white people is between 14.2% and 14.7%. For the ‘other’ ethnic group it is between 13.6% and 17.6%.
We can use these confidence intervals as one way of deciding whether or not we can be confident that the amount of people smoking in these groups are different.
The use of confidence intervals would usually apply to a survey based on a sample of respondents. In general, we would not use them for a census of the population, or for data derived from an administrative process. However, similar techniques can be used for administrative data to measure natural fluctuations in measurements.
5. Calculating confidence intervals
5.1 Standard method
The confidence interval for an estimate of a proportion is usually calculated by the formula:
- p + z * se for the upper bound of the confidence interval
- p - z * se for the lower bound of the confidence interval
Where ‘p’ is the estimated proportion from the sample. For example, the estimated proportion of white people who smoke.
Where ‘se’ is the standard error of ‘p’. The standard error tells you how precise the proportion from any given sample from that population is likely to be compared to the actual population proportion. A larger standard error indicates that the proportion estimates from different possible samples are more spread out and less a reflection of the true proportion from the population. For a simple random sample, the standard error is calculated by √p(1 − p)/n, where ‘p’ is the percentage estimate. A simple random sample is a randomly selected subset of a population. In this sampling method, each sample of the population has an exactly equal chance of being selected.
Where ‘z’ is a number related to the level of confidence we want the interval to represent. For a 95% confidence interval, this number will be 1.96. This number varies depending on the level of confidence you require. The value of 1.96 is based on the fact that 95% of the area of a normal distribution is within 1.96 standard deviations of the mean.
For a simple random survey, this formula does not take into account complex survey design and estimation. The impact of these complexities – things like stratification, multi-stage sampling and calibration – can be represented in a design factor (DEFT). If a DEFT is available, this should be included when calculating the standard errors.
The formula becomes DEFT*√(p*((100-p)/n)) to account for this complex survey design.
5.2 Example of confidence intervals using this method
We can use the Crime Survey for England and Wales (CSEW) as an example dataset. The measure on Ethnicity facts and figures showing data on victims of crime has estimated proportions that are in general not extreme (not near 0% or 100%). In the year ending March 2019, 15% of people aged 16 and over said they had been the victim of a crime at least once in the last year.
The standard errors and symmetrical confidence intervals are given in the download file.
Someone deliberately using force or violence on the respondent is a much rarer event for all ethnic groups in the survey. The weighted and unweighted data for the year ending March 2018 can be seen in Table 1.
Estimates range from 0.6% of respondents in the Asian ethnic group to 2.0% in the black ethnic group. The unweighted sample sizes on which the data are based are also small for ethnic minority groups.
The 5 aggregate groups have been used as a demonstration because of the small sample size of people responding ‘yes’ to this question. Also the analyses here are not official statistics and have been developed for illustrative purposes only.
Table 1: Data on whether anyone has deliberately used force or violence on the adult respondent (England and Wales, year ending March 2018)
% responding ‘yes’ (weighted estimate) | Number responding ‘yes’ (unweighted) | Total unweighted sample (excluding ‘refused’ and ‘don’t know’) | |
---|---|---|---|
White | 1.6 | 440 | 30,995 |
Mixed | 1.3 | 4 | 375 |
Asian | 0.6 | 11 | 2,011 |
Black | 2.0 | 20 | 967 |
Other | 0.9 | 4 | 300 |
Source: Office for National Statistics’s (ONS) Crime Survey for England and Wales. The CSEW data used here and later in the report was obtained under an End User Licence from the UK Data Service.
As the sample design and estimation for the CSEW is complex, including stratification, multi-stage sampling and weighting we cannot use the simpler calculation shown above. Instead we have used the ‘survey’ package in the statistical software R. This package provides facilities in R for analysing data from complex surveys.
Similar functionality is available in software like SAS, SPSS and Stata. The R code used is presented in Annex A. This produces the confidence intervals in Figure 2. Note that this type of analysis can be used on a sample survey like the CSEW but not for police recorded crime statistics taken from administrative datasets.
Figure 2: Data on whether anyone has deliberately used force or violence on the adult respondent, with 95% confidence intervals (England and Wales, year ending March 2018)
Source: ONS Crime Survey for England and Wales
In this instance, the lower confidence bounds of the intervals for the mixed and other groups are less than zero. It seems counter-intuitive to suggest a true value for a proportion could be outside the range of 0% to 100%. This shows that, while this method works for a number of scenarios, it has limitations when:
- sample sizes are small because large confidence intervals may overlap these bounds
- when proportions are close to 0% or 100% because confidence intervals may overlap these bounds
The simple method with an interval that is symmetric about the estimate is also prone to overlapping the bounds of 0% and 100%.
As well as these presentational problems, we would find that the original intention – that the interval contains the true value with a given likelihood (for example, 95%) – is no longer met, and some of the simplifying assumptions described earlier are no longer correct.
We might see these situations quite often with data for ethnic groups. The populations are smaller for some ethnic groups, so survey sample sizes might be more likely to be small for those groups. For distributions across ethnic groups, some proportions might be close to 0.
In these cases, we can use a different approach to calculating confidence intervals.
5.3 Wilson score intervals
One technique we can use to overcome these limitations is a Wilson Score interval. This way of deriving confidence intervals works well for variable data, and data that has extreme estimates towards 0% and 100% (either very rare, or very likely occurrences).
Using the Wilson Score interval calculation we obtain the confidence intervals in Figure 3. The example code is also available in Annex A. Users should note that they are not symmetrical, and would be bounded by 0% at the lower end and 100% at the upper end. In this instance, the Wilson Score interval is a more robust way than the standard method of calculating confidence intervals described above.
This method also meets the intention that the interval contains the true value with the given likelihood (for example, 95%).
Figure 3: Percentage of respondents who had force or violence deliberately used on them, plus 95% confidence intervals (England and Wales, year ending March 2018)
Note: these confidence intervals, calculated using the Wilson score approach, are not symmetric about the point estimates
Source: ONS Crime Survey for England and Wales
6. Establishing significant differences
6.1 What ‘statistically significant’ means
We have used confidence intervals to show how confident we can be in our estimates. We also need to try to find out whether differences between estimates are meaningful in an analytical sense. This is what we call ‘statistically significant,’ or simply ‘significant’ when clearly used in a statistical context.
When a difference is statistically significant, we are confident it is the result of a real difference, or a real change over time, rather than being the result of chance.
Statistical significance is different from other types of significance, for example biological significance. Biological significance is the significance of differences between outcomes in a health situation. For example, the effect on health or survival.
The RDU’s approach for Ethnicity facts and figures is to provide commentary only on statistically significant differences or changes.
6.2 Significance vs importance
It is important to be clear about the distinction between statistical significance and importance. If a significant difference between groups is identified, this does not mean that this is of sufficient policy significance to require an intervention.
This is because analysing a dataset for significance provides a way of determining whether 2 figures are different due to something other than chance. It does not necessarily mean that difference is important or notable in a practical sense. A difference between 2 ethnic groups of 0.5 percentage points might be statistically significant, but in practical, real life terms might not be.
Conversely, a larger (non-significant) difference between ethnic groups that persists in a dataset over several time periods might warrant further investigation as being important or notable.
The Methods and Quality Report on using relative likelihoods to compare ethnic disparities described a ‘four-fifths rule’ to identify notable differences in relative likelihoods, for example those that are greater than 1.25 or less than 0.80. A relative likelihood is a number that shows the extent to which 2 groups differ in their likelihood of experiencing an outcome. It’s calculated by the percentage (or proportion) of one group experiencing an outcome, divided by the percentage (or proportion) of another group experiencing an outcome.
2 further general points about significance:
First, a large, but non-significant difference, could be the result of using a survey with too small a sample size.
This is addressed by the statistical concept of ‘power’. Statistical power is the ability to detect a difference if a difference really exists.
It depends on 2 things:
- sample size (number of subjects)
- effect size (for example, the difference in outcomes between 2 groups)
Second, if you make lots of comparisons between different groups, then one might come up as significant.
This does not necessarily mean that the particular comparison is of interest, simply because it is significant.
6.3 Using a t-test
There are a number of methods for testing whether the difference between 2 numbers is statistically significant. These depend on the metric you are using for the estimate. One of the most common for testing the difference between 2 proportions is called a t-test.
For a difference (d) between 2 proportions p1 and p2 then: d = p2 - p1
The t statistic is t = d/se(d), where se(d) is the standard error of the difference d.
In most cases we can treat the 2 proportions as being independent, in which case se(d) =√(se(p1)2+se(p2)2).
In more complex sample designs the 2 measures may not be treated as independent, for example if the 2 proportions were measured from the same sample of people at 2 points in time.
The difference is statistically significant if the t statistic falls outside a critical range. For reasonably large sample sizes (usually 30 and above) and choosing the usual 95% statistical significance, the range used is [-1.96, 1.96].
This can be reworked to say that the difference is significantly different from zero if the confidence interval of the difference: [d-1.96 * se(d), d+1.96 * se(d)] does not contain zero.
6.4 Differences based on the Wilson Score intervals
The discussion on testing significant differences above relates to proportions that are not close to 0% or 100% nor based on small samples. We showed earlier how to produce Wilson Score confidence intervals to address these circumstances. We can use these Wilson Score intervals to extend that to the difference of proportions.
The code in Annex A also calculates a difference between 2 ethnic groups. This code takes the Wilson intervals and calculates a 95% confidence interval around the difference between the Asian group and each of the other 4 aggregate groups.
Similar to the t-test, this is a robust way of calculating confidence in there being a significant difference between 2 estimates and provides an estimate of confidence in what the size of the difference is.
Figure 4 compares the difference between the Asian group and the other 4 ethnic groups. In this case the confidence interval around the difference is positive for the lower estimate for the difference between the Asian and black groups and the Asian and white groups.
This means we can conclude that there is a significant difference between the estimates for those 2 comparisons.
Figure 4: Confidence interval around the percentage point difference between the Asian group and the other 4 aggregate ethnic groups, using Wilson’s Score intervals (England and Wales, 2017 to 2018)
Source: ONS Crime Survey for England and Wales
7. Overlapping confidence intervals method
Overlapping confidence intervals can also be used to find whether there is a significant difference between estimates, although this method is less robust.
If we compare the confidence intervals for the 2 sets of statistics and they do not have any values in common, then the difference is statistically significant.
If they do have values in common, this gives an indication that the difference in the 2 estimates is not significant. This means the findings cannot tell us if the difference found in the samples is also shown in the whole population.
We can conclude when using this method on Figure 1 that the percentage of white adult smokers is significantly different to the percentage for the ethnic minority groups, except the ‘other’ ethnic group.
8. Advantages and limitations
If confidence intervals exist for the data, using them to determine significant differences is easy and gives broadly reliable results. Data suppliers can provide confidence intervals and users can compare any of the groups they like, for any time periods.
However, this method is generally too conservative in measuring statistical significance. Fewer significant differences are likely to be picked up in the data than is actually the case.
When 95% confidence intervals do not overlap, there is a statistically significant difference between the estimates. However, the opposite is not necessarily true. The intervals may overlap, but there may be a statistically significant difference between the 2 estimates when significance testing is used, for example, the t-test.
If confidence intervals do not exist for the data, it might be relatively easy to calculate them. However, sample designs in large government surveys like the Annual Population Survey can be complex and using the formula for a simple random sample, for example, might not give accurate results.
T-tests, and comparisons of the difference using Wilson’s Score intervals will give more accurate results, although often the underlying raw data are required to do the calculations.
Whichever method users choose, RDU and Office for National Statistics (ONS) will support those using a method that places observed differences in the context of their uncertainty, rather than doing nothing at all.
8.1 Example of a difference between t-test and overlapping confidence intervals
RDU have analysed how the prevalence of single parent households has changed over time by ethnicity, with a focus on the black ethnic groups. For this analysis, single parent households were taken to be any households with children with only one adult resident.
The measure used was the proportion of all households with children that were ‘single parent’ households. The analysis used the Annual Population Survey (APS) data from 2006, 2010, 2014 and 2018.
The results suggested that between 2006 and 2018 there was a downward trend in the proportion of single parent households for all ethnicities, including the aggregated black ethnic group, and the black African and black Caribbean ethnic groups.
To test whether these changes were statistically significant, 95% confidence intervals were compared to assess any overlap. This method suggested that the difference for the black ethnic group was statistically significant, while the differences for the black African and black Caribbean ethnic groups were not statistically significant (Figure 5 and Figure 6).
Figure 5: Percentage of black Caribbean and black African households with children, which are ‘single parent’ households, by ethnicity
Source: Annual Population Survey
Figure 6: Percentage of black households with children which are ‘single parent’ households, by ethnicity (APS data)
Source: Annual Population Survey
Because there was only a narrow overlap in the 2006 and 2018 confidence intervals for the black African group, the significance of the differences were tested using a t-test. The results of the t-test confirmed the results for the black and black Caribbean groups found from comparing confidence intervals.
But for the black African ethnic group, the t-test offered a different result. We were able to conclude that the difference for this group was statistically significant. In other words, a different conclusion than that drawn from simply looking at overlapping confidence intervals.
The 95% criteria often used is somewhat arbitrary and results that are narrowly either side are actually very similar. The treatment here is to get an accurate assessment of the difference.
9. Other ways of comparing data
The kinds of tests and confidence intervals used here concern the uncertainty around estimates of proportions. A range of other techniques exist for looking at different comparisons and for different types of estimates. For example, you might be interested in the strength of association between variables.
Both RDU and ONS would be happy to hear feedback on this report, and whether users would like to see further reports on other statistical techniques in the context of using ethnicity data.
10. Acknowledgements
This report has been jointly written by Darren Stillwell in the Race Disparity Unit and Charles Lound in the Office for National Statistics.
The Crime Survey for England and Wales and methodology teams in the Office for National Statistics also provided help and assistance.