Warwickshire County Council: Domestic EPC Estimates
This algorithmic tool is used to estimate the domestic energy performance certificates (EPC) of all households in Warwickshire that are missing one.
Tier 1 Information
Name
Domestic EPC Estimates
Description
This algorithmic tool is used to estimate domestic Energy Performance Certificates (EPC) where one doesn’t exist. It also helps with financial inclusion because the estimates can help us understand the cost of energy, which has an impact on the cost of living.
We use it to calculate potential domestic energy CO2 output, but it is also used to help support financial inclusion around the cost of living, and is shared with our partners to support domestic retrofit programmes. We use this data to help support our actions over our Sustainable Future Strategy, as well as identified needs in the Director of Public Health Report (2022).
Website URL
N/A
Contact email
For additional information, it should be addressed to: FAO: Data Science Professional Lead Email: businessintelligence@warwickshire.gov.uk
Tier 2 - Owner and Responsibility
1.1 - Organisation or department
Warwickshire County Council
1.2 - Team
Data Science
1.3 - Senior responsible owner
Data Science Professional Lead
1.4 - External supplier involvement
Yes
1.4.1 - External supplier
The algorithmic tool was built by a member of staff in their personal time, which is now part of an external company (Databuilder Limited), of which the member of staff is the Director. The output of the model is freely given to Warwickshire County Council for the improvement of citizens outcomes on a non-commercial basis.
1.4.2 - Companies House Number
15906801
1.4.3 - External supplier role
The algorithmic tool was built as a personal project during the Covid lockdowns. It was demonstrated as having value particularly with four areas of work. Cost of living and climate change being two key areas.
1.4.4 - Procurement procedure type
N/A
1.4.5 - Data access terms
N/A
Tier 2 - Description and Rationale
2.1 - Detailed description
The EPC tool uses various data sources monthly and automatically; this consumed data is then further cleaned and loaded.
It is a 13-step tool, which every month automatically downloads all the data via either a web scrape or an API to acquire the datasets. Data is pre-processed, and an initial model is built.
From this built model, the tool then generates estimates for all domestic properties in Warwickshire; it uses unsupervised and supervised methods and mathematical modelling and filtering. We then adjust costs to reflect energy pricing unique to the properties estimates.
Model evaluation takes places prior to bundling the data with the licensing of the original data and estimate data. The model evaluation uses a separate supervised learning in the assessment.
The bundled synthetic data is reduced to only data needed for modelling and no address data or personally identifiable data is shared.
Reports are generated for how the model performs and its accuracy, which are bundled as part of the release to the internal and external partner for their specific area of remit.
The new data is then shared with a geospatial database, with both the database table and each column clearly marked as ‘estimated’, where an estimate exists.
When a new EPC exists, the synthetic data is removed from the property.
2.2 - Scope
The purpose of the tool is to help Warwickshire County Council (WCC) understand domestic properties CO2 equivalent output within Warwickshire.
The tool aids the organisation to tackle climate change. Additionally, it helps support the following areas:
- Helps WCC to understand possible fuel bills.
- Helps WCC understand the likely retrofits required.
- Helps WCC understand potential fuel poverty.
- The dataset can help support fire risk.
- Supports the improvement of health outcomes (as detailed in our Public Health Report 2022).
2.3 - Benefit
The primary benefits of the tool is that it produces an estimated dataset that covers a broad range of use cases, where WCC have little to no existing data.
EPC assessments are only carried out in three scenarios, and require an energy assessor to perform the assessments. Given that a single assessment may cost anywhere between £65 and £120, the tool presents a substantial value for money gain.
Moreover, since there is a gap in the EPC data of around 100,000 homes in Warwickshire, there’s a pressing need to take action to help support individuals. The output of the EPC tool fills this gap in the EPC data; and in doing so, it helps us improve outcomes for citizens, whilst covering a broad range of services for Warwickshire Council.
2.4 - Previous process
There was no previous approach, EPC assessments are only carried out by an energy assessor.
Where partners have previously supported citizens in financial need, they have looked at the nearest houses available EPC to give an indication of energy costs, which may not be similar in nature, dependent on the type of house or when the energy assessment was carried out.
2.5 - Alternatives considered
One alternative is to conduct manual energy assessments. However, the complexity of doing manual energy assessments would require a large amount of coordination and a substantial amount of initial investment.
Tier 2 - Decision making Process
3.1 - Process integration
The output data from the model is used to understand the potential energy requirements of domestic properties. This data is presented within a PowerBI dashboard, but it is also available as part of a geospatial database. The data is clearly labelled as ‘estimates’.
For partners, it helps support understanding of costs around the cost of living to enable financial inclusion, when people come forward for help. The data is clearly marked as ‘estimated’. Metadata and model accuracy are made available to internal and external partners.
The data is provided to help inform. Testing in 2024 will look at its integration into fire risk assessments.
Decision-making around the output may mean that particularly strategic approaches to tackle climate change may be used if this data is considered as part of that.
3.2 - Provided information
The output of the EPC tool provides estimated EPC assessment data, in addition to the likely costs of energy at current prices.
Data is clearly marked as ‘estimated’.
3.3 - Frequency and scale of usage
The tool is being used monthly, and is actively used by our Climate Change Team, the Business Intelligence Team, and our external partner(s) (other local authorities and their partners within Warwickshire, in order to deliver tasked duties).
The tool estimates around 100,000 domestic households. Less than 1,000 of these homes estimates are shared with partners currently, but could be up to 100,000 a month, if the partner is county-wide.
No citizens are directly interacting with the tool, or its output. However, it may be used to assist citizens with financial help by appropriate staff in an operational context of support.
3.4 - Human decisions and review
The model output is evaluated and are checked by a human before sharing the output data more widely. This evaluation happens monthly.
The data is assessed on precision, recall and F1. The weighted average is typically considered across all 3 measures, but we tend to report on the F1.
Records are logged for the evaluation, each month of automatic evaluations.
The output of the data is clearly marked as ‘estimated’ for each column of data, and the dataset.
3.5 - Required training
The data doesn’t require training to use, as the metadata is provisioned – it is as straightforward as using the original data, which operationally has been used.
Internally staff are advised that the use should be considered similar to an experimental statistics, so data should be communicated with uncertainty.
3.6 - Appeals and review
N/A
Tier 2 - Tool Specification
4.1.1 - System architecture
The data is automatically downloaded from the relevant websites via an API key, it is then loaded into an internal database.
e.g. https://epc.opendatacommunities.org/api/v1/domestic/search
The data are then cleaned, for instance if there are multiple issues with the original data that have to be addressed. Once the data is clean, the data is called again from the internal database. The internal database is queried again with the cleaned data, and the modelling is performed.
The output from the modelling is then written as a separate table to the internal database.
Finally this data is then queried once more and the data are bundled and moved to a data share that will be accessed by the partner.
4.1.2 - Phase
Production
4.1.3 - Maintenance
The tool has a monthly review schedule.
Logs are checked monthly, and any errors are addressed.
4.1.4 - Models
Unsupervised machine learning models are used to perform clustering.
Supervised machine learning methods are used for further clustering (K-Nearest Neighbours (KNN)).
Mathematical modelling are used in estimating values and the general process, e.g. geometric operations are also performed, for instance boundary detection and set theory.
Supervised machine learning (Random Forest) is used to perform the evaluation.
Tier 2 - Model Specification
4.2.1 - Model name
Domestic EPC Estimates
4.2.2 - Model version
V0.2
4.2.3 - Model task
Generation of synthetic data, with adjustments to cost of energy for predictions of EPCs per household.
4.2.4 - Model input
EPC data - publicly available
OS data - public sector geospatial agreement / paid
ONS data - publicly available
4.2.5 - Model output
Estimated (synthetic) EPC data - tabular
4.2.6 - Model architecture
The process goes through several stages.
- Automatically downloading the data
- Uploading and preprocessing the data
- Clustering the data
- Applying mathematical functions on this data
- Preprocessing the data further and clustering
- Evaluation of the model using a further supervised machine learning model (Random Forest)
- Write to database
- Write to datashare
4.2.7 - Model performance
Model performance is evaluated on precision, recall and F1. Although for internal and external reporting we reference the F1 for a balance of both precision and recall.
Model performance is also evaluated on each of the potential classifications, as part of a standard classification report.
Model performance is updated monthly, but currently accuracy measures range from 97-98% on the weighted average of F1 score.
4.2.8 - Datasets
EPC data - publicly available
OS data - public sector geospatial agreement
ONS data - publicly available
4.2.9 - Dataset purposes
EPC (Energy Performance of Buildings Data: England and Wales) dataset and OS (Ordnance Survey AddressBase Premium) data are used in all elements of the development process. ONS (Office for National Statistics - Consumer Price Index) data is added towards the end of the process to calculate current energy costs.
Tier 2 - Data Specification
4.3.1 - Source data name
Energy Performance of Buildings Data: England and Wales
Ordnance Survey AddressBase Premium
Office for National Statistics - Consumer Price Index
4.3.2 - Data modality
Tabular
4.3.3 - Data description
EPC data provides the energy performance of buildings, and we use the domestic EPC data for this work.
OS AddressBase Premium is data that includes Addresses, and other geospatial features of Britain.
Office for National Statistics data contains indices for inflation.
4.3.4 - Data quantities
The model data uses around 200,000 rows of data, with about ~200 columns of data, although this data is minimised during the process to about 90.
4.3.5 - Sensitive attributes
There are no personal data or protected characteristics within the data.
4.3.6 - Data completeness and representativeness
The data does have some issues, particularly issues resulting from inconsistencies due to the absence of data. An example would be the inconsistency recording missing data, the recording of a null value is not consistent across the dataset. However, overall most of the data is complete to a standard that is acceptable for use.
Some data is not within the data release. For instance, homes that request an EPC under certain conditions are likely to be high performance, and this is one of the restrictions in the release.
Additionally, homes that may be inherited would be absent from the data, as they would not meet the conditions for requiring an EPC.
However, the data overall has enough coverage to be workable.
4.3.7 - Source data URL
https://epc.opendatacommunities.org/
https://www.ons.gov.uk/economy/inflationandpriceindices/timeseries/l55o/mm23 (timeseries representation)
4.3.8 - Data collection
The data is collected for the specific purposes of understanding domestic energy output, which is in line with the use case of the original release of data.
4.3.9 - Data cleaning
Data is cleaned during the pre-processing stage, within the data there are many gaps (or a null value) in multiple columns of data. The cleaning process reduces all these different types of missing values to the same type for better consistency (null). Mixed types also exist in the data within some columns (for instance numbers and text) and these are corrected (for instance numeric data may be presented as both 1900 or ‘before 1900’).
4.3.10 - Data sharing agreements
We are looking to implement a data sharing agreement as part of best practice, with our local authority partners.
4.3.11 - Data access and storage
Data is stored on our internal systems and is accessed via secure enterprise controls.
The data is available to users with access to Warwickshire County Council’s (WCC’s) geospatial database and is also available as part of an Azure Data Share, part of our data lake architecture. These are all tightly controlled with limited access to only analysts, or strictly relevant individuals.
Tier 2 - Risks, Mitigations and Impact Assessments
5.1 - Impact assessment
A DPIA has been completed for this work.
The summary of the DPIA is that the work will be used for four areas of work with a specific focus:
- Understanding domestic energy with relation to climate change.
- Help people with reducing fuel poverty, and improve lived conditions.
- Improve outcomes of individuals living in poor health related living conditions for the purpose of public health.
- Improve WCC’s approach to fire risk.
Once the data is acquired the data will be enhanced using modelling.
Data sharing is intended to take place for the purposes set out with our partner local authorities (excluding fire risk).
DPIA completed 01/10/23.
Warwickshire’s reference for this DPIA is “IMP-23-09_Acquire Energy Data Sets for Further Data Analysis”
5.2 - Risks and mitigations
The risks are around the inaccuracy of the data. Whilst the tool achieves high accuracy, it also produces errors. In terms of numbers overall, we would anticipate that out of every 100 homes, 3 to be inaccurately estimated.
Of the misclassifications that do occur, in general the classifications are mislabelled to a lower energy, or high energy rating for instance, but not to an extreme (for instance A predicted to be a G). The data suffers from bias (there are very few A, F, G ratings in the data). If data is incorrectly predicted, it’s likely to go to the level above or below in its ordinality (A -> G).
This risk is mitigated by ensuring that the data is clearly labelled as ‘estimated’.