Natural England: Living England

Habitat mapping for the whole of England using satellite imagery, targeted field survey and machine learning.

From:: Cabinet Office, Department for Science, Innovation and Technology and Government Digital Service
Published: 10 February 2025

Organisation:: Department for Environment, Food and Rural Affairs
Organisation type:: Agency or public body
Function:: Environmental protection
Capability:: Computer vision
Task:: Image segmentation
Phase:: Production
Region:: England
Date published:: 10 February 2025
ATRS version:: 3.0

Tier 1 Information

1 - Name

Living England

2 - Description

The Living England project is mapping broad habitat extent through the use of satellite imagery, targeted field survey and machine learning. This is used to accurately predict the likely habitat class within each location across England, this is achieved by looking at the relationships between ground data labels, satellite imagery and environmental variables. This is cost-effective process compared with undertaking national field surveys, and is a repeatable and updatable process that generates predictions across the entirely of England. This information is then used to help inform policy and land management decisions.

3 - Website URL

https://www.data.gov.uk/dataset/e207e1b3-72e2-4b6a-8aec-0c7b8bb9998c/living-england-habitat-map-phase-4

4 - Contact email

livingenglandenquiries@naturalengland.org.uk

Tier 2 - Owner and Responsibility

1.1 - Organisation or department

Natural England

1.2 - Team

Living England, Natural Capital Ecosystem Assessment (NCEA), Evidence Directorate

1.3 - Senior responsible owner

NCEA Geospatial Principal Officer

1.4 - External supplier involvement

Tier 2 - Description and Rationale

2.1 - Detailed description

The project uses an object-based classification approach to predict habitat presence, which includes the use of a distributed random forest classification model. The distributed random forest classification model, using a reproducible automated process in R. This has been developed by our in-house team specifically for our data in Living England deployed using the ‘H20ai’ package v.3.44.0.2. Our model process includes several measures such as removing highly correlated predictor variables, splitting the data with 20% held out for evaluating the model performance, and then training 5-fold models with classwise weighting, rotating a 20% validation dataset in order to find the best model parameters and model for our data. The best model is then used to predict across all our England data. The model is then evaluated against the independent dataset and reports statistics of F1, precision, recall, overall modelled accuracy, balanced accuracy and specificity. The modelled probabilities are then calibrated and binned to generate a categorical score for reliability, ranging from Very Low to Very High scores.

More information is available in our technical report for the 2022-23 dataset (Trippier et al. 2024). https://publications.naturalengland.org.uk/publication/5260859937652736

2.2 - Scope

The purpose of the tool is to predict the likely habitats present within each segment in the Living England spatial framework. Living England uses a hybrid approach to habitat predictive mapping and so this modelled data is combined alongside other data layers in order to provide users with a full coverage national map of England with the most likely primary habitat present at each location.

More information is available in our technical report for the 2022-23 dataset (Trippier et al. 2024). https://publications.naturalengland.org.uk/publication/5260859937652736

The dataset is designed for use in informing natural capital asset assessments, policy and land use decision making and further environmental applications. Importantly, this is a predictive dataset and so should not be used in isolation, these outputs should provide a baseline habitat map prediction. The data should be used alongside further evidence and local land advice in order to be used use in policy and land use applications.

2.3 - Benefit

National scale field data collection is expensive, resource intensive, access restrictive and not readily updatable. This project and the Living England tool allows us to take advantage of openly available satellite imagery to predictively model habitat presence given habitat labelled data collected through our ground datasets and their relationships with environmental variables. This is beneficial to traditional processes as it is cost-effective, reproducible and transparent, and can be readily updated providing full coverage across England. With this approach we can monitor habitats in the landscape and how they are changing over time.

2.4 - Previous process

Before the use of satellite images and machine learning to generate information about land, mapping was largely based on direct surveys and/or broad modelling algorithms with limited accuracy and precision. Satellite imagery in particular provided the opportunity to use far more data to bear on habitat modelling, if techniques could be developed to fully exploit the data. The random forest machine learning approach was adopted from the start of the project in 2016. Since then this modelled approach has been developed over several years to optimise the predictor variables and hyperparameters used, the optimal model algorithm and deploy machine learning best practice. These process decisions are noted in annual reports produced by the project, as well as our published technical user guides.

2.5 - Alternatives considered

Trialling of several other model approaches over a number of years, including trialling different training data subsets and random stratified sampling of the training data, where Natural England found actually Random Forest algorithms handle unbalanced data well and alleviated this by including classwise weighting in our model. We have also trialled the number and combination of classes modelled in the multi-class classification model as well as the best predictor variables for Natural England purposes, as well as the model algorithm itself, finding the distributed random forest model best performing with our data over other trialled algorithms, e.g. Support Vector Machine , Generalised Linear Model, Generalised Additive Model, Quantile Regression Forests. We have trialled some deep learning approaches as part of this development however there is a trade off with computational expensive and infrastructure to support this, particularly given our mapping on a national extent with large Earth Observation (EO) datasets.

Tier 2 - Decision making Process

3.1 - Process integration

Living England uses a range of data to model habitat probability. Input data (earth observation / satellite data) are accessed from a variety of sources brought together in a central data repository for the purpose of analysis. Data are gathered over periods at different parts of the year and processed to reduce interference such as cloud cover. Training data include habitat records which have been identified, collated and processed for use in Living England through engagement with internal and external habitat and monitoring specialists. Datasets are selected to ensure they provide an accurate representation of the habitat type within the assigned segment. These data have undergone a quality checking and translation process to ensure classifications align to the UKBAP habitat framework. To support data collection and targeting, a bespoke Esri ArcGIS Collector data collection application was developed for internal use by NE surveyors using ArcGIS Pro and ArcGIS Online (AGOL). This allowed for easy and consistent recording. A map interface allows surveyors to accurately relate their ground observations to the segment boundaries used as a framework for creating the Living England map, assisted by high-resolution Esri World Imagery (Esri 2021). Users can then select from a pre_x0002_determined list of habitats to record Living England detailed habitat classes and an estimate of the percentage cover of that habitat type within a given segment.

The algorithmic tool predicts the likely primary habitat class prediction per segment in the Living England (LE) object framework. Where a segment hasn’t been flagged through the specific habitat mapping assignment (see LE 2022-23 Technical User Guide for more information) this is then assigned the primary habitat prediction from the modelled outputs. Attributes are also populated from the modelled outputs for ‘Model_Habs’ listing all the predicted habitats within a segment where these are predicted above a probability threshold, ‘Model_probs’ listed the modelled probability scores for the Model_Habs results, ‘ Mixed_Hab’ which flags given the difference between the primary and secondary habitat prediction whether a segment containing a mixture of dominant habitats is likely to be present, and the ‘Reliability’ a categorical score providing users with an easy to interpret score of the reliability of the primary habitat.

3.2 - Provided information

The Living England Habitat Probability map is available to view on the Defra’s MAGIC Platform, Natural England Open data portal and Defra Data Services Platform with download access available via data.gov.

Classified segments from the vector based classification and model-based classification workflows are merged to create maps with completed habitat classifications for each of 14 areas that England is split into for the purposes of processing (Biogeographic Zones (BGZs)). The predicted habitat classifications for each BGZ are then merged, and the final product is clipped to the Mean Height Water Springs (MHWS) extent to produce the final Living England national habitat map.

3.3 - Frequency and scale of usage

The tool is intended to help inform environmental policy decision making and national habitat extent and connectivity assessments for targeting nature recovery. This can inform planning strategy. We do not hold data on the extent of use. The map is updated and deployed every two years to create a new Living England dataset for publication. This is applied on a whole England scale.

3.4 - Human decisions and review

The tool is checked as it is run every two years. The team perform a number of data quality checks internally to the outputs to ensure these are performing as expected, if certain habitats/locations are looking accurate in comparison to the satellite imagery, and the model performance is of sufficient quality. The final output dataset produced by the hybrid approach (including the tool), is then assessed by external stakeholders through a beta release, and then reviewed by an external panel of specialists in order to ensure it is of a high standard.

3.5 - Required training

There is in-house training for deploying the tool. The team have good documentation and senior supporting staff to run through the workflow and analysis. Technical documentation is available for users that describes how the tool should be used and outputs interpreted. No external training is available.

3.6 - Appeals and review

Users can contact the Living England team at earth.observation.naturalengland.org.uk if they have a query about the map, including if they believe there is an issue or error. The maps are regularly updated to ensure continued accuracy, and this process includes quality assurance and review.

Tier 2 - Tool Specification

4.1.1 - System architecture

r package: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/drf.html

Living England predicts most habitat classes using a Random Forest (RF) modelling framework deployed in R. The model is trained on the field habitat data, satellite imagery, and predictor variables. Highly correlated variables were identified and removed prior to model training to reduce multicollinearity and redundancy. Following removal of the highly correlated variables, the RF model was trained on 80% of the field dataset while the remaining 20% was retained separately for validation and not used during the modelling process to ensure independent evaluation of model performance. Both training and validation subsets were balanced, so they each contained the same proportions of each habitat class.

A Distributed Random Forest (DRF) model was trained and validated using the training and validation sets, respectively. With model hyperparameters and the number of trees were optimised using a grid search methodology. The DRF was implemented using the H20ai library due to its capacity to cope with segments with missing data, which allows the generation of predictions where data gaps exist due to cloud/cloud shadow. Missing values are interpreted as containing information, rather than being missing at random, which allows decision trees to be built that optimise performance in the case where data is missing, treating the absence of data as a decision point itself. This consequently allows predictions to be made in the absence of data. Following training, the model was evaluated on the independent validation dataset and performance metrics are calculated including the confusion matrix, accuracy, balanced accuracy, and F1 score. Each of these are extracted at the class level. This model was then used to predict across all segments across England.

The reliability of predictive models is seldom considered beyond the overall accuracy metrics generated by the model, and without calibration. To aid data users and policy makers in using the dataset, an additional metric has been derived to describe the level of reliability for each modelled habitat prediction. The reliability scores for the modelled habitats, were calculated by first extracting the modelled probability for each segment (Model_probs) output by the RF model. However, the modelled probabilities are only relative to each other and do not indicate the probability of a segment being correctly predicted. To align the modelled probabilities to the likelihood of a segment being correctly identified, the model probabilities are calibrated against the event rate (how often the ground data is correctly being predicted by the model). For example, if a modelled probability score of 0.7 was only being correctly predicted 40% of the time based on the ground data, the new calibrated score would be 0.4. This better represents the level of confidence the model has in the prediction made being correct. Model probabilities were calibrated separately for each broad habitat class. The calibrated model scores were then converted to a categorical reliability score (very low to very high).

4.1.2 - Phase

Production

4.1.3 - Maintenance

The model code is reviewed regularly and shared with external users. when issues are flagged with the scripts these are reviewed by a senior analyst. The model is then further reviewed and applied at each data production cycle every two years.

4.1.4 - Models

Distributed random forest classification model.

Tier 2 - Model Specification

4.2.1 - Model name

Distributed random forest classification model

4.2.2 - Model version

H2o v.3.44.0.2

4.2.3 - Model task

Object based habitat classification - to predict the likely habitat present within each polygon of the living England spatial framework.

4.2.4 - Model input

This uses habitat labels from the LE ground database and a number of model predictors including satellite imagery and indices from Sentinel-1 and Sentinel-2, Lidar derived topographic data, soils and geology data, and climatic datasets.

4.2.5 - Model output

Model habitat predictions and scores for each segment, model performance metrics, trained model, calibrated model probability scores.

4.2.6 - Model architecture

Distributed random forest classification model deployed through R h20 library. The model script also includes functions for removing highly correlated predictor variables, splitting out 20% of the data for model evaluation, classwise weighting and model tuning of hyperparameters. It uses a k-fold approach where 5 models are trained using a different subset of validation data (a further 20%) and then model stability statistics are calculated. It takes the best trained k-fold model and then uses this as the national model to predict results across England. This model is then evaluated against the independent model evaluation held-out subset of the ground database.

4.2.7 - Model performance

Classwise performance and confusion matrix analysis was undertaken for Living England 2022-23 and is available in the Technical User Guide, and will also be undertaken and published alongside the 2024-25 model. Living England 2022-23 reports an overall accuracy for modelled habitats of 87%; however, this varies between classes and regions across England. Classwise performance is the breakdown of the random forest model performance for each habitat class, derived from applying the model to an independent hold-out dataset of 20% of the Living England ground dataset not used during model training. This gives an indication of accuracy of the predictions for each broad habitat class across England (sensitivity, specificity, F1 score, prevalence, detection rate, detection prevalence). Confusion matrix demonstrates the ability of the model to predict each habitat class, displaying the number of correct and incorrect predictions for each class.

4.2.8 - Datasets

Living England ground database, Sentinel-2 seasonal mosaics for Spring, Summer and Autumn, Sentinel-1 backscatter and coherence mosaics for Spring, Summer and Autumn, Lidar derived DTM, Slope, Aspect, canopy height model, Sentinel-2 NDVI and NDWI seasonal composites (Spr, Sum, Aut), Lidar derived Topographic Wetness Index (TWI), annual and seasonal average rainfall for 2 and 20-year periods, annual and seasonal min, max and mean temperature for 2 and 20 year periods, proximity to coastal, urban and water Ordnance Survey features, soil parent material carbonate content, grain size, European Soil Bureau (ESB) group, soil texture, soil thickness, soilscapes group and wetness; and bedrock geology.

4.2.9 - Dataset purposes

The LE ground database was combined with zonal statistics of the predictor variables (all the other datasets listed) for each individual segment (min, mean, max, sd). These data were then subset into training and testing datasets, and then further subdivided into training and validation datasets to train the k-fold models.

Tier 2 - Data Specification

4.3.1 - Source data name

LE ground database, Sentinel-2 seasonal mosaics for Spring, Summer and Autumn, Sentinel-1 backscatter and coherence mosaics for Spring, Summer and Autumn, Lidar derived DTM, Slope, Aspect, canopy height model, Sentinel-2 NDVI and NDWI seasonal composites (Spr, Sum, Aut), Lidar derived Topographic Wetness Index (TWI), annual and seasonal average rainfall for 2 and 20-year periods, annual and seasonal min, max and mean temperature for 2 and 20 year periods, proximity to coastal, urban and water OS features, soil parent material carbonate content, grain size, European Soil Bureau (ESB) group, soil texture, soil thickness, soilscapes group and wetness; and bedrock geology.

4.3.2 - Data modality

Geospatial data

4.3.3 - Data description

Listed in the Living England 2022-23 Technical User Guide.

Sentinel-2 multispectral imagery is high-resolution, wide-swath data collected by the Sentinel-2 mission’s Multispectral Instrument (MSI). Sentinel-1 backscatter imagery is a measurement of how much microwave radiation is reflected back to a sensor from the surface of the ground (and can measure land use change and land deformation). The Environment Agency Integrated Height Model (IHM) models floodplains, river channels, and pipe networks. Other data relate to geological information, district vectormaps, saltmarsh extent and zonation, crop maps, and climatic variables.

4.3.4 - Data quantities

69069 labelled ground data points

4.3.5 - Sensitive attributes

N/A

4.3.6 - Data completeness and representativeness

Where there are missing attributes in the data (e.g. cloud cover over satellite imagery for segments), these are handled by the model using the ‘missing in attribute’ functionality of the model architecture, to ignore this for the purpose of providing a modelled prediction for the segment. In the technical guide we clearly state the limitations of the method and the biases inherent in the spatial data, as well as a disclaimer for the final output dataset.

4.3.7 - Source data URL

Living England is available here https://www.data.gov.uk/dataset/e207e1b3-72e2-4b6a-8aec-0c7b8bb9998c/living-england-habitat-map-phase-4#licence-info

Source data for input datasets are available via: https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S2_SR, https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S1_GRD, https://scihub.copernicus.eu/, https://www.bgs.ac.uk/datasets/bgs-geology-50k-digmapgb/, https://osdatahub.os.uk/downloads/open/VectorMapDistrict, https://data.gov.uk/dataset/0e9982d3-1fef-47de-9af0-4b1398330d88/saltmarsh-extent-zonation, https://data.gov.uk/dataset/8c5b635f-9b23-4f32-b12a-c080e3f455d0/crop-map-of-england-crome-2019, https://worldclim.org/

4.3.8 - Data collection

Input data (earth observation / satellite data) are drawn from publicly accessible datastores including from the European Space Agency, Environment Agency, Britich Geological Survey, Ordnance survey, and Rural Payments Agency. These are collected for a number of purposes according to the licensing body, including geospatial analysis and modelling (the current use).

Training data include habitat records which have been identified, collated and processed for use in Living England. Engagement with internal and external habitat and monitoring specialists has confirmed the data are suitable for use in Living England.

4.3.9 - Data cleaning

Data cleaning / preparation was undertaken prior to processing. This includes work to: facilitate the acquisition of cloud-free image mosaics (masking using an automated cloud and cloud shadow masking algorithm (Batic, 2018), filter to reduce radar speckle, and produce average coherence maps. The training dataset was validated to ensure the quality of the training points used to inform the Random Forest model, including data reduction and statistical validation. Technical User Guide has full details.

All data used in the tool are used under specific data licenses between Natural England and the data owner or under Open Government License. Technical User Guide has full details.

Attribution statement: © Natural England 2024. Contains: OS data © Crown copyright and database rights 2023 OS AC0000851168; Natural England Licence No. 2011/052 British Geological Survey © NERC. All rights reserved; © Environment Agency 2023. All rights reserved; © Rural Payments Agency 2022; NERC EDS Environmental Information Data Centre; National Plant Monitoring Scheme and survey data (2015-2023) organised and funded by the UKCEH, BSBI, Plantlife and JNCC, indebted to all volunteers who contribute data to the scheme; Modified Copernicus Sentinel data 2023; © Forestry Commission 2022; Soils Data © Cranfield University (NSRI) and for the Controller of HMSO 2005; © Carlos Bedson & Manchester Metropolitan University 2019; British Geological Survey materials © UKRI 2016; HadUK-Grid data © Met Office 2018; Modified Copernicus Climate Change Service information 2023; © Bluesky International Ltd 2024; Map services and data available from U.S. Geological Survey, National Geospatial Program; © Department for Energy Security and Net Zero; © OpenStreetMap 2024.

4.3.11 - Data access and storage

This is stored internally and accessible to the Living England team in Natural England. This is stored on an internal shared drive, Internal ArcGIS online Field maps application database and an AWS S3 bucket held and managed within Defra’s Cloud Centre of Excellence.

Tier 2 - Risks, Mitigations and Impact Assessments

5.1 - Impact assessment

N/A

5.2 - Risks and mitigations

RISK: Lack of accessibility DESCRIPTION: Issues with the digital provision of the tool may lead to the public not being able to access it, or using a previous version which is less accurate. MITIGATIONS: Include reviewing digital service during the 2 year model update, ensuring effective labelling of the model products and clear communications around the model, and effective management of the tool / data assets.

RISK: Incorrect use DESCRIPTION: Users are unsure about how to use the tool and/or use it in for purposes it is not suited for. MITIGATIONS: Detailed Technical User Guide published alongside the tool which describes how the model was generated, the underlying data, and accuracy of the model.

RISK: Model inaccuracy DESCRIPTION: Modelling is inaccurate and incorrectly describes habitats, leading to poor decision-making. MITIGATIONS: Model is assessed according to accuracy metrics and these are reported so that users are aware. (Living England 2022-23 reports an overall accuracy for modelled habitats of 87%.) Outputs are reviewed by a team of experts.

Published 10 February 2025

Contents

Cookies on GOV.UK