Cost-benefit awareness tool

Question 1

Purpose of this tool

Accepted Answer

This toolkit is designed to support organisations considering adopting emerging privacy enhancing technologies (PETs). PETs can be adopted across different sectors and by organisations of different sizes. However, the potential of these technologies has not yet been fully realised, with adoption currently limited to a relatively small number of use cases.

This resource provides information about some of the costs and benefits associated with the adoption of these technologies. It is designed for use by individuals within organisations such as data officers, data architects, data scientists, as well as business unit owners assessing the opportunities that adopting these technologies may bring. It explores key areas that organisations looking to adopt PETs may wish to consider when assessing technical options or making a business case for a project. It does not attempt to quantify costs and benefits, as they are highly context and use case specific.

This resource has been created by the Responsible Technology Adoption Unit (RTA) in the UK government’s Department for Science Innovation and Technology (DSIT), in partnership with the Information Commissioners Office (ICO). It is intended to assist organisations to make well-informed decisions about the use of emerging PETs but is not a statement of formal government policy or regulatory guidance. This document is intended to offer suggestions as to how organisations can make use of emerging PETs. This document is not legal advice. Should you require legal advice, you should seek this from independent legal advisors.

Question 2

Introduction

Accepted Answer

What are privacy enhancing technologies?

A privacy enhancing technology (PET) is a technical method that protects the privacy or confidentiality of sensitive information. This term covers a broad range of technologies including more traditional PETs and more novel, emerging PETs.

Traditional PETs are more established privacy technologies, such as encryption schemes, which are constituted by methods that secure information during transmission and when stored; de-identification techniques such as tokenisation, which replaces sensitive data with unique identifiers; and generalisation, which removes specific details to reduce data sensitivity.

This toolkit focuses on emerging PETs which are comparatively novel solutions to privacy challenges in data-driven systems. Whilst there is no fixed definition of emerging PETs, this toolkit primarily considers the following technologies:

homomorphic encryption (HE): a method of encryption that enables computation directly on encrypted data.
trusted execution environments (TEEs): a secure area within a processor that runs alongside the main operating system, isolated from the main processing environment. Also known as secure enclaves.
multi-party computation (MPC): cryptographic protocols that enable multiple parties to share or collaborate to process data without disclosing details of the information each party holds.
synthetic data: artificial data generated to preserve the patterns and statistical properties of an original dataset on which it is based.
differential privacy: a formal mathematical approach to ensuring data privacy, which works by adding noise to either input data, or to the output it produces.
federated analytics: processing data in a decentralised manner to produce analysis or carry out machine learning, often used alongside combinations of the technologies listed above.

Background to this toolkit

PETs can be utilised to support a wide and increasing range of use cases across many sectors (See our Repository of PETs use cases).

This toolkit is structured around a high-level use case: using privacy-preserving federated learning to enable the training of machine learning models without sharing data directly.

This use case focuses on a subset of federated analytics, known as federated learning, layered with other PETs to increase both input privacy (protecting raw data during the processing stage in training a machine learning model) and output privacy (protecting data that is shared or released after processing). The combination of federated learning with other PETs is often referred to as privacy-preserving federated learning (PPFL).

We use this PPFL use case to structure this guide, as it involves a range of relevant PETs, and provides a concrete basis to frame potential costs and benefits against a clear baseline. This type of use case was the focus of the UK-US PETs Prize Challenges in 2022-23, and in our work designing those challenges we identified PPFL use cases as having potential to improve data collaboration between organisations and across borders, without compromising on privacy. However, the analysis of the document remains relevant to other deployments of the same emerging PETs in related contexts.

Alongside this tool we have produced a checklist to support organisations considering utilising PETs to ensure they have considered the impacts outlined in this document.

Navigating this toolkit

Section 1 examines the costs and benefits of federation, i.e. training a model while the data remains distributed across different locations or organisations, which is integral to our PPFL use case.

The following sections (Sections 2 and 3) discuss the costs and benefits incurred by layering other PETs at different points in this solution. They consider the deployment of additional PETs to two ends: improving input privacy (Section 2) and improving output privacy (Section 3). These terms are explained below.

Different sections of this document may be more useful and relevant than others to certain readers depending on their intended use case.

readers interested in federated analytics or federated learning (without additional input and output privacy techniques) should read this introduction and Section 1.
readers interested in PPFL should read this document in its entirety.
readers interested in approaches to improve input privacy (or any of homomorphic encryption, trusted execution environments, and multi-party computation) should read Section 2.
readers interested in approaches to improve output privacy (or either of differential privacy or synthetic data) should read Section 3.

The remainder of this section introduces federated analytics, federated learning and PPFL, technologies which enable the use case assessed throughout this document. This section then introduces a baseline solution, that uses more traditional methods, to provide a point of comparison to our PPFL solution throughout the rest of the document.

Input and output privacy

Input privacy focuses on protecting raw data throughout the processing stage. Effective input privacy ensures that no party can access or infer sensitive inputs at any point. This protection may involve:

preventing unauthorised access: ensuring that all processing of data is conducted without any party being able to access or infer the original raw data. This involves a combination of access controls and protection against indirect inference attacks.
offensive security considerations: anticipating and countering potential offensive security techniques that adversaries could employ to gain unauthorised access to a system. This includes defending against attacks that leverage observable systemic changes such as timing or power usage.
proactive attack countermeasures: utilising robust defensive techniques and methodologies, including quality assurance cycles and rigorous red-teaming exercises (red-teaming, also used in the UK and US PETs Prize Challenge, 2022-2023, is a process in which participants known as ‘red teams’ deliberately simulate attacks that might occur in the real world to rigorously test the strength of solutions created by others), to proactively minimise attack vectors. These measures can help identify and mitigate potential vulnerabilities that could be exploited through the likes of side-channel attacks.

Input privacy may be improved by stacking a range of PETs and techniques across a federated solution. The sort of PETs and techniques encompassed by such approaches can be hardware and/or cryptographic based and are often viewed as synonymous concepts to security itself. For more information on input privacy see Section 2: Input Privacy Considerations.

Output privacy is concerned with improving the privacy of outputted data or models. Protecting processed data is important to prevent potential privacy breaches after data has been analysed or used to train models. Key considerations include:

implementing output-based techniques: techniques which add random noise to the training process of models, such as differential privacy can be particularly effective for ensuring that training data, or subsets thereof, cannot be extracted at a later stage. This approach can help to protect data even when a model is shared or deployed.
balancing privacy with model performance: techniques like differential privacy can affect a model’s performance, including accuracy. The trade-offs between privacy and performance should be carefully examined, considering factors such as the size of the model and the significance of accuracy relative to the specific research question.

Output privacy may be improved by effectively implementing a range of PETs and techniques across a federated solution. For more information on output privacy see Section 3: Output Privacy Considerations.

Federated analytics and learning

Federated analytics is a technique for performing data analysis or computations across decentralised data sources. It enables organisations to use data that cannot be directly shared. Local data from multiple sources is used to inform a global model or perform complex analysis, using federated approaches without sharing the actual data itself. After data is processed locally, the results of this processing are aggregated (either at a global node or between local nodes).

In this toolkit we define federated learning as a subset of federated analytics. Federated learning involves training a machine learning model on datasets distributed across multiple nodes. This approach uses model updates from many local models to improve a central or global model. Nodes transfer updated model parameters based on training conducted on locally held data, rather than the actual data itself. This allows for the training of a model without the centralised collection of data.

Example 1: Federated analytics for statistical analysis

A healthcare organisation looks to collaborate with universities and counterparts across countries to analyse trends in disease outbreaks. The organisation develops data pipelines to partners’ locally stored data. Through these pipelines, the organisation can send requests for data analysis.

The analysis is performed locally, without the healthcare organisation having access to the dataset. The output of this analysis is then returned to the organisation, which aggregates the results from all partners.

Example 2: Federated learning for training a model

A technology organisation wants to improve the accuracy of a voice recognition system without collecting their users’ voice data centrally. The organisation creates an initial model trained on a readily available data set, which is then shared to users’ devices. This model is updated locally based on a user’s voice data.

With user consent, the local models are uploaded to a central server periodically, without any of the users’ individual voice data ever leaving their device. The central model is continually iterated using the local models collected from repeated rounds of localised training on users’ devices. This updated central model is then shared to users’ devices and this training loop continues.

Privacy preserving federated learning (PPFL)

Layering additional PETs on top of a federated learning architecture is often referred to as privacy-preserving federated learning (PPFL). Use of additional PETs on top of federated learning can improve input and/or output privacy.

^{Figure 1: shows an example of a PPFL solution and illustrates a multi-step process where PETs are strategically implemented to enhance privacy across a federated network.}

[1] Database structures at local nodes

[2] Central Global node (aggregator for federated learning)

[3] Connections between nodes

[4] Federated learning network

[5] End-user devices (Client-side)

The mechanisms behind many of these approaches and combinations will be discussed in more detail at a later stage. This comprehensive approach ensures that from data input to model deployment, every step provides a degree of privacy protection, safeguarding against unauthorised data exposure and enhancing trust in the federated learning process.

The accompanying explanations also serve as a reference point for technical options or considerations for how to deploy these technologies practically in a specific use case. This is intended to illustrate an indicative approach to how these technologies can be usefully deployed, not a definitive guide as to the only correct way of doing so, nor a specific endorsement of these techniques as better than other potential approaches.

[1] Database structures at local nodes

1.a) Trusted execution environment (TEE) and federated learning:

Implementation: TEEs can be used to create a secure local environment for each node participating in federated learning. This ensures that intermediate computations on local data are securely isolated within a server enclave. While federated learning inherently prevents other parties from accessing local raw training data by sharing only model weight updates, TEEs add an additional layer of security.

TEEs can be used to create a secure local environment that further protects the computations and model updates from potential tampering or unauthorised access, even within the local device. This can be particularly useful in scenarios where there is a heightened risk of local attacks or when additional hardware-based security is required.

Interaction: Local model training for federated learning can occur within TEEs. In such a process, only model updates (not raw data) are sent to the central/global node (node [2] connected to database structures). This provides an additional layer of security while benefiting from collective learning.

1.b) Homomorphic encryption (HE) and multi-party computation (MPC):

Implementation: HE enables computations to be performed directly on encrypted data, ensuring that sensitive data remains protected even during processing. This prevents any party from accessing the unencrypted data, thereby enhancing privacy.

MPC allows multiple parties to collaboratively compute a function over their inputs while keeping those inputs private from each other.

By leveraging TEEs, HE, or MPC, organisations can carry out secure computations without revealing sensitive data, providing an additional layer of privacy that complements the inherent protections of federated learning.

Interaction: HE ensures that data remains encrypted during transmission and computation, while MPC allows these encrypted results to be combined securely at the global node or between databases, enhancing both input and output privacy. This combination of techniques helps to protect against inference attacks, and is particularly applicable in scenarios requiring collaborative analytics, allowing for collective computations that are secure and private.

1.c) Synthetic data generation:

Implementation: Synthetic data generation involves creating artificial datasets that replicate the statistical properties of real datasets. This synthetic data can be used for initial model training and testing without exposing sensitive information, making it valuable for scenarios where data privacy is a specific concern due to the involvement of especially sensitive information. If the synthetic data is well-crafted and does not contain any identifiable information, it generally does not require additional privacy techniques. However, in cases where there is a concern that the synthetic data could be correlated with external data to infer sensitive information, techniques like differential privacy can be applied to add an extra layer of protection.

Interaction: Synthetic data can be utilised to safely conduct experiments, validate models, or train machine learning systems without risking the exposure of real, sensitive data. The interaction between the synthetic dataset and the machine learning models or analytical tools remains similar to that of real data, allowing for accurate testing and development.

In situations where sensitive information could be inferred, the use of differential privacy or other privacy-preserving techniques ensures that even if the synthetic data is accessed by unauthorised parties, the risk of re-identification remains minimal.

[2] Central Global node (aggregator for federated learning)

2.a) Federated learning with differential privacy:

Implementation: Implementing differential privacy techniques at the aggregator, to add noise to the aggregated model updates, enhances the privacy of the model by making it harder to trace back to individual contributions.

Interaction: Combining federated learning with differential privacy ensures that even if the aggregated model is exposed, there is a lower risk of the privacy of individual data sources being compromised.

2.b) Federated learning with synthetic data:

Implementation: Synthetic data can be used to initially baseline and validate a machine learning model during the development phase. This approach allows for early testing and adjustment of model architecture using data that mimics real datasets without exposing sensitive information. Once the model is confirmed to be functioning as intended, it should then be trained further with real data to ensure accuracy and effectiveness before being deployed to local nodes for federated learning.

Interaction: Federated learning can utilise synthetic data for calibration and testing under various conditions, ensuring robustness before deploying the model with real user data. During the interaction phase, the model can be tested and refined using synthetic data, which helps to establish a solid foundation whilst reducing privacy breaches. However, it is essential to note that the model should not be pushed to local nodes for final training if it has only been trained on synthetic data. Instead, the model should undergo additional training with real data to ensure it performs accurately in real-world scenarios before deployment across the federated network.

[3] Connections between nodes

Federated learning and HE:

Implementation: HE might be used to encrypt the model updates as they are transmitted between nodes. These updates, while originally derived from the data, are no longer the raw data itself but rather parameter updates that represent learned patterns. Encrypting these updates ensures that even as they are aggregated and processed, the underlying data patterns remain protected from potential inference attacks.

Interaction: During the interaction phase, model updates are securely transmitted between nodes using HE. These updates are no longer the raw data but encrypted representations of the model’s learned parameters. This encryption ensures that while the updates are aggregated to refine the global model, they remain secure and inaccessible, protecting the privacy of the underlying data.

[4] Federated learning network

Model consolidation: Implementation: This section represents the consolidated output of a federated learning process a fully trained model that integrates insights derived from all participating nodes.

Interaction: The model, now optimised and refined through aggregated updates, embodies the collective intelligence of the decentralised network while maintaining the privacy of the underlying data.

[5] End-user devices (Client-side)

TEE and synthetic data on client devices:

Implementation: TEEs can be employed on client devices to securely process data and use synthetic data to simulate user interactions without risking exposure of real data. Synthetic data could be generated from real data or anonymised versions of a dataset.

Interaction: TEEs ensure that even if a device is compromised, the processing of sensitive data (real or synthetic) remains secure.

Baseline for comparison

When assessing the costs and benefits of adopting PETs, it is useful to compare the costs and benefits to alternative methods.

In this PPFL example, a useful baseline for comparison is the training of an equivalent model on centrally collated data. The data is assumed to be collected by the organisation, originating from different entities and containing personal or sensitive information.

^{Figure 2: Example of baseline centralised data processing model}

Question 3

Section 1: costs and benefits of federated learning

Accepted Answer

This section examines the costs and benefits associated with implementing a federated approach to machine learning, without additional privacy protections. It explores technical, operational, legal and longer-term considerations, weighed against the comparative costs and benefits of implementing the baseline scenario described above.

Although some of the specifics refer to machine learning, the considerations outlined in this section are also applicable to federated analytics more broadly.

Later sections explore how additional PETs can be layered around this solution to enhance privacy.

Technical considerations

Data storage considerations

The baseline scenario requires organisations to set up and maintain a larger central database. Bringing together a large volume of data in one domain requires strong security protections; in many contexts this aggregation may raise the level of security required as the impact of data loss would be greater, and the threat model may be enhanced. If the data is gathered from multiple data-owning organisations, the central database may acquire the need to comply with multiple sets of security requirements.

Costs associated with this include the implementation of appropriate data governance and security mechanisms to protect and secure sensitive data, as well as ongoing operational costs associated with this. Such an aggregation often leads to a level of inflexibility, with any change in the central platform needing to be signed off by multiple data controllers or processors.

For federated learning, some of these costs may be lower. Most data will remain close to its source and will be processed locally, minimising the need for large central databases and reducing the risk from a single data breach or leak, and leaving control over individual data sets with the organisations that own them. It may also serve as a more efficient solution, in contrast with a centralised setup where copies of all the data from local sources will need to be made and processed centrally. Federated learning eliminates the need to duplicate data. However, other factors will need to be taken into consideration to assess this fully in an organisation’s specific context, for example, the readiness of existing data storage infrastructure.

Baseline	PPFL
Implement once in one place.	Retain local control over data. Minimise risk from data breach by minimising aggregation. Avoid aggregation of security requirements from multiple organisations, and inflexibility due to complex multi-org governance regimes.

Compute considerations in federated learning

In the baseline scenario, a substantial quantity of data will need to be processed centrally. This means that all the heavy lifting in terms of computation happens centrally, leading to high central computational costs.

By contrast, federated learning involves training models at local participant nodes before aggregating it centrally. This reduces the computational load on the central server and distributes the cost of data processing across the federated network. Therefore, the central computational costs are likely to be reduced significantly compared to the baseline scenario. In scenarios where the nodes are individual user devices, this may save costs centrally but could shift the burden and implicit cost to less capable devices, e.g. mobile phones. In multi-organisational setups, each participant bears part of the computational cost, which can lead to differing views on the cost-benefit ratio.

In addition to shifting the computational burden to less capable devices, federated learning can also introduce connectivity-related dependencies that may affect performance and reliability. Devices with intermittent or poor network connections may struggle to participate effectively in the federated learning process, potentially delaying model updates or causing inconsistencies in the global model. Furthermore, increased reliance on network connectivity can lead to higher latency and potential data synchronisation issues, which may degrade the overall efficiency of the learning process. These connectivity challenges must be carefully managed to ensure that all participating devices can contribute effectively without compromising the integrity of the federated mode. The above impacts will be more significant for more computational complex tasks. For example, the computational overhead of training a machine learning model is much greater than performing simpler analyses.

While federated learning likely reduces the computational overhead on central servers, it does not necessarily reduce the overall computational need. In fact, when summing the total compute across all nodes, federated learning can have higher total computational cost due to inefficiencies and the need to process on many nodes independently.

The performance and efficiency of federated learning depends heavily on the computational power of local nodes, which may vary. User devices, such as mobiles, have limited computational power and battery life compared to the likes of organisational servers, and local computational power can significantly impact performance.

By contrast, devices with more computational power can locally process more complex models without the same constraints at the expense of increased operational costs. Depending on the use case, it may be important to carefully assess and manage computational tasks and battery usage for federated learning on the likes of mobile devices. This consideration is more significant for more computationally intensive processes.

Layering in additional privacy-preserving techniques can significantly impact computational overhead. For more detailed information on the computational impacts of using additional PETs, see Section 2 and Section 3.

Baseline	PPFL
Invest in advanced hardware in one location and use it efficiently.	Spread compute load among participants.

Technical complexity

Data science and machine learning skills are in high demand in the market generally, often leading to higher salary costs and recruitment and retention challenges in either approach. However, there are some additional challenges in federated scenarios.

The additional complexity of running a federated learning process across several nodes requires blending understanding of data science and machine learning (ML), with increased expertise in how analytics code interacts with the infrastructure (compute, networking etc). In the centralised baseline scenario, most of this is typically abstracted away from the data scientist via a range of mature frameworks and software products. In a federated approach, the available frameworks are currently less mature, and fewer people have experience in deploying them in complex real-world scenarios. This has the potential to increase costs and risks in the short term; however, this is improving as federated approaches become more commonplace.

A federated approach can also add challenges for developers and data scientists. Often data science and machine learning require iterative exploration of the data, experimenting with different approaches and seeking to understand the outcomes. A federated approach, where the data is not directly available to the data scientist, can make this more challenging. This can also make troubleshooting issues more challenging.

Federated learning can also introduce potential complexities related to representation within datasets. For instance, when training models on data distributed across various sites, there may be significant differences in dataset characteristics, such as varying propositions of ethnic groups in medical datasets. These biases might not be visible to the coordinating server, necessitating additional efforts at the local level to ensure the data is suitable for federated training. This could involve extensive data preprocessing or detailed documentation to inform the unsighted coordinating party of potential biases, adding to the overall time and resource costs involved in the process.

Though this does represent an additional challenge for federated approaches, it is important to highlight that many of the same constraints might apply to a sensitive data set held centrally, where allowing a data scientist direct access to the raw data is either not possible at all, or only possible in highly constrained circumstances (e.g. a dedicated physical environment). Strategies such as using dummy data and automated validation processes can be effective approaches to counter the above challenges. These methods can help simulate potential issues and validate functionality in the absence of direct data access.

Privacy, data protection and legal considerations

Federated approaches are inherently more private than traditional centralised processes. The decentralised nature of federated learning prevents data from being shared, which minimises opportunities for data leakages or breaches. As the data is distributed across a federated network, the risk of the entire dataset being compromised is usually lowered. By contrast, in a centralised approach an entire dataset can be accessed if the sever is successfully attacked.

Despite the privacy and security benefits that can be derived from using federated learning compared to the baseline, use of federated learning on its own may, depending on the circumstances, not be sufficient to meet the requirements of the security principle of the UK GDPR.

For example, without additional PETs, federated learning may pose the risk of indirectly exposing data that is intended to be kept private that is used for local model training. This exposure can occur through model inversion, observing identified patterns (gradients), or other attacks like membership inference. This risk is present if an attacker can observe model changes over time, specific model updates, or manipulate the model. For more information on the risks of using federated learning without additional PETs, see the ICO guidance.

This risk of data breach exposes organisations to risk of legal action and from the ICO and/or data subjects, and associated fines; however, it should be noted that some of the risk levels should still be lower relative to centralised solutions.

While it is possible to use federated learning without additional PETs, organisations will likely find it more difficult to do so and demonstrate that adequate measures have been taken to protect personal data (compared to implementing federated learning with additional PETs. This may mean that organisations fail to fulfil the requirement for solutions to demonstrate data protection by design and default.

Use of federated learning without additional PETs may leave risks of re-identification of individuals from the model’s outputs open. Unauthorised re-identification of individuals could result in regulatory action from the ICO (this is discussed in more detail in Section 3). Deciding whether to use PETs for output privacy will depend on the nature and the purposes of the processing. Organisations will need to consider whether anonymous outputs are required, the size of the datasets, and the accuracy and utility required for the results of the analysis.

To mitigate many of these risks, organisations may wish to create a PPFL solution by layering multiple PETs. For more information on legal considerations associated with PETs that may be layered around a PPFL approach, see the sections linked below:

Section 2: Input privacy - legal considerations

Section 3: Output privacy and legal considerations

Although using federated learning without additional PETs will not protect data against all risks, federated learning could still provide benefits and improve security when compared to the baseline solution. In the event of an infringement, when considering whether to impose a penalty, the ICO will consider the technical and organisational measures in place in respect to data protection by design. Using federated learning may help to demonstrate proactive measures to reduce harm which may influence potential penalties favourably.

Data protection impact assessments (DPIA)

Use of privacy-preserving techniques may streamline the DPIA process, therefore potentially reducing legal costs. For example, by minimising the amount of personal data processed, federated learning solutions should inherently mitigate some privacy risks that would otherwise need to be accounted for.

Designing federated learning systems with data protection by design and default is likely to result in lowering inherent privacy risks. This could lead to the reduction of legal overhead involved in these assessments. For example, auditing the system could be less costly as compliance considerations would be ‘baked into’ the system design.

Distributing legal costs

In scenarios where multiple organisations participate in a federated learning project, shared legal resources or joint legal teams may help to distribute legal costs (if it is appropriate to do so in the circumstances) among the participating parties. However, it should also be noted that the use of unfamiliar emerging or novel technologies like federated learning can lead to protracted discussions between legal teams, as it may be challenging to reach a consensus on what constitutes sufficient security and privacy safeguards. To mitigate these challenges, establishing clear, standardised guidelines and best practices at the outset of the project can help streamline these discussions and reduce the time and cost involved in reaching agreements.

Federated learning reduces the risk of large-scale data breaches by keeping the data localised on devices rather than centralising it. Decentralised data handling minimises the attack surface and limits the impact of potential data breaches to individual nodes rather than the entire dataset.

The privacy-preserving nature of federated learning, avoiding the transferring of raw data, aligns with security best practices. This alignment can lead to a perception of reduced risk among insurers, potentially lowering premiums. However, this potential benefit is more likely to be realised if there are established, repeatable guidelines and standards for implementing federated learning securely. Without such standards, the variability in implementation could lead to inconsistent security outcomes, making insurers cautious.

Longer term considerations

Some of the benefits of the use of federated learning may only be fully realised in the longer term. Organisations should consider costs and benefits across the whole life cycle of a product/system, and this should include considering the wider opportunities that adopting federated learning could lead to in the future.

Longer term benefits of using federated learning include improving long-term efficiency for adding new data sources into model training, enabling the use of previously inaccessible data assets, wider network effects as more successful federated analytics solutions are deployed (all explored in further detail below).

Integrating new data sources

Baseline	PPFL
Adding new data sources may result in a bespoke or semi-bespoke approach in each instance.	Standardises the approach to integrating new data sources, simplifying this process.

Federated learning can improve long-term system efficiency as it establishes - through system design - a method and structure for integrating insights from different data providers.

In the long term, federated learning could simplify the process of integrating new data sources, thereby continuously enhancing the central model. While federated learning inherently facilitates the addition of data sources from diverse locations without centralising data, it is important to acknowledge that a well-designed centralised system could also be structured to accommodate future data integration effectively.

The key difference lies in the approach: federated learning naturally supports incremental data integration with minimal disruption, whereas centralised systems require careful foresight and design to achieve similar flexibility.

Getting value from data assets

Baseline	PPFL
Some data assets are unmonetisable due to privacy/commercial/IP concerns.	Value can be derived from previously inaccessible data sources through privacy preserving approaches.

Generally, federated approaches offer wider opportunities to benefit from the value of protected and sensitive data. Data owners might be able to unlock value from data assets that would not have been previously accessible, the value of the data to the data owner is protected by enabling use of the data without full access to it. Depending on the context, this might correspond to direct monetary value for the data, or broader social or economic benefits.

PPFL offers additional security which allows for greater control and management over the full data, which can enhance input and output privacy. Only information about model updates is shared which ensures the value of the actual data is preserved and can be utilised for further opportunities. In contrast, in the case of federated learning, the lack of additional layers of PETs for security limits the extent to which the value of the data can be protected.

Network effects

Baseline	PPFL
Limited network effects due to centralised data and processes.	More opportunities to benefit from federated learning as it becomes a more widely adopted approach.

Adopters of federated learning may see benefits compound over time due to network effects from wider use and adoption of federated approaches. While federated learning is relatively nascent, over time, the deployment of - and collaboration through - more successful federated analytics-based solutions could encourage greater uptake of the approach.

Wider adoption of federated learning will create further opportunities for use of federated approaches across organisations and sectors, which will result in further opportunities for collaboration and to unlock greater value from data.

Question 4

Section 2: input privacy considerations

Accepted Answer

Introduction

Whilst federated approaches offer improvements in privacy compared to our baseline scenario, organisations may wish to layer in additional PETs to ensure that no processing party can access or infer sensitive inputs at any point.

Greater levels of input privacy can be achieved by stacking additional PETs such as trusted execution environment (TEE), homomorphic encryption (HE) or secure multiparty computation (SMPC) into a federated solution. See Figure 1 for examples on where these PETs fit in a PPFL architecture. These PETs can also be used for a wide range of other use cases, and organisations looking to deploy these technologies will encounter similar costs and benefits.

This section expands on the different types of PETs that can improve input privacy in our PPFL use case, the costs and benefits associated of these approaches and other use cases these PETs can enable.

Enabling technologies

Homomorphic encryption

Homomorphic encryption (HE) enables computation directly on encrypted data. Traditional encryption methods enable data to be encrypted whilst in transit or at rest but require data to be decrypted to be processed. HE enables encryption of data at rest, in transit, and in process. At no point are the processing parties able to access the unencrypted data, or to decrypt the encrypted data.

There are 3 forms of HE, each of which permit different types of operations:

partial homomorphic encryption (PHE): permits only a single type of operation (e.g. addition) on encrypted data.
somewhat homomorphic encryption (SHE): permits some combinations of operations (e.g. some additions and multiplications) on encrypted data.
fully homomorphic encryption (FHE): permits arbitrary operations on encryption data.

Unless stated otherwise, this resource uses HE to describe all forms of homomorphic encryption.

^{Figure 3: Homomorphic encryption}

Example

A technology company running a password manager wants to monitor whether their users’ passwords have been compromised and leaked online. The organisation collects homomorphically encrypted versions of their users’ passwords and compares these passwords to lists of leaked passwords. The organisation can run these comparisons without being able to decrypt the users’ passwords. The organisation can then alert users if their passwords have been compromised, without ever actually having access to the users’ credentials.

Trusted execution environments

A trusted execution environment (TEE) is a secure area within a processor that runs alongside the main operating system, isolated from the main processing environment. It provides additional safeguards that code and data loaded inside the TEE are protected with respect to confidentiality and integrity. TEEs provide execution spaces which ensure that sensitive data and code are stored, processed, and protected in a secure environment. In practice, this means that even if the main processor or operating system is compromised the TEE remains secure.

This isolation prevents unauthorised access through a series of hardware-enforced controls. Additionally, the secure design of TEEs helps safeguard the broader processor system by containing any potentially malicious code or data breaches within the TEE itself. This containment ensures that threats do not spread to other parts of the system, thereby enhancing the overall security architecture and reducing the risk of widespread system vulnerabilities.

While TEEs provide strong security guarantees by isolating and containing potential threats, their effectiveness is also dependent on the trustworthiness of the TEE provider. For instance, the integrity of the TEE’s security features hinges on the provider’s ability to implement and maintain these controls while addressing any vulnerabilities. It is important to acknowledge that if the TEE provider is compromised or if the TEE has undiscovered vulnerabilities, the security assurances offered by the TEE could be undermined. Some TEE providers aim to mitigate these risks and enhance transparency and trust in TEEs by leveraging open-source code and independent verification processes, though these solutions are relatively nascent and are not discussed in further detail here.

^{Figure 4: Trusted execution environments}

Example

An organisation that develops a mobile messaging application wants to match users with contacts on their mobile who are also using the platform. The organisation does this by comparing a user’s contacts to their wider database of users. The organisation does not wish to access the user’s contact data directly. The user’s contacts are encrypted and uploaded to a TEE inside the company’s servers. Inside of this TEE, the user’s data is decrypted and compared against the company’s database of users. Information regarding matches between the user’s contacts and the company’s user base is then returned to the user. As the user’s contacts are only decrypted within the TEE, the company never has visibility of them, nor do they receive a copy of the unencrypted data.

Secure multiparty computation

A secure multiparty computation (SMPC) protocol allows multiple parties to collaborate on data whilst keeping inputs secret from others. Typically, this is done by fragmenting data over multiple networked nodes. Each node hosts an “unintelligible shard”, which is a portion of data that, in isolation, cannot be used to infer information about the original data. Functions are locally completed on the shards and the outcomes are then aggregated to produce a result.

^{Figure 5: Very simple example of a secure multiparty computation implementation}

Example

A group of employees want to understand average salaries without revealing their own salaries. SMPC uses basic mathematical properties of addition which split the computation between parties whilst keeping their actual salaries confidential. Once the results of local computations are recombined, the correct result can be obtained.

Case studies

BWWC (July 2021) Product - SMPC
Eurostat (June 2023) Proof of concept - TEE
Microsoft (July 2021) Digital product - TEE
Indonesia Ministry of Tourism (June 2023) Digital product - TEE
Secretarium/Danie (Finance) (June 2023) Product - TEE
Enviel (July 2021) Proof of concept - HE