CarefulAI: Prompt-LLM Improvement Method (PLIM)

PLIM: an approach to ensuring the relationship between prompts and large language models is validated quickly and efficiently.

From:: Department for Science, Innovation and Technology
Published: 5 December 2024

Use case:: Natural language processing and generation and Virtual agents or artificial conversational interfaces
Sector:: Agriculture, Forestry and Fishing (SIC Code Section A), Mining and Quarrying (SIC Code Section B), Manufacturing (SIC Code Section C), Energy & Utilities (SIC Code Sections D & E) and Construction (SIC Code Section F)
Show 13 more
Retail (SIC Code Section G), Transportation & Storage (SIC Code Section H), Accommodation and Food Service (SIC Code Section I), Digital & Comms (SIC Code Section J), Financial and Insurance (SIC Code Section K), Real Estate (SIC Code Section L), Professional, Scientific & Professional Activities (SIC Code Section M), Administrative & Support Services (SIC Code Section N), Public Administration & Defence (SIC Code Section O), Education (SIC Code Section P), Healthcare & Social Work (SIC Code Section Q), Arts, Entertainment & Recreation (SIC Code Section R), and Other Services (SIC Code Section S)
Principle:: Safety, security and robustness
Key function:: R&D, Product and service development and Risk management
AI Assurance Technique:: Data assurance, Formal verification, Performance testing, Risk Assessment and Impact Assessment
Show 3 more
Impact Evaluation, Conformity Assessment, and Bias Audit
Assurance Technique Approach:: Technical, Procedural and Educational

Background & Description

When working with large language models (LLMs), accuracy is important. However, there is a lack of understanding of the co-dependency between LLM outputs and prompts. Existing LLM benchmarks do not specify this; they allude to historical accuracy scores on LLM benchmarks that may not be relevant to the end user. In addition, LLMs are usually dynamic in practice. Their behaviour is not static, but changes over time, and often cannot be explained by LLM providers. Users, therefore, can only partially depend upon LLM benchmarks. In practice, to make LLMs fit for purpose and safe, users are required to constantly test Prompt-LLM outputs for specific cases. This can be time-consuming.

CarefulAI’s approach to this is based on the discovery that by serving a model with a standard set of end user-specific examples of questions and answers—validated by the end-user community (with each prompt validated by a minimum of 3 subject matter experts/end users), the time taken to get acceptable answers is significantly reduced (tenfold). In addition to getting Prompt-LLM combinations that are deemed safe, the approach enables sector/subject matter prompt benchmarking against multiple models.

PLIM is designed to make benchmarking and continuous monitoring of LLMs safer and more fit for purpose. This is particularly important in high-risk environments, e.g. healthcare, finance, insurance and defence. Having community-based prompts to validate models as fit for purpose is safer in a world where LLMs are not static.

The PLIM method consists of question-and-answer prompts that can be applied to specific purposes validated by the community the Prompt-LLM output seeks to support. These prompts are shared widely across sector leads for validation purposes (for example, in a healthcare context, this would be senior clinicians, NICE and MHRA). At least three subject matter experts independently validate each prompt and carry safety case information (e.g. in mental health, these would be phrases that would be problematic, i.e. suicide ideation phrases and correct responses). Synthetic prompts are also created that mirror the interactions that have been validated to increase the test boundaries. They, too, are validated by the subject matter experts.

Relevant Cross-Sectoral Regulatory Principles

Safety, security and robustness

Performance: PLIM enables more robust and predictable outcomes from LLMs by validating Prompt-LLM dependencies. The performance of LLMs relative to prompts can be optimised for a wider community.
Efficiency: PLIM speeds up the development life-cycle of safer AI by making the Prompt-LLM outputs dependencies transparent to users, the lack of transparency of efficient Prompt-LLM combinations hitherto has meant that individual LLM users would have needed to undertake 10,000’s of interactions. With PLIM, this can be reduced to between 150-750.
Risk Management: PLIM reduces operational risks associated with misaligned prompts and models by validating prompts against a community of potential beneficiaries across the ecosystem in which the Prompt-LLM outputs are targeted.

See https://www.carefulai.com/plim.html for more details.

Why we took this approach

Public-facing systems rely on Prompt-LLM accuracy to generate reliable outputs. The scale of the challenge is significant: primarily because the main providers of LLMs are embodied across the IT systems that face the public e.g. Microsoft, AWS, Oracle etc. Each are promoting LLMs and associated GenAI as a method of generating content using a variety of LLMs (there are over 2500 individual models). None make it transparent that the fitness for purpose of a LLM is dependent upon keeping prompts up to date.

PLIM ensures that the relationship between prompts and large language models is validated quickly and efficiently. Prompt-LLM benchmarks appear to be the only safe way to ensure LLMs are safe and fit for purpose. Given that much of industry is going to be affected by LLM use, it is important for organisations to adopt methods for improving model robustness and accuracy such as PLIM. At this time, the AI industry is dependent upon relatively meaningless LLM performance metrics. CarefulAI sees the value of LLMs, but without tools to assess and manage the relationship between prompts and outputs – such as PLIM – more work will be needed to enable the GenAI industry to safely grow.

Benefits to the organisation using the technique

Effective deployment of safer AI systems based on LLMs: previously, LLMs were deployed with prompts that were not maintained, which has meant that engineering teams could not effectively deploy stable LLM enabled services: as such services behaviour could not be guaranteed.
Reduced risk of incurring costs associated Prompt-LLM-Service engineering groups: by decreasing the amount of re-engineering.
Improved alignment of LLM outputs with desired objectives: by having a community subject matter validating prompts with LLM engineering teams.

Limitations of the approach

The effectiveness of this approach depends on the quality of prompt safety rules and the experience and availability of subject matter experts prepared to validate question and answer sets for the target markets where LLMs are to be deployed. In essence, if LLM engineering teams do not have access to subject matter experts, they are forced into a cycle of prompt re-engineering to deliver a stable prompt-LLM.
The dependency of LLM performance on prompt types has yet to be well understood. Existing high-profile universities and institutions that publish LLM benchmarks are not set up to manage the risk their benchmarks create around overconfidence in individual models. There will therefore always be a co-dependency between LLM providers and prompt engineering.

Further Links (including relevant standards)

The approach adds value to risk mitigation associated with:

Further AI Assurance Information

For more information about other techniques visit the Portfolio of AI Assurance Tools: https://www.gov.uk/ai-assurance-techniques
For more information on relevant standards visit the AI Standards Hub: https://aistandardshub.org/

Published 5 December 2024

Contents

CarefulAI: Prompt-LLM Improvement Method (PLIM)

Background & Description

Relevant Cross-Sectoral Regulatory Principles

Safety, security and robustness

Why we took this approach

Benefits to the organisation using the technique

Limitations of the approach

Further Links (including relevant standards)

Further AI Assurance Information

Is this page useful?

Help us improve GOV.UK

Help us improve GOV.UK

Cookies on GOV.UK

CarefulAI: Prompt-LLM Improvement Method (PLIM)

Background & Description

Relevant Cross-Sectoral Regulatory Principles

Safety, security and robustness

Why we took this approach

Benefits to the organisation using the technique

Limitations of the approach

Further Links (including relevant standards)

Further AI Assurance Information

Updates to this page

Is this page useful?

Help us improve GOV.UK

Help us improve GOV.UK