CarefulAI: Prompt-LLM Improvement Method (PLIM)

PLIM: an approach to ensuring the relationship between prompts and large language models is validated quickly and efficiently.

Background & Description

When working with large language models (LLMs), accuracy is important. However, there is a lack of understanding of the co-dependency between LLM outputs and prompts. Existing LLM benchmarks do not specify this; they allude to historical accuracy scores on LLM benchmarks that may not be relevant to the end user. In addition, LLMs are usually dynamic in practice. Their behaviour is not static, but changes over time, and often cannot be explained by LLM providers. Users, therefore, can only partially depend upon LLM benchmarks. In practice, to make LLMs fit for purpose and safe, users are required to constantly test Prompt-LLM outputs for specific cases. This can be time-consuming.

CarefulAI’s approach to this is based on the discovery that by serving a model with a standard set of end user-specific examples of questions and answers—validated by the end-user community (with each prompt validated by a minimum of 3 subject matter experts/end users), the time taken to get acceptable answers is significantly reduced (tenfold). In addition to getting Prompt-LLM combinations that are deemed safe, the approach enables sector/subject matter prompt benchmarking against multiple models.

PLIM is designed to make benchmarking and continuous monitoring of LLMs safer and more fit for purpose. This is particularly important in high-risk environments, e.g. healthcare, finance, insurance and defence. Having community-based prompts to validate models as fit for purpose is safer in a world where LLMs are not static.

The PLIM method consists of question-and-answer prompts that can be applied to specific purposes validated by the community the Prompt-LLM output seeks to support. These prompts are shared widely across sector leads for validation purposes (for example, in a healthcare context, this would be senior clinicians, NICE and MHRA). At least three subject matter experts independently validate each prompt and carry safety case information (e.g. in mental health, these would be phrases that would be problematic, i.e. suicide ideation phrases and correct responses). Synthetic prompts are also created that mirror the interactions that have been validated to increase the test boundaries. They, too, are validated by the subject matter experts.

Relevant Cross-Sectoral Regulatory Principles

Safety, security and robustness

  • Performance: PLIM enables more robust and predictable outcomes from LLMs by validating Prompt-LLM dependencies. The performance of LLMs relative to prompts can be optimised for a wider community.
  • Efficiency: PLIM speeds up the development life-cycle of safer AI by making the Prompt-LLM outputs dependencies transparent to users, the lack of transparency of efficient Prompt-LLM combinations hitherto has meant that individual LLM users would have needed to undertake 10,000’s of interactions. With PLIM, this can be reduced to between 150-750.
  • Risk Management: PLIM reduces operational risks associated with misaligned prompts and models by validating prompts against a community of potential beneficiaries across the ecosystem in which the Prompt-LLM outputs are targeted.

See https://www.carefulai.com/plim.html for more details.

Why we took this approach

Public-facing systems rely on Prompt-LLM accuracy to generate reliable outputs. The scale of the challenge is significant: primarily because the main providers of LLMs are embodied across the IT systems that face the public e.g. Microsoft, AWS, Oracle etc. Each are promoting LLMs and associated GenAI as a method of generating content using a variety of LLMs (there are over 2500 individual models). None make it transparent that the fitness for purpose of a LLM is dependent upon keeping prompts up to date.

PLIM ensures that the relationship between prompts and large language models is validated quickly and efficiently. Prompt-LLM benchmarks appear to be the only safe way to ensure LLMs are safe and fit for purpose. Given that much of industry is going to be affected by LLM use, it is important for organisations to adopt methods for improving model robustness and accuracy such as PLIM. At this time, the AI industry is dependent upon relatively meaningless LLM performance metrics. CarefulAI sees the value of LLMs, but without tools to assess and manage the relationship between prompts and outputs – such as PLIM – more work will be needed to enable the GenAI industry to safely grow.

Benefits to the organisation using the technique

  • Effective deployment of safer AI systems based on LLMs: previously, LLMs were deployed with prompts that were not maintained, which has meant that engineering teams could not effectively deploy stable LLM enabled services: as such services behaviour could not be guaranteed.
  • Reduced risk of incurring costs associated Prompt-LLM-Service engineering groups: by decreasing the amount of re-engineering.
  • Improved alignment of LLM outputs with desired objectives: by having a community subject matter validating prompts with LLM engineering teams.

Limitations of the approach

  • The effectiveness of this approach depends on the quality of prompt safety rules and the experience and availability of subject matter experts prepared to validate question and answer sets for the target markets where LLMs are to be deployed. In essence, if LLM engineering teams do not have access to subject matter experts, they are forced into a cycle of prompt re-engineering to deliver a stable prompt-LLM.
  • The dependency of LLM performance on prompt types has yet to be well understood. Existing high-profile universities and institutions that publish LLM benchmarks are not set up to manage the risk their benchmarks create around overconfidence in individual models. There will therefore always be a co-dependency between LLM providers and prompt engineering.

The approach adds value to risk mitigation associated with:

Further AI Assurance Information

Updates to this page

Published 5 December 2024