GCS Co-Pilot alpha assessment report

Service Standard assessment report GCS Co-Pilot 20/03/2024

Service Standard assessment report

GCS Co-Pilot

Assessment date: 20/03/2024
Stage: Alpha
Result: Amber
Service provider: Cabinet Office

Service description

The service aims to help government communications colleagues initiate campaign ideas and first drafts, using generative AI and best practice. It is being rapidly developed. The assessment will be in the original spirit of service assessments - a good old Show the Thing, with little or no presentation (more Show than Tell). Followed by concise questioning focussing on if / how the service meets the standard at alpha and the areas of most risk. The assessment will be quite short, half an hour per section. So two and a half hours for service demo, UR questions, design questions, tech questions and the lead’s questions about team & agile ways of working plus performance analytics.

AI is an emerging technology deserving appropriate scrutiny. This assessment will balance that with efficiency given the speed of development. The service team will not present, this will save them time preparing. Crucially it will also give you time to ask the questions you need to make a judgement as to whether the service meets the standard, and concentrate on the high level risks

Service users

This service is for staff with roles within Government Communications Service across all Departments totalling c7,000 people.

There are 25 Ministerial and 20 non-Ministerial departments and some 400 public bodies and agencies.

Things the service team have done well:

  • conducted research with a range of users, including capturing relevant data such as variance in AI use and civil service experience.
  • demonstrated a good understanding of the different use cases that users would have for the service
  • considered the different challenges and pain points which the service could address.
  • considered the risks associated with using AI in government, including reputational risk and misuse, and have strategies in place to mitigate these risks.
  • had a full multi disciplined team for the alpha and worked closely, both in person and over technology, and have applied all the agile ceremonies and artefacts to their work.
  • considered in detail the areas where Performance Analytics can drive improvements in the service and begun to develop a Dashboard for the Team and Stakeholders to gain insights.
  • considered the data security implications of using AI, and supplemented this with a training course on usage and expectations to prevent data such as PII being used. The team were committed to implementing additional automated controls through Beta, such as LLM Guard, and were going to be moving the models to UK based services prior to public Beta.
  • have a plan for open sourcing the work to be done for private Beta and beyond.
  • a great range and depth of stakeholder engagement for the use of AI and LLMs across government.
  • worked at pace, using a variety of methods to prototype an innovative technology and navigate a production environment where there is little precedent to learn from.
  • an in-depth approach to evaluation of LLM providers to determine which was the most effective market offer

1. Understand users and their needs

Decision

The service was rated Amber for point 1 of the Standard.

During the assessment, we didn’t see evidence of:

  • how the design of the service is connected to user need. User need statements which demonstrate the team’s awareness and understanding of their users, including how user needs can differ due to various factors such as role, discipline, digital confidence etc. Specific user needs, e.g. ‘I need to summarise and assess evidence’.
  • structuring and prioritising user needs to support their use for design, e.g. into primary, secondary and tertiary levels.
  • differentiated personas or user profiles which demonstrate the varied goals, challenges and needs of different users.
  • use of personas and selection of participants that didn’t just verify the initial project assumptions and were able to provide a wider range of insight to challenge these assumptions.
  • development of wider underlying behavioural personas or mindsets to support design decisions, that could be applied to the wider cross government user base e.g. to understand how the design can support positive user behaviour and mitigate negative behaviours .
  • a research plan going forward to understand how users actually use the product’s output within the context of their job, and how this insight can support product iteration.

2. Solve a whole problem for users

Decision

The service was rated amber for point 2 of the Standard.

During the assessment, we didn’t see evidence of:

  • a plan to understand the wider context of the product’s use across OGDs as a decision support tool, and also the wider context and process within a specific piece of work, moving beyond the use of developing a ‘first draft’.
  • how to make it easy for users to focus on the next step in a work journey, and how the product could best integrate with the preceding and proceeding work stages of that journey.
  • specific plans for how to approach the design of the integration into GCS

3. Provide a joined-up experience across all channels

Decision

The service was rated amber for point 3 of the Standard.

During the assessment, we didn’t see evidence of:

  • plans for monitoring human error in data input that doesn’t rely on limited human resource.
  • plans for understanding and reducing the risk of potential human error to prevent PPI data being transferred abroad, and design approaches to ensure this ‘fails to safe.’
  • plans to consider non-digital parts of the journey, and identifying a model for user support that is mapped onto ‘failure points’ in the product.
  • plans to ensure the success of scaling up and identify the wider implementation needs, e.g. training needs and variation, wider context use, QA process.
  • plans to enable OGDs to learn from use of the tool, and feed this insight back to product iteration.

4. Make the service simple to use

Decision

The service was rated amber for point 4 of the Standard.

During the assessment, we didn’t see evidence of:

  • plans for user testing to ensure that the data structure and data hierarchy guide the design of the user interface to support task completion and reduce cognitive load.
  • specific plans for identifying potential failure points, and supporting users for task completion and the inclusion of assisted digital/non-digital routes when there is a failure point.
  • plans to develop understanding of support needs. Even though GCS Connect have a support contract, and there is potential for the product to be added to that contract, what are the user and product support needs, and how far would the GCS Connect support contract meet them?

5. Make sure everyone can use the service

Decision

The service was rated amber for point 5 of the Standard.

During the assessment, we didn’t see evidence of:

  • testing the service with users with access needs or users of assistive software
  • a plan of how to understand and meet the needs of ‘non-standard’ users and the use of assistive technologies, especially screen readers.
  • how the team will ensure the private beta is conducted with an inclusive and varied range of users and how they will avoid sampling bias (i.e. users who have volunteered to participate in the private beta due to experience or interest in AI technologies)
  • plans to understand the wider context of the users’ environment when using the service, to ensure that potential drivers of unintended consequences and risks are understood and mitigated.
  • incorporating understanding of public perception of risk and trust into design and implementation decisions to identify and mitigate potential risks, e.g. reputational or unintended consequences.
  • a wider understanding of the drivers of users’ resistance to using AI and the perceived risks, and the implications for design decisions, e.g. how can you design for user trust of the product?
  • plans for review of content to meet user need, reduce cognitive load and support task completion.

6. Have a multidisciplinary team

Decision

The service was rated green for point 6 of the Standard.

7. Use agile ways of working

Decision

The service was rated green for point 7 of the Standard.

8. Iterate and improve frequently

Decision

The service was rated amber for point 8 of the Standard.

During the assessment, we didn’t see evidence of:

  • specific plans for linking user needs, qualitative and quantitative data, to test hypotheses to drive product iteration with a performance framework to surface trade-offs, e.g. process efficiency and quality of output.
  • plan for collecting data on prompts to achieve the desired outputs, and how to use this insight for iterating and improving prompt content and interface design.
  • a plan to understand the range of users’ experience and users’ working practices and implications for design and implementation of the product. For example how does the wider job design and culture influence user behaviour of the product?
  • plans for standardising the user peer review process for product outputs, so scaling and implementation to supports and drives QA, and how this peer review insight can be used to support product iteration.
  • understanding of risky assumptions behind potential unintended consequences for use of the product, and linking this to the research approach and associated risk mitigation activities.

9. Create a secure service which protects users’ privacy

Decision

The service was rated green for point 9 of the Standard.

10. Define what success looks like and publish performance data

Decision

The service was rated green for point 10 of the Standard.

11. Choose the right tools and technology

Decision

The service was rated green for point 11 of the Standard.

12. Make new source code open

Decision

The service was rated green for point 12 of the Standard.

13. Use and contribute to open standards, common components and patterns

Decision

The service was rated green for point 13 of the Standard.

14. Operate a reliable service

Decision

The service was rated green for point 14 of the Standard.

Next Steps

This service can now move into a private beta phase, subject to addressing the amber points within three months time and CDDO spend approval.

To get the service ready to launch on GOV.UK the team needs to:

  • get a GOV.UK service domain name
  • work with the GOV.UK content team on any changes required to GOV.UK content the Standard that are rated amber within three months time and CDDO spend approval.

Updates to this page

Published 3 December 2024