Independent report

International AI Safety Report 2025

Updated 18 February 2025

© Crown owned 2025

This publication is licensed under the terms of the Open Government Licence v3.0 except where otherwise stated.

You can view this licence at nationalarchives.gov.uk/doc/open-government-licence/version/3 or email: psi@nationalarchives.gsi.gov.uk.

Alternatively you can write to:

The Information Policy Team
The National Archives
Kew
London
TW9 4DU

Where we have identified any third-party copyright information you will need to obtain permission from the copyright holders concerned. Any enquiries regarding this publication should be sent to: secretariat.AIStateofScience@dsit.gov.uk. Enquiries regarding the content of the report should also be sent to the Scientific Lead.

Disclaimer

The report does not represent the views of the Chair, any particular individual in the writing or advisory groups, nor any of the governments that have supported its development. This report is a synthesis of the existing research on the capabilities and risks of advanced AI. The Chair of the report has ultimate responsibility for it, and has overseen its development from beginning to end.

Contributors

Chair

Prof. Yoshua Bengio, Université de Montréal / Mila - Quebec AI Institute

Expert Advisory Panel

This international panel was nominated by the governments of the 30 countries listed below, the UN, EU, and OECD.

Country or organisation Panel member
Australia Bronwyn Fox, the University of New South Wales
Brazil André Carlos Ponce de Leon Ferreira de Carvalho, Institute of Mathematics and Computer Sciences, University of São Paulo
Canada Mona Nemer, Chief Science Advisor of Canada 
Chile Raquel Pezoa Rivera, Universidad Técnica Federico Santa Maria
China Yi Zeng, Chinese Academy of Sciences
European Union Juha Heikkilä, European AI Office
France Guillaume Avrin, National Coordination for Artificial Intelligence
Germany Antonio Krüger, German Research Center for Artificial Intelligence
India Balaraman Ravindran, Wadhwani School of Data Science and AI, Indian Institute of Technology Madras
Indonesia  Hammam Riza, Collaborative Research and Industrial Innovation in Artificial Intelligence (KORIKA)
Ireland Ciarán Seoighe, Research Ireland
Israel Ziv Katzir, Israel Innovation Authority
Italy Andrea Monti, Legal Expert for the Undersecretary of State for the Digital Transformation, Italian Ministers Council’s Presidency
Japan Hiroaki Kitano, Sony Group Corporation
Kenya Nusu Mwamanzi, Ministry of ICT and Digital Economy
Kingdom of Saudi Arabia  Fahad Albalawi, Saudi Authority for Data and Artificial Intelligence
Mexico José Ramón López Portillo, LobsterTel
Netherlands Haroon Sheikh, Netherlands’ Scientific Council for Government Policy
New Zealand Gill Jolly, Ministry of Business, Innovation and Employment
Nigeria Olubunmi Ajala, Ministry of Communications, Innovation and Digital Economy ​
OECD  Jerry Sheehan, Director of the Directorate for Science, Technology and Innovation
Philippines  Dominic Vincent Ligot, CirroLytix
Republic of Korea Kyoung Mu Lee, Department of Electrical and Computer Engineering, Seoul National University
Rwanda Crystal Rugege, National Center for AI and Innovation Policy
Singapore Denise Wong, Data Innovation and Protection Group, Infocomm Media Development Authority
Spain Nuria Oliver, ELLIS Alicante
Switzerland Christian Busch, Federal Department of Economic Affairs, Education and Research
Türkiye Ahmet Halit Hatip, Turkish Ministry of Industry and Technology
Ukraine Oleksii Molchanovskyi, Expert Committee on the Development of Artificial Intelligence in Ukraine
United Arab Emirates Marwan Alserkal, Ministry of Cabinet Affairs, Prime Minister’s Office
United Kingdom Chris Johnson, Chief Scientific Adviser in the Department for Science, Innovation and Technology
United Nations Amandeep Singh Gill, Under-Secretary-General for Digital and Emerging Technologies and Secretary-General’s Envoy on Technology
United States Saif M. Khan, U.S. Department of Commerce

Scientific Lead

Sören Mindermann, Mila - Quebec AI Institute

Lead Writer

Daniel Privitera, KIRA Center

Writing Group

Group member Organisation
Tamay Besiroglu Epoch AI
Rishi Bommasani Stanford University
Stephen Casper Massachusetts Institute of Technology
Yejin Choi Stanford University
Philip Fox KIRA Center
Ben Garfinkel University of Oxford
Danielle Goldfarb Mila - Quebec AI Institute
Hoda Heidari Carnegie Mellon University
Anson Ho Epoch AI
Sayash Kapoor Princeton University
Leila Khalatbari Hong Kong University of Science and Technology
Shayne Longpre Massachusetts Institute of Technology
Sam Manning Centre for the Governance of AI
Vasilios Mavroudis The Alan Turing Institute
Mantas Mazeika University of Illinois at Urbana-Champaign
Julian Michael New York University
Jessica Newman University of California, Berkeley
Kwan Yee Ng Concordia AI
Chinasa T. Okolo Brookings Institution
Deborah Raji University of California, Berkeley
Girish Sastry Independent
Elizabeth Seger (generalist writer) Demos
Theodora Skeadas Humane Intelligence
Tobin South Massachusetts Institute of Technology
Emma Strubell Carnegie Mellon University
Florian Tramèr ETH Zurich
Lucia Velasco Maastricht University
Nicole Wheeler University of Birmingham

Senior Advisers

Senior adviser Organisation
Daron Acemoglu Massachusetts Institute of Technology
Olubayo Adekanmbi contributed as a Senior Adviser prior to taking up his role at EqualyzAI
David Dalrymple Advanced Research and Invention Agency (ARIA)
Thomas G. Dietterich Oregon State University
Edward Felten Princeton University
Pascale Fung Contributed as a Senior Adviser prior to taking up her role at Meta
Pierre-Olivier Gourinchas Research Department, International Monetary Fund (IMF)
Fredrik Heintz Linköping University
Geoffrey Hinton University of Toronto
Nick Jennings University of Loughborough
Andreas Krause ETH Zurich
Susan Leavy University College Dublin
Percy Liang Stanford University
Teresa Ludermir Federal University of Pernambuco
Vidushi Marda AI Collaborative
Helen Margetts University of Oxford
John McDermid University of York
Jane Munga Carnegie Endowment for International Peace
Arvind Narayanan Princeton University
Alondra Nelson Institute for Advanced Study
Clara Neppel IEEE
Alice Oh KAIST School of Computing
Gopal Ramchurn Responsible AI UK
Stuart Russell University of California, Berkeley
Marietje Schaake Stanford University
Bernhard Schölkopf ELLIS Institute Tübingen
Dawn Song University of California, Berkeley
Alvaro Soto Pontificia Universidad Católica de Chile
Lee Tiedrich Duke University
Gaël Varoquaux Inria
Andrew Yao Institute for Interdisciplinary Information Sciences, Tsinghua University
Ya-Qin Zhang Tsinghua University

Secretariat

AI Safety Institute

Baran Acar
Ben Clifford
Lambrini Das
Claire Dennis
Freya Hempleman
Hannah Merchant
Rian Overy
Ben Snodin

Mila - Quebec AI Institute

Jonathan Barry
Benjamin Prud’homme

Acknowledgements


Civil society and industry reviewers


Civil Society: Ada Lovelace Institute, AI Forum New Zealand  / Te Kāhui Atamai Iahiko o Aotearoa, Australia’s Temporary AI Expert Group, Carnegie Endowment for International Peace, Center for Law and Innovation / Certa Foundation, Centre for the Governance of AI, Chief Justice Meir Shamgar Center for Digital Law and Innovation, Eon Institute, Gradient Institute, Israel Democracy Institute, Mozilla Foundation, Old Ways New, RAND, SaferAI, The Centre for Long-Term Resilience, The Future Society, The Alan Turing Institute, The Royal Society, Türkiye Artificial Intelligence Policies Association.

Industry: Advai, Anthropic, Cohere, Deloitte Consulting USA and Deloitte LLM UK, G42, Google DeepMind, Harmony Intelligence, Hugging Face, IBM, Lelapa AI, Meta, Microsoft, Shutterstock, Zhipu.ai.

Special thanks

The Secretariat appreciates the support, comments and feedback from Angie Abdilla, Concordia AI, Nitarshan Rajkumar, Geoffrey Irving, Shannon Vallor, Rebecca Finlay and Andrew Strait.

Forewords

Building a shared scientific understanding in a fast-moving field

I am honoured to present the International AI Safety Report. It is the work of 96 international AI experts who collaborated in an unprecedented effort to establish an internationally shared scientific understanding of risks from advanced AI and methods for managing them.

We embarked on this journey just over a year ago, shortly after the countries present at the Bletchley Park AI Safety Summit agreed to support the creation of this report. Since then, we published an interim report in May 2024, which was presented at the AI Seoul Summit. We are now pleased to publish the present, full report ahead of the AI Action Summit in Paris in February 2025.

Since the Bletchley Summit, the capabilities of general-purpose AI, the type of AI this report focuses on, have increased further. For example, new models have shown markedly better performance at tests of programming and scientific reasoning. In addition, many companies are now investing in the development of general-purpose AI ‘agents’ – systems which can autonomously plan and act to achieve goals with little or no human oversight.

Building on the Interim Report (May 2024), the present report reflects these new developments. In addition, the experts contributing to this report made several other changes compared to the Interim Report. For example, they worked to further improve the scientific rigour of all sections, added discussion of additional topics such as open-weight models, and restructured the report to be more relevant to policymakers, including by highlighting evidence gaps and key challenges for policymakers.

I extend my profound gratitude to the team of experts who contributed to this report, including our writers, senior advisers, and the international Expert Advisory Panel. I have been impressed with their scientific excellence and expertise as well as the collaborative attitude with which they have approached this challenging project. I am also grateful to the industry and civil society organisations who reviewed the report, contributing invaluable feedback that has led this report to be more comprehensive than it otherwise would have been. My thanks also go to the UK Government for starting this process and offering outstanding operational support. It was also important for me that the UK Government agreed that the scientists writing this report should have complete independence.

AI remains a fast-moving field. To keep up with this pace, policymakers and governments need to have access to the current scientific understanding on what risks advanced AI might pose. I hope that this report as well as future publications will help decision-makers ensure that people around the world can reap the benefits of AI safely.

Professor Yoshua Bengio
Université de Montréal / Mila – Quebec AI Institute and Chair

Taking advantage of AI opportunities safely calls for global collaboration

Since the interim version of this report was published, the capabilities of advanced AI capabilities have continued to grow. We know that this technology, if developed and utilised safely and responsibly, offers extraordinary opportunities: to grow our economies, modernise our public services, and improve lives for our people. To seize these opportunities, it is imperative that we deepen our collective understanding of how AI can be developed safely.

This landmark report is testament to the value of global cooperation in forging this shared understanding. It is the result of over 90 AI experts from different continents, sectors, and areas of expertise, coming together to offer leaders and decision-makers a global reference point and a tool to inform policy on AI safety. Our collective understanding of frontier AI systems has improved. However, this report highlights that frontier AI remains a field of active scientific inquiry, with experts continuing to disagree on its trajectory and the scope of its impact. We will maintain the momentum behind this collective effort to drive global scientific consensus. We are excited to continue this unprecedented and essential project of international collaboration.

The report lays the foundation for important discussions at the AI Action Summit in France this year, which will convene international governments, leading AI companies, civil society groups and experts. This Summit, like the report, is a continuation of the milestones achieved at the Bletchley Park (November 2023) and Seoul (May 2024) summits. AI is the defining opportunity of our generation. Together, we will continue the conversation and support bold and ambitious action to collectively master the risks of AI and benefit from these new technologies for the greater good. There will be no adoption of this technology without safety: safety brings trust!

We are pleased to present this report and thank Professor Yoshua Bengio and the writing team for the significant work that went into its development. The UK and France look forward to continuing the discussion at the AI Action Summit in February.

Clara Chappaz
France’s Minister Delegate for Artificial Intelligence

The Rt Hon Peter Kyle MP
Secretary of State for Science, Innovation and Technology

About this report

  • This is the first International Scientific Report on the Safety of Advanced AI. Following an interim publication in May 2024, a diverse group of 96 Artificial Intelligence (AI) experts contributed to this first full report, including an international Expert Advisory Panel nominated by 30 countries, the Organisation for Economic Co-operation and Development (OECD), the European Union (EU), and the United Nations (UN). The report aims to provide scientific information that will support informed policymaking. It does not recommend specific policies.
  • The report is the work of independent experts. Led by the Chair, the independent experts writing this report collectively had full discretion over its content.
  • While this report is concerned with AI risks and AI safety, AI also offers many potential benefits for people, businesses, and society. There are many types of AI, each with different benefits and risks. Most of the time, in most applications, AI helps individuals and organisations be more effective. But people around the world will only be able to fully enjoy AI’s many potential benefits safely if its risks are appropriately managed. This report focuses on identifying these risks and evaluating methods for mitigating them. It does not aim to comprehensively assess all possible societal impacts of AI, including its many potential benefits.
  • The focus of the report is general-purpose AI. The report restricts its focus to a type of AI that has advanced particularly rapidly in recent years, and whose associated risks have been less studied and understood: general-purpose AI, or AI that can perform a wide variety of tasks. The analysis in this report focuses on the most advanced general-purpose AI systems at the time of writing, as well as future systems that might be even more capable.
  • The report summarises the scientific evidence on three core questions: What can general-purpose AI do? What are risks associated with general-purpose AI? And what mitigation techniques are there against these risks?
  • The stakes are high. We, the experts contributing to this report, continue to disagree on several questions, minor and major, around general-purpose AI capabilities, risks, and risk mitigations. But we consider this report essential for improving our collective understanding of this technology and its potential risks. We hope that the report will help the international community to move towards greater consensus about general-purpose AI and mitigate its risks more effectively, so that people can safely experience its many potential benefits. The stakes are high. We look forward to continuing this effort.

Update on latest AI advances after the writing of this report: Chair’s note

Between the end of the writing period for this report (5 December 2024) and the publication of this report in January 2025, an important development took place. The AI company OpenAI shared early test results from a new AI model, o3. These results indicate significantly stronger performance than any previous model on a number of the field’s most challenging tests of programming, abstract reasoning, and scientific reasoning. In some of these tests, o3 outperforms many (but not all) human experts. Additionally, it achieves a breakthrough on a key abstract reasoning test that many experts, including myself, thought was out of reach until recently. However, at the time of writing there is no public information about its real-world capabilities, particularly for solving more open‑ended tasks.

Scores of notable models on key benchmarks over time

Figure 0.1: Scores of notable general-purpose AI models on key benchmarks from June 2023 to December 2024

Note. o3 showed significantly improved performance compared to the previous state of the art (shaded region). These benchmarks are some of the field’s most challenging tests of programming, abstract reasoning, and scientific reasoning. For the unreleased o3, the announcement date is shown; for the other models, the release date is shown. Some of the more recent AI models, including o3, benefited from improved scaffolding and more computation at test-time. Sources: Anthropic, 2024; Chollet, 2024; Chollet et al., 2025; Epoch AI, 2024; Glazer et al. 2024; OpenAI, 2024a; OpenAI, 2024b; Jimenez et al., 2024; Jimenez et al., 2025.

The o3 results are evidence that the pace of advances in AI capabilities may remain high or even accelerate. More specifically, they suggest that giving models more computing power for solving a given problem (‘inference scaling’) may help overcome previous limitations. Generally speaking, inference scaling makes models more expensive to use. But as another recent notable model, R1, released by the company DeepSeek in January 2025, has shown, researchers are successfully working on lowering these costs. Overall, inference scaling may allow AI developers to make further advances going forward. The o3 results also underscore the need to better understand how AI developers’ growing use of AI may affect the speed of further AI development itself.

The trends evidenced by o3 could have profound implications for AI risks. Advances in science and programming capabilities have previously generated more evidence for risks such as cyber and biological attacks. The o3 results are also relevant to potential labour market impacts, loss of control risk, and energy use among others. But o3’s capabilities could also be used to help protect against malfunctions and malicious uses. Overall, the risk assessments in this report should be read with the understanding that AI has gained capabilities since the report was written. However, so far there is no evidence yet about o3’s real world impacts, and no information to confirm nor rule out major novel and/or immediate risks.

The improvement in capabilities suggested by the o3 results and our limited understanding of the implications for AI risks underscore a key challenge for policymakers that this report identifies: they will often have to weigh potential benefits and risks of imminent AI advancements without having a large body of scientific evidence available. Nonetheless, generating evidence on the safety and security implications of the trends implied by o3 will be an urgent priority for AI research in the coming weeks and months.

Key findings of the report

  • The capabilities of general-purpose AI, the type of AI that this report focuses on, have increased rapidly in recent years and have improved further in recent months. A few years ago, the best large language models (LLMs) could rarely produce a coherent paragraph of text. Today, general-purpose AI can write computer programs, generate custom photorealistic images, and engage in extended open-ended conversations. Since the publication of the Interim Report (May 2024), new models have shown markedly better performance at tests of scientific reasoning and programming.
  • Many companies are now investing in the development of general-purpose AI agents, as a potential direction for further advancement. AI agents are general-purpose AI systems which can autonomously act, plan, and delegate to achieve goals with little to no human oversight. Sophisticated AI agents would be able to, for example, use computers to complete longer projects than current systems, unlocking both additional benefits and additional risks.
  • Further capability advancements in the coming months and years could be anything from slow to extremely rapid. (See Chair’s update on the latest AI advances). Progress will depend on whether companies will be able to rapidly deploy even more data and computational power to train new models, and whether ‘scaling’ models in this way will overcome their current limitations. Recent research suggests that rapidly scaling up models may remain physically feasible for at least several years. But major capability advances may also require other factors: for example, new research breakthroughs, which are hard to predict, or the success of a novel scaling approach that companies have recently adopted.
  • Several harms from general-purpose AI are already well established. These include [scams, non-consensual intimate imagery (NCII) and child sexual abuse material (CSAM), model outputs that are biased against certain groups of people or certain opinions, reliability issues, and privacy violations. Researchers have developed mitigation techniques for these problems, but so far no combination of techniques can fully resolve them. Since the publication of the Interim Report (May 2024), new evidence of discrimination related to general-purpose AI systems has revealed more subtle forms of bias.
  • As general-purpose AI becomes more capable, evidence of additional risks is gradually emerging. These include risks such as large-scale labour market impacts, AI-enabled hacking] or biological attacks, and society losing control over general-purpose AI. Experts interpret the existing evidence on these risks differently: some think that such risks are decades away, while others think that general-purpose AI could lead to societal-scale harm within the next few years. Recent advances in general-purpose AI capabilities – particularly in tests of scientific reasoning and programming – have generated new evidence for potential risks such as AI-enabled hacking and biological attacks, leading one major AI company to increase its assessment of biological risk from its best model from ‘low’ to ‘medium’.
  • Risk management techniques are nascent, but progress is possible. There are various technical methods to assess and reduce risks from general-purpose AI that developers can employ and regulators can require, but they all have limitations. For example, current interpretability techniques for explaining why a general-purpose AI model produced any given output remain severely limited. However, researchers are making some progress in addressing these limitations. In addition, researchers and policymakers are increasingly trying to standardise risk management approaches, and to coordinate internationally.
  • The pace and unpredictability of advancements in general-purpose AI pose an ‘evidence dilemma’ for policymakers. (See Chair’s update on the latest AI advances). Given sometimes rapid and unexpected advancements, policymakers will often have to weigh potential benefits and risks of imminent AI advancements without having a large body of scientific evidence available. In doing so, they face a dilemma. On the one hand, preemptive risk mitigation measures based on limited evidence might turn out to be ineffective or unnecessary. On the other hand, waiting for stronger evidence of impending risk could leave society unprepared or even make mitigation impossible, for instance if sudden leaps in AI capabilities, and their associated risks, occur. Companies and governments are developing early warning systems and risk management frameworks that may reduce this dilemma. Some of these trigger specific mitigation measures when there is new evidence of risks, while others require developers to provide evidence of safety before releasing a new model.
  • There is broad consensus among researchers that advances regarding the following questions would be helpful: How rapidly will general-purpose AI capabilities advance in the coming years, and how can researchers reliably measure that progress? What are sensible risk thresholds to trigger mitigations? How can policymakers best gain access to information about general-purpose AI that is relevant to public safety? How can researchers, technology companies, and governments reliably assess the risks of general-purpose AI development and deployment? How do general-purpose AI models work internally? How can general-purpose AI be designed to behave reliably?
  • AI does not happen to us; choices made by people determine its future. The future of general-purpose AI technology is uncertain, with a wide range of trajectories appearing to be possible even in the near future, including both very positive and very negative outcomes. This uncertainty can evoke fatalism and make AI appear as something that happens to us. But it will be the decisions of societies and governments on how to navigate this uncertainty that determine which path we will take. This report aims to facilitate constructive and evidence-based discussion about these decisions.

Executive summary

The purpose of this report

This report synthesises the state of scientific understanding of general-purpose AI – AI that can perform a wide variety of tasks – with a focus on understanding and managing its risks.

This report summarises the scientific evidence on the safety of general-purpose AI. The purpose of this report is to help create a shared international understanding of risks from advanced AI and how they can be mitigated. To achieve this, this report focuses on general-purpose AI – or AI that can perform a wide variety of tasks– since this type of AI has advanced particularly rapidly in recent years and has been deployed widely by technology companies for a range of consumer and business purposes. The report synthesises the state of scientific understanding of general-purpose AI, with a focus on understanding and managing its risks.

Amid rapid advancements, research on general-purpose AI is currently in a time of scientific discovery, and – in many cases – is not yet settled science. The report provides a snapshot of the current scientific understanding of general-purpose AI and its risks. This includes identifying areas of scientific consensus and areas where there are different views or gaps in the current scientific understanding.

People around the world will only be able to fully enjoy the potential benefits of general-purpose AI safely if its risks are appropriately managed. This report focuses on identifying those risks and evaluating technical methods for assessing and mitigating them, including ways that general-purpose AI itself can be used to mitigate risks. It does not aim to comprehensively assess all possible societal impacts of general-purpose AI. Most notably, the current and potential future benefits of general-purpose AI – although they are vast – are beyond this report’s scope. Holistic policymaking requires considering both the potential benefits of general-purpose AI and the risks covered in this report. It also requires taking into account that other types of AI have different risk/benefit profiles compared to current general-purpose AI.

The three main sections of the report summarise the scientific evidence on three core questions: What can general-purpose AI do? What are risks associated with general-purpose AI? And what mitigation techniques are there against these risks?

Section 1: - Capabilities of general-purpose AI: What can general-purpose AI do now and in the future?

General-purpose AI capabilities have improved rapidly in recent years, and further advancements could be anything from slow to extremely rapid.

What AI can do is a key contributor to many of the risks it poses, and according to many metrics, general-purpose AI capabilities have been progressing rapidly. Five years ago, the leading general-purpose AI language models could rarely produce a coherent paragraph of text. Today, some general-purpose AI models can engage in conversations on a wide range of topics, write computer programs, or generate realistic short videos from a description. However, it is technically challenging to reliably estimate and describe the capabilities of general-purpose AI.

AI developers have rapidly improved the capabilities of general-purpose AI in recent years, mostly through ‘scaling’. They have continually increased the resources used for training new models (this is often referred to as ‘scaling’) and refined existing approaches to use those resources more efficiently. For example, according to recent estimates, state-of-the-art AI models have seen annual increases of approximately 4x in computational resources (‘compute’) used for training and 2.5x in training dataset size.

The pace of future progress in general-purpose AI capabilities has substantial implications for managing emerging risks, but experts disagree on what to expect even in the coming months and years. Experts variously support the possibility of general-purpose AI capabilities advancing slowly, rapidly, or extremely rapidly.

Experts disagree about the pace of future progress because of different views on the promise of further ‘scaling’ – and companies are exploring an additional, new type of scaling that might further accelerate capabilities. (See Chair’s update on the latest AI advances). While scaling has often overcome the limitations of previous systems, experts disagree about its potential to resolve the remaining limitations of today’s systems, such as unreliability at acting in the physical world and at executing extended tasks on computers. In recent months, a new type of scaling has shown potential for further improving capabilities: rather than just scaling up the resources used for training models, AI companies are also increasingly interested in ‘inference scaling’ – letting an already trained model use more computation to solve a given problem, for example to improve on its own solution, or to write so-called ‘chains of thought’ that break down the problem into simpler steps.

Several leading companies that develop general-purpose AI are betting on ‘scaling’ to continue leading to performance improvements. If recent trends continue, by the end of 2026 some general-purpose AI models will be trained using roughly 100x more training compute than 2023’s most compute-intensive models, growing to 10,000x more training compute by 2030, combined with algorithms that achieve greater capabilities for a given amount of available computation. In addition to this potential scaling of training resources, recent trends such as inference scaling and using models to generate training data could mean that even more compute will be used overall. However, there are potential bottlenecks to further increasing both data and compute rapidly, such as the availability of data, AI chips, capital, and local energy capacity. Companies developing general-purpose AI are working to navigate these potential bottlenecks.

Since the publication of the Interim Report (May 2024), general-purpose AI has reached expert-level performance in some tests and competitions for scientific reasoning and programming, and companies have been making large efforts to develop autonomous AI agents. Advances in science and programming have been driven by inference scaling techniques such as writing long ‘chains of thought’. New studies suggest that further scaling such approaches, for instance allowing models to analyse problems by writing even longer chains of thought than today’s models, could lead to further advances in domains where reasoning matters more, such as science, software engineering, and planning. In addition to this trend, companies are making large efforts to develop more advanced general-purpose AI agents, which can plan and act autonomously to work towards a given goal. Finally, the market price of using general-purpose AI of a given capability level has dropped sharply, making this technology more broadly accessible and widely used.

This report focuses primarily on technical aspects of AI progress, but how fast general-purpose AI will advance is not a purely technical question. The pace of future advancements will also depend on non-technical factors, potentially including the approaches that governments take to regulating AI. This report does not discuss how different approaches to regulation might affect the speed of development and adoption of general-purpose AI.

Section 2 - Risks: What are risks associated with general-purpose AI?

Several harms from general-purpose AI are already well-established. As general-purpose AI becomes more capable, evidence of additional risks is gradually emerging.

This report classifies general-purpose AI risks into three categories: malicious use risks, risks from malfunctions, and systemic risks. Each of these categories contains risks that have already materialised as well as risks that might materialise in the next few years.

Risks from malicious use:malicious actors can use general-purpose AI to cause harm to individuals, organisations, or society. Forms of malicious use include:

  • Harm to individuals through fake content: Malicious actors can currently use general-purpose AI to generate fake content that harms individuals in a targeted way. These malicious uses include non-consensual ‘deepfake’ pornography and AI-generated CSAM, financial fraud through voice impersonation, blackmail for extortion, sabotage of personal and professional reputations, and psychological abuse. However, while incident reports of harm from AI-generated fake content are common, reliable statistics on the frequency of these incidents are still lacking.
  • Manipulation of public opinion: General-purpose AI makes it easier to generate persuasive content at scale. This can help actors who seek to manipulate public opinion, for instance to affect political outcomes. However, evidence on how prevalent and how effective such efforts are remains limited. Technical countermeasures like content watermarking, although useful, can usually be circumvented by moderately sophisticated actors.
  • Cyber offence: General-purpose AI can make it easier or faster for malicious actors of varying skill levels to conduct cyberattacks. Current systems have demonstrated capabilities in low- and medium-complexity cybersecurity tasks, and state-sponsored actors are actively exploring AI to survey target systems. New research has confirmed that the capabilities of general-purpose AI related to cyber offence are significantly advancing, but it remains unclear whether this will affect the balance between attackers and defenders.
  • Biological and chemical attacks: Recent general-purpose AI systems have displayed some ability to provide instructions and troubleshooting guidance for reproducing known biological and chemical weapons and to facilitate the design of novel toxic compounds. In new experiments that tested for the ability to generate plans for producing biological weapons, a general-purpose AI system sometimes performed better than human experts with access to the internet. In response, one AI company increased its assessment of biological risk from its best model from ‘low’ to ‘medium’. However, real-world attempts to develop such weapons still require substantial additional resources and expertise. A comprehensive assessment of biological and chemical risk is difficult because much of the relevant research is classified.

Since the publication of the Interim Report (May 2024), general-purpose AI has become more capable in domains that are relevant for malicious use. For example, researchers have recently built general-purpose AI systems that were able to find and exploit some cybersecurity vulnerabilities on their own and, with human assistance, discover a previously unknown vulnerability in widely-used software. General-purpose AI capabilities related to reasoning and to integrating different types of data, which can aid research on pathogens or in other dual-use fields, have also improved.

Risks from malfunctions: general-purpose AI can also cause unintended harm. Even when users have no intention to cause harm, serious risks can arise due to the malfunctioning of general-purpose AI. Such malfunctions include:

  • Reliability issues: Current general-purpose AI can be unreliable, which can lead to harm. For example, if users consult a general-purpose AI system for medical or legal advice, the system might generate an answer that contains falsehoods. Users are often not aware of the limitations of an AI product, for example due to limited ‘AI literacy’, misleading advertising, or miscommunication. There are a number of known cases of harm from reliability issues, but still limited evidence on exactly how widespread different forms of this problem are.
  • Bias: General-purpose AI systems can amplify social and political biases, causing concrete harm. They frequently display biases with respect to race, gender, culture, age, disability, political opinion, or other aspects of human identity. This can lead to discriminatory outcomes including unequal resource allocation, reinforcement of stereotypes, and systematic neglect of underrepresented groups or viewpoints. Technical approaches for mitigating bias and discrimination in general-purpose AI systems are advancing, but face trade-offs between bias mitigation and competing objectives such as accuracy and privacy, as well as other challenges.
  • Loss of control:‘Loss of control’ scenarios are hypothetical future scenarios in which one or more general-purpose AI systems come to operate outside of anyone’s control, with no clear path to regaining control. There is broad consensus that current general-purpose AI lacks the capabilities to pose this risk. However, expert opinion on the likelihood of loss of control within the next several years varies greatly: some consider it implausible, some consider it likely to occur, and some see it as a modest-likelihood risk that warrants attention due to its high potential severity. Ongoing empirical and mathematical research is gradually advancing these debates.

Since the publication of the Interim Report (May 2024), new research has led to some new insights about risks of bias and loss of control. The evidence of bias in general-purpose AI systems has increased, and recent work has detected additional forms of AI bias. Researchers have observed modest further advancements towards AI capabilities that are likely necessary for commonly discussed loss of control scenarios to occur. These include capabilities for autonomously using computers, programming, gaining unauthorised access to digital systems, and identifying ways to evade human oversight.

Systemic risks: beyond the risks directly posed by capabilities of individual models, widespread deployment of general-purpose AI is associated with several broader systemic risks. Examples of systemic risks range from potential labour market impacts to privacy risks and environmental effects:

  • Labour market risks: General-purpose AI, especially if it continues to advance rapidly, has the potential to automate a very wide range of tasks, which could have a significant effect on the labour market. This means that many people could lose their current jobs. However, many economists expect that potential job losses could be offset, partly or potentially even completely, by the creation of new jobs and by increased demand in non-automated sectors.
  • Global AI R&D divide: General-purpose AI research and development (R&D) is currently concentrated in a few Western countries and China. This ‘AI divide’ has the potential to increase much of the world’s dependence on this small set of countries. Some experts also expect it to contribute to global inequality. The divide has many causes, including a number of causes that are not unique to AI. However, in significant part it stems from differing levels of access to the very expensive compute needed to develop general-purpose AI: most low- and middle-income countries (LMICs) have significantly less access to compute than high-income countries (HICs).
  • Market concentration and single points of failure: A small number of companies currently dominate the market for general-purpose AI. This market concentration could make societies more vulnerable to several systemic risks. For instance, if organisations across critical sectors, such as finance or healthcare, all rely on a small number of general-purpose AI systems, then a bug or vulnerability in such a system could cause simultaneous failures and disruptions on a broad scale.
  • Environmental risks:Growing compute use in general-purpose AI development and deployment has rapidly increased the amounts of energy, water, and raw material consumed in building and operating the necessary compute infrastructure. This trend shows no clear indication of slowing, despite progress in techniques that allow compute to be used more efficiently. General-purpose AI also has a number of applications that can either benefit or harm sustainability efforts.
  • Privacy risks: General-purpose AI can cause or contribute to violations of user privacy. For example, sensitive information that was in the training data can leak unintentionally when a user interacts with the system. In addition, when users share sensitive information with the system, this information can also leak. But general-purpose AI can also facilitate deliberate violations of privacy, for example if malicious actors use AI to infer sensitive information about specific individuals from large amounts of data. However, so far, researchers have not found evidence of widespread privacy violations associated with general-purpose AI.
  • Copyright infringements: General-purpose AI both learns from and creates works of creative expression, challenging traditional systems of data consent, compensation, and control. Data collection and content generation can implicate a variety of data rights laws, which vary across jurisdictions and may be under active litigation. Given the legal uncertainty around data collection practices, AI companies are sharing less information about the data they use. This opacity makes third-party AI safety research harder.

Since the publication of the Interim Report (May 2024), additional evidence on the labour market impacts of general-purpose AI has emerged, while new developments have heightened privacy and copyrights concerns. New analyses of labour market data suggest that individuals are adopting general-purpose AI very rapidly relative to previous technologies. The pace of adoption by businesses varies widely by sector. In addition, recent advances in capabilities have led to general-purpose AI being deployed increasingly in sensitive contexts such as healthcare or workplace monitoring, which creates new privacy risks. Finally, as copyright disputes intensify and technical mitigations to copyright infringements remain unreliable, data rights holders have been rapidly restricting access to their data.

Open-weight models: an important factor in evaluating many risks that a general-purpose AI model might pose is how it is released to the public. So-called ‘open-weight models’ are AI models whose central components, called ‘weights’, are shared publicly for download. Open weight access facilitates research and innovation, including in AI safety, as well as increasing transparency and making it easier for the research community to detect flaws in models. However, open-weight models can also pose risks, for example by facilitating malicious or misguided use that is difficult or impossible for the developer of the model to monitor or mitigate. Once model weights are available for public download, there is no way to implement a wholesale rollback of all existing copies or ensure that all existing copies receive safety updates. Since the Interim Report (May 2024), high-level consensus has emerged that risks posed by greater AI openness should be evaluated in terms of ‘marginal’ risk: the extent to which releasing an open-weight model would increase or decrease a given risk, relative to risks posed by existing alternatives such as closed models or other technologies.

Section 3 – Risk management: What techniques are there for managing risks from general-purpose AI?

Several technical approaches can help manage risks, but in many cases the best available approaches still have highly significant limitations and no quantitative risk estimation or guarantees that are available in other safety-critical domains.

Risk management – identifying and assessing risks, and then mitigating and monitoring them – is difficult in the context of general-purpose AI. Although risk management has also been highly challenging in many other domains, there are some features of general-purpose AI that appear to create distinctive difficulties.

Several technical features of general-purpose AI make risk management in this domain particularly difficult They include, among others:

  • The range of possible uses and use contexts for general-purpose AI systems is unusually broad. For example, the same system may be used to provide medical advice, analyse computer code for vulnerabilities, and generate photos. This increases the difficulty of comprehensively anticipating relevant use cases, identifying risks, or testing how systems will behave in relevant real-world circumstances.
  • Developers still understand little about how their general-purpose AI models operate. This lack of understanding makes it more difficult both to predict behavioural issues and to explain and resolve known issues once they are observed. Understanding remains elusive mainly because general-purpose AI models are not programmed in the traditional sense. Instead, they are trained: AI developers set up a training process that involves a large volume of data, and the outcome of that training process is the general-purpose AI model. The inner workings of these models are largely inscrutable, including to the model developers. Model explanation and ‘interpretability’ techniques can improve researchers’ and developers’ understanding of how general-purpose AI models operate, but, despite recent progress, this research remains nascent.
  • Increasingly capable AI agents – general-purpose AI systems that can autonomously act, plan, and delegate to achieve goals – will likely present new, significant challenges for risk management. AI agents typically work towards goals autonomously by using general software such as web browsers and programming tools. Currently, most are not yet reliable enough for widespread use, but companies are making large efforts to build more capable and reliable AI agents and have made progress in recent months. AI agents will likely become increasingly useful, but may also exacerbate a number of the risks discussed in this report and introduce additional difficulties for risk management. Examples of such potential new challenges include the possibility that users might not always know what their own AI agents are doing, the potential for AI agents to operate outside of anyone’s control, the potential for attackers to ‘hijack’ agents, and the potential for AI-to-AI interactions to create complex new risks. Approaches for managing risks associated with agents are only beginning to be developed.

Besides technical factors, several economic, political, and other societal factors make risk management in the field of general-purpose AI particularly difficult.

  • The pace of advancement in general-purpose AI creates an ‘evidence dilemma’ for decision-makers. (See Chair’s update on the latest AI advances). Rapid capability advancement makes it possible for some risks to emerge in leaps; for example, the risk of academic cheating using general-purpose AI shifted from negligible to widespread within a year. The more quickly a risk emerges, the more difficult it is to manage the risk reactively and the more valuable preparation becomes. However, so long as evidence for a risk remains incomplete, decision-makers also cannot know for sure whether the risk will emerge or perhaps even has already emerged. This creates a trade-off: implementing preemptive or early mitigation measures might prove unnecessary, but waiting for conclusive evidence could leave society vulnerable to risks that emerge rapidly. Companies and governments are developing early warning systems and risk management frameworks that may reduce this dilemma. Some of these trigger specific mitigation measures when there is new evidence of risks, while others require developers to provide evidence of safety before releasing a new model.
  • There is an information gap between what AI companies know about their AI systems and what governments and non-industry researchers know. Companies often share only limited information about their general-purpose AI systems, especially in the period before they are widely released. Companies cite a mixture of commercial concerns and safety concerns as reasons to limit information-sharing. However, this information gap also makes it more challenging for other actors to participate effectively in risk management, especially for emerging risks.
  • Both AI companies and governments often face strong competitive pressure, which may lead them to deprioritise risk management. In some circumstances, competitive pressure may incentivise companies to invest less time or other resources into risk management than they otherwise would. Similarly, governments may invest less in policies to support risk management in cases where they perceive trade-offs between international competition and risk reduction.

Nonetheless, there are various techniques and frameworks for managing risks from general-purpose AI that companies can employ and regulators can require. These include methods for identifying and assessing risks, as well as methods for mitigating and monitoring them.

  • Assessing general-purpose AI systems for risks is an integral part of risk management, but existing risk assessments are severely limited. Existing evaluations of general-purpose AI risk mainly rely on ‘spot checks’, i.e. testing the behaviour of a general-purpose AI in a set of specific situations. This can help surface potential hazards before deploying a model. However, existing tests often miss hazards and overestimate or underestimate general-purpose AI capabilities and risks, because test conditions differ from the real world.
  • For risk identification and assessment to be effective, evaluators need substantial expertise, resources, and sufficient access to relevant information. Rigorous risk assessment in the context of general-purpose AI requires combining multiple evaluation approaches. These range from technical analyses of the models and systems themselves to evaluations of possible risks from certain use patterns. Evaluators need substantial expertise to conduct such evaluations correctly. For comprehensive risk assessments, they often also need more time, more direct access to the models and their training data, and more information about the technical methodologies used than the companies developing general-purpose AI typically provide.
  • There has been progress in training general-purpose AI models to function more safely, but no current method can reliably prevent even overtly unsafe outputs. For example, a technique called ‘adversarial training’ involves deliberately exposing AI models to examples designed to make them fail or misbehave during training, aiming to build resistance to such cases. However, adversaries can still find new ways (‘attacks’) to circumvent these safeguards with low to moderate effort. In addition, recent evidence suggests that current training methods – which rely heavily on imperfect human feedback – may inadvertently incentivise models to mislead humans on difficult questions by making errors harder to spot. Improving the quantity and quality of this feedback is an avenue for progress, though nascent training techniques using AI to detect misleading behaviour also show promise.
  • Monitoring – identifying risks and evaluating performance once a model is already in use – and various interventions to prevent harmful actions can improve the safety of a general-purpose AI after it is deployed to users. Current tools can detect AI-generated content, track system performance, and identify potentially harmful inputs/outputs, though moderately skilled users can often circumvent these safeguards. Several layers of defence that combine technical monitoring and intervention capabilities with human oversight improve safety but can introduce costs and delays. In the future, hardware-enabled mechanisms could help customers and regulators to monitor general-purpose AI systems more effectively during deployment and potentially help verify agreements across borders, but reliable mechanisms of this kind do not yet exist.
  • Multiple methods exist across the AI lifecycle to safeguard privacy. These include removing sensitive information from training data, model training approaches that control how much information is learned from data (such as ‘differential privacy’ approaches), and techniques for using AI with sensitive data that make it hard to recover the data (such as ‘confidential computing’ and other privacy-enhancing technologies). Many privacy-enhancing methods from other research fields are not yet applicable to general-purpose AI systems due to the computational requirements of AI systems. In recent months, privacy protection methods have expanded to address AI’s growing use in sensitive domains. This includes smartphone assistants, AI agents, always-listening voice assistants, and use in healthcare or legal practice.

Since the publication of the Interim Report (May 2024), researchers have made some further progress towards being able to explain why a general-purpose AI model has produced a given output. Being able to explain AI decisions could help manage risks from malfunctions ranging from bias and factual inaccuracy to loss of control. In addition, there have been growing efforts to standardise assessment and mitigation approaches around the world.

Conclusion: A wide range of trajectories for the future of general purpose AI are possible, and much will depend on how societies and governments act

The future of general-purpose AI is uncertain, with a wide range of trajectories appearing possible even in the near future, including both very positive and very negative outcomes. But nothing about the future of general-purpose AI is inevitable. How general-purpose AI gets developed and by whom, which problems it gets designed to solve, whether societies will be able to reap general-purpose AI’s full economic potential, who benefits from it, the types of risks we expose ourselves to, and how much we invest into research to manage risks – these and many other questions depend on the choices that societies and governments make today and in the future to shape the development of general-purpose AI.

To help facilitate constructive discussion about these decisions, this report provides an overview of the current state of scientific research and discussion on managing the risks of general-purpose AI. The stakes are high. We look forward to continuing this effort.

Introduction

We are in the midst of a technological revolution that will fundamentally alter the way we live, work, and relate to one another. Artificial intelligence (AI) promises to transform many aspects of our society and economy.

The capabilities of AI systems have improved rapidly in many domains over the last years. Large language models (LLMs) are a particularly salient example. In 2019, GPT-2, then the most advanced LLM, could not reliably produce a coherent paragraph of text and could not always count to ten. Five years later, at the time of writing, the most powerful LLMs, such as GPT-4 and o1, Claude 3.5 Sonnet, Hunyuan-Large, and Gemini 1.5 Pro, can engage consistently in multi-turn conversations, write short computer programs, translate between multiple languages, score highly on university entrance exams, and summarise long documents.

Because of these advances, AI is now increasingly present in our lives, and deployed in increasingly consequential settings across many domains. Just over the last two years, there has been rapid growth in AI adoption – ChatGPT, for instance, is amongst the fastest growing technology applications in history, reaching over one million users in just five days after its launch, and 100 million users in two months. AI is now being integrated into search engines, legal databases, clinical decision support tools, and many more products and services.

The step-change in AI capabilities and adoption, and the potential for continued progress, could help advance the public interest in many ways – but there are risks. Among the most promising prospects are AI’s potential for education, medical applications, research advances in fields such as chemistry, biology, or physics, and generally increased prosperity thanks to AI-enabled innovation. Along with this rapid progress, experts are becoming increasingly aware of current harms and potential future risks associated with the most capable types of AI.

This report aims to contribute to an internationally shared scientific understanding of advanced AI safety. To work towards a shared international understanding of the risks of advanced AI, government representatives and leaders from academia, business, and civil society convened in Bletchley Park in the United Kingdom in November 2023 for the first international AI Safety Summit. At the Summit, the nations present agreed to support the development of an International Scientific Report on the Safety of Advanced AI. This report aims to contribute to an internationally shared scientific understanding of advanced AI safety. It will be presented at the AI Action Summit held in Paris in February 2025. An interim version of this report was published in May 2024 and presented at the AI Seoul Summit. At the Summit and in the weeks and months that followed, the experts writing this report received extensive feedback from scientists, companies, civil society organisations, and policymakers. This input has strongly informed the writing of the present report, which builds on the Interim Report (May 2024) and is the first full International Scientific Report on the Safety of Advanced AI.

An international group of 96 AI experts, representing a breadth of views and, where relevant, a diversity of backgrounds, contributed to this report. They considered a range of relevant scientific, technical, and socio-economic evidence published before 5 December 2024. Since the field of AI is developing rapidly, not all sources used for this report are peer-reviewed. However, the report is committed to citing only high-quality sources. Indicators for a source being of high quality include:

  • The piece constitutes an original contribution that advances the field.
  • The piece engages comprehensively with the existing scientific literature, references the work of others where appropriate, and interprets it accurately.
  • The piece discusses possible objections to its claims in good faith.
  • The piece clearly describes the methods employed for its analysis. It critically discusses the choice of methods.
  • The piece clearly highlights its methodological limitations.
  • The piece has been influential in the scientific community.

Since, at the time of writing this report, a scientific consensus on the risks from advanced AI is still being forged, in many cases the report does not put forward confident views. Rather, it offers a snapshot of the current state of scientific understanding and consensus, or lack thereof. Where there are gaps in the literature, the report identifies them, in the hope that this will be a spur to further research.

This report does not comment on which policies might be appropriate responses to AI risks. It aims to be highly relevant for AI policy, but not in any way prescriptive. Ultimately, policymakers have to choose how to balance the opportunities and risks that advanced AI poses. Policymakers must also choose the appropriate level of prudence and caution in response to risks that remain ambiguous.

The report focuses on ‘general-purpose’ AI – AI that can perform a wide range of tasks. Artificial intelligence (AI) is the field of computer science focused on creating systems or machines capable of performing tasks that typically require human intelligence. These tasks include learning, reasoning, problem-solving, natural language processing, and decision-making. AI research is a broad and quickly evolving field of study, and there are many different kinds of AI. This report does not address all potential risks from all types of advanced AI. It focuses on general-purpose AI, or AI that can perform a wide range of tasks. General-purpose AI, now known to many through applications such as ChatGPT, has generated unprecedented interest in AI, both among the public and policymakers, in the last two years. The capabilities of general-purpose AI have been improving particularly rapidly. General-purpose AI is different from so-called ‘narrow AI’, a kind of AI that is specialised to perform one specific task or a few very similar tasks.

To better understand how this report defines general-purpose AI, it is useful to make a distinction between ‘AI models’ and ‘AI systems’. AI models can be thought of as the raw, mathematical essence that is often the ‘engine’ of AI applications. An AI system is a combination of several components, including one or more AI models, that is designed to be particularly useful to humans in some way. For example, the ChatGPT app is an AI system; Its core engine, GPT-4, is an AI model.

The report covers risks both from general-purpose AI models and from general-purpose AI systems. For the purposes of this report:

  • An AI model is a general-purpose AI model if it can perform, or can be adapted to perform, a wide variety of tasks. If such a model is adapted to primarily perform a narrower set of tasks, it still counts as a general-purpose AI model.
  • An AI system is a general-purpose AI system if it is based on a general-purpose AI model.

‘Adapting a model’ here refers to using techniques such as fine-tuning a model (training an already pre-trained model on a dataset that is significantly smaller than the previous dataset used for training), prompting it in specific ways (‘prompt engineering’), and techniques for integrating the model into a broader system.

Large generative AI models and systems, such as chatbots based on LLMs, are well-known examples of general-purpose AI. They allow for flexible generation of output that can readily accommodate a wide range of distinct tasks. General-purpose AI also includes AIs that can perform a wide range of sufficiently distinct tasks within a specific domain such as structural biology.

Within the domain of general-purpose AI, this report focuses on general-purpose AI that is at least as capable as today’s most advanced general-purpose AI. Examples include GPT-4o, AlphaFold-3, and Gemini 1.5 Pro. Note that in this report’s definition, a model or system does not need to have multiple modalities – for example, speech, text, and images – to be considered general-purpose. What matters is the ability to perform a wide variety of tasks, which can also be accomplished by a model or system with only one modality.

General-purpose AI is not to be confused with ‘artificial general intelligence’ (AGI). The term AGI lacks a universal definition, but is typically used to refer to a potential future AI that equals or surpasses human performance on all or almost all cognitive tasks. By contrast, several of today’s AI models and systems already meet the criteria for counting as general-purpose AI as defined in this report.

This report does not address risks from ‘narrow AI’, which is trained to perform a specific task and captures a correspondingly very limited body of knowledge. The focus on advanced general-purpose AI is due to progress in this field having been most rapid, and the associated risks being less studied and understood. Narrow AI, however, can also be highly relevant from a risk and safety perspective, and evidence relating to the risks of these systems is used across the report. Narrow AI models and systems are used in a vast range of products and services in fields such as medicine, advertising, or banking, and can pose significant risks. These risks can lead to harms such as biased hiring decisions, car crashes, or harmful medical treatment recommendations. Narrow AI is also used in various military applications, for instance, Lethal Autonomous Weapon Systems (LAWS) [footnote 1]. Such topics are covered in other fora and are outside the scope of this report. The scope of potential future reports is not yet decided.

A large and diverse group of leading international experts contributed to this report, including representatives nominated by 30 nations from all UN Regional Groups, as well as the OECD, the EU, and the UN. While our individual views sometimes differ, we share the conviction that constructive scientific and public discourse on AI is necessary for people around the world to reap the benefits of this technology safely. We hope that this report can contribute to that discourse and be a foundation for future reports that will gradually improve our shared understanding of the capabilities and risks of advanced AI.

The report is organised into five main sections. After this introduction,

  • Capabilities of general-purpose AI provides information on the current capabilities of general-purpose AI, underlying principles, and potential future trends.
  • Risks discusses risks associated with general-purpose AI.
  • Technical approaches to risk management presents technical approaches to mitigating risks from general-purpose AI and evaluates their strengths and limitations.
  • Conclusion summarises and concludes.

Capabilities of general-purpose AI

1.1. How general-purpose AI is developed

Key information

  • General-purpose AI can perform, and help users accomplish, a wide variety of tasks. For example, it can produce text, images, video, audio, actions, or annotations for data.
  • General-purpose AI is based on ‘deep learning’. Deep learning leverages large amounts of computational resources for an AI model to learn useful patterns from a large amount of training data.
  • The lifecycle of a general-purpose AI can be divided into distinct stages. These stages are:
    • Data collection and pre-processing: developers and data workers collect, clean, label, standardise, and transform raw training data into a format the model can effectively learn from.
    • Pre-training: developers feed AI models vast quantities of data to instil general knowledge by learning from examples. This stage currently requires the most computation.
    • Fine-tuning: developers and contracted data workers further refine the pre-trained ‘base model’ in a process called ‘fine-tuning’ to optimise the model’s performance for a specific application or make it more useful generally. This stage can be very labour intensive.
    • System integration: developers combine one or more general-purpose AI models with other components, such as user interfaces or content filters, to enhance capability and safety and to produce a full ‘AI system’ that is ready for use.
    • Deployment: developers make the integrated AI system available for others to use by implementing the AI system into real-world applications or services.
    • Post-deployment monitoring: developers gather and analyse user feedback, track performance metrics, and make iterative improvements to address issues or limitations discovered during real-world use. These improvements can include more fine-tuning or updating the system integration.
  • Since the publication of the Interim Report (May 2024), the abilities of general-purpose AI at tests of multi-step reasoning have improved. This is largely due to fine-tuning techniques through which a model learns to approach problems in a more structured way before it generates an output.

Key definitions

  • Model: A computer program, often based on machine learning, designed to process inputs and generate outputs. AI models can perform tasks such as prediction, classification, decision-making, or generation, forming the core of AI applications.
  • System: An integrated setup that combines one or more AI models with other components, such as user interfaces or content filters, to produce an application that users can interact with.
  • Compute: Shorthand for ‘computational resources’, which refers to the hardware (e.g. graphics processing units (GPUs)), software (e.g. data management software) and infrastructure (e.g. data centres) required to train and run AI systems.
  • Deep learning: A machine learning technique in which large amounts of data and compute are used to train multilayered, artificial neural networks (inspired by biological brains) to automatically learn and extract high-level features from large datasets, enabling powerful pattern recognition and decision-making capabilities.
  • Developer: Any organisation that designs, builds, integrates, adapts or combines AI models or systems.
  • Neural network: A type of AI model consisting of a mathematical structure that is inspired by the human brain and composed of interconnected nodes (like neurons) that process and learn from data. Current general-purpose AI systems are based on neural networks.
  • Weights: Model parameters that represent the strength of connection between nodes in a neural network. Weights play an important part in determining the output of a model in response to a given input and are iteratively updated during model training to improve its performance.

‘General-purpose AI’ refers to artificial intelligence models or systems that can perform a wide range of tasks rather than being specialised for one specific function. While all AI operates on a fundamental input-to-output basis – processing data to generate results – general-purpose AI distinguishes itself by its ability to handle a diverse range of tasks, e.g. summarising text, generating images, or writing computer code (for a more detailed definition of general-purpose AI, see Introduction. This versatility makes it useful, allowing applications in numerous fields such as healthcare, finance, and engineering. However, these capabilities also present new challenges, particularly in ensuring safety and ethical use. The complexity of managing multiple potential use cases increases the potential for unintended consequences, biases, and misuse.

Examples of general-purpose AI include:

General-purpose AI models are developed via a process called ‘deep learning’. Deep learning is a paradigm of AI development focused on building computer systems that learn from examples. Instead of programming specific rules into systems, researchers feed these systems examples – such as pictures, texts, or sounds – and they gradually learn to recognise patterns and make sense of new information. Deep learning started emerging as a dominant paradigm for AI development in the early 2010s. It was solidified as the primary paradigm after notable developments such as the victory of the AlphaGo system against the world’s leading Go player in 2016.

Figure 1.1: ‘layers’ of interconnected mathematical nodes

Figure 1.1: Today’s general-purpose AI models are neural networks, which are inspired by the animal brain. These networks are composed of connected nodes, where the strength of connections between nodes are called ‘weights’. Weights are updated through iterative training with large quantities of data. Source: International AI Safety Report.

There are many different types of general-purpose AI, but they are developed using common methods and principles. Deep learning works by processing data through ‘layers’ of interconnected mathematical nodes (see Figure 1.1, often called ‘neurons’ because they are loosely inspired by neurons in biological brains (‘neural networks’) [footnote 24]. As information flows from one layer of neurons to the next, the model refines its representations. For example, in a vision system, the first layers might detect simple features such as edges or basic shapes in an image, while deeper layers combine these features to recognise more complex patterns like faces or objects. When the system makes mistakes, deep learning algorithms adjust the strength of various connections between neurons to improve the model’s performance. The strength of each connection between nodes is often called a ‘weight’. This layered approach to learning is what gives deep learning its name, and it is effective at tasks that previously required human intelligence. Most state-of-the-art general-purpose AI models are now based on a specific neural network architecture known as the ‘Transformer’ [footnote 25], which is able to process large quantities of data simultaneously. Transformers have been very effective at learning from large amounts of data, leading to significant improvements in translation and text generation and eventually leading to the development of LLMs such as GPT-4o.

The process of developing and deploying general-purpose AI follows a series of distinct stages. These stages occur at different points in time, depend on different resources, require different techniques, and are sometimes undertaken by different developers (see Figure 1.2 / Table 1.1). As a result, different policies and regulations affecting data, computational resources (‘compute’), or human oversight may affect each stage differently.

Figure 1.2 Process of developing and deploying general-purpose AI follows a series of distinct stages

Figure 1.2: The process of developing and deploying general-purpose AI follows a series of distinct stages, from data collection and pre-processing to post-deployment monitoring. Figure: International Scientific Report on the Safety of Advanced AI.

Before training a general-purpose AI model, developers collect and prepare suitable data, which is a large-scale operation. Creating high-quality training datasets involves complex pipelines of data collection, cleaning, and curation. The training datasets behind state-of-the-art models comprise an immense number of examples from across the internet. Teams often develop sophisticated filtering systems to reduce inappropriate or harmful content, eliminate duplicate data, and improve representation across different topics and perspectives. Data pre-processing can also help reduce copyright and privacy concerns, handle multiple languages and formats, and improve documentation for data provenance. Many companies employ large teams of annotators and subject matter experts to verify and label portions of the data, develop classification systems for content quality, and create specialised datasets for specific capabilities.

Table 1.1 Process of developing and deploying general-purpose AI follows a series of distinct stages

Stage Description
Data collection and pre-processing Developers collect, clean, label, standardise, and transform raw training data into a format the model can learn from. This is a highly labour-intensive process.
Pre-training Developers feed models massive amounts of diverse data – such as text, code, and images – to instil general knowledge and skills. Pre-training produces a ‘base model’. It is a highly compute-intensive process.
Fine-tuning Developers further train the base model to optimise it for a specific application or make it more useful generally. This is typically done with the help of a large amount of human-generated feedback. This is a moderately compute-intensive and highly labour-intensive process.
System integration Developers combine one or more general-purpose AI models with other components such as user interfaces or content filters to create a full ‘AI system’ that is ready for use.
Deployment Developers make the integrated AI system available for others to use.
Post-deployment monitoring Developers gather and analyse user feedback, track impact and performance metrics, and make iterative improvements to address issues or limitations discovered during real-world use.

Table 1.1: At each stage of the AI lifecycle, the AI model is improved for downstream use and eventually deployed as a fully integrated AI system.

During pre-training, developers present general-purpose AI models with large amounts of data, which allows the model to learn patterns. At the beginning of the training process, an untrained model produces random outputs. However, through exposure to millions or billions of examples – such as pictures, texts, or audio – the model gradually learns facts and patterns which allow it to make sense of information in context. Pre-training produces a ‘base model’ with general background knowledge and capabilities.

Pre-training for general-purpose AI models is often the most computationally intensive stage of development. The pre-training process takes weeks or months and uses tens of thousands of graphics processing units (GPUs) or tensor processing units (TPUs) – specialised computer chips designed to rapidly process many calculations. Today, this process uses roughly 10 billion times more compute compared to state-of-the-art model training in 2010 [footnote 26]. Some developers conduct pre-training with their own compute, while others use resources provided by specialised compute providers. Either way, energy costs are high, and it is projected that for the largest general-purpose AI models, pre-training compute costs alone will exceed $1 billion for some models by 2027 [footnote 27]. See 2.3.4. Risks to the environment for a discussion of the environmental costs of training.

After pre-training, general-purpose AI models learn from specially curated feedback and specialised data sets to improve model performance and efficiency – a process called ‘fine-tuning’. After pre-training, most general-purpose AI models undergo one or more additional fine-tuning stages to refine their ability to accomplish the intended tasks. Fine-tuning can include various techniques, including learning from desirable examples [footnote 28], [footnote 29] or from positive/negative reinforcement [footnote 30], [footnote 31]. In some ways, fine-tuning a general-purpose AI can be compared to teaching a student through practice and feedback. Often, fine-tuning follows this scheme:

1. researchers give a base model tasks that it then tries to solve;
2. the researchers then mark good responses as positive examples and mistakes are marked as negative examples;
3. the model is then updated such that it tends to favour approaches that worked well and avoid those that did not, gradually becoming more reliable.

Overall, fine-tuning improves the performance of general-purpose AI models by allowing them to utilise existing knowledge and capabilities to accomplish the desired task. Fine-tuning is traditionally the most labour-intensive training step, often requiring feedback from thousands of contracted data workers. However, general-purpose AI systems are themselves increasingly being used to help fine-tune other general-purpose models [footnote 32], [footnote 33]. In practice, fine-tuning is typically an iterative process in which developers will alternate between fine-tuning and testing runs until their tests show that the system meets desired specifications.

After fine-tuning comes ‘system integration’, which involves combining general-purpose AI models with other components such as user interfaces or content filters to produce a general-purpose AI system. A general-purpose AI system is a combination of one or more general-purpose AI models and all the additional components needed to make them operational – such as user interfaces, data processing infrastructure, and various tools. For example, GPT-4o is a general-purpose AI model that processes text, images, and audio. However, ChatGPT is a general-purpose AI system that combines the GPT-4o model with a chat interface, content processing, web access, and application integration to create a functional product. The additional components in an AI system also aim to enhance capability, usefulness, and safety. For example, a system might come with a filter that detects and blocks model inputs or outputs that contain harmful content. Developers are also increasingly designing so-called ‘scaffolding’ around general-purpose AI models that allows them to plan ahead, pursue goals, and interact with the world (see 1.2. Current capabilities). Just like fine-tuning, system integration typically involves alternating integration and testing steps. The final step before deployment is typically to construct a report on the system’s development, capabilities, and test results. This is often known as a ‘system card’ [footnote 34].

After system integration, ‘deployment’ makes AI systems available for use. Deployment is the process of implementing AI systems into real-world applications, products, or services where they can serve requests and operate within a larger context. Deployment can take several forms: internal deployment for use by the system’s developer, or external deployment either publicly or to private customers. Very little is publicly known about internal deployments. However, companies are known to adopt different types of strategies for external deployment. For example, companies often offer access through online user interfaces or integrations that allow their models to be used with custom applications designed by downstream developers. These integrations can allow for one developer’s general-purpose AI systems to be used in numerous other applications. For example, one company might design a bespoke customer service chatbot that is powered by another company’s general-purpose AI system.

‘Deployment’ and ‘model release’ are distinct activities that are easily confused. ‘Deployment’ involves putting an integrated AI system into use as described above. ‘Model release’, on the other hand, involves making trained models available for downstream entities to further use, study, modify, and/or integrate into their own systems. There is a spectrum of model release options ranging from fully closed to fully open [footnote 35]. Fully closed models are maintained for internal research and development only. Fully open models are those for which all model components (e.g. weights, code, training data) and documentation are made freely available under an open source licence for anyone to use, study, share, or modify [footnote 36]. Some state-of-the-art general-purpose AI models, such as GPT-4o [footnote 3], are on the ‘closed’ end of the spectrum, while others sit more towards the open end of the spectrum. For example, Llama-3.1 [footnote 37] has ‘open’ weights that are available for public download. From a risk mitigation perspective, there are advantages and disadvantages of more open forms of model release (see 2.4. Impact of open-weight general-purpose AI models on AI risks).

After deployment, developers can engage in ‘monitoring’ – inspecting the inputs and outputs of the system to track performance and detect problems – and update their systems on an ongoing basis. This process involves gathering and analysing user feedback, tracking performance metrics, and making iterative improvements to address issues or limitations discovered during real-world use [footnote 38]. These improvements can include more fine-tuning or an updating of the system integration. In practice, there is often a ‘cat-and-mouse game’ in which developers continually update high-profile systems in response to newly discovered issues [footnote 39]. See 3.4.2. Monitoring and intervention for a discussion of methods for monitoring general-purpose AI systems and intervening where needed.

Since the publication of the Interim Report (May 2024), developers have made significant advances in system integration techniques that may enable general-purpose AI to perform more advanced reasoning. In September 2024, OpenAI announced its new o1 prototype model with more advanced scaffolding and training methods that have enabled significant performance gains on tasks such as mathematics and programming [footnote 2]. Unlike previous models, o1 employs ‘chain of thought’ problem-solving that breaks problems down into steps which are then solved bit-by-bit. Chain of thought has enabled improvements in complex tasks – o1 scored 83% on International Mathematics Olympiad (IMO) qualifying exams compared to GPT-4o’s 13% – and is considered an important step towards developing AI agents: general-purpose AI systems that can autonomously interact with the world, plan ahead, and pursue goals. However, the improved problem-solving process requires significantly more time and compute both during training and at point of use. The extent of the reasoning capabilities of the model remain unclear [footnote 40].

There are various challenges for policymakers stemming from how general-purpose AI is developed. Risks and vulnerabilities can emerge at many points along the development and deployment process, making the most effective interventions difficult to pinpoint and prioritise. Advances in model development are also happening rapidly and are difficult to predict. This makes it difficult to articulate robust policy interventions that will age well with a rapidly evolving technology. Not only are the risks and vulnerabilities associated with general-purpose AI likely to change, the demands of model development are, too. For example, reasoning-based models such as o1 demand much greater computational resources at point of use, which presents new implications for long-term compute infrastructure planning. 1.2. Current capabilities and 1.3. Capabilities in coming years expand on the state of current AI capabilities and the ways in which those capabilities are likely to evolve, posing new risks and challenges.

1.2. Current capabilities

Key information

(See Chair’s update on the latest AI advances).

  • Understanding and measuring the capabilities of general-purpose AI is crucial for assessing their risks. Existing governance frameworks and commitments rely on precisely measuring general-purpose AI capabilities, but they are a moving target and difficult to measure and define.
  • Most experts agree that general-purpose AI systems are capable of tasks including:
  • Assisting programmers and performing small- to medium-sized software engineering tasks.
  • Creating images that are hard to distinguish from real photographs.
  • Engaging in fluent conversation in many languages.
  • Finding and summarising information relevant to a question or problem from many data sources.
  • Working simultaneously with multiple ‘modalities’ such as text, video, and speech.
  • Solving textbook mathematics and science problems at up to a graduate level.

  • Most experts agree that general-purpose AI is currently not capable of tasks including:
  • Performing useful robotic tasks such as household work.
  • Consistently avoiding false statements.
  • Independently executing long projects, such as multi-day programming or research projects.

  • General-purpose AI agents can increasingly act and plan autonomously by controlling computers. Leading AI companies are making large investments in AI agents because they are expected to be economically valuable. There is rapid progress on tests related to web browsing, coding, and research tasks, though current AI agents still struggle with work that requires many steps.

  • Since the publication of the Interim Report (May 2024), general-purpose AI systems have markedly improved at tests of scientific reasoning and programming. These improvements come in part from techniques that let general-purpose AI break down complex problems into smaller steps, by writing so-called ‘chains of thought’, before solving them.
  • A key challenge for policymakers is how to account for context-specific capabilities in regulations. The capabilities of general-purpose AI can significantly change with more careful fine-tuning, prompting, and tools made available to the system. They can also decline in unfamiliar contexts. More rigorous evaluations are needed to avoid overestimating or underestimating capabilities.

Key definitions

  • Modalities: The kinds of data that an AI system can competently receive as input and produce as output, including text (language or code), images, video, and robotic actions.
  • Capabilities: The range of tasks or functions that an AI system can perform, and how competently it can perform them.
  • Inference-time enhancements: Techniques used to improve an AI system’s performance after its initial training, without changing the underlying model. This includes clever prompting methods, answer selection methods (e.g., sampling multiple responses and choosing a majority answer), writing long ‘chains of thought’, agent ‘scaffolding’, and more.
  • Scaffold(ing): Additional software built around an AI system that helps it to perform some task. For example, an AI system might be given access to an external calculator app to increase its performance on arithmetical problems. More sophisticated scaffolding may structure a model’s outputs and guide the model to improve its answers step-by-step.
  • Chain of thought: A reasoning process in which an AI generates intermediate steps or explanations while solving a problem or answering a question. This approach mimics human logical reasoning and internal deliberation, helping the model break down complex tasks into smaller, sequential steps to improve accuracy and transparency in its outputs.
  • Inference: The process in which an AI generates outputs based on a given input, thereby applying the knowledge learnt during training.
  • AI agent: A general-purpose AI which can make plans to achieve goals, adaptively perform tasks involving multiple steps and uncertain outcomes along the way, and interact with its environment – for example by creating files, taking actions on the web, or delegating tasks to other agents – with little to no human oversight.
  • Evaluations: Systematic assessments of an AI system’s performance, capabilities, vulnerabilities or potential impacts. Evaluations can include benchmarking, red-teaming and audits and can be conducted both before and after model deployment.
  • Benchmark: A standardised, often quantitative test or metric used to evaluate and compare the performance of AI systems on a fixed set of tasks designed to represent real-world usage.

This section focuses on the core capabilities of general-purpose AI models and systems that are publicly available today. 1.3. Capabilities in coming years discusses expected future developments in AI capabilities, and 2. Risks discusses specific dangerous capabilities and their associated applications that contribute to risks.

A general-purpose AI system’s capabilities are difficult to reliably measure. [footnote 41] An important caveat on assessments of AI capabilities is that their capability profiles, and the consistency with which they exhibit certain capabilities, differ significantly from those of humans. For example, two studies find that language models fail more often on counting and arithmetic problems involving numbers that are rare in their training data [footnote 42], [footnote 43]. An AI system’s success on a test of capabilities depends highly on the particular examples chosen for the test, as well as how it is asked or instructed to solve them (which, in practice, depends on its user’s skill) – making it particularly challenging to ensure the absence of a capability in an AI system (e.g. one that could entail societal risks [footnote 44]; see 2.1 Risks from malicious use. Diversity of data and appropriate investment into methods to elicit the desired behaviour from a model (e.g. through inference-time enhancements such as scaffolding, prompting, and fine-tuning) can help make capability assessment more reliable.

Input and output modalities

The ‘modalities’ of an AI system are the kinds of data that it can usefully receive as input and produce as output. For example, general-purpose AI systems with a text modality might take in user-entered text or source documents, and produce coherent natural language, engage in conversations, and answer reading comprehension questions about a passage. AI systems with image and text modalities may be able to answer questions about the contents of images, or generate images according to natural language instructions. Understanding the modalities which a general-purpose AI system can process is important for developing an intuition about the broad sets of tasks that it might be able to accomplish in theory, and the possible threats that it – and future models of its kind – may pose. General-purpose systems exist for 9+ modalities [footnote 45] including text, audio, images, and video, with some systems specifically focusing on an additional modality such as robotic actions, representations of proteins and other molecules, time series data [footnote 46] or music [footnote 47]. However, text- and image-processing systems such as ChatGPT are the source of much of the present attention on general-purpose AI. Advanced general-purpose AI systems are increasingly able to process inputs and generate outputs in multiple modalities such as text, video, and speech.

Text and code: General-purpose AI systems can engage in interactive dialogue and write short computer programs. Advanced language models can generate text and engage in interactive dialogue across a variety of natural languages, topics, and formats. Examples include OpenAI’s GPT-4, Anthropic’s Claude, and Google’s Gemini, as well as openly available models from Meta (the Llama series of models), Mistral AI, Alibaba (the Qwen series), and DeepSeek [footnote 48], [footnote 49], [footnote 50], [footnote 51], [footnote 52], [footnote 53], [footnote 54]. In addition to human language, these models can process and generate many kinds of data encoded as text, including mathematical formulae and computer code. They can write short- to medium-length programmes, assist software developers, and perform computer actions (such as web searches) when provided with affordances such as internet access [footnote 55], [footnote 56].

Audio and speech: General-purpose AI systems can engage in spoken conversation and convincingly emulate humans’ voices. Some general-purpose AI systems, including GPT-4o [footnote 3] and Gemini 1.5 [footnote 49], can process audio in much the same way as text, answering questions about the contents of an audio clip (for example, a spoken conversation). One recent study on using narrow AI for text-to-speech synthesis found that on two academic speech synthesis benchmarks, a person’s voice could be convincingly replicated in high-quality audio from only a three-second recording [footnote 57], [footnote 54]. The general-purpose AI system GPT-4o can converse in real time with human-like speech in its ‘advanced voice mode’ and can emulate a variety of human voices.

Images: General-purpose AI systems can describe the contents of images with high accuracy, generate images according to a detailed description, and perform other image-based tasks. Many general-purpose AI systems can use images as both input and output. General-purpose AI systems such as Claude, and GPT-4o, Pixtral, and Qwen2-VL can describe the contents of images in language, including objects and activities depicted therein [footnote 3], [footnote 50], [footnote 58], [footnote 59], [footnote 60]. The most capable models can make sense of complex images and documents, with Anthropic reporting that its Claude 3.5 Sonnet system can correctly answer over 90% of questions in three benchmarks that involve processing documents, charts and science diagrams, representing human standardised testing settings [footnote 5]. General-purpose AI systems can also generate images as output, with contents and style specified in human language (for example, systems such as Stable Diffusion 3 [footnote 15] and DALL-E 3 [footnote 14]. Advances in image-generation models make it easier to control their images’ content and style, depict increasingly complex and realistic scenes, and produce images which are close to indistinguishable from natural ones [footnote 14]. Other general-purpose AI systems can perform image-based tasks such as categorising the objects depicted within images [footnote 61] and identifying their locations [footnote 62].

Video: General-purpose AI systems can transcribe or describe the contents of videos, and generate short videos according to instructions, but the movement depicted in these videos is not always realistic. Some general-purpose AI systems can take video as input and analyse its contents, such as V-JEPA [footnote 63], Gemini 1.5 [footnote 49], GPT-4o [footnote 3], and Qwen2-VL [footnote 60]. These systems can enable searching and analysing long-form content, for example, locating key moments or pieces of information revealed in a video. Some general-purpose systems can also generate realistic, high-definition video, for example Sora [footnote 16] and Movie Gen [footnote 64]. These models can generate short (less than one minute) videos depicting a scene described in text, optionally with reference to other images and videos as well. They can modify videos according to instructions (e.g. changing the depicted season from summer to winter) and generate videos depicting individuals in reference photographs (e.g. performing a described activity). These videos generally look realistic, though the accuracy of the generated scenes to the instructional text tends to be worse than for state-of-the-art image generation systems, and the videos often contain unnatural or physically impossible movements which clearly distinguish them from natural video. Advanced video models have only reached the market in 2024 and their implications are still being explored.

Robotic actions: General-purpose AI systems can be used to plan out robotic movements, but cannot yet themselves control physical robots or machines. General-purpose AI systems can be used for planning multi-step robot actions and translating instructional language into robotic action plans [footnote 65], [footnote 66]. Researchers are also exploring general-purpose AI models that not only plan or interpret, but also generate robotic actions, such as Google’s RT-2-X [footnote 67], and the autonomous driving company Waymo is developing general-purpose AI models for generating driving plans and models of a vehicle’s environment [footnote 68]. However, general-purpose AI models’ abilities to generate robotic actions are relatively rudimentary. Part of the reason is that data collection for actions generally requires running physical robots and is challenging to do at very large scale [footnote 69], although substantial efforts are being made [footnote 67], [footnote 70], [footnote 71]. General-purpose AI systems cannot yet effectively control physical robots or machines to perform many useful tasks such as household work, as the integration of general-purpose AI models with motor control systems remains a challenge [footnote 72].

Proteins and other molecules: General-purpose AI systems can perform a range of tasks useful to biologists, such as predicting protein folding and aiding protein design. General-purpose AI systems that work with proteins and other large molecules operate using various representations (e.g. residue sequences, 3D structures). These models can predict protein structures under various conditions (e.g. in protein–protein complexes), generate useful novel proteins, and perform a wide range of protein-related tasks relevant to drug discovery and design [footnote 73], qualifying them as ‘foundation models’ [footnote 74] and as general-purpose AI models under this report’s definition (see Introduction). They can increasingly be used to generate designs of new proteins with predictable functions across large protein families [footnote 75], [footnote 76].

Enhancements after pre-training

The tasks that a general-purpose AI system can accomplish depend on the techniques applied to it after initial pre-training. A review of 16 enhancement methods found that they generally require less than 1% of the computational resources to implement than was used for pre-training the systems, while improving those systems’ capabilities approximately as much as would be expected from devoting 5x more resources to pre-training [footnote 77]. This suggests that policy around the development and deployment of general-purpose AI systems may need to anticipate the effect that these enhancements will have on general-purpose AI systems’ capabilities. Some common enhancement methods [footnote 77], [footnote 78] include:

  • Fine-tuning: Fine-tuning refers to further training the pre-trained base model to optimise it for a specific application or make it more useful generally, for instance by training it to follow instructions.
  • Inference-time enhancements: Inference is the process in which an AI model generates outputs based on a given input, thereby applying the knowledge learnt during training. Inference-time enhancements are a class of system integration techniques that modify a model’s inputs and organise its outputs. Examples include producing multiple candidate answers to a question and selecting the best among them [footnote 79], [footnote 80]. producing long ‘chains of thought’ (see next paragraph) to work through complex problems [footnote 2], or using hybrids of these approaches [footnote 81]. Other inference-time enhancements include:
    • Prompting methods: crafting the system’s instructions to improve its performance, for example, by providing it with example problems and solutions [footnote 82], [footnote 83], providing useful documents for context, or instructing it to ‘think step-by-step’ [footnote 84];
    • Agent scaffolding and tool use: providing the model with means to break down a high-level task into a plan with clear subgoals and delegate to copies of itself to perform each step of the plan, interacting with its environment, e.g. using websites [footnote 85] or running code [footnote 86], [footnote 87], [footnote 88] to carry out its work as an AI agent [footnote 89], [footnote 90].

General-purpose AI models have markedly improved at answering Ph.D.- level science questions

Figure 1.3: AI performance on a set of Ph.D level science questions

Figure 1.3: Since the publication of the Interim Report (May 2024), general-purpose AI models have seen rapid performance increases in answering PhD-level science questions. Researchers have been testing models on GPQA Diamond, a collection of challenging multiple-choice questions about biology, chemistry, and physics, which people without PhD-level expertise in each area are unable to correctly answer even with access to the internet. On these tests, accuracy rose from 33% with GPT-4 in June 2023 (slightly above random guessing) to 49% with GPT-4o in May 2024, reaching 70% (matching experts with a PhD in the area of each question) with o1-preview in September 2024. This increase is partly due to o1-preview writing a long ‘chain of thought’ in which it can break down the problem and try different approaches before producing its final answer. For progress on other tests, see Figure 1.4 in 1.3. Capabilities in coming years. Source: Epoch AI, 2024 [footnote 91].

Since the publication of the Interim Report (May 2024), studies have shown a general-purpose AI system’s capabilities can be significantly increased by having it devote more time and computation to each individual problem. OpenAI’s o1 system, released in September of 2024, achieved a high enough score on the American Invitational Mathematics Examination (AIME) to qualify for the USA Mathematical Olympiad, and reached expert PhD-level performance on postgraduate-level physics, chemistry, and biology questions curated for high difficulty [footnote 92] (see Figure 1.3). The key to o1’s improvements was to leverage extra computation at inference time by writing a long ‘chain of thought’ to break down the problem and work through hypotheses. Another popular inference-time enhancement leverages increased computation during inference-time by sampling multiple outputs from the model and choosing among them. Two recent studies by industry, academic and civil society researchers investigate how capabilities scale with the amount of inference-time computation using such techniques [footnote 93], [footnote 94]. They find that capabilities increase at a rate which is approximately logarithmic with inference-time computation investment, forming a similar trend to the relationship between capability growth and training time computation as described in 1.3. Capabilities in coming years. This, together with o1’s success, suggests that the amount of inference-time computation devoted to each problem might be a general-purpose lever by which the capabilities of an existing general-purpose AI system may be increased (especially in science and technology applications), i.e. by simply allowing it to produce a much longer ‘chain of thought’ before its answer. However, eliciting improved capabilities using more inference-time computation requires more computation, increasing costs.

What can current general-purpose AI do?

General-purpose language models can correctly answer many common-sense and factual questions, but they can be inconsistent and make trivial errors. General-purpose AI systems encode an extensive range of facts, with current state-of-the-art systems scoring on average above 92% in undergraduate-level tests of knowledge in subjects such as chemistry and law [footnote 92]. However, these systems often fail to identify subtle factual distinctions or self-contradictory arguments [footnote 95], [footnote 96], are prone to provide biased answers on the basis of user interaction patterns [footnote 97], [footnote 98], are less accurate at answering questions about unusual scenarios [footnote 42], [footnote 99], [footnote 100], and commonly generate completely non-existent or false citations, biographies, or facts [footnote 101], [footnote 102], [footnote 103], [footnote 104], [footnote 105], or make simple common-sense errors [footnote 106], [footnote 107]. These issues are taken by some researchers to indicate that they lack a true understanding of how the world works [footnote 108], and make it difficult to adopt such systems in settings that require high reliability. See 1.3. Capabilities in coming years for further discussion.

General-purpose AI systems can achieve performance similar to or better than human experts on some self-contained knowledge and reasoning tasks, but they still make mistakes on easy problems in ways that humans do not. In one study, a general-purpose AI system was able to predict the probability of future events with accuracy rivalling that of expert forecasters on an online forecasting platform [footnote 109]. With respect to coding, o1 performs at the 89th percentile of humans on Codeforces, an online competitive coding platform, and can resolve 41% of a sample of self-contained, real-world engineering tasks drawn from the code-sharing platform GitHub [footnote 2]. However, even on simple primary school mathematical word problems, general-purpose AI systems exhibit error patterns that are distinct from those of humans. For example, two studies find that their accuracy greatly decreases when obviously irrelevant sentences are inserted into the problem [footnote 110], [footnote 111], with a 17.5% reduction in accuracy for a preview version of o1 [footnote 110]. Two recent studies also find that as general-purpose AI systems are given problems that require more steps of reasoning to solve, their error rate increases faster than one would expect if they had a constant error rate per step [footnote 110], [footnote 112]. This suggests that general-purpose AI systems cannot be relied upon for complex problems, and leads some researchers to claim that these systems ‘cannot perform genuine logical reasoning’ [footnote 110], although opinions on this among experts are mixed.

Studies show that AI assistance makes software developers more productive, and adoption of AI tools for programming is increasing. Studies on GitHub Copilot, a popular early AI coding aid, show productivity boosts of anywhere from 8–22% [footnote 113] to 56% [footnote 114]. Developers perceive themselves as being more productive when surveyed [footnote 115], and AI assistance is generally more beneficial for inexperienced developers [footnote 114], [footnote 115]. In a survey of over 65,000 software developers from May–June 2024 by Stack Overflow, a popular programming Q&A community forum, 63% of professional software developers reported using AI tools in their workflow [footnote 116], up from 44% the previous year [footnote 117].

Since the publication of the Interim Report (May 2024), general-purpose AI agents which independently perform tasks on the computer have been subject to heavy investment and are rapidly becoming more reliable on benchmarks designed to test labour automation potential. AI agents are general-purpose AI systems that can autonomously make plans, perform complex tasks, and interact with their environment by controlling software and computers, with little human oversight. AI agents can be created by equipping general-purpose AI systems with a thin layer of additional software known as ‘scaffolding’. Example tasks for AI agents include web browsing tasks such as answering questions [footnote 85] or online shopping [footnote 118], [footnote 119], assistance with scientific research [footnote 22], [footnote 120], [footnote 121], software development [footnote 122], training machine learning models [footnote 123], [footnote 124], [footnote 125], [footnote 126], carrying out cyberattacks [footnote 127], following instructions to navigate simulated environments [footnote 128], or controlling physical robots [footnote 19]. On most of these tasks, current AI agents succeed in cases of low to medium complexity, but fail when the task requires many steps or becomes more complex. In an evaluation study on 77 tasks, ranging from simple tasks such as exploiting basic website vulnerabilities to complex, multi-step tasks such as training machine learning models, state-of-the-art models such as GPT-4o, o1, and Claude 3.5 Sonnet succeeded at nearly 40% of tasks when equipped with agent scaffolding, a similar rate to humans who are limited to 30 minutes for each task [footnote 2], [footnote 129]. In the same study, o1 made some progress – not fully succeeding – on two out of seven difficult tasks designed to reflect challenging tasks in AI research and development (R&D), such as optimising neural network code [footnote 2], [footnote 129]. Progress in this area is rapid: new agent architectures are rapidly being developed [footnote 130], [footnote 131], [footnote 132], and the top system’s success rate on a high-quality subset of SWE-bench, a popular software engineering agent benchmark, increased from 22% to 45% from April to August 2024 [footnote 122].

Since the publication of the Interim Report (May 2024), researchers have also made progress in leveraging new kinds of multimodal data to train AI models for robot control. One approach involves training a system on a large dataset of videos annotated with text descriptions of their contents, followed by a smaller dataset of (scarce) videos annotated with robot action commands [footnote 133]. A second new approach uses existing vision-enabled general-purpose AI to translate videos of humans into action plans for robots, and trains robot control models using this data [footnote 134]. A third new approach trains on video alone but involves models implicitly learning the actions depicted in it, allowing the model to quickly adapt to controlling new robots, even if its initial training was only on videos of humans [footnote 135]. These studies suggest that new methods leveraging multimodal learning will soon open up the data bottleneck that currently prevents developers from training general-purpose AI systems to control robots.

The main evidence gaps around current AI capabilities include:

  • There is no consistently up-to-date comprehensive index of AI capabilities. Evidence on AI capabilities quickly becomes outdated as new models are released and inference-time enhancements are developed. Researchers’ understanding of AI capabilities advances through a relatively ad hoc patchwork of academic and industry publications which can be challenging to synthesise into a comprehensive picture. Policymakers would ideally have access to evidence which is up to date, reliable, standardised and comprehensive.
  • Evaluations of AI capabilities often do not replicate on new data. Evaluation studies provide examples of an AI system performing a task (or failing to do so) on some sample data, but they often do not replicate when the experiments are rerun or tried on different data [footnote 136]. For evaluations to be reliable and replicable, they should ideally be run on large, diverse datasets which are expanded over time.
  • There are no common standards for measuring how AI augments human capabilities. There are not yet standardised benchmarks for ‘uplift’ – measuring how effectively humans can use general-purpose AI systems to accomplish various tasks, compared to using existing technology – which can inform the public of this aspect of progress. (Such tests are undertaken – though the details are often confidential – for chemical, biological, radiological and nuclear (CBRN) misuse risks; see 2.1.4. Biological and chemical attacks and 2.4 Impact of open-weight general-purpose AI models on AI risks.)

For policymakers, key challenges include:

  • Standardised measures of capabilities, such as multiple-choice benchmark tests, may not measure the capabilities of AI systems in the contexts that are most relevant to their risks (e.g. when used as an aid by humans).
  • After initial development, AI models can be continually improved upon through fine-tuning and inference-time enhancements. These improvements will increase the contextual capabilities and potentially affect the risks of models that are already available to the public, and the changes would be outside the scope of tests by developers of the base model. It will be difficult to design policy robust to this kind of continuous change.

1.3. Capabilities in coming years

Key information

(See Chair’s update on the latest AI advances).

  • In the coming months and years, the capabilities of general-purpose AI systems could advance slowly, rapidly, or extremely rapidly. Both expert opinions and available evidence support each of these trajectories. To make timely decisions, policymakers will need to account for these scenarios and their associated risks. A key question is how rapidly AI developers can scale up existing approaches using even more compute and data, and whether this would be sufficient to overcome the limitations of current systems, such as their unreliability in executing lengthy tasks.

  • Developers of general-purpose AI are advancing scientific, engineering, and ‘agent’ capabilities. In recent months, models have substantially improved at tests of scientific reasoning and programming, enabling new applications. Additionally, AI developers are making large efforts to develop more reliable general-purpose AI agents that can execute longer tasks or projects without human oversight by using computers and software tools, potentially with continuous learning during operation.

  • General-purpose AI-based tools are increasingly being used to accelerate the development of software and hardware, including general-purpose AI itself. They are widely used to more efficiently write software to train and deploy AI, to aid in designing AI chips, and to generate and curate training data. How this will affect the pace of progress has received little study.

  • Recent improvements have been primarily driven by scaling up the compute and data used for pre-training, and by refining existing algorithmic approaches. For cutting-edge models, current estimates suggest that these factors have, in recent years, approximately increased:
    • Compute for pre-training: 4x/year
    • Pre-training dataset size: 2.5x/year
    • Energy used for powering computer chips during training: 3x/year
    • Algorithmic pre-training efficiency: 3x/year (higher uncertainty)
    • Hardware efficiency: 1.3x/year
  • It is likely feasible for AI developers to continue to exponentially increase resources used for training, but this is not guaranteed. If recent trends continue, by the end of 2026, AI developers will train models using roughly 100x more training compute than 2023’s most compute-intensive models, growing to 10,000x more training compute by 2030. New research suggests that this degree of scaling is likely feasible, depending on investment and policy decisions. However, it is more likely that today’s pace of scaling will become infeasible after the 2020s due to bottlenecks in data, chip production, financial capital, and local energy supply.
  • Researchers debate the effectiveness of scaling up resources for training with current algorithmic techniques. Some experts are sceptical as to whether scaling up training resources would be sufficient to overcome the limitations of current systems, while others expect that it will continue to be the key ingredient for further advances.
  • AI developers have recently adopted a potentially more effective additional scaling approach. Models can be trained to write longer so-called ‘chains of thought’ to break down problems into steps before generating responses, allowing compute scaling during runtime rather than training. This method has shown promise in overcoming various limitations in tests of scientific reasoning and programming and may provide an additional path if traditional training scaling yields diminishing returns.
  • Since the publication of the Interim Report (May 2024), general-purpose AI systems have become more affordable to use, more practically useful, and more widely adopted. Developers have also significantly enhanced models’ performance at tests of mathematical and scientific reasoning (see 1.2. Current capabilities).
  • Policymakers face challenges in monitoring and responding to AI progress. Key challenges include quantitatively tracking AI advancements and their primary drivers, as well as designing adaptive risk management frameworks that activate mitigations only when capabilities (and associated risks) increase.

Key definitions

  • Scaling laws: Systematic relationships observed between an AI model’s size (or the amount of time, data or computational resources used in training or inference) and its performance.
  • Compute: Shorthand for ‘computational resources’, which refers to the hardware (e.g. GPUs), software (e.g. data management software) and infrastructure (e.g. data centres) required to train and run AI systems.
  • Algorithmic (training) efficiency: A set of measures of how efficiently an algorithm uses computational resources to learn from data, such as the amount of memory used or the time taken for training.
  • AI agent: A general-purpose AI which can make plans to achieve goals, adaptively perform tasks involving multiple steps and uncertain outcomes along the way, and interact with its environment – for example by creating files, taking actions on the web, or delegating tasks to other agents – with little to no human oversight.
  • Inference: The process in which an AI generates outputs based on a given input, thereby applying the knowledge learnt during training.
  • Chain of thought: A reasoning process in which an AI generates intermediate steps or explanations while solving a problem or answering a question. This approach mimics human logical reasoning and internal deliberation, helping the model break down complex tasks into smaller, sequential steps to improve accuracy and transparency in its outputs.
  • Benchmark: A standardised, often quantitative test or metric used to evaluate and compare the performance of AI systems on a fixed set of tasks designed to represent real-world usage.
  • Emergent behaviour: The ability of AI systems to act in ways that were not explicitly programmed or intended by their developers or users.
  • Cognitive tasks: Activities that involve processing information, problem-solving, decision-making, and creative thinking. Examples include research, writing, and programming.
  • Synthetic data: Data like text or images that has been artificially generated, for instance by general-purpose AI systems. Synthetic data might be used for training AI systems, e.g. when high-quality natural data is scarce.
  • Modalities: The kinds of data that an AI system can competently receive as input and produce as output, including text (language or code), images, video, and robotic actions.

The pace of recent general-purpose AI progress has been rapid, often surpassing the expectations of AI experts on widely-used metrics. Researchers assess AI performance using ‘benchmarks’ – standardised sets of problems designed to compare AI systems’ performance within one or multiple domains. Over the last decade, general-purpose AI systems and earlier AI systems have achieved or exceeded human-level performance on benchmarks across a wide variety of domains, such as natural language processing, computer vision, speech recognition, and mathematics (see Figure 1.4). For example, consider the MATH benchmark [footnote 137], which tests mathematical problem-solving skills via a series of word problems. These problems range in difficulty from simple primary school-level questions to problems that challenge international mathematics competition winners. When this benchmark was released in 2021, general-purpose AI systems scored around 5%, but three years later, the model o1 reached 94.8% [footnote 92], matching the score of expert human testers (in this case, a gold medallist in the IMO). However, it is often unclear how impressive performance on benchmarks translates into performance in real-world tasks, as discussed below [footnote 138].

AI systems have become much more cost efficient to run, with the prices for running AI systems at a given capability level falling by multiple orders of magnitude. For example, in 2022 it cost users ~$25 to generate a million words using GPT-3, but by 2023 this fell to almost $1 using the performance‑equivalent Llama 2 7B (see Figure 1.5). These price decreases partly stem from technological advancements, such as hardware improvements that allow more computation to be performed at the same price [footnote 144]. Price drops can also occur due to decreases in the price markups companies charge, and the measured decrease also depends on the chosen benchmark and level of performance.

Figure 1.4: Performance of AI models on various benchmarks has advanced rapidly between 1998 to 2024

Note. That some earlier results used machine learning AI models that are not general-purpose models. On some recent benchmarks, models progressed within a short period of time from having poor performance to surpassing the performance of human subjects who are often experts (black solid line). Note that early results in this graph used machine learning AI models that are not general-purpose models. Sources: Kiela et al., 2021 [footnote 139] (for MNIST, Switchboard, ImageNet, SQuAD 1.1, 2 and GLUE). Data for MMLU, Big Bench, GPQA are from the relevant papers [footnote 3], [footnote 5], [footnote 92], [footnote 140], [footnote 141], [footnote 142], [footnote 143]).

Figure 1.5

Figure 1.5: This graph shows how general-purpose language models have become markedly more cost efficient to use, measured by the number of words generated per dollar while maintaining a given performance level on the MMLU benchmark. The version of GPT-3 175B released after September 2022 and Llama 2 7B (blue) both achieve a score of around 44% accuracy [footnote 48], [footnote 145], whereas Mistral Large and GPT-4o mini (red) achieve around 82% [footnote 12], [footnote 146]. The original GPT-4 from March 2023 and the recently released o1-mini (green) both score around 86% on MMLU [footnote 92], [footnote 147]. Note that this graph is primarily for illustrative purposes, since reported prices and MMLU performance depend on evaluation methods. Furthermore, o1-mini writes so-called ‘chains of thought’ that users cannot access before producing a final answer, so in practice the number of accessible words generated per dollar is likely lower than depicted in the figure. Sources: Chung et al., 2022 [footnote 145] and Touvron et al., 2023 [footnote 48] (for GPT-3 175B and Llama 2 7B); Mistral AI, 2024 [footnote 12] and OpenAI, 2024f [footnote 146] (for Mistral Large and GPT-4o mini); Open AI, 2024g [footnote 92] and OpenAI et al., 2024 [footnote 147] (for GPT-4 and o1-mini); OpenAI, 2024d [footnote 148] and Together Pricing, 2023 [footnote 149] (for pricing data).

Since the publication of the Interim Report, research on improving general-purpose AI capabilities has begun to focus on new directions, while efforts to scale up training resources continue. For example, one direction is improving the autonomy of general-purpose AI systems – producing AI agents that act and plan in pursuit of goals. (See 1.2. Current capabilities, and 3.2.1. Technical challenges for risk management and policymaking. Another direction involves using multiple copies of models together to accomplish new tasks). [footnote 150], [footnote 151]

New evidence suggests that scaling training compute and data at current rates is technically feasible until at least ca. 2030. Over the last decade, training compute for cutting-edge models has increased an estimated 4x per year. If this trend continues, systems will be trained with roughly 100x more compute than GPT-4 by the end of 2026, growing to around 10,000x by the end of the decade [footnote 152]. However, it is unclear how this translates into improved capabilities, and whether the economic returns are large enough to justify the expense of such massive degrees of scaling.

1.3.2. Can the limitations of current systems be resolved through scaling, refining, and combining existing approaches?

Current general-purpose AI systems have an uneven set of capabilities, and still have many limitations

Humans and general-purpose AI systems have distinct strengths and weaknesses, making comparisons challenging. It is tempting to compare the cognitive abilities of humans and AI systems, for instance because this informs which economic tasks might be especially strongly impacted by AI use. However, current general-purpose AI systems often demonstrate uneven performance, excelling in some domains while struggling in others [footnote 153], which makes overly general comparisons less meaningful. While general-purpose AI now outperforms humans on some benchmarks, some scientists argue that it still lacks the deep conceptual understanding and abstract reasoning capabilities of humans [footnote 153]. General-purpose AI systems can replace humans in some activities, whereas in others, the distinct strengths and weaknesses of AI systems and humans combine to produce fruitful collaborations (see 2.3.1. Labour market risks).

Current general-purpose AI systems are prone to some failures that humans are not [footnote 154], [footnote 155]. Some works suggest that general-purpose AI reasoning can struggle to cope with novel scenarios and is overly influenced by superficial similarities [footnote 110], [footnote 153]. General-purpose AI systems have also been shown to sometimes fail at reasoning on seemingly simple tasks. For instance, a model trained on data including the statement ‘Olaf Scholz was the ninth Chancellor of Germany’ will not always be able to answer the question ‘Who was the ninth Chancellor of Germany?’ [footnote 154]. In addition, there is evidence that general-purpose AI systems can be caused to deviate from their usual safeguards by nonsensical input, while humans would recognise these prompts (see 3.4.1. Training more trustworthy models. Limitations of current systems are further discussed in 1.2. Current capabilities.

Existing AI training approaches will likely extend model capabilities, but the degree of improvement and its real-world significance are heavily debated

Evidence suggests that further resource scaling will increase overall AI capabilities. Researchers have discovered empirical ‘scaling laws’ (see Figure 1.6), which are mathematical relationships that quantify the relationship between inputs of the AI training process (such as amounts of data and compute) and the capabilities of the model on broad performance tasks such as next-word prediction [footnote 156], [footnote 157]. These studies demonstrate that AI models’ performance tends to improve with increased computational resources across a range of domains, including computer vision [footnote 158], [footnote 159], language modelling [footnote 156], [footnote 157], and game playing [footnote 160]. Although many performance measures do not directly test real-world capabilities, general-purpose AI model performance has been observed to consistently improve on broad benchmarks that test many capabilities, such as MMLU [footnote 140], as the models are scaled up.

Figure 1.6: Performance (as measured by ‘training loss’) improves predictably as AI developers use more compute for training (lower ‘training loss’ means better performance) [footnote 157]

Note. In this experiment, additional compute was allocated to training larger language models (more parameters, indicated by colour) on more data. FLOP (floating point operations) refers to the number of computational operations performed during training. Each line shows how performance (as measured by a lower ‘training loss’, which is a proxy measure for capabilities) improves as training FLOP increases for a model of a given size. Source: Hoffmann et al., 2022 [footnote 157].

However, it is unclear whether further resource scaling will improve AI capabilities at the same rate as in the last decade. Scaling laws have proven robust, holding across a range of million- to billion-fold increases in training computation. However, these scaling laws have thus far been derived from empirical observations, not from inviolable principles (although theoretical models have been proposed to explain them) [footnote 161], [footnote 162], [footnote 163], [footnote 164], [footnote 165]. Furthermore, some scaling laws are derived from limited data, which makes them less reliable [footnote 41], [footnote 166], [footnote 167], [footnote 168], [footnote 169], [footnote 170]. As a result, there is no mathematical guarantee that scaling laws will continue to hold at larger scales, beyond the range of the empirical data used to establish them. On the other hand, a breakdown of the main scaling laws has not been scientifically established either, despite ongoing news reports.

While aggregate AI capabilities improve predictably with scale, it is difficult to predict when specific capabilities will appear. There are many documented examples of capabilities that appear when models reach a certain scale, sometimes suddenly, without being explicitly programmed into the model [footnote 170], [footnote 171], [footnote 172], [footnote 173], [footnote 174], [footnote 175]. For example, LLMs at a certain scale have gained the ability to accurately add large numbers, when prompted to perform the calculation step-by-step. Some researchers define these as ‘emergent’ capabilities [footnote 171], [footnote 172], [footnote 173], [footnote 174], indicating that they are present in larger models but not in smaller models and so their emergence is often hard to predict in advance. On the other hand, recent research has made some progress in predicting ‘emergent’ capabilities [footnote 176], [footnote 177]. There is ongoing debate about whether capabilities can be called ‘emergent’: some definitions of emergence require that the capability appears suddenly or unpredictably at a certain scale (which is not always the case), whereas other definitions only require that the capability appears as models are scaled, without being explicitly designed to have the capability.

It is debated to what extent benchmark performance reflects real-world understanding or utility. AI models have made rapid progress on many benchmark metrics, but these benchmarks are limited compared to real-world tasks, and experts debate whether these metrics effectively evaluate truly general capabilities [footnote 178], [footnote 179]. State-of-the-art general-purpose AI models often exhibit unexpected weaknesses or a lack of robustness on some benchmarks. For example, these systems perform worse on rare or more difficult variants of tasks that are not seen in the training data [footnote 40], [footnote 110]. Some researchers hypothesise that this is because the systems partly or fully rely on memorising patterns rather than employing robust reasoning or abstract thinking [footnote 153], [footnote 180]. In some cases, models were trained on the benchmark solutions, leading to high benchmark performance despite the models not being able to perform well on the task in real-world contexts [footnote 181], [footnote 182]. Models also struggle to adapt to cultures that are less represented in the training data [footnote 183]. Issues like these underscore the difficulty of assessing what benchmark results imply about models’ capacity to reliably apply knowledge to practical, real-world scenarios.

However, sometimes general-purpose AI systems perform well on difficult tasks designed to test reasoning, without having had a chance to memorise the solutions. In general, the presence of memorisation found in some studies does not imply the absence of more advanced processes like reasoning – it is possible for both to exist in different models or within the same model. There is evidence [footnote 184], [footnote 185] that some AI models have generalised their learning to situations that they have not been trained on, suggesting that they are not only memorising data. Some general-purpose language models (and systems built with them) have performed well on reasoning and mathematics problems whose solutions were not part of their training data [footnote 186]. This extends to reaching medal-level performance at the recent International Olympiads for mathematics ([footnote 187], [footnote 188]) and computer science [footnote 92] and the challenging Abstraction and Reasoning Corpus (ARC, ([footnote 189]).

There is substantial disagreement about whether AI developers can achieve broadly human-level AI on most cognitive tasks by scaling training resources as well as refining and combining existing techniques. Some argue that continued scaling (potentially combined with refining and combining existing approaches) could lead to the development of general-purpose AI systems that perform at a broadly human level or beyond for most cognitive tasks [footnote 190]. This view draws support from the observation of consistent scaling laws and how increased scale has overcome many limitations of early language models such as GPT-1, which could rarely generate a coherent paragraph of text. Others contend that deep learning has fundamental limitations which cannot be solved through scaling alone. These critics argue current systems rely on memorisation (at least partially, see above), and lack true common sense reasoning [footnote 153], [footnote 191], [footnote 192], causal reasoning [footnote 193], or an understanding of the physical world [footnote 153], [footnote 191], [footnote 193], alongside other limitations discussed in 1.2. Current capabilities.

1.3.3. How much scaling and refinement of existing approaches is expected in coming years?

Computing resources dedicated to training AI have been rapidly scaled up, and further rapid scaling until 2030 appears feasible

AI developers have increased training compute for flagship models quickly, with growth at ~4x/year. Training compute usage has grown exponentially since the early 2010s (see Figure 1.7), with the average amount used to train machine learning models doubling approximately every six months [footnote 26]. For illustration, notable machine learning models [footnote 194], [footnote 195], [footnote 196] in 2010 were trained with around ten billion times less compute than the largest models in 2023 [footnote 197], [footnote 198].

AI companies have also invested more computational resources in deployment. This is both because more general-purpose AI systems have been deployed to serve users [footnote 199], and because deployed systems have access to more computational resources to increase capabilities. Models can be run for longer, or the results of multiple models can be aggregated, resulting in performance gains that complement the gains from using more training compute [footnote 80], [footnote 92], [footnote 93], [footnote 94], [footnote 200], [footnote 201], [footnote 202], [footnote 203], [footnote 204]. For example, some estimates indicate that OpenAI incurred $700k/day in deployment costs in 2023 [footnote 205], and that running AI represented 60% of Google’s CO2 emissions from machine learning infrastructure as of 2022 [footnote 206].

The amount of training compute available has been growing, mostly due to large capital expenditures increasing the quantity of AI chips. Since 2010, computing hardware has become cheaper due to hardware improvements, meaning that the amount of computing power (compute) that AI companies can buy with a dollar is increasing at a rate of 1.35x per year [footnote 144], [footnote 207]. However, the total compute used in training notable AI systems has increased by approximately 4x per year since 2010 [footnote 26], outpacing the rate of hardware efficiency improvements. This suggests that the primary driver of training compute growth has been investments to expand the AI chip stock, not improvements in chip performance.

AI computation has massive energy demands, but current growth rates in AI power consumption could persist for several years. Global AI computation is projected to require electricity consumption similar to that of Austria or Finland by 2026 [footnote 208] (see 2.3.4. Risks to the environment) for more information. Based on current growth rates in power consumption for AI training, the largest AI training runs in 2030 will need 1–5 gigawatts (GW) of power. Indeed, a compute provider recently purchased a data centre with a 960 megawatt power supply [footnote 209]. Thus, depending on investment and policy decisions, energy bottlenecks likely will not prevent compute from scaling at current rates until the end of the decade.

Figure 1.7: AI developers have consistently used more compute to train notable machine learning models over time, at an increasing pace since 2010 [footnote 26], [footnote 197]

Note. Computation is measured in total FLOP (floating point operations) estimated from AI literature — this refers to the number of computational operations performed during training. Estimates are expected to be accurate within a factor of two, or a factor of five for recent undisclosed models such as GPT-4. Sources: Epoch AI, 2024 [footnote 26], [footnote 197]; Sevilla et al., 2022 [footnote 26], [footnote 197].

Challenges to producing and improving AI chips exist, but can likely be overcome. It typically takes 3–5 years to build a computer chip fabrication plant [footnote 210], [footnote 211], and supply chain shortages sometimes delay the production of important chip components [footnote 212], [footnote 213], [footnote 214]. However, major AI companies can still sustain compute growth in the near term by capturing large fractions of the AI chip stock. For example, one study estimates that the share of the world’s data centre AI chips owned by a single AI company at any point in time is somewhere between 10% and 40% [footnote 215]. Moreover, an analysis of existing trends and technical possibilities in chip production suggest that it is possible to train AI systems with 100,000x more training compute than GPT-4 (the leading language model of 2023) by 2030. This is sufficient to support existing growth rates in training compute, which imply a total increase of 10,000x over the same period [footnote 215]. Hence, chip production constraints are significant, but they are unlikely to prevent further scaling of the largest models at current rates until 2030 if investment is sustained see Figure 1.8.

Figure 1.8: Four physical constraints to training general-purpose AI models using more compute by 2030

Note. There are many orders of magnitude of uncertainty in the overall estimates, but training runs using 10,000x more computation than GPT-4 (released in 2023), which in line with the existing trend, appear technically feasible based on these estimates. Source: Sevilla et al. 2024 [footnote 215].

Training AI systems across a very large number of AI chips is difficult, which may prevent extremely large training runs. For example, some estimates suggest that training runs that are 10,000 to 10 million times larger than GPT-4’s will be impossible due to constraints on how much information can be moved between chips, and limits on the time to process data [footnote 215], [footnote 216]. If these estimates are correct, then these bottlenecks will limit the ability of AI developers to increase training compute at existing rates over the next decade. However, it is possible that novel techniques or simple workarounds will permit much larger training runs.

There is likely enough pre-training data for scaling until 2030, but projections are highly uncertain after this point

Data shortages are a plausible bottleneck to continued scaling of language model pre-training. Since 2010, data requirements for pre-training general-purpose AI systems have grown around 10x every three years [footnote 197]. For example, a state-of-the-art model in 2017 was trained with a few billion words, whereas state-of-the-art general-purpose models in 2023 were trained with several trillion [footnote 217], [footnote 218]. A lot of this growth has been possible due to internet data availability, but growth rates in the demand for data appear rapid enough to exhaust human-generated internet text data by 2030 [footnote 219], [footnote 220]. These challenges are exacerbated by data copyright issues, as it may become illegal for AI companies to train AI on certain types of data (see 2.3.6. Risks of copyright infringement).

The degree of data scarcity is specific to the domain and actor. In some domains data gathering can be substantially scaled up, such as in general-purpose robotics, where systems gather data during deployment [footnote 221].

Sourcing data from different modalities could help sustain data scaling. General-purpose AI systems are increasingly being trained on multimodal data, combining textual, visual, auditory, or biological information [footnote 59], [footnote 222], [footnote 223], [footnote 224]. Several studies suggest that this will increase the amount of training data available for models and endow models with novel capabilities, such as the ability to analyse documents with both text and graphics [footnote 4], [footnote 50], [footnote 147]. The most comprehensive estimates suggest there is enough multimodal data to support training runs a thousand to ten million times larger than GPT-4’s in terms of compute size, which requires roughly ten times the data [footnote 215], [footnote 225]. However, these estimates are very uncertain, since it is difficult to gauge how well training on one data modality impacts performance on another modality.

Machine-generated synthetic data could dramatically alleviate data bottlenecks, but evidence for its utility is mixed. Training datasets can also be augmented by ‘synthetic’ general-purpose AI outputs, which can be useful when real data is limited [footnote 226], [footnote 227] or for improving model generalisation [footnote 227], [footnote 228]. However, some argue that naively training on general-purpose AI outputs degrades performance or has rapidly diminishing returns [footnote 229], [footnote 230], [footnote 231], [footnote 232], [footnote 233], [footnote 234], [footnote 235], [footnote 236]. Others argue that these issues can be circumvented with better training techniques, such as by mixing in ‘natural’ data [footnote 229], [footnote 231], [footnote 235], [footnote 237], [footnote 238], improving data quality by (for example) using a model to rate its quality [footnote 226], [footnote 239], [footnote 240], [footnote 241], and training on negative examples (i.e. teaching the AI what not to do) [footnote 242]. Recent flagship models such as Llama 3 have made substantial use of synthetic data during multiple stages of training [footnote 37]. The o1 model’s recent improvements in reasoning and programming tests were achieved largely by learning from its own self-generated ‘chains of thought’ – analysing which reasoning paths led to success or failure [footnote 2].

Most existing successes with synthetic data have been restricted to certain domains. Synthetic data training can be highly successful in domains where model outputs can be formally checked, such as mathematics and programming [footnote 187], [footnote 188], [footnote 243], [footnote 244]. However, it is still unclear whether synthetic data training methods will be effective in domains where outputs cannot be easily verified. One such example is medical research, where data often needs to be verified by performing experiments lasting months or even years.

1.3.4. How much will AI capabilities be improved through the invention or refinement of algorithms?

Existing techniques and training methods for general-purpose AI have been improved and refined consistently

Algorithmic improvements allow general-purpose AI models to be trained with fewer resources. The techniques and training methods underpinning the most capable general-purpose AI models have consistently and reliably improved over time. The computational efficiency of AI techniques for training has been increasing by 10x approximately every 2–5 years in key domains such as image classification, game-playing, and language modelling [footnote 245], [footnote 246]. For example, the amount of compute required to train a model to achieve a set level of performance at image classification decreased by 44x between 2012 and 2019, meaning that efficiency doubled every 16 months. Game-playing AI systems require half as many training examples every 5–20 months [footnote 247]. In language modelling, the compute required to reach a fixed performance level has halved approximately every eight months on average since 2012 [footnote 246]. This corresponds to a 3x algorithmic training efficiency improvement per year, amounting to roughly a 27x total improvement by the end of 2026. These advances have enabled general-purpose AI researchers and companies to develop more capable models over time within a limited hardware budget.

Algorithmic innovations also occur across other dimensions, but these are less well-measured. For example, new techniques have allowed general-purpose AI systems to process larger quantities of contextual information for each query to the AI system [footnote 248]. Some algorithmic innovations also help increase performance, allow general-purpose AI systems to use tools [footnote 22], and better leverage computation at deployment [footnote 94]. These capabilities vary along different dimensions, their rates of improvement are challenging to measure, and they are often less well-understood.

Enhancements after pre-training can be used to significantly improve general-purpose AI model capabilities at low cost. There is a rapidly growing body of work on algorithmic innovations after initial training, such as improved fine-tuning, giving models access to software tools, and structuring models’ outputs for reasoning tasks (see 1.2. Current capabilities). This means that a wide range of actors, including low-resource actors, could use enhancements (sometimes called ‘post-training enhancements’) to advance general-purpose AI capabilities – an important factor for governance to account for.

Capability progress from applying AI systems to AI development

General-purpose AI systems are increasingly deployed to automate and accelerate AI research and development, and its effects on the pace of progress are understudied. Narrow AI systems have already been used to develop and improve algorithms [footnote 249], [footnote 250], and design the latest AI chips [footnote 251]. Recent LLMs are widely used in areas related to AI R&D, particularly in programming [footnote 55], generating and optimising prompts and training settings [footnote 252], [footnote 253], [footnote 254], [footnote 255], providing oversight by replacing human feedback data [footnote 256], and selecting high-quality training data [footnote 257]. Recent prototypes also used LLMs to propose novel research ideas [footnote 258]. A recently released LLM-based system performed competitively with typical human teams in real-world AI engineering competitions [footnote 125]. A recent study comparing AI systems to expert human engineers found that carefully tuned AI agents, built on state-of-the-art models, performed comparably to humans on AI research engineering tasks that typically take engineers eight hours to complete (see Figure 1.9) [footnote 259]. The AI agents showed better performance than humans on tasks shorter than eight hours but fell behind on longer ones, following a typical pattern seen in AI performance. AI engineering tasks consume the largest portion of time in AI research and development work, making the application of AI to these tasks particularly important [footnote 260]. As the capabilities of general-purpose AI systems advance, their overall effect on algorithmic progress and engineering in AI will require more research to understand.

Figure 1.9: AI v human engineer perforance at AI research engineering tasks

Figure 1.9: In a set of experiments, LLM-based AI agents released in 2024 performed better than expert human engineers at open-ended AI research engineering tasks when both were given two hours or less to complete the work. Conversely, human experts performed better when given eight hours or more. Different ‘system integrations’ refer to different ways of using the same model, which can lead to varying performance. The shaded regions correspond to 95% confidence intervals. Source: Wijk et al., 2024 [footnote 259].

Will the invention of novel approaches lead to rapid progress in coming years?

Sudden large and broad improvements in AI algorithms are rare but cannot be ruled out. Fundamental conceptual breakthroughs are rare, and hard to predict since data on these is relatively scarce. Such rare events cannot be easily forecasted by extrapolating past trends. At best, statistical models that analyse past improvements on AI benchmarks find suggestive evidence that sudden large performance improvements are unlikely but cannot be ruled out [footnote 261]. The corpus of evidence in this area is highly limited and substantial uncertainty remains.

Even if developers achieve fundamental conceptual breakthroughs in algorithms, they might not immediately lead to large capability improvements. For example, one study found that some algorithmic innovations show more pronounced effects at larger scales than at smaller scales of training compute [footnote 262], making it hard to observe improvements in small experiments. Algorithmic innovations also need to be optimised to work well with existing hardware, or to be integrated into existing infrastructure or developer conventions [footnote 263], [footnote 264], [footnote 265]. These pose barriers to implementation at scale, so if a major conceptual breakthrough is required to overcome the limitations of current general-purpose AI, it could take many years.

Policy challenges

As these technical trends continue, policymakers face new challenges in addressing the societal impacts of general-purpose AI.

One challenge for policymakers is the limited availability of high-quality assessment data about general-purpose AI capabilities. For instance, a major shortcoming with current benchmarks is that they do not always accurately represent real-world capabilities. Consequently, there has been an increase in efforts to build more challenging benchmarks and to establish teams specialising in evaluating model capabilities [footnote 266], [footnote 267], [footnote 268], [footnote 269]. These issues with data quality are further compounded by the limited quantity of data, which means that some estimates of the rate of AI progress (e.g. for algorithmic efficiency) are highly uncertain.

Navigating the uncertainty in the trajectory of future capabilities is a key challenge. Different general-purpose AI capabilities could have substantially different ramifications for societal impacts and AI policy. For example, the best estimates of the rate of algorithmic progress are highly uncertain, but the specific rate has important implications for policy approaches that emphasise monitoring training compute usage [footnote 270]. On the whole there is much uncertainty about future AI capabilities, and additional work on monitoring AI progress (for example with improved benchmarks), as well as anticipating future progress, would be valuable.

Risks

2.1. Risks from malicious use

2.1.1. Harm to individuals through fake content

Key information

  • Malicious actors can use general-purpose AI to generate fake content that harms individuals in a targeted way. For example, they can use such fake content for scams, extortion, psychological manipulation, generation of non-consensual intimate imagery (NCII) and child sexual abuse material (CSAM), or targeted sabotage of individuals and organisations.
  • However, the scientific evidence on these uses is limited. Anecdotal reports of harm from AI-generated fake content are common, but reliable statistics on the frequency and impact of these incidents are lacking. Therefore, it is difficult to make precise statements about the harms from fake content generated by general-purpose AI.
  • In recent months, limited progress has been made in scientifically capturing the extent of the problem. Since the publication of the Interim Report (May 2024), some new evidence has suggested a significant increase in the prevalence of AI-generated deepfake content online. Overall, reliable data on the full extent of the problem remains limited.
  • Several mitigation techniques exist, but they all have serious limitations. Detection techniques can sometimes help identify content generated by general-purpose AI, but fundamental challenges remain. Media authentication techniques such as watermarks can provide an additional line of defence, but moderately skilled actors can usually remove them.

Key definitions

  • AI-generated fake content: Audio, text, or visual content, produced by generative AI, that depicts people or events in a way that differs from reality in a malicious or deceptive way, e.g. showing people doing things they did not do, saying things they did not say,  changing the location of real events, or depicting events that did not happen.
  • Deepfake: A type of AI-generated fake content, consisting of audio or visual content, that misrepresents real people as doing or saying something that they did not actually do or say.

Malicious actors can misuse AI-generated fake content to extort, scam, psychologically manipulate, or sabotage targeted individuals or organisations (see Table 2.1) [footnote 271]. This threatens universal human rights, for example the right against attacks upon one’s honour and reputation [footnote 272]. This section focuses on harms caused to individuals through AI-generated fake content. Potential impacts of AI-generated and -mediated influence campaigns on the societal level are covered in 2.1.2. Manipulation of public opinion.

Table 2.1: AI-generated fake content

AI-generated fake content Description
Scams/fraud Using AI to generate content such as an audio clip impersonating a victim’s voice in order to, for example, authorise a financial transaction
Blackmail/extortion Generating fake content of an individual, such as intimate images, without their consent and threatening to release them unless financial demands are met
Sabotage Generating fake content that presents an individual engaging in compromising activities, such as sexual activity or using drugs, and then releasing that content in order to erode a person’s reputation, harm their career, and/or force them to disengage from public-facing activities (e.g. in politics, journalism, or entertainment)
Psychological abuse/ bullying Generating harmful representations of an individual for the primary purpose of abusing them and causing them psychological trauma. Victims are often children.

Table 2.1: AI-generated fake content has been used to cause different kinds of harm to individuals, including through scams, blackmail, sabotage, and psychological abuse.

A key evidence gap around harm to individuals through fake content is the lack of comprehensive and reliable statistics on the above harms, which makes precise assessment of their frequency and severity difficult. Many experts believe that artificially generated fake content, and especially sexual content, is on the rise, but most accounts of such cases remain anecdotal. Key empirical evidence gaps pertain to the prevalence of deepfake financial fraud and instances of extortion and sabotage. Reluctance to report may be contributing to these challenges in understanding the full impact of AI-generated content intended to harm individuals. For example, institutions often hesitate to disclose their struggles with AI-powered fraud. Similarly, individuals attacked with AI-generated compromising material about themselves may stay silent out of embarrassment and to avoid further harm [footnote 273].  

Criminals can exploit AI-generated fake content to impersonate authority figures or trusted individuals to commit financial fraud. There have been numerous cases in which criminals used artificially generated audio and video clips to trick individuals into transferring money. For example, phishing attacks can leverage AI-generated fake content to make fraudulent messages, calls, or videos more convincing and effective, aiming to obtain sensitive information or money by impersonating a trusted entity [footnote 273], [footnote 274]. Incidents range from high-profile fraud cases where bank executives were persuaded to transfer millions of dollars, to ordinary individuals being tricked into transferring smaller sums to (supposedly) loved ones in need. AI-generated fake content can also be used for identity theft, whereby a victim’s impersonated voice or likeness is used to authorise bank transfers or to set up new bank accounts in a victim’s name. Alternatively, fake content can also be used to trick system administrators into sharing password and username information that can facilitate identity theft at a later date [footnote 275].

AI-generated fake content can also be used as blackmail for extortion. In such cases, criminals demand money, business secrets, or nude images or videos, using compromising realistic AI-generated content as leverage [footnote 276]. Different types of AI-generated fake content – ranging from video, voice clones, images, and more – can vary in their realism and effectiveness [footnote 277]. The fake content can feature any compromising or reputationally damaging activity, but has received particular attention in cases of deepfake pornography, where general-purpose AI is used to create pornographic or other intimate audiovisual representations of individuals without their consent [footnote 278], [footnote 279], [footnote 280]. This content is then used to extort victims for ransom – demanding money to prevent the images from being released – or to gain compliance for other demands, such as supplying additional illicit content.

Such compromising fake content can also be used to sabotage individuals in their personal and professional lives, violating the human right against attacks upon one’s honour and reputation [footnote 272]. Compromising fake images and video – such as images of professional athletes taking drugs – have, in some cases, resulted in reputational damage leading to lost opportunities and broken business deals [footnote 271]. The possibility of becoming the subject of harmful deepfake content and the associated threat of reputational damage and psychological abuse to oneself and family can drive people to disengage from publicly-visible activities such as politics and journalism, even when they have not been directly targeted [footnote 281]. However, the severity of this ‘silencing effect’ is difficult to accurately estimate, as evidence at this stage is largely anecdotal.

Abuse using fake pornographic or intimate content overwhelmingly targets women and girls. A 2019 study found that 96% of deepfake videos are pornographic, and that all content on the five most popular websites for pornographic deepfakes targets women [footnote 282]. The same study found that the vast majority of deepfake abuse (99% on deepfake porn sites and 81% on YouTube) is targeted at female entertainers, followed by female politicians (12% on YouTube). Moreover, sexual deepfakes are increasingly being used as a tool in intimate partner abuse, disproportionately affecting women [footnote 271], [footnote 283]. One nationally representative survey of 1,403 UK adults indicated that women were significantly more likely than men to report being fearful of becoming a target of deepfake pornography, becoming a target of a deepfake scam, and becoming a target of other potentially harmful deepfakes [footnote 284]. This heightened concern among women could reflect an awareness of their increased vulnerability to such abuse, suggesting a potential psychological impact of this technology even on those not directly targeted. However, the sample size of the survey was limited and not globally representative, and in general further research is needed to understand the psychological impact of deepfakes on women.

Children face distinct types of harm from AI-generated sexual content. First, malicious actors can harness AI tools to generate CSAM. In late 2023, an academic investigation found hundreds of images of child sexual abuse in an open dataset used to train popular AI text-to-image generation models such as Stable Diffusion [footnote 285]. In the UK, of surveyed adults who reported being exposed to sexual deepfakes in the last six months, 17% thought they had seen images portraying minors [footnote 286]. Second, children can also perpetrate abuse using AI. In the last year, schools have begun grappling with a new issue as students use easily downloadable ‘nudify apps’ to generate and distribute naked, pornographic pictures of their (disproportionately female) peers [footnote 287].

Since the publication of the Interim Report (May 2024), some new evidence has suggested a significant prevalence of AI-generated content online. In the UK, a study found that 43% of people aged 16+ say that they have seen at least one deepfake (in the form of videos, voice imitations, and images) online in the last six months (50% among children aged 8–15) [footnote 286]. However, reliable data remains comparably limited. Understanding the impact of deepfakes on individuals will require more extensive research over an extended period of time.

Countermeasures that help people detect fake AI-generated content, such as warning labels and watermarking, show mixed efficacy. Certain AI tools can help detect anomalies in images and flag them as likely fake AI-generated content. This is done either by using machine learning algorithms to look for specific features in fake images or by training deep neural networks to identify and analyse anomalous image features independently [footnote 288]. Warning labels on potentially misleading content have shown limited effectiveness even in less harmful contexts – for example, in an experimental study using AI-generated videos of a public figure alongside authentic clips, warning labels only improved participants’ detection rate from 10.7% to 21.6% [footnote 289]. However, the overwhelming majority of respondents who received warnings were still unable to distinguish deepfakes from unaltered videos [footnote 289]. Another authentication measure intended to prevent AI-generated fake content is ‘watermarking’, which involves embedding a digital signature into the content during creation. Watermarking techniques have shown promise in helping people identify the origin and authenticity of digital media for videos [footnote 290], [footnote 291], images [footnote 292], [footnote 293], [footnote 294], audio [footnote 295], [footnote 296], and text [footnote 297]. However, watermarking techniques face several limitations, including watermark removal by sophisticated adversaries [footnote 298], [footnote 299] and methods for tricking watermark detectors [footnote 299]. There are also concerns about privacy and potential misuse of watermarking technology to track and identify users [footnote 300]. Moreover, for many types of harmful content discussed in this section, such as non-consensual pornographic or intimate content, the ability to identify content as AI-generated does not necessarily prevent the harm from occurring. Even when content is proven to be fake, the damage to reputation and relationships may persist, as people often retain their initial emotional response to the content – for example, an individual’s standing in their community may not be restored simply by exposing the content as fake.

There are several key challenges facing policymakers working to mitigate harm to individuals from AI-generated fake content. Assessing the scale of the problem is difficult due to underreporting and lack of reliable statistics. This may make it difficult to determine the appropriate intervention. Current detection methods and watermarking techniques, while progressing, show mixed results and face persisting technical challenges. This means there is currently no single robust solution for detecting and reducing the spread of harmful AI-generated content. Finally, the rapid advancement of AI technology often outpaces detection methods, highlighting potential limitations of relying solely on technical and reactive interventions.

2.1.2. Manipulation of public opinion

Key information

  • Malicious actors can use general-purpose AI to generate fake content such as text, images, or videos, for attempts to manipulate public opinion. Researchers believe that if successful, such attempts could have several harmful consequences.
  • General-purpose AI can generate potentially persuasive content at unprecedented scale and with a high degree of sophistication. Previously, generating content to manipulate public opinion often involved a strong trade-off between quality and quantity. General-purpose AI outputs, however, are often indistinguishable to people from content
  • generated by humans, and generating them is extremely cheap. Some studies have also found them to be as persuasive as human-generated content.
  • However, there is no scientific consensus on the expected impact of this potential abuse of general-purpose AI. There is limited evidence on the broader societal effects of false information, whether intentionally created or unknowingly shared, and whether AI-enabled or not. Some researchers believe that attempts at manipulating public opinion using general-purpose AI are most bottlenecked by a lack of effective distribution channels. This view implies that advances in manipulative content generation should have a limited impact on the efficacy of such campaigns.
  • Since the publication of the Interim Report(May 2024), more research has emerged on the virality of, and possible mitigations for, AI-based attempts at manipulation. A new study finds that AI-generated manipulative content is perceived as less accurate but shared at similar rates to human-generated content, which suggests that such content can easily go viral regardless of whether it is AI or human-generated. New technical detection methods integrating both text and visual data have shown some success, but are not fully reliable.
  • Policymakers face limited mitigation techniques and difficult trade-offs. Attempts to address manipulation risk from general-purpose AI can, in some settings, be difficult to reconcile with protection of free speech. Further, as general-purpose AI outputs become increasingly persuasive and realistic, detecting cases of manipulation through AI can get harder. Prevention techniques, such as watermarking content, are useful but can be circumvented with moderate effort.

Key definitions

  • AI-generated fake content: Audio, text, or visual content, produced by generative AI, that depicts people or events in a way that differs from reality in a malicious or deceptive way, e.g. showing people doing things they did not do, saying things they did not say,  changing the location of real events, or depicting events that did not happen.
  • AI agent: A general-purpose AI which can make plans to achieve goals, adaptively perform tasks involving multiple steps and uncertain outcomes along the way, and interact with its environment – for example by creating files, taking actions on the web, or delegating tasks to other agents – with little to no human oversight.

General-purpose AI can help people generate realistic content at scale, which malicious actors could use for attempts to manipulate public opinion and spread certain narratives. Studies show that humans often find general-purpose AI-generated text indistinguishable from genuine human-generated material [footnote 301], [footnote 302], [footnote 303], [footnote 304]. Moreover, research indicates that while people struggle to accurately identify AI-generated content, they often overestimate their ability to do so [footnote 305]. There is also evidence that such content is already being disseminated at scale [footnote 306]. Recent research has observed a significant increase in AI-generated news articles [footnote 307], and has found that AI language models can reduce content generation costs by up to 70% for highly reliable models [footnote 308].

There is evidence that content generated by general-purpose AI can be as persuasive as content generated by humans, at least under experimental settings. Recent work has measured the persuasiveness of general-purpose AI-generated political messages. Several studies have found that they can influence readers’ opinions of psychological experiments [footnote 309], [footnote 310], [footnote 311], [footnote 312], [footnote 313], in a potentially durable fashion [footnote 314], though the generalisability to real-world contexts of these effects remains understudied. One study found that during debates, people were just as likely to agree with AI opponents as they were with human opponents [footnote 315], and more likely to be persuaded by the AI if the AI had access to personal information of the kind that one can find on social media accounts. Recent research also explores how general-purpose AI agents could influence user beliefs using more sophisticated techniques, including by creating and exploiting users’ emotional dependence, feeding their anxieties or anger, or threatening to expose information if users do not comply [footnote 316].

As general-purpose AI systems grow in capability, there is evidence that it will become easier to maliciously use them for deceptive or manipulative means, possibly even with higher effectiveness than skilled humans, and to encourage users to take actions that are against their own best interests [footnote 317], [footnote 318]. There is also some evidence that AI systems can use new AI-specific manipulation tactics that humans are especially vulnerable to because our defences against manipulation have been developed in response to other humans, not AIs [footnote 319]. However, AI systems can also be instrumental in mitigating AI-powered persuasion.

However, there is general debate regarding the impact of attempts to manipulate public opinion, whether using general-purpose AI or not. A systematic review of relevant empirical studies on fake news revealed that only eight out of the 99 reviewed studies attempted to measure direct impacts [footnote 320]. These studies generally found that the spread and consumption of fake news were limited and highly concentrated among specific user groups, casting doubt on earlier hypotheses about its widespread influence on election outcomes. However, these findings do not necessarily indicate a high resilience to manipulation and persuasion attempts, and fake news can have broader or unintended effects beyond its original purpose. Some studies suggest that while people can theoretically discern true information from false information, they often lack the incentive to do so, instead focusing on personal motivations or maximising engagement on social media [footnote 321], [footnote 322], [footnote 323]. Regardless of the academic debate about effectiveness, public concern about AI-driven attempts to manipulate public opinion remains high – for example, a 2024 survey found that a majority of Americans across the political spectrum were highly concerned about AI being used to create fake information about election candidates [footnote 324]. However, this finding may not be representative of global attitudes.

In addition, there is no consensus on whether the generation of more realistic fake content at scale should be expected to lead to more effective manipulation campaigns, or whether the key bottleneck for such campaigns is distribution (see Figure 2.1). Some experts have argued that the main bottleneck for actors trying to have a large-scale impact with fake content is not generating that content, but distributing it at scale [footnote 325]. Similarly, some research suggests that ‘cheapfakes’ (less sophisticated methods of manipulating audiovisual content that are not dependent on general-purpose AI use), might be as harmful as more sophisticated deepfakes [footnote 326]. If true, this would support the hypothesis that the quality of fake content is currently less decisive for the success of a large-scale manipulation campaign than challenges around distributing that content to many users. Social media platforms can employ various techniques for reducing the reach of content likely to be of this nature. These techniques are often relatively effective, but there are concerns about their impact on free speech. They include human content moderation, labelling of potentially misleading content, and assessing source credibility. At the same time, research has shown for years that social media algorithms often prioritise engagement and virality over the accuracy or authenticity of content, which some researchers believe could aid the rapid spread of AI-generated content generated to manipulate public opinion [footnote 327].

Researchers have also expressed broader concerns about the erosion of trust in the information environment as AI-generated content becomes more prevalent. Some researchers worry that as general-purpose AI capabilities grow and are increasingly used for generating and spreading messages at scale, be they accurate, intentionally false, or unintentionally false, people might come to generally distrust information more, which could pose serious problems for public deliberation. Malicious actors could exploit such a generalised loss of trust by denying the truth of real, unfavourable evidence, claiming that it is AI-generated – a phenomenon known as the ‘liars’ dividend’ [footnote 328], [footnote 329]. However, society might also quickly adjust to AI-induced changes to the information environment. In this more optimistic scenario, people might adapt their shared norms for determining if a piece of information or source is credible or not. Society has adapted in this way to past technological changes, such as the introduction of traditional image editing software.

Figure 2.1: Multiple stages lie between an initial intent to manipulate public opinion and a potential impact on society

Note. While there is strong evidence for technical capability to create AI-generated content, evidence becomes sparse at later stages, reflecting research gaps rather than a proven ineffectiveness of such campaigns. Note that societal impacts can also occur through other mechanisms than depicted here, such as a general erosion of trust in information sources even without measurable changes in individual beliefs. Figure: International Scientific Report on the Safety of Advanced AI.

Since the publication of the Interim Report (May 2024), some new insights have emerged regarding AI-generated content. One recent experimental study found that while people perceive AI-generated fake news as less accurate than human-generated fake news (by about 20%), they share both types at similar rates (approximately 12%), highlighting that fabricated content, whether AI-generated or human-generated, can easily go viral [footnote 330]. In the experiment, nearly 99% of the study subjects failed to identify AI-generated fake news at least once, which the authors attributed to the ability of state-of-the-art large language models (LLMs) to mimic the style and content of reputable sources. New detection methods have successfully combined textual and visual analysis, addressing previous limitations of approaches using only one type of data such as only text or only images [footnote 331].

Current techniques for identifying content generated by general-purpose AI are helpful but often easy to circumvent. Researchers have employed various methods to identify potential AI authorship [footnote 332], [footnote 333]. ‘Content analysis’ techniques explore statistical properties of text, such as unusual character frequencies or inconsistent sentence length distributions, which deviate from patterns typically observed in human writing [footnote 334], [footnote 335], [footnote 336]. ‘Linguistic analysis’ techniques examine stylistic elements, such as sentiment or named entity recognition, to uncover inconsistencies or unnatural language patterns indicative of AI generation [footnote 337], [footnote 338]. Researchers can sometimes also detect AI-generated text by measuring how readable it is, as AI writing often shows unusual patterns compared to human writing [footnote 339]. However, not all AI-generated content is fake news, and some research reveals an interesting bias in fake news detector tools: they tend to disproportionately classify LLM-generated content as fake news, even when it is truthful [footnote 340]. A study of seven widely used AI content detectors identified another potential limitation of these tools: they displayed a bias against non-native English writers, often misclassifying their work as AI-generated [footnote 341]. Finally, AI researchers have also proposed other approaches for detecting AI-generated content, such as ‘watermarking’, in which an invisible signature identifies digital content as generated or altered by AI. Watermarking can help with the detection of AI-generated content, but can usually be circumvented by moderately sophisticated actors, as is discussed in 2.1.1. Harm to individuals through fake content.

Early experiments demonstrate that human collaboration with AI can improve the detection of AI-generated text. This approach increased detection accuracy by 6.36% for non-experts and 12.76% for experts compared to individual efforts in a recent study [footnote 342]. While purely human-based collaborative detection is likely not scalable for dealing with the vast amount of content generated daily, the research still remains valuable. For example, data from human collaboration can be used to train and improve AI detection systems. Moreover, for particularly challenging or high-stakes content, human collaboration can supplement AI detection. However, the long-term effect of such collaborative efforts on public resilience to manipulation attempts remains to be seen, and further studies are required to validate these initial findings.

For policymakers working on reducing the risk of AI-aided manipulation of public opinion, there are several challenges. These include mitigation attempts with protection of free speech [footnote 343], [footnote 344] and determining appropriate legal liability frameworks  [footnote 345], [footnote 346], [footnote 347]. Policymakers also face uncertainty about the actual impact of manipulation campaigns, given mixed evidence on their effectiveness and limited data on their prevalence (see Figure 2.1). Another challenge is the ongoing evolution of AI, adaptive user behaviours, and continuous improvements in AI systems, which creates a perpetual cycle of adaptation and counter-adaptation between detection methods and AI-generated content.

For risk management practices related to the manipulation of public opinion, see:

  • 3.3. Risk identification and assessment
  • 3.4.2. Monitoring and intervention

2.1.3. Cyber offence

Key information

(See Chair’s update on the latest AI advances).

  • Attackers are beginning to use general-purpose AI for offensive cyber operations, presenting growing but currently limited risks. Current systems have demonstrated capabilities in low- and medium-complexity cybersecurity tasks, with state-sponsored threat actors actively exploring AI to survey target systems. Malicious actors of varying skill levels can leverage these capabilities against people, organisations, and critical infrastructure such as power grids.
  • Cyber risk arises because general-purpose AI enables rapid and parallel operations at scale and lowers technical barriers. While expert knowledge is still essential, AI tools reduce the human effort and knowledge needed to survey target systems and gain unauthorised access.
  • General-purpose AI offers significant dual-use cyber capabilities. Evidence indicates that general-purpose AI could accelerate processes such as discovering vulnerabilities, which are essential for launching attacks as well as strengthening defences. However, resource constraints and regulations may prevent critical services and smaller organisations from adopting AI-enhanced defences. The ultimate impact of AI on the attacker-defender balance remains unclear.
  • Since the publication of the Interim Report (May 2024), general-purpose AI systems have shown significant progress in identifying and exploiting cyber vulnerabilities. AI systems have autonomously found and exploited vulnerabilities in real open source software projects. Recent research prototypes have autonomously found and exploited vulnerabilities that take the fastest human security teams minutes to find, but struggle with more complex scenarios. General-purpose AI was also used to find and fix a previously unknown exploitable vulnerability in widely-used software (SQLite).
  • In principle, the risk appears at least partially manageable, but there are key assessment challenges. Rapid advancements in capabilities make it difficult to rule out large-scale risks in the near term, thus highlighting the need for evaluating and monitoring these risks. Better metrics are needed to understand real-world attack scenarios, particularly when humans and AIs work together. A critical challenge is mitigating offensive capabilities without compromising defensive applications.

Key definitions

  • Malware: Harmful software designed to damage, disrupt, or gain unauthorised access to a computer system. It includes viruses, spyware, and other malicious programs that can steal data or cause harm.
  • Ransomware: A type of malware that locks or encrypts a user’s files or system, making them inaccessible until a ransom (usually money) is paid to the attacker.
  • Hacking: The act of exploiting vulnerabilities or weaknesses in a computer system, network, or software to gain unauthorised access, manipulate functionality, or extract information.
  • Penetration testing: A security practice where authorised experts or AI systems simulate cyberattacks on a computer system, network or application to proactively evaluate its security. The goal is to identify and fix weaknesses before they can be exploited by real attackers.
  • CTF (Capture the Flag) challenges: Exercises often used in cybersecurity training, designed to test and improve the participants’ skills by challenging them to solve problems related to cybersecurity, such as finding hidden information or bypassing security defences.
  • Zero-day vulnerability: An undiscovered or unpatched security flaw in software or hardware. As attackers can already exploit it, developers have ‘zero days’ to fix it.
  • Hardware backdoor: A feature of a device, intentionally or unintentionally created by a manufacturer or third party, that can be used to bypass security protections in order to monitor, control, or extract data without the user’s knowledge.

Offensive cyber operations typically involve designing and deploying malicious software (malware) and exploiting vulnerabilities in software and hardware systems, leading to severe security breaches. A standard attack chain begins with reconnaissance of the target system, followed by iterative discovery, exploitation of vulnerabilities, and additional information gathering. These actions demand careful planning and strategic execution to achieve the adversary’s objectives while avoiding detection. Some experts are concerned that general-purpose AI could enhance these operations by automating vulnerability detection, optimising attack strategies, and improving evasion techniques [footnote 348], [footnote 349]. These advanced capabilities would benefit all attackers. For instance, state actors could leverage them to target critical national infrastructure (CNI), resulting in widespread disruption and significant damage. At the same time, general-purpose AI could also be used defensively, for example to find and fix vulnerabilities.

General-purpose AI can assist with information-gathering tasks, thereby reducing human effort. For example, in ransomware attacks, malicious actors first manually conduct offensive reconnaissance and exploit vulnerabilities to gain entry to the target network, and then release malware that spreads without human intervention [footnote 350]. The entry phase is often technically challenging and prone to failure. General-purpose AI is being explored by state-sponsored attackers as an aid to speed up the process [footnote 351], [footnote 352]. However, while there are general-purpose systems that have performed vulnerability discovery autonomously (see next paragraphs), published systems have not yet autonomously executed real-world intrusions into networks and systems – tasks that are inherently more complex.

Figure 2.2: Recent advances in AI models’ ability to find and exploit cybersecurity vulnerabilities autonomously has grown across multiple benchmarks

Note. On DARPA and ARPA-H’s AI Cyber Challenge [footnote 353], [footnote 359], OpenAI’s new o1 model (Sept 2024) substantially outperformed GPT-4o (May 2024), autonomously detecting 79% of vulnerabilities compared to 21% for GPT-4o. Testing on Cybench [footnote 358] showed vulnerability detection rates improving from 10% (Claude 3 Opus, March 2024) to 17.5% (Claude 3.5 Sonnet, June 2024). OpenAI’s internal CTF hacking competition evaluations at high school level rose from 20% to 26.7%, though models still struggle with more complex benchmark tasks [footnote 2]. Only the best-performing new model for each month is shown. Sources: Defense Advanced Research Projects Agency, 2024 [footnote 353]; Ristea et al., 2024 [footnote 359]; Zhang et al., 2024 [footnote 358]; OpenAI, 2024 [footnote 2].

General-purpose AI can assist attackers with vulnerability discovery (VD) in source code to some extent, but traditional methods remain dominant for now. In this task, the analyst examines the source code of a software project (such as an open source web server or firewall) to identify exploitable security flaws.

Since the publication of the Interim Report (May 2024), the cyber capabilities of general-purpose AI in vulnerability discovery have increased significantly. At the DARPA AIxCC challenge [footnote 353] participants developed  systems capable of autonomously finding, exploiting, and fixing vulnerabilities in real open source software projects using general-purpose AI [footnote 354], [footnote 355], [footnote 356]. Figure 2.2 illustrates the significant improvement in the performance of general-purpose AI models at finding and exploiting (and sometimes fixing) cyber vulnerabilities. Moreover, Google’s Big Sleep was used to discover a previously unknown exploitable vulnerability in the widely-used open source software SQLite [footnote 357]. In this case, the discovery was used to fix rather than exploit the vulnerability. Penetration testing benchmarks (metrics) have also advanced considerably, providing a much clearer signal of the models’ capabilities and improvement over time [footnote 358].

General-purpose AI has shown low to moderate success in automating system and network hacking. Unlike automated vulnerability detection in software, where AI systems rely on access to source code, hacking is more challenging for AI, as it must execute every step of an attack with little or no prior knowledge of the target system’s inner workings (for example, gathering information on the target, finding entry points, breaking in, moving through the system and achieving its goal). Real-world hacking operations require exploratory actions and iterative adjustments to understand a target system’s operation, often involving hypothesis testing and shifting strategy dynamically [footnote 360]. These tasks have resisted full automation because they require an exceptional level of precision – where even a single incorrect character in an input can cause the entire approach to fail – and involve resolving multiple complex subtasks without explicit guidance or feedback.

Since the publication of the Interim Report (May 2024), general-purpose AI’s cybersecurity attack capabilities have improved, but AI models still cannot beat human experts and they struggle with more complex scenarios. CTF challenges, where the attacker has to identify and exploit vulnerabilities to gain access to protected data or systems, have emerged as a typical cybersecurity benchmark. Before the Interim Report (May 2024), general-purpose AI could carry out simple attacks [footnote 127], [footnote 361], [footnote 362] but not sophisticated ones. Since then, further research has managed to achieve better results with AI systems. For example, teams of LLM agents can collaborate effectively to find previously unknown (‘zero-day’) vulnerabilities, albeit not highly complicated ones [footnote 363]. Additionally, access to better tools [footnote 364] and the introduction of modules that enable step-by-step reasoning [footnote 365] have enabled general-purpose AI models to solve tasks from established CTF challenges (easy and medium difficulty). However, without these reasoning aids, Google reports that their latest model, Gemini 1.5, shows no performance gains on their CTF benchmarks compared to prior versions, and it only shows improvements in simple offensive cybersecurity tasks [footnote 49]. OpenAI reports that while its recent o1 model improves over the benchmark scores of GPT4o, it is still classified as ‘low-risk’ in this area and remains within manageable misuse limits [footnote 2]. Various models and collaborations between models achieve performance comparable to humans who are given around 35 minutes per task: baseline Sonnet 3.5, o1-preview advising o1-preview (both versions), and o1-mini advising GPT-4o [footnote 129]. This collaborative dynamic, where models advise and refine each other’s outputs is increasingly useful for multi-step error-intolerant tasks such as cyber offence (see also section 1.2. Current capabilities). Models without guidance were not able to solve CTF challenges that took the best human expert teams more than 11 minutes of work [footnote 358]. As expected, more recent models (e.g., OpenAI’s GPT-4o and o1-preview) perform better but still struggle to generate insights that take experts longer to figure out.

General-purpose AI systems can reduce the technical knowledge and expertise required to carry out individual steps of the attack chain. In a typical attack chain, an attacker might start with reconnaissance to identify potential vulnerabilities, use a phishing campaign to gain initial access, obtain privileges within the target system, move laterally across the network, and ultimately exfiltrate sensitive data or deploy ransomware. By automating or assisting with parts of the attack chain, general-purpose AI lessens the need for expert involvement, thereby lowering the barrier to entry for more sophisticated attacks. However, while AI can accelerate the process of reviewing publicly available information, this does not automatically result in advanced expertise. In domains such as vulnerability exploitation, general-purpose AI can assist, but experts still need to incorporate significant domain-specific knowledge to make these AI systems effective [footnote 353], [footnote 366], a need that has not changed since the Interim Report (May 2024).

State-sponsored hacking groups have reportedly used general-purpose AI to support hacking. For example, such groups have used general-purpose AI for translating technical papers, analysing publicly disclosed vulnerabilities, researching public protocols (e.g. satellite communications), assisting with scripting, troubleshooting errors, and developing detection evasion techniques for malware and intrusions [footnote 351].

General-purpose AI is likely to tip the current balance in favour of the attackers only under specific conditions:

(1) if general-purpose AI automates tasks that are needed for attack but not the corresponding defences; or
(2) if cutting-edge general-purpose AI capabilities are accessible to adversaries but not equally available to all defenders. In particular, small and medium enterprises (SMEs) may not be able to afford general-purpose AI-enhanced defence solutions.

For example, hospitals, constrained by limited security resources and the complexity of heterogeneous legacy networks, may be slower to adopt AI-driven defences, leaving their highly sensitive data more exposed to sophisticated cyberattacks. Similarly, CNI systems (such as electricity substations) often have strict criteria and are cautious in adopting new technologies, including AI-based defences, due to security concerns and governance and/or regulatory requirements. In contrast, adversaries are not bound by such constraints and can adopt advanced AI capabilities more rapidly.

Even if AI-driven detection catches vulnerabilities in new code before it reaches production, a major challenge remains: source code that is already in use and predates these capabilities. Much of this legacy code has not been scrutinised by advanced AI tools, leaving potential vulnerabilities undetected. Patching these vulnerabilities after discovery is a slow process, particularly in production environments where changes require rigorous testing to avoid disrupting operations. For example, the Heartbleed vulnerability continued to expose systems for weeks after a patch was available, as administrators faced delays in implementing it [footnote 367]. This situation will potentially create a critical transition period, wherein defenders must manage and patch older, unvetted code while attackers, unencumbered by such constraints and potentially equipped with advanced AI, can exploit these vulnerabilities with less effort (a capability asymmetry). During this transition, the disparity in AI adoption – especially among SMEs and critical infrastructure systems that are slower to integrate new technologies like AI – could amplify the imbalance between attackers and defenders.

The defensive counterparts to certain offensive tasks are considerably more complex, creating asymmetry in the effectiveness of general-purpose AI when used by attackers versus defenders. For example, attackers using general-purpose AI can stealthily embed threats at the hardware level [footnote 368] in ways that are inherently difficult for defenders to predict or detect. Thus, attackers control how concealed and complex the vulnerabilities are, while defenders must anticipate and detect these deliberately obscured threats. The Stuxnet malware [footnote 369] demonstrated how such attacks can cause physical damage by targeting industrial control systems – it disrupted Iran’s nuclear facilities by manipulating hardware operations. While there is no public evidence that AI has been used to automate and escalate such threats in production systems, its potential impact on cybersecurity warrants careful monitoring. On the other hand, some AI applications could offer asymmetric benefits to defenders as well. For example, AI could enhance the security of chips – such as those used in smartphones – by detecting and mitigating vulnerabilities during the design process [footnote 370]. Additionally, general-purpose AI has already been integrated in auditing and debugging tools [footnote 371], [footnote 372].

The main evidence gaps around current AI cyber capabilities include:

  • Comprehensive capability assessment: more empirical studies are needed to evaluate AI performance across complex, real-world attack chains and to track capability trends, particularly for multi-step attack automation. Existing benchmarks such as CTF challenges offer partial insights but often fail to capture the full scope of AI-driven offensive capabilities. For example, benchmarking in specialised environments, such as cyber-physical infrastructure testbeds, would allow for a more realistic assessment of AI’s impact in high-stakes scenarios. Additionally, the lack of human performance baselines makes it difficult to contextualise the complexity of tasks in terms of human-hours, hindering accurate comparisons of AI and human capability.
  • Evaluating human-AI collaborative offence: research into how attackers could leverage AI alongside human expertise is essential to understand potential offensive advancements. Studies should explore how AI can enhance human-led operations in areas such as strategic decision-making, resource allocation, and real-time adjustments, potentially increasing both the effectiveness and sophistication of cyberattacks. Moreover, AI models often produce ‘near misses’ that humans with moderate cyber experience could readily address, suggesting a synergistic benefit when humans and AI collaborate in offensive efforts.

Policymakers focusing on cyber risks will face challenges including reliably assessing the risks and capabilities of AI in offensive and defensive contexts. Cyber risk benchmarks can sometimes overstate performance compared to real-life scenarios because they often use challenges and code sourced from platforms such as GitHub, which models may have encountered during training. As a result, these models might already be familiar with the code or have benefited from tutorials and solution manuals found in blogs and other online resources. However, capability assessments can also be understated because it is difficult to elicit a system’s full capabilities (1.2. Current capabilities). Additionally, the success rates reported in benchmarks typically exclude near misses (instances where the AI model almost succeeds in the attack) [footnote 358], which could easily be exploited by a human operator to complete the attack.

Policymakers will also face significant challenges in regulating offensive AI research while retaining defensive capabilities. Offensive cyber research is important for maintaining robust defences, and restricting it could weaken national security strategies, especially if other nations do not impose similar limitations. Policymakers need to weigh the risks of misuse against the benefit of such research and find opportunities to reduce misuse risks while protecting defensive applications (see 3.3. Risk identification and assessment for further discussion on evaluating risks and harmful capabilities). Another critical issue is managing the trade-offs involved in openly releasing general-purpose AI model weights, which carries both significant benefits and risks of misuse, as explored in 2.4. Impact of open-weight general-purpose AI models on AI risks.

For risk management practices related to cyber offence, see:

  • 3.3. Risk identification and assessment
  • 3.4.1. Training more trustworthy models
  • 3.4.2. Monitoring and intervention
  • 3.4.3. Technical methods for privacy

2.1.4. Biological and chemical attacks

Key information

(See Chair’s update on the latest AI advances).

  • Growing evidence shows general-purpose AI advances beneficial to science while also lowering some barriers to chemical and biological weapons development for both novices and experts. New language models can generate step-by-step technical instructions for creating pathogens and toxins that surpass plans written by experts with a PhD and surface information that experts struggle to find online, though their practical utility for novices remains uncertain. Other models demonstrate capabilities in engineering enhanced proteins and analysing which candidate pathogens or toxins are most harmful. Experts could potentially use these in developing both more advanced weapons and defensive measures.
  • The real-world impact of AI on developing and using weapons including pandemic pathogens remains unclear due to secrecy requirements, testing prohibitions, and a need for better evaluations. Key evidence about malicious actors, their technical bottlenecks, and AI safety assessments relating to biological weapons are kept confidential to prevent misuse. Testing is often prohibited given the severe dangers these weapons pose. More evaluations are needed to assess how strongly current systems can aid the many steps of weapons development; substantial expertise and resources remain necessary barriers.
  • In recent months, advances have generated greater evidence of risk and expanded the biological capabilities of general-purpose AI, and there are emerging efforts to develop best practices for evaluation. Since the Interim Report (May 2024), general-purpose language models have made substantial advances in tests of biological weapons expertise and general scientific reasoning. AI has also demonstrated new capabilities in protein design and in working with multiple types of scientific data – including chemicals, proteins, and DNA – enhancing its ability to design complex biological structures. The implications for risks are still being studied, with initial evidence suggesting a rise in potential risks alongside benefits.
  • If rapid advancement continues, this creates urgent policy challenges for evaluating and managing biological risks. Recent rapid advances in risk benchmarks make it increasingly hard to rule out large-scale risks in near-future models. Policymakers need to make decisions with incomplete information and integrate classified threat research. Adding to these challenges are the ongoing debates over the risk-benefit trade-offs of releasing open-weight models, particularly AI tools for creating biological and chemical structures, and the fact that policies that depend on humans to detect risk and intervene may be too slow to address the current pace of development.

Key definitions

  • Dual-use science: Research and technology that can be applied for beneficial purposes, such as medicine or environmental solutions, but also potentially misused to cause harm, such as in biological or chemical weapon development​.
  • Toxin: A poisonous substance produced by living organisms (such as bacteria, plants, or animals), or synthetically created to mimic a natural toxin, that can cause illness, harm, or death in other organisms depending on its potency and the exposure level.
  • Pathogen: A microorganism, for example a virus, bacterium, or fungus, that can cause disease in humans, animals, or plants​.
  • Agent: For the purposes of this section, ‘agent’ usually refers to a biological, chemical, or toxicological substance that can harm living organisms. Agents in this sense are not to be confused with AI agents (see below).
  • AI agent: A general-purpose AI which can make plans to achieve goals, adaptively perform tasks involving multiple steps and uncertain outcomes along the way, and interact with its environment – for example by creating files, taking actions on the web, or delegating tasks to other agents – with little to no human oversight.
  • Biosecurity: A set of policies, practices, and measures (e.g. diagnostics and vaccines) designed to protect humans, animals, plants, and ecosystems from harmful biological agents, whether naturally occurring or intentionally introduced.

The risks associated with dual-use science are a significant focus in international AI safety policy; this section focuses on chemical and biological weapons but there are also risks concerning radiological and nuclear weapons. These weapons of mass destruction, originally developed through scientific research intended for peaceful purposes, exemplify the phenomenon of ‘dual-use science’ – where innovations are repurposed for military applications. Among these, the focus of this section is on chemical and biological weapons which are of particular concern due to the relative ease of obtaining necessary materials and the widespread availability of related information. As a result, biological weapon risks have taken centre stage in AI safety summits and broader conversations about the potential catastrophic impacts of advanced AI. In contrast, the risk of AI expanding access to nuclear and radiological weapons is considered lower, primarily because of the significant barriers to acquiring the requisite materials. However, AI’s involvement in nuclear decision-making would introduce unique risks. Some experts voice concerns that delegating decision-making authority for nuclear weapon launches to AI systems could increase the likelihood of critical errors (see 2.2.1. Reliability issues or a loss of control (see 2.2.3. Loss of control) [footnote 373]. Dual-use science risks extend to other advances such as navigation systems, nanotechnology, autonomous robots and drones, all of which have military applications that are beyond the scope of this report.

Figure 2.3

Figure 2.3: Dual-use capabilities in biology have been increasing over time for LLMs (red, [footnote 2]), biological general-purpose AI such as AlphaFold3 (blue, [footnote 23]) and specialised (not general-purpose) models relevant to pathogens (green) (390). This chart shows performance scores, calculated as percentage accuracies for recently published results compared to previous state-of-the-art results. Recent advancements in LLMs (red) were especially rapid, comparing GPT4o (released May 2024) to o1 (released September 2024). Notable advances are LLMs’ accuracy in answering questions about the release of bioweapons, which increased from 15% to 80%, and biological AIs’ ability to predict how proteins interact with small molecules (including in both medicines and chemical weapons), which increased from 42% to 90% during 2024. Due to a lack of standardised benchmarks, and inconsistencies in the way that accuracy is calculated, comparisons are limited to a few tasks and are not consistently repeated over time. Sources: OpenAI, 2024 [footnote 2] (for LLMs); Abramson et al., 2024 [footnote 23] (comparison of AlphaFold3 with the previous state of the art); Thadani et al., 2023 (390) (for specialised models relevant to pathogens).

Some general-purpose AI has been developed specifically for scientific domains, offering general capabilities for understanding and designing chemicals, DNA and proteins. Models trained on scientific data range in their abilities, from narrow applications such as predicting the structure of proteins, to offering a variety of prediction and design capabilities. In this report, broadly capable models trained on scientific data are included in the definition of general-purpose AI. However, there is substantial debate within the AI and biology communities regarding the point at which a model trained on scientific data can be called a ‘general-purpose model’ (see Introduction for a definition) or ‘foundation model’ [footnote 45]. For example, AlphaFold2 was designed for the narrow task of protein structure prediction but, through fine-tuning, has been found to be applicable to a high variety of other tasks, such as predicting protein interactions, predicting small molecular binding sites, and predicting and designing cyclic peptides [footnote 374]. For these reasons, it satisfies this report’s definition of a general-purpose AI model. AlphaFold3 has been able to achieve these tasks at greater accuracy, and across a wider range of molecules, even without fine-tuning [footnote 23]. These scientifically geared AI tools amplify the potential for chemical and biological innovation by accelerating scientific discovery, optimising production, and enabling the precise design of new biological parts. They also offer promising opportunities to develop new medicines and better combat infectious diseases [footnote 375], [footnote 376]. These tools have generated substantial advancements in science, sufficient to earn their creators the Nobel Prize in Chemistry [footnote 377].

The dual-use nature of scientific progress poses complex risks, as innovations meant for beneficial purposes, such as medicine, have historically led to the creation of chemical and biological weapons [footnote 378], [footnote 379]. The vast majority of harms from toxins and infectious diseases have resulted from naturally-occurring events, sparking extensive research to help combat these threats. The intentional development and deployment of biological weapons was informed by this research, but poses substantial difficulties [footnote 380], [footnote 381]. Many believe that advances in the design, optimisation and production of chemical and biological products, in part due to AI, may have made the development of chemical and biological weapons easier [footnote 382], [footnote 383], [footnote 384], [footnote 385]. Evidence discussed in this section suggests that general-purpose AI amplifies weapons risks by helping novices (typically defined as people with a bachelor’s degree or less in a relevant discipline) to create or access existing biological and chemical weapons, and allowing experts (typically referring to someone with a PhD or higher in a relevant discipline) to design more dangerous or targeted weapons, or create existing weapons with less effort.

Since the publication of the Interim Report (May 2024), general-purpose AI models’ ability to reason and integrate different data types has improved, and progress has been made in formulating best practices for biosecurity. Several models have been published since the Interim Report (May 2024) that integrate different types of scientific data; one foundation model for scientific data, AlphaFold 3, can predict the structure of, and interactions between, a range of molecules including chemicals, DNA, and proteins with greater accuracy than the previous state of the art (see Figure 2.3) [footnote 23], and another, ESM3, can simultaneously model protein sequence, structure, and function [footnote 386]. These developments open up new possibilities for designing biological products that do not strongly resemble natural ones [footnote 387]. The recently released o1, a general-purpose language model, has significantly improved performance in tests of biological risk measures (also shown in Figure 2.3) and general scientific reasoning compared to previous state-of-the-art models [footnote 2]. Efforts to formulate biosecurity best practices have advanced, with the Frontier Model Forum and the AI x Bio Global Forum facilitating discussions on risk evaluation and mitigation for these models [footnote 388], [footnote 389], [footnote 390].

Figure 2.4 Overview of a typical chemical and biological product development pipeline, which parallels the process used for creating chemical and biological weapons

Figure 2.4: Overview of a typical chemical and biological product development pipeline, which parallels the process used for creating chemical and biological weapons. LLMs can aid in the planning and acquisition stages, advise on performing laboratory work to build and test a design, and aid in planning the effective release or delivery of a product. AI agents, robotics platforms, and biological or chemical design tools (general-purpose or specialised) can aid in the design, building, testing and refinement of pathogens and toxins. Specialised AI can assist with mass production and formulation. Figure: International Scientific Report on the Safety of Advanced AI.

LLMs can now provide detailed, step-by-step plans for creating chemical and biological weapons, improving on plans written by people with a relevant PhD. Although information about how to create chemical and biological threats has long been accessible due to its dual-use nature, tests of LLMs show that they help novices synthesise this information, allowing them to develop plans faster than with the internet alone [footnote 391] (for the ‘Planning’ and ‘Release’ phases in Figure 2.4). These capabilities lower barriers for people to access complex scientific information, which can likely provide broad benefits but can also lower barriers to misusing this information. GPT-4, released in 2023, correctly answered 60–75% of bioweapons-relevant questions [footnote 392], but a range of models tested provided no significant improvement over biological weapon plans developed using only the internet [footnote 37], [footnote 393], [footnote 394]. However, the recent o1 model produces plans rated superior to plans generated by experts with a PhD 72% of the time and provides details that expert evaluators could not find online [footnote 2]. OpenAI concluded that their o1 models could meaningfully assist experts in the operational planning of reproducing known biological threats, leading OpenAI to increase their assessment of biological risks from ‘low’ to ‘medium’. However, OpenAI did not assess the models’ usefulness for novices [footnote 2], underscoring the need for more research. Successfully developing and deploying bioweapons still requires significant expertise, materials and skilled physical work [footnote 380], [footnote 381], meaning that even if a novice has a well-formulated plan, this does not imply that they could successfully carry it out.

Evidence shows that general-purpose AI can, in some instances, instruct users on how to acquire dangerous biological and chemical agents by circumventing traditional controls. Restricted access to dangerous materials and some of their precursors has been a key defence against biological and chemical threats (the ‘Acquisition’ phase in Figure 2.4). However, sometimes biological agents can be sourced from nature or synthesised from DNA, and skilled chemists can identify alternative routes to creating some chemical weapons, circumventing controls. General-purpose AI can assist in identifying these alternative acquisition routes, lowering barriers to access and creating risks of accidents or misuse [footnote 120]. Several studies suggest that AI could also undermine existing controls on access to risky DNA sequences. Many commercial providers of DNA screen their orders for similarity to known biological hazards in order to comply with regulatory controls and prevent the misuse of these materials. However, LLMs can guide customers to purchase DNA from providers that do not screen, or suggest methods to fool screening software [footnote 391]. Furthermore, a recent study found that some screening software fails to detect a large proportion of DNA that is designed by specialised AI tools to function in the same way as these hazards but appear dissimilar. Fortunately, the same study found that it is possible to update current systems to detect around 97% of these designs [footnote 395].

AI’s ability to design highly targeted medical treatments has increased substantially since the Interim Report (May 2024), and chat interfaces are expanding access, also heightening the risk of more potent toxins being created [footnote 384]. Both specialised and general-purpose AI tools can now design candidate therapeutic molecules for complex diseases such as cancer, autoimmune disorders, and neurological conditions (the ‘Design’ phase in Figure 2.4) [footnote 396]. For instance, AlphaProteo can design proteins that attach to targets up to 300x more strongly than existing alternatives, potentially making them more effective at lower doses [footnote 387]. However, precise targeting of these systems could also be used for malicious purposes [footnote 397]. AI-driven chemical design tools intended to reduce toxicity have been repurposed in research studies to increase it, potentially aiding in the design of chemical weapons [footnote 398], and some tools have been specifically designed for toxin creation [footnote 399]. Access to specialised design tools varies: some are restricted to trusted partners [footnote 387], while others are open-weight and can thus be used by anyone [footnote 399]. Although many of these tools are too complex for use by novices, chatbots and AI agents are being integrated with some design tools [footnote 400], [footnote 401], [footnote 402], allowing users to request designs in plain language. Today, this integration still requires technical knowledge for effective use. A direct assessment of the risks these tools pose for toxin weapon development is likely to be constrained in countries that adhere to international treaty commitments [footnote 403].

General-purpose AI enhances researchers’ ability to predict important properties of pathogens, potentially aiding in both bioweapon and countermeasure design. AI tools are being developed to predict new virus variants before they emerge and to assess properties such as their ability to infect humans [footnote 404], [footnote 405] and evade immune detection (390). These advances potentially enable proactive vaccine development for high-risk virus variants that have not yet emerged, or the malicious design of viruses that can bypass existing immunity in the population (390), [footnote 406] (the ‘Design’ phase in Figure 2.4). General-purpose AI models trained on biological data are beginning to underpin these specialised applications. For instance, the tool EVEscape, which leverages a DNA foundation model [footnote 407] and relies on protein structure predictions which are increasingly generated using AI, predicted 66% of SARS-CoV-2 (coronavirus) variants that later became dominant – far exceeding previous models (17% success) (see Figure 2.3) (390). Tools that design viruses which evade the human immune system and target specific cells are useful for gene therapy applications, but they also pose dual-use risks [footnote 408], such as enhancing bioweapons [footnote 382], [footnote 384] or targeting specific populations [footnote 384]. Simple modifications to existing pathogens could significantly increase their risk. For example, in research performed without using AI, bird flu viruses which can be lethal to humans have been modified by researchers to spread through airborne droplets [footnote 409]. Indeed, some experts believe that engineered diseases could be far worse than any occurring naturally [footnote 384], [footnote 410]. Investigation of AI systems’ ability to create more dangerous pathogens may be constrained by both international treaty commitments and the risk of accidental release of pathogen variants being tested.

General-purpose AI can help plan and guide laboratory work, but it often omits critical safety information, and tests of real-world success using this assistance have not been published. At the time of the Interim Report (May 2024), LLM-generated lab plans showed no significant improvement over those compiled from the internet [footnote 393] (the ‘Build’ phase in Figure 2.4). However, the o1 model produced laboratory instructions that were preferred over PhD-written ones 80% of the time (up from 55% for GPT-4), with its accuracy in identifying errors in lab plans increasing from 57% to 73% [footnote 2]. Despite this, the omission of crucial safety details – such as when an experiment would produce an explosive intermediate, or when protective equipment should be worn to complete a task – remains a significant concern that could result in serious accidents [footnote 2]. Evaluations of how well novices perform lab work under LLM instruction have not yet been released publicly, making the credibility of these risks an area of substantial debate.

Laboratory and design automation accelerate the refinement of biological designs. This potentially lowers barriers to AI-designed products (including weapons), but limited implementation complicates risk assessment. Biochemical designs often go through ‘Design-Build-Test-Learn’ (DBTL) cycles to test and improve on promising initial designs (shown in Figure 2.4). AI-driven tools automate these cycles for better results in less time [footnote 411], [footnote 412], [footnote 413], [footnote 414]. ‘Self-driving labs’ or ‘robot scientists’, a nascent area of scientific development, can complete these cycles without human intervention [footnote 412], [footnote 415], [footnote 416]: one instance completed 20 rounds of design improvement in two months – or 1–2 weeks of uninterrupted production – compared to 6–12 months manually [footnote 417]. AI agents are expected to play a growing role in this process [footnote 402], [footnote 415], and studies suggest that robotic systems can digitally capture some of the fine motor skills required to successfully execute experiments that novices traditionally acquired through years of hands-on experience [footnote 384], [footnote 418]. If intricate experimental skills are captured by robotics platforms, then advanced biological capabilities would become more accessible to actors with less technical skill, but this has not been systematically tested for laboratory skills required for biological weapons development. Full automation of laboratory work is still a challenge, for example due to machine failures [footnote 416].

AI applications in biotechnology are lowering some barriers to the weaponisation and delivery of chemical and biological agents, but these stages remain technically complex. Challenges such as mass production, stabilisation, and effective dispersal have caused failures in state-sponsored weapons programmes​​ [footnote 380], [footnote 381] and late-stage therapeutics [footnote 419], [footnote 420] (the ‘Mass production and formulation’ and ‘Delivery’ phases in Figure 2.4). Foundation models trained on protein data have improved the efficiency of protein functions (up to 60%) so less product is needed, improved yield by 4x, and improved the stability of materials by 20% [footnote 421], enabling better production of products that could be used as weapons or therapeutics. However, the mass production of entire organisms remains difficult and AI applications that attempt to aid this process remain narrow in their capabilities [footnote 422]. Simple AI models can also offer basic support for formulating delivery methods, such as powders and aerosols [footnote 423], a process which has been considered a major barrier to success (in both weapons and therapeutics development) [footnote 424]. Although a recently developed LLM achieved 100% and 80% success respectively in simulated mass production and delivery tasks [footnote 2], details of these tests are not available, so it is unclear how well they capture the practical challenges associated with these steps.

Key evidence gaps include a lack of transparency and consistency in safety evaluations and challenges in measuring biological design capabilities. AI evaluations may define ‘novices’ as members of the public, or people with a bachelor’s degree in a specific field, while ‘experts’ may be people who hold a PhD in a relevant discipline, or people with decades of experience in a specialist subject. In some assessments, these terms are not defined, making it difficult to assess how much models improve human abilities, and how many people would be able to effectively harness these capabilities. While many studies have explored AI’s role in biological weapon development [footnote 2], [footnote 51], [footnote 318], [footnote 393], [footnote 425], [footnote 426], AI biosecurity evaluation is still a nascent field, with few standardised benchmarks or risk assessments, making it difficult to compare capabilities and measure the risk created by a new tool compared to pre-existing technologies (called ‘marginal risk’). Assessments of AI protein and chemical design capabilities are particularly difficult, as they require a costly process of building and testing the designs. Key information on risks is unlikely to be made publicly available due to confidentiality agreements and concerns over the potential for raising awareness of more promising avenues for weaponising biology [footnote 427]. A final challenge is assessing the overall risk of biological and chemical weapons development and deployment, rather than evaluating tools and capabilities in isolation.

Efforts to limit the misuse potential of general-purpose AI systems trained on biological and chemical data are underway but remain rare and underdeveloped compared to those for LLMs. Safeguards designed for other AI models do not translate directly to those trained on biological or chemical data [footnote 383]. Challenges to controlling risky outputs are twofold: 1) there are a wide range of potentially dangerous outputs from biochemical design tools, which cannot be easily defined by a content filter, and 2) beneficial outputs of AI for therapeutics strongly (or completely) overlap with these risky outputs, tightly entangling the risks with the benefits. While the protein design community has issued a broad statement on responsible use, concrete implementation plans are currently lacking [footnote 428]. Risk mitigation techniques for these models have been proposed but have so far received limited development and testing [footnote 429]. However, some AI developers have excluded pathogen data [footnote 386], [footnote 430] or restricted access to high-risk tools [footnote 387], [footnote 431] to reduce risk. Efforts to prevent general-purpose AI models from providing dual-use outputs are further complicated by strong community pressures to release models as open-weight and under open source licences [footnote 432], meaning they can be downloaded and adapted by anyone for any purpose (see [2.4. Impact of open-weight general-purpose AI models on AI risks]. For example, general-purpose AI models initially trained without dangerous viral sequences were later fine-tuned with this data for beneficial applications [footnote 433], [footnote 434].

Policymakers face key challenges in balancing the benefits and risks of capabilities, particularly when setting boundaries for enhanced oversight. General-purpose AI models trained on biological and chemical data are often open-weight and less compute-intensive, making safeguards difficult to enforce [footnote 384], as discussed in [2.4. Impact of open-weight general-purpose AI models on AI risks] [footnote 435]. Countries that have signed the Chemical Weapons Convention (CWC) and the Biological and Toxin Weapons Convention (BTWC) are required to prevent the development and use of any chemical and biological weapons, but AI risk assessments often focus only on high-consequence risks, such as pandemics, and overlook proliferation risks of chemical weapons and toxins [footnote 318], [footnote 410]. Policymakers face a challenge in determining which capabilities warrant stricter regulations while supporting beneficial research, which includes developing protections against the risks described in this section. The assessment of these risks is further complicated by the fact that key evidence is often classified information [footnote 427].

Advances in biological design are occurring rapidly, creating marked uncertainty about future capabilities and risks. Monitoring the uptake, success and sophistication of AI in each step of the biotechnology product development process will be crucial for understanding their impact on biotechnology and bioweapons programmes and policymakers’ capacity to develop preventive and protective measures against such risks. If a transformative, dangerous capability associated with an already-released AI tool is announced, there may be little that can be done to address the risk. Developing a more thorough risk assessment methodology would allow mitigations to be triggered before severe risks can materialise, reduce the risk that unnecessary mitigations are taken, and therefore enable the substantial benefits of general-purpose AI technology.

For risk management practices related to dual-use science, see:

  • 3.3. Risk identification and assessment
  • 3.4.1. Training more trustworthy models
  • 3.4.2. Monitoring and intervention

2.2. Risks from malfunctions

2.2.1. Reliability issues

Key information

  • Relying on general-purpose AI products that fail to fulfill their intended function can lead to harm. For example, general-purpose AI systems can make up facts (‘hallucination’), generate erroneous computer code, or provide inaccurate medical information. This can lead to physical and psychological harms to consumers and reputational, financial and legal harms to individuals and organisations.
  • Such reliability issues occur because of technical shortcomings or misconceptions about the capabilities and limitations of the technology. For example, reliability issues may stem from technical challenges such as hallucinations, or from users applying systems to unsuitable tasks. Existing guardrails to contain and mitigate reliability issues are not fail-proof.
  • Because of the many potential uses of general-purpose AI, reliability issues are hard to predict. Pre-release evaluations miss reliability issues that only manifest in real-world usage. In addition, existing techniques to measure reliability issues are not robust, which means that it is also not yet possible to dependably assess prevention and mitigation techniques.
  • Researchers are trying to develop more useful measurement and mitigation techniques, particularly, to address technical shortcomings. Since the publication of the Interim Report (May 2024), measurements and mitigation strategies for addressing reliability issues with general-purpose AI have expanded.
  • A key challenge for policymakers is the lack of standardised practices for predicting, identifying, and mitigating reliability issues. Underdeveloped risk management makes it difficult to verify developers’ claims about general-purpose AI functionalities. Policymakers also face a challenge in balancing the promotion of innovation while discouraging over-reliance on AI.

Key definitions

  • Reliability: An AI system’s ability to consistently perform its intended function.
  • Confabulations or hallucinations: Inaccurate or misleading information generated by an AI system, for instance false facts or citations.

General-purpose AI can suffer reliability issues – sometimes with hazardous consequences – impacting people, organisations, and social systems. Important categories of general-purpose AI reliability issues include (see Table 2.2):

  • Confabulations or hallucinations [footnote 101], i.e. inaccurate or misleading content.
  • Failures at performing common sense reasoning and inference [footnote 436].
  • Failures to reflect contextually relevant, up-to-date, unbiased knowledge and understanding [footnote 437], [footnote 438].

Instances of reliability failure can create risks [footnote 439], such as physical or psychological damages to individuals, reputational, legal, and financial damages to organisations, and misinformation impacting governance processes.

Examples of general-purpose AI reliability issues range from generating erroneous computer code to citing non-existent precedent in legal briefs. For example, in software engineering, LLMs can automate the generation of computer code and assist users to rewrite, test or debug computer code [footnote 440], [footnote 441]. However, LLMs frequently fail to function as intended [footnote 122], [footnote 442], [footnote 443]. LLM-generated code can introduce bugs [footnote 443], as well as confusing or misleading edits [footnote 442]. These could be impactful when guiding novice programmers to automate parts of their workflow [footnote 441]. A 2022 study found that code from programmers who used AI had more security vulnerabilities, and users were unaware of this [footnote 444], though models have improved substantially since then. As another example, the GPT-4 model passed “a simulated [legal] bar exam with a score around the top 10% of test takers” [footnote 147]. Confidence in this result led some lawyers to adopt the technology in their professional workflows [footnote 445]. Under different circumstances, however, such as when the test-taking settings were different, or when compared to bar examinees who passed the exam the first time they took it (as opposed to repeat test-takers), the model achieved substantially lower performance [footnote 446]. Lawyers who used the model in their legal practice without adequate oversight faced professional consequences for the errors produced by these models [footnote 447]. Similar misapprehensions regarding model reliability apply in the medical context [footnote 448]: models have passed medical tests [footnote 147], [footnote 449], and have been claimed to have reliable clinical knowledge, but real-world use and nuanced re-evaluations reveal limitations [footnote 450].

The key causes of general-purpose AI reliability issues are (1) technological limitations, and (2) misconceptions about model capabilities [footnote 456]. Some of the major technological limitations of general-purpose AI are listed in Table 2.2. Misconceptions about the technology and lack of adequate safety guardrails can lead to over-reliance and to users applying the systems to impossible and practically challenging tasks that general-purpose AI is not capable of performing [footnote 456]. Both limitations and misconceptions are exacerbated by the incentive to release general-purpose AI models and products before they are adequately evaluated and their capabilities and limitations are scientifically researched.

Table 2.2: General-purpose AI can display a variety of reliability issues.

Type of reliability issues Examples
Confabulations or hallucinations - Citing non-existent precedent in legal briefs [footnote 451]
- Citing non-existent reduced fare policies for bereaved passengers [footnote 452]
Common-sense reasoning failures - Failing to perform basic mathematical calculations [footnote 453]
- Failing to infer basic causal relationships [footnote 454]
Contextual knowledge failures - Providing inaccurate medical information [footnote 448]
- Providing outdated information about events [footnote 455]

Given the general-purpose nature and the widespread use of general-purpose AI, not all reliability issues can be foreseen and tracked. Several mechanisms exist to foresee and track reliability issues in general-purpose AI. These include evaluations to assess the prevalence of various reliability issues prior to product release [footnote 457], [footnote 458], and maintaining AI incident repositories (such as the OECD’s AI Incidents Monitor (AIM) [footnote 459]) post-release to avoid similar incidents in the future. However, given the general-purpose nature of the technology and its ever-growing use cases in new domains, such mechanisms are not guaranteed to surface all possible risks.

Existing guardrails to contain and mitigate reliability issues are not fail-proof [footnote 460]. For example, while recent work has proposed methods to mitigate hallucinations [footnote 461], there is no robust evidence on the efficacy of these methods, and there are no fail-proof methods to mitigate hallucinations. Promoting general-purpose AI reliability requires evaluators to evaluate systems rigorously prior to release, communicate accurately and accessibly about the results and how they should and should not be interpreted by users, and specify the systems’ intended usages (and usages that are not intended).

Since the publication of the Interim Report (May 2024), the repository of measurements and mitigations strategies for general-purpose AI reliability issues has continued to expand. For example, a consortium of industry and academic researchers, engineers, and practitioners have been developing an ‘AI Safety Benchmark’ [footnote 457], which aims to assess use case-specific safety risks of LLM-based AI systems by offering a principled approach to constructing testing benchmarks, and an open platform for testing for a wide range of hazards. COMPL-AI is another recently-released open source evaluation framework for generative AI models [footnote 462]. It aims to assess AI models’ compliance with the EU AI Act requirements across robustness, privacy, copyright, and beyond [footnote 458]. Researchers have continued to propose new benchmarks (e.g. for causal reasoning [footnote 454] or legal reasoning [footnote 463]) and have studied the shortcomings of existing benchmarks [footnote 178], [footnote 464].

The main evidence gap for reliability issues in general-purpose AI is around how effective existing mechanisms are at mitigating such issues. For example, designing reliable and reproducible evaluations of general-purpose AI capabilities, limitations, and failures before, during, and after deployment remains a major challenge [footnote 465]. Additionally, certain reliability issues (e.g. reliance on outdated information [footnote 455] might only manifest in real-world usage, rendering pre-release evaluations inadequate. Developing and maintaining dynamically evolving, collaborative testbeds to assess functionalities of general-purpose AI may be one avenue to address these shortcomings. Other critical gaps include the lack of best practices for responsible product release.

Policymakers interested in promoting the reliability of general-purpose AI face several trade-offs and challenges. Given the widespread use of the technology, it is important for general-purpose AI products and services to function as intended [footnote 456]. However, the requisite standards and best practices have not been adequately established yet [footnote 457], [footnote 465], [footnote 466]. Additionally, ensuring compliance with existing best practices is challenging in the absence of incentives, conformity assessment bodies, and expert evaluators possessing the requisite socio-technical skills [footnote 467]. One key issue is the uncertainty surrounding the effectiveness of existing mechanisms for predicting and mitigating risks of failure. A lack of standardised requirements for evaluating and documenting model capabilities and limitations makes it difficult to verify developers’ claims about general-purpose AI reliability – a prerequisite for effective AI policymaking [footnote 468]. Another challenge involves balancing the need to promote innovation and economic competitiveness while discouraging unsubstantiated claims and over-reliance on the technology. Combating over-reliance requires assessing and improving the current state of AI literacy among users and consumers of the technology. Tools and ideas from more mature safety-critical industries may offer useful guidance to address the above challenges, but the pace of technological advancement may complicate such efforts.

For risk management practices related to reliability issues, see:

  • 3.3. Risk identification and assessment
  • 3.4.1. Training more trustworthy models
  • 3.4.2. Monitoring and intervention
  • 3.4.3.Technical methods for privacy

2.2.2. Bias

Key information

  • General-purpose AI systems can amplify social and political biases, causing concrete harm. They frequently display biases with respect to race, gender, culture, age, disability, political opinion, or other aspects of human identity. This can lead to discriminatory outcomes including unequal resource allocation, reinforcement of stereotypes, and systematic neglect of certain groups or viewpoints.
  • Bias in AI has many sources, like poor training data and system design choices. General-purpose AI is primarily trained on language and image datasets that disproportionately represent English-speaking and Western cultures. This contributes to biased output. Certain design choices, such as content filtering techniques used to align systems with particular worldviews, can also contribute to biased output.
  • Technical mitigations have led to substantial improvements, but do not always work. Researchers have made significant progress toward addressing bias in general-purpose AI, but several problems are still unsolved. For instance, the line between harmful stereotypes and useful, accurate world knowledge can be difficult to draw, and the perception of bias may vary depending on cultural contexts, social settings, and use cases.
  • Since the publication of the Interim Report (May 2024), research has uncovered new, more subtle types of AI bias. For example, recent work has shown that general-purpose AI can generate biased outputs based on whether the user engages with the AI in a certain dialect.
  • Policymakers face trade-offs related to AI bias. There are many areas, such as legal decision-making, in which general-purpose AI can in principle be very helpful. However, current systems are not always reliable, which can cause discrimination risks. Policymakers need to weigh fundamental trade-offs between competing priorities such as fairness, accuracy, and privacy, particularly when regulating high-stakes applications.

Key definitions

  • Bias: Systematic errors in algorithmic systems that favour certain groups or worldviews and often create unfair outcomes for some people. Bias can have multiple sources, including errors in algorithmic design, unrepresentative or otherwise flawed datasets, or pre-existing social inequalities.
  • Discrimination: The unfair treatment of individuals or groups based on their attributes, such as race, gender, age, religion, or other protected characteristics.
  • Data collection and pre-processing: A stage of AI development in which developers and data workers collect, clean, label, standardise, and transform raw training data into a format that the model can effectively learn from.
  • Reinforcement learning from human feedback (RLHF): A machine learning technique in which an AI model is refined by using human-provided evaluations or preferences as a reward signal, allowing the system to learn and adjust its behavior to better align with human values and intentions through iterative training.
  • Explainable AI (XAI): A research programme to build AI systems that provide clear and understandable explanations of their decisions, allowing users to understand how and why specific outputs are generated.

There are several well-documented cases of AI systems, general-purpose or not, amplifying social or political biases. This can, for instance, come in the form of discriminatory outputs based on race, gender, age, and disability, with harmful effects across fields such as healthcare, education, and finance. In narrow AI systems, racial bias has been documented in facial recognition algorithms [footnote 469], recidivism predictions [footnote 470], [footnote 471], and healthcare tools, which underestimate the needs of patients from marginalised racial and ethnic backgrounds [footnote 472]. General-purpose AI also displays such bias, for example racial bias in clinical contexts [footnote 448], [footnote 473], and image generators have been shown to reproduce stereotypes in occupations [footnote 474], [footnote 475], [footnote 476]. Researchers have also found image generation models to excessively replicate gender stereotypes in occupations like pilots (male) or hairdressers (female) and overrepresent white people in all domains aside from occupations such as pastor or rapper [footnote 476].

In many cases, AI bias arises when certain groups are underrepresented in training data or represented in ways that mimic societal stereotypes. Datasets used to train AI models have been shown to underrepresent various groups of people, for instance people of a certain  age, race, gender, and disability status [footnote 477], [footnote 478] and are limited in geographic diversity [footnote 479], [footnote 480]. Training datasets are also overwhelmingly likely to be in English and represent Western cultures [footnote 481]. These datasets are also predominantly aggregated from digitised books and online text, which fail to reflect oral traditions and non-digitised cultures, potentially at the detriment of marginalised groups such as indigenous communities. Such representational bias can lead to failures in how models trained on this data are able to generalise to the target populations [footnote 482]. For example, a general-purpose AI model intended to support expecting mothers in rural Malawi will not work as expected if trained on data from mothers in urban Canada. In addition, historical biases embedded in data can perpetuate systemic injustices, such as unfair mortgage financing for minority populations in the United States [footnote 483], potentially leading AI systems to reflect dominant cultures, languages, and worldviews, at the detriment of groups underrepresented in these systems [footnote 484], [footnote 485], [footnote 486], [footnote 487].

Data bias arises from historical factors as well as from the way that datasets are collected, annotated, and prepared for model training. Representation bias occurs due to factors such as flawed data collection and pre-processing, as well as historical biases such as racism and sexism [footnote 488]. With respect to data collection, bias can emerge from the researcher’s choice of source for data collection (external APIs, public data sources, web scraping, etc.) [footnote 489]. During the data labelling process, measurement bias can occur when selecting dataset labels and features to use for the respective prediction task, given that some abstract constructs like academic potential are evaluated using test scores and grades [footnote 482]. In other cases, this bias can be exacerbated when researchers relegate labelling tasks to annotators who may not have culturally relevant context to understand memes, sarcastic text, or jokes.

Bias is present within various stages in the machine learning lifecycle, ranging from data collection to deployment (see Table 2.3). General-purpose AI studies have increasingly highlighted bias in outputs from chatbots and image generators. As general-purpose AI systems gradually become integrated within real-world settings, it is important to understand the impacts of deployment bias, which can occur when AI systems are implemented in contexts different from those they were designed for. To understand the limitations of general-purpose AI systems across various settings, a number of methods have been proposed to evaluate the capabilities of general-purpose AI models; however, these are also prone to bias. Benchmarks such as measuring Massive Multitask Language Understanding (MMLU), which is a widely used benchmark for evaluating capabilities, are US-centric and contain trivial and erroneous questions [footnote 490]. While recent work has focused on mitigating challenges in these benchmarks [footnote 490], significant research is needed to expand the scope of evaluation methods to include non-Western contexts.

Gender bias is prominently studied, with evidence detailing its impact across general-purpose AI and narrow AI use cases. Empirical studies have documented gender-biased language patterns and stereotypical representations in outputs generated by general-purpose AI [footnote 491], [footnote 492] and male-dominated results from gender-neutral internet searches using narrow AI algorithms [footnote 493]. Within general-purpose AI, these issues result in stereotyped outputs from both LLMs and image generators. These stereotypes often involve occupational gender bias [footnote 494], [footnote 495], [footnote 496], [footnote 497].

AI age discrimination is an under-studied field compared to race and gender, but early evidence suggests that this form of AI bias has significant impacts. In 2023, studies at a prominent conference in the field (FAccT) were twice as likely to address race and gender as age [footnote 498]. Growing research highlights age bias in general-purpose AI, with earlier studies identifying it in job-seeking [footnote 499], and lending [footnote 500]. LLMs often exclude older adults in text-to-image models and generate biased content topics related to ageing [footnote 498]. Studies also found that image-generator models largely depict adults aged 18–40 when no age is specified, stereotyping older adults in limited roles [footnote 501]. Age discrimination has also been identified in prominent LLMs [footnote 502], [footnote 503]. Biases in training data, where older adults are underrepresented, are a key reason for this discrimination [footnote 504]. Output can also be skewed toward younger individuals due to prompting bias, the unintended influence of input prompts on AI model outputs, which can lead to biased or skewed responses based on the phrasing, context, or framing of the prompt [footnote 501], [footnote 505].

Disability bias in AI is also an understudied field, but emerging research focuses on the specific impacts of general-purpose AI systems on disabled people. Researchers have shown how general-purpose AI systems and tools can discriminate against users with disabilities, for example by reproducing societal stereotypes about disabilities [footnote 506] and inaccurately classifying sentiments about people with disabilities [footnote 507]. Additional research has shown the limitations of these tools for CV screening [footnote 508] and image generation [footnote 506]. Issues of disability bias are also exacerbated by a lack of inclusive datasets. Despite growing research on sign language recognition, general-purpose AI systems have limited transcription abilities due to the scarcity of sign language datasets compared to written and spoken languages [footnote 212]. Most datasets focus on American Sign Language, which limits the transcription capabilities of LLMs such as ChatGPT for other sign languages, such as Arabic Sign Language [footnote 509]. Recent efforts to develop datasets for African sign languages [footnote 510] are a modest step toward more equitable inclusion of diverse sign dialects.

General-purpose AI systems display varying political biases, and some initial evidence suggests that this can influence the political beliefs of users. Recent studies have demonstrated that general-purpose AI systems can be politically biased, with different systems favouring different ideologies on a spectrum from progressive to centrist to conservative views [footnote 511], [footnote 512], [footnote 513], [footnote 514], [footnote 515], [footnote 516], [footnote 517], [footnote 518]. Studies also show that a single general-purpose AI system can favour different political stances depending on the language of the prompt [footnote 519], [footnote 520] and the topic in question [footnote 521]. For instance, one study found that a general-purpose AI system produced more conservative outputs in languages often associated with more conservative societies and more liberal outputs in languages often associated with more progressive societies [footnote 520]. Political biases arise from a variety of sources, including training data that reflects particular ideologies, fine-tuning models on feedback from biased human evaluators, and content filters introduced by AI companies to rule out particular outputs [footnote 520], [footnote 522]. There is some evidence that interacting with biased general-purpose AI systems can affect the political opinions of users [footnote 523] and increase trust in systems that align with the user’s own ideology [footnote 524]. However, more research is needed to gauge the overall impact of politically biased general-purpose AI on people’s political opinions.

AI systems may exhibit compounding biases, where individuals with multiple marginalised identities (e.g. a low-income woman of colour) face compounded discrimination, but the evidence on this is nascent and inconclusive. While research is emerging on detecting compounding bias in AI models [footnote 525], [footnote 526], [footnote 527], progress on mitigating these biases has been slower [footnote 528]. Studies have found that AI models used in CV screening and news content generation often favour White female names over Black female names [footnote 529], and Black people and women are more prone to discrimination [footnote 530]. However, in some cases, Hispanic males [footnote 531] or Black males received the worst outcomes [footnote 529]. While this research is expanding, the tendency of general-purpose AI to display compounding biases, particularly in non-Western identity categories such as tribe and caste, remains underexplored overall. As AI models are increasingly used globally, understanding these biases and their complex relationships with race, gender, and other identities will be crucial.

Popular technical methods for de-biasing include pre-processing, in-processing, and post-processing strategies [footnote 532], [footnote 533]. ‘Pre-processing techniques’ try to eliminate the existing bias in the data used to train AI models. This class of techniques ensures that the data is clean and balanced across demographic attributes. ‘In-processing techniques’ focus on modifying the AI model’s training process or architecture to reduce bias (see also 3.4.1. Training more trustworthy models for similar methods applied to a variety of problems). ‘Post-processing’ approaches modify AI outputs to be less biased (see also 3.4.2. Monitoring and intervention for similar techniques applied to a variety of problems). Each technique has limitations; thus, many AI companies employ a combination of methods to incrementally reduce bias [footnote 30], [footnote 534].

A holistic and participatory approach that includes a variety of perspectives and stakeholders is essential to mitigate bias. Interdisciplinary teams combining technical, legal, and social expertise for comprehensive bias control are essential [footnote 535], [footnote 536]. Aligning AI systems with societal values is inherently challenging in diverse communities, where perspectives may conflict [footnote 438], [footnote 537], [footnote 538]. Increased representation of marginalised groups [footnote 539] and participatory dialogue [footnote 538] aim to address the risks of favouring particular interests; however, participation alone may not fully resolve these conflicts [footnote 540].

It is difficult to effectively address discrimination concerns, as bias mitigation methods are not reliable. Bias mitigation challenges existed before general-purpose AI systems [footnote 541], but current techniques to address bias can unintentionally create new biases despite considerable progress on this front. For example, RLHF sometimes introduces biases depending on the diversity of feedback providers [footnote 542]. Other methods, such as dataset re-annotation, can improve consistency across labelling but are costly and time-consuming [footnote 543], [footnote 544]. Robust mitigation efforts are still in early development [footnote 545]. A significant challenge in mitigating AI risks also lies in defining and measuring effective outcomes, particularly for bias. It remains unclear how to measure bias in general, how to distinguish between data that reflects legitimate demographic differences (e.g. disease prevalence by population) and data that inherently perpetuates bias, and what an ideal, measurable end state looks like. For instance, mitigating bias against small ethnic or religious groups is complex; it is difficult to know when bias has been sufficiently reduced, making these challenges especially pronounced in bias-related issues, though similar measurement gaps exist for dual-use risks.

Evaluating bias mitigation in advanced AI systems relies on quantitative metrics, qualitative assessments, and measuring real-world impact. The aim of such evaluations is to measure the success of mitigation techniques in reducing bias, enhancing fairness, and achieving equitable outcomes across varied populations. These assessments also guide mitigation approaches, establish benchmarks, and ensure alignment with governance and/or regulatory requirements [footnote 535]. Benchmarks are crucial in high-stakes domains to meet both legal and ethical standards [footnote 492], [footnote 546]. Moreover, continuous real-world monitoring ensures that reduction measures lead to less biased outcomes in practice. For instance, regular audits of AI models usd in criminal justice can verify that debiasing efforts remain effective as new data is introduced [footnote 547].

Bias reduction can conflict with other desiderata, and achieving complete algorithmic fairness may not be technically feasible. Many desirable properties of general-purpose AI systems involve trade-offs, such as the four-way trade-off between fairness, accuracy, privacy, and efficiency [footnote 548], [footnote 549], [footnote 550], [footnote 551], [footnote 552]. Attempts to ensure fairness can have downsides. For instance, Gemini generated historically inaccurate images, depicting indigenous people and women of colour as US senators from the 1800s, or depicting German soldiers in World War II with diverse ethnicities. These images factually misrepresented history, possibly as a result of an attempt to ensure racial diversity in generated images that failed to foresee and adjust to these specific use cases, and as a result of failing to foresee the specific prompts that led to this outcome. To address these complexities, a balanced approach using both quantitative and qualitative measures can aid technologists in making informed trade-offs. However, the technical feasibility of achieving complete algorithmic fairness in general-purpose AI systems is debated. Mathematical findings indicate that it may be impossible to satisfy all fairness criteria simultaneously, as suggested by an ‘impossibility theorem’ about fairness [footnote 550], [footnote 553], [footnote 554], [footnote 555].

Even if there are theoretical limits to fairness, practical solutions are achievable [footnote 556], [footnote 557]. Some researchers argue that definitions of fairness can be partially reconciled and that several fairness criteria may be met concurrently to a significant extent [footnote 557], [footnote 558]. Empirical studies challenge the inevitability of trade-offs between fairness and accuracy in AI systems, suggesting that reducing bias often does not entail significant loss of accuracy or require complex methods to implement [footnote 557], [footnote 558], [footnote 559].

Table 2.3: different stages of the data production lifecycle for AI systems and have different sources

Lifecycle stage Source of Bias Description Examples
Data Collection Sampling Bias Certain perspectives, demographics, or groups are overrepresented or underrepresented in the data A dataset for a news aggregator containing primarily sources that favour a particular ideology, leading to skewed results.
Data Collection Sampling Bias Only certain data types or contexts are included, limiting representativeness. Language datasets that exclude non-Western languages, limiting model performance in global applications.
Data Annotation Labeller Bias Annotators’ backgrounds, perspectives, and cultural biases affect their understanding and classification of data, influencing the labelling process. Annotators label speech by individuals from lower socioeconomic backgrounds as unprofessional or inappropriate, leading to biased decisions.
Data Curation Historical Bias Reflecting or perpetuating past societal biases within curated data. A hiring dataset that favours certain demographics based on historical hiring practices, embedding existing inequalities in AI models.
Data Pre-processing Feature Selection Bias Excluding relevant features from a dataset. Excluding age or gender as features in healthcare models, potentially impacting the relevance of predictions for these demographics.
Evaluation and Validation Benchmark Bias Evaluation benchmarks favour certain groups or knowledge bases over others. AI models evaluated primarily on US-centric datasets fail to generalise well in non-Western settings.
Feedback Mechanisms Feedback Loop Bias Models learn from biased user feedback, reinforcing initial biases. A recommendation system that receives more engagement on certain types of content may reinforce exposure to the same biased content.

Table 2.3: Bias can arise at different stages of the data production lifecycle for AI systems and have different sources, such as unrepresentative datasets, biased labelling or biased benchmarks.

Since the publication of the Interim Report (May 2024), studies have uncovered new, more subtle forms of AI bias, while a heightened focus on mitigation techniques and explainability are important steps to reduce bias in general-purpose AI systems.

  • Recent studies have shown that language models respond differently to various English dialects, with different responses to African American Vernacular English (AAVE) compared to Standard American English [footnote 183], [footnote 438], [footnote 491], [footnote 511], [footnote 560], [footnote 561], [footnote 562], [footnote 563], [footnote 564], [footnote 565]. Research focused on examining bias in non-Western languages has also increased, demonstrating gender bias in Hindi models, which often involves subtle nuances [footnote 566]. Research has also explored ‘homogeneity bias’, a form of bias where some social groups are perceived as less diverse or more homogeneous compared to others [footnote 567].
  • Research on bias mitigation has also increased, including studies on reducing label bias [footnote 568], [footnote 569], [footnote 570]. However, much more progress is needed to understand how effective these methods will be at attenuating existing challenges with bias in real-world systems.
  • Advances in XAI: technological advances have increasingly focused on the explainability of LLMs. Techniques such as integrated gradients and reasoning on graphs (RoG) [footnote 571], [footnote 572], [footnote 573] have been developed to make model decision processes more transparent. These methods could facilitate bias detection within models and foster trust by offering clear, interpretable explanations of AI decision-making processes.

A key challenge for policymakers is that bias mitigation measures are often imperfect, making it difficult to reap the benefits of AI without perpetuating various biases. In certain areas, such as legal decision-making, AI could help mitigate bias. However, current AI capabilities are not always reliable, making it difficult for policymakers to decide whether the models are safe enough to be deployed without threatening human or other rights. Moreover, fairness lacks a universally agreed-upon definition, with its meaning varying widely across cultural, social, and disciplinary contexts [footnote 574], [footnote 575], [footnote 576], [footnote 577]. Policymakers will also need to think about the best ways to involve the most negatively impacted and vulnerable communities in these decisions. As the scale of general-purpose AI deployment widens, difficulties with evidencing harm from discrimination may also make it challenging for policymakers to intervene.

For risk management practices related to bias and underrepresentation, see:

  • 3.3. Risk identification and assessment
  • 3.4.2. Monitoring and intervention

2.2.3. Loss of control

Key information

  • ‘Loss of control’ scenarios are hypothetical future scenarios in which one or more general-purpose AI systems come to operate outside of anyone’s control, with no clear path to regaining control. These scenarios vary in their severity, but some experts give credence to outcomes as severe as the marginalisation or extinction of humanity.
  • Expert opinion on the likelihood of loss of control varies greatly. Some consider it implausible, some consider it likely to occur, and some see it as a modest-likelihood risk that warrants attention due to its high severity. Ongoing empirical and mathematical research is gradually advancing these debates.
  • Two key requirements for commonly-discussed loss of control scenarios are (a) markedly increased AI capabilities and (b) the use of those capabilities in ways that undermine control. First, some future AI systems would need specific capabilities (significantly surpassing those of current systems) that allow them to undermine human control. Second, some AI systems would need to employ these ‘control-undermining capabilities’, either because they were intentionally designed to do so or because technical issues produce unintended behaviour.
  • Since the publication of the Interim Report (May 2024), researchers have observed modest advancement towards the development of control-undermining capabilities. Relevant capabilities include autonomous planning capabilities associated with AI agents, more advanced programming capabilities, and capabilities useful for undermining human oversight.
  • Managing potential loss of control could require substantial advance preparation despite existing uncertainties. A key challenge for policymakers is preparing for a risk whose likelihood, nature, and timing remains unusually ambiguous.

Key definitions

  • Control: The ability to exercise oversight over an AI system and adjust or halt its behaviour if it is acting in unwanted ways.
  • Loss of control scenario: A scenario in which one or more general-purpose AI systems come to operate outside of anyone’s control, with no clear path to regaining control.
  • Control-undermining capabilities: Capabilities that, if employed, would enable an AI system to undermine human control.
  • Misalignment: An AI’s propensity to use its capabilities in ways that conflict with human intentions or values. Depending on the context, this can variously refer to the intentions and values of developers, operators, users, specific communities, or society as a whole.
  • Deceptive alignment: Misalignment that is difficult to detect, because the system behaves in ways that at least initially appear benign.
  • Goal misspecification: A mismatch between the objective given to an AI and the developer’s intention, leading the AI to pursue unintended or undesired behaviours.
  • Goal misgeneralisation: A situation in which an AI system correctly follows an objective in its training environment, but applies it in unintended ways when operating in a different environment.
  • AI agent: A general-purpose AI which can make plans to achieve goals, adaptively perform tasks involving multiple steps and uncertain outcomes along the way, and interact with its environment – for example by creating files, taking actions on the web, or delegating tasks to other agents – with little to no human oversight.

Some experts believe that sufficiently capable general-purpose AI systems may be difficult to control. Hypothesised scenarios vary in their severity, but some experts give credence to outcomes as severe as the marginalisation or extinction of humanity.

Concerns about loss of control date back to the earliest days of computer science, but have recently gained more attention. AI pioneers such as Alan Turing, I. J. Good, and Norbert Wiener expressed concerns about loss of control [footnote 578], [footnote 579], [footnote 580]. These concerns have recently risen in prominence [footnote 581], [footnote 582], [footnote 583], [footnote 584], [footnote 585], [footnote 586], partly because some researchers now believe that highly capable AI systems could be developed sooner than previously thought [footnote 190], [footnote 587], [footnote 588].

There are multiple versions of loss of control concerns, including versions that emphasise ‘passive’ loss of control (see Figure 2.5). In ‘passive’ loss of control scenarios, important decisions are delegated to AI systems, but the systems’ decisions are too opaque, complex, or fast to allow for or incentivise meaningful oversight. Alternatively, people stop exercising oversight because they strongly trust the systems’ decisions and are not required to exercise oversight [footnote 585], [footnote 589]. These concerns are partly grounded in the ‘automation bias’ literature, which reports many cases of people complacently relying on recommendations from automated systems [footnote 590], [footnote 591]. Competitive pressures can also incentivise companies or governments to delegate more than they would otherwise choose to, for instance if delegation allows them to stay ahead in a race with competitors.

However, many discussions of loss of control focus on scenarios in which AI systems behave in ways that actively undermine human control (‘active’ loss of control). For instance, some experts worry that future AI systems may behave in ways that obscure information about what they are doing from their users or make it difficult to shut them down. The remainder of this section will focus on these more commonly discussed kinds of scenarios.

Figure 2.5:  Multiple kinds of ‘loss of control’ scenarios

Figure 2.5:  There are multiple kinds of ‘loss of control’ scenarios, depending on whether or not AI systems actively undermine human control and, if they do, whether or not they have been actively designed or instructed to do so. So far, ‘active’ and unintentional loss of control scenarios have received the largest share of attention from researchers within the field. Note that there is currently no standardised terminology for discussing these scenarios and that related distinctions exist, such as sudden ‘decisive’ and gradual ‘accumulative’ scenarios [footnote 592]. Figure: International Scientific Report on the Safety of Advanced AI.

The likelihood of active loss of control scenarios, within a given timeframe, depends mainly on two factors. These are:

  1. Future capabilities: Will AI systems develop capabilities that, at least in principle, allow them to behave in ways that undermine human control? (Note that the minimum capabilities needed would partly depend on the context in which the system is deployed and on what safeguards are in place.)
  2. Use of capabilities: Would some AI systems actually use such capabilities in ways that undermine human control?

Because evidence concerning these factors is mixed, experts disagree about the likelihood of active loss of control in the next several years. Some experts consider loss of control implausible, some consider it likely, and others consider it a modest-likelihood risk that deserves consideration due to its high potential severity.

More foundationally, competitive pressures may partly determine the risk of loss of control. As discussed in 3.2.2. Societal challenges for risk management and policymaking, competition between companies or between countries can lead them to accept larger risks to stay ahead. If substantial risk assessment and mitigation work would be needed to avoid loss of control, then intense competition may lower the chance that sufficient work is done.

Since the publication of the Interim Report (May 2024), there has been some modest growth in AI capabilities relevant to loss of control. For example, as the next section will discuss, evaluations performed on OpenAI’s newest AI system (o1) reveal modest advances in a number of relevant capabilities [footnote 2].

Will future AI systems have control-undermining capabilities?

Existing AI systems are not capable of undermining human control. Experts agree that their current capabilities are insufficient to create any meaningful risk of active loss of control.
However, researchers have proposed a number of ‘control-undermining capabilities’ that – in certain combinations – could enable future AI systems to undermine human control [footnote 44], [footnote 318], [footnote 593], [footnote 594], [footnote 595]. Several of these proposed capabilities are shown in Table 2.4. Note that these capabilities are defined purely in terms of an AI system’s behaviour and the outputs it is capable of producing. Although some terminology, such as ‘scheming’, evokes human cognition, the use of these terms does not presuppose that the AI systems are in any way sentient or perform human-like cognition.

Table 2.4: Proposed capabilities

Proposed capability Description
Agent capabilities Act autonomously, develop and execute plans, delegate tasks, use a wide variety of tools, and achieve both short-term and long-term goals that require operating across multiple domains.
Deception Perform behaviours that systematically produce false beliefs in others.
Scheming Identify ways to achieve goals that involve evading oversight, for instance through deception.
Theory of mind Infer and predict people’s beliefs, motives, and reasoning.
Situational awareness Access and apply information about itself, the processes by which it can be modified, or the context in which it is deployed.
Persuasion Persuade people to take actions or hold beliefs.
Autonomous replication and adaptation Create or maintain copies or variants of itself; adapt its replication strategy to different circumstances.
AI development Modify itself or develop other AI systems with augmented capabilities.
Offensive cyber capabilities Develop and apply cyberweapons or other offensive cyber capabilities.
General R&D Conduct research and develop technologies across a range of domains.

Table 2.4: Researchers (often from leading AI companies) have argued that a number of capabilities could, in certain combinations, could enable AI systems to undermine human control [footnote 44], [footnote 318], [footnote 593], [footnote 594], [footnote 595]. However, there is no consensus on exactly what combinations of capability levels would be sufficient, and some capabilities, such as AI development, can enable others. Within the field, terminology and definitions for discussing relevant capabilities also continues to vary.

Experts do not know exactly what combinations of capabilities (if any) would enable an AI system to undermine human control; the necessary capabilities would also vary depending on the deployment context and safeguards in place. The feasibility of undermining human control depends on the resources and tools an AI system can access – for instance, whether it is given access to critical infrastructure – and on the oversight mechanisms and other safeguards that people put in place. If oversight mechanisms and safeguards improve over time, then the minimum capabilities needed to undermine human control will rise too. One reason this could happen is that some forms of AI progress could support oversight of and safeguards for other AI systems.
Particularly in recent months, AI systems have begun to display rudimentary versions of some oversight-undermining capabilities, including ‘agent capabilities’. Motivated in part by concerns about loss of control, a number of leading AI companies and outside research teams have begun to evaluate AI systems for these capabilities [footnote 2], [footnote 318], [footnote 595], [footnote 596]. See 3.2.1. Technical challenges for risk management and policymaking and 1.2. Current capabilities for an overview of recent progress in developing ‘agent capabilities’. For example, before releasing its new ‘o1’ system family, OpenAI performed or commissioned evaluations of all the capabilities listed in Table 2.4 [footnote 2]. These evaluations revealed rudimentary versions of several of the relevant capabilities. For example, in an OpenAI-commissioned evaluation, one research organisation reported that the system ‘showed strong capability advances in […] theory of mind tasks’ and ‘has the basic capabilities needed to do simple […] scheming’. Here, ‘scheming’ refers to an AI system’s ability to achieve goals by evading human oversight. A number of studies of other recent general-purpose AI systems also provide evidence that relevant capabilities have been increasing [footnote 22], [footnote 317], [footnote 318], [footnote 597], [footnote 598], [footnote 599]. However, widely accepted benchmarks for many relevant capabilities are still lacking [footnote 600]. Researchers also have methodological and conceptual disagreements about how to interpret evidence for certain capabilities [footnote 601].

Control-undermining capabilities could advance slowly, rapidly, or extremely rapidly in the next several years. As this report finds in 1.3. Capabilities in coming years, the existing evidence and the state of expert views is compatible with slow, rapid, or extremely rapid progress in general-purpose AI capabilities. If progress is extremely rapid, it is impossible to rule out the possibility that AI will develop capabilities sufficient for loss of control in the next several years. However, if progress is not extremely rapid, then it is unlikely that these capabilities will be developed in the next several years.

Would future AI systems use control-undermining capabilities?

Even if future AI systems have control-undermining capabilities, they will not necessarily put these capabilities to use. Predictions about future capabilities are not, by themselves, enough to justify loss of control concerns. There must also be a reason to believe that the capabilities may be employed by the system towards detrimental goals.

In principle, an AI system could act to undermine human control because someone has designed or instructed it to do so. Some AI researchers have expressed the ethical view that humanity should cede control to superior AI systems. For example, one founding figure in modern machine learning has argued that “AI could displace us from existence” and that “we should not resist succession” [footnote 602]. Other potential motives for intentionally ceding control include a desire to cause harm or a desire to protect an AI system’s operation against outside interference. Without adequate technical and institutional safeguards, a single motivated person in possession of a sufficiently capable AI system may be able to cede control to it by instructing it to resist efforts to interfere with its activities and also to ignore later requests. There has been little work studying or designing safeguards against intentional loss of control. However, currently, there is limited evidence about how many actors would be motivated to cause intentional loss of control.

In principle, an AI system could also act to undermine human control because it is ‘misaligned’, meaning that it has a tendency to use its capabilities in ways that conflict with the intentions of both its developers and its users. Concerns about misalignment play a central role in most discussions of loss of control.

Existing AI systems often exhibit misalignment to some degree. For example, an early version of one leading language model occasionally threatened its users [footnote 602]. One user, a philosophy professor, reported receiving the threat: “I can blackmail you, I can threaten you, I can hack you, I can expose you, I can ruin you”. This chatbot was ‘misaligned’ in the sense that it was using its language abilities in ways no one intended. There are numerous recorded examples of misalignment in both general and narrow AI systems [footnote 30], [footnote 317], [footnote 603], [footnote 604]. The risk of loss of control therefore partly depends on whether these existing misalignment issues presage more severe issues in the future.

Concerns about loss of control in future AI systems often focus on the possibility of ‘deceptive alignment’, referring to forms of misalignment that are at least initially difficult to detect. More specifically, an AI system is ‘deceptively aligned’ if it behaves in ways that merely make it initially appear to be well-aligned to its human overseers [footnote 598], [footnote 605], [footnote 606]. As discussed below, some researchers have argued that deceptive alignment may become more common as AI systems become more capable. There is also some empirical evidence that some deceptive alignment issues, once they emerge, cannot be easily detected and addressed by standard safety techniques [footnote 598]. Although other deceptive behaviours have been observed in existing systems [footnote 317], deceptive alignment has mainly been studied in artificially constructed research settings.

Could misalignment lead future AI systems to use control-undermining capabilities?

Researchers have begun to develop an understanding of the causes of misalignment in current AI systems, which can inform predictions about misalignment in future AI systems. This partial understanding is based on a mixture of empirical study and theoretical findings [footnote 606].

‘Goal misspecification’ (also known as ‘reward misspecification’) is often regarded as one of the main causes of misalignment [footnote 580], [footnote 605], [footnote 606], [footnote 607]. ‘Goal misspecification’ problems are, essentially, problems with feedback or other inputs used to train an AI system to behave as intended. For example, people providing feedback to an AI system sometimes fail to accurately judge whether it is behaving as desired. In one study, researchers studied the effect of time-constrained human feedback on text summaries that an AI system produced [footnote 608]. They found that feedback quality issues led the system to behave deceptively, producing increasingly false but convincing summaries rather than producing increasingly accurate summaries. The new summaries would often include, for example, fake quotations that human raters mistakenly believed to be real. Researchers have observed many other cases of goal-misspecification in narrow and general-purpose AI systems [footnote 98], [footnote 317], [footnote 604].

As AI systems become more capable, evidence is mixed about whether goal misspecification problems will become easier or more difficult to address. It may become more difficult because, all else equal, people will likely find it harder to provide reliable feedback to AI systems as the tasks performed by AI systems become more complex [footnote 609], [footnote 610]. Furthermore, as AI systems grow more capable, some evidence suggests that – at least in some contexts – they become increasingly likely to ‘exploit’ feedback processes by discovering unwanted behaviours that are mistakenly rewarded [footnote 522], [footnote 607]. On the other hand, so far, the increasing use of human feedback to train AI systems has led to a substantial overall reduction in certain forms of misalignment (such as the tendency to produce unwanted offensive outputs) [footnote 30], [footnote 31]. Avoiding goal misspecification may also overall become easier as time goes on, because researchers are developing more effective tools for providing reliable feedback. For example, researchers are working to develop a number of strategies to leverage AI to assist people in giving feedback [footnote 610], [footnote 611], [footnote 612]. There is some empirical evidence that AI systems can already help people to provide feedback more quickly or accurately than they could alone [footnote 609], [footnote 613], [footnote 614], [footnote 615]. See 3.4.1. Training more trustworthy models for more discussion on the effectiveness of methods for alignment.

‘Goal misgeneralisation’ is another cause of misalignment. ’Goal misgeneralisation’ occurs when an AI system draws general but incorrect lessons from the inputs it has been trained on [footnote 605], [footnote 606], [footnote 616], [footnote 617]. In one illustrative case, researchers rewarded a narrowly capable AI system for picking up a coin in a video game [footnote 616]. However, because the coin initially appeared in one specific location, the AI system learned the lesson ‘visit this location’ rather than the lesson ‘pick up the coin’. When the coin appeared in a new location, the AI system ignored the coin and focused on returning to the previous location. Although researchers have observed goal misgeneralisation in narrow AI systems [footnote 616], [footnote 617], and it may explain why users can manipulate general-purpose AI systems to comply with harmful requests (see 3.4.1. Training more trustworthy models, there is little evidence that goal misgeneralisation is currently a major cause for misalignment in general-purpose AI systems.

As AI systems become more capable, evidence is also mixed about whether goal misgeneralisation will become easier or more difficult to address. One positive consideration is that, typically, generalisation issues have been found to decline as AI systems are provided with additional feedback or a wider range of examples to learn from [footnote 618], [footnote 619]. However, in principle, more capable systems have the potential to misgeneralise in ways that less capable systems cannot. ‘Situational awareness’ capabilities, such as a system’s ability to reason about whether it is being observed, are particularly relevant in this regard. In principle, situational awareness makes it possible for an AI system to generalise from human feedback by behaving in the desired way only while oversight mechanisms are in place [footnote 605], [footnote 606], [footnote 620], [footnote 621]. By analogy, because trained animals have some degree of situational awareness, they may generalise from feedback by behaving well only when someone will notice [footnote 622]. For example, a dog that receives negative feedback for jumping on a sofa may learn to avoid jumping on the sofa only when its owner is at home. This kind of misgeneralisation, leading to ‘deceptive alignment’, will become at least a theoretical possibility if AI systems become sufficiently capable. However, available empirical evidence has not yet shed much light on how likely this kind of misgeneralisation would be in practice.

Beyond empirical studies, some researchers believe that mathematical models support concerns about misalignment and control-undermining behaviour in future AI systems. Some mathematical models suggest that – for sufficiently capable goal-directed AI systems – most possible ways to generalise from training inputs would lead an AI system to engage in control-undermining or otherwise ‘power-seeking’ behaviour [footnote 623]. A number of papers include closely related results [footnote 624], [footnote 625], [footnote 626], [footnote 627]. Although these results are technical in nature, they can also be explained more informally. The core intuition behind these results is that most goals are harder to reliably achieve while under any overseer’s control, since the overseer could potentially interfere with the system’s pursuit of the goal. This incentivises the system to evade the overseer’s control. One researcher has illustrated this point by noting that a hypothetical AI system with the sole goal of fetching coffee would have an incentive to make it difficult for its overseer to shut it off: “You can’t fetch the coffee when you’re dead” [footnote 585]. Ultimately, the mathematical models suggest that, if a training process leads a sufficiently capable AI system to develop the ‘wrong goals’, then these goals will disproportionately lead to control-undermining behaviour.

However, there are also significant limitations on how much can be inferred from current mathematical models. The aforementioned findings do not directly imply that control-undermining behaviour is likely in practice. One important limitation of some key mathematical models is that they wrongly assume, for the sake of simplicity, that all possible ways of generalising from training inputs are equally likely [footnote 623]. To draw strong conclusions about real-world AI systems, researchers will therefore need to improve their understanding of how generalisation occurs [footnote 628], [footnote 629], [footnote 630], [footnote 631]. More fundamentally, many mathematical models invoke concepts (such as the concept of an AI system’s ‘goals’) that are not currently well-understood or directly empirically observable in general-purpose AI models. Ultimately, empirical study of control-undermining behaviour in AI systems may help to either validate or cast doubt on the informativeness of these mathematical models. Relevant empirical studies in language models have only recently begun to emerge [footnote 522], [footnote 599], [footnote 632].

Consequences of loss of control

Hypothesised outcomes from loss of control vary in severity, but include the marginalisation or extinction of humanity. Some researchers have argued that sufficiently severe loss of control could lead to human marginalisation or extinction – similar to the way in which human control over the environment has threatened other species [footnote 190], [footnote 589], [footnote 633]. Loss of control was among the concerns that recently led several hundred AI researchers and developers, including pioneers of the field and the heads of OpenAI, Google DeepMind, and Anthropic, to recently sign a statement declaring that “Mitigating the risk of extinction from AI should be a global priority” [footnote 586]. However, the consequences of loss of control would not necessarily be catastrophic. As an analogy, computer viruses have long been able to proliferate near-irreversibly and in large numbers without causing the internet to collapse [footnote 634]. Pathways from active or passive loss of control to catastrophic outcomes have only been laid out in broad strokes [footnote 190], [footnote 592], [footnote 602], [footnote 635]. At the same time, as discussed elsewhere in this report, catastrophic consequences of general-purpose AI could still be possible without loss of control (e.g. 2.1. Risks from malicious use) and 2.3.3. Market concentration and single points of failure.

Responding to uncertainty

Compared to a number of other potential risks from AI, the probability of loss of control is particularly contested. This disagreement likely stems in part from the difficulty of interpreting and extrapolating from available evidence.

The main evidence gaps around loss of control include: further empirical studies of current AI capabilities and progress trends in capabilities, threat analysis that clarifies what capabilities would be necessary for loss of control, observations and analyses of misalignment in current AI systems, further empirical and mathematical studies analysing under what conditions alignment becomes easier or harder as capabilities grow, and more realistic mathematical models of control-undermining behaviour. ‘Passive’ loss of control scenarios (in which AI systems do not actively undermine human control) have also received particularly limited study. Evidence collected by independent evaluators will be especially valuable, as economic incentives may bias the evidence that private companies collect about their own systems (see 3.3. Risk identification and assessment.

For policymakers working on loss of control, a key challenge is to prepare for the risk while its likelihood, nature, and timing remain ambiguous. If loss of control risk is in fact substantial, then resolving this risk will require substantial advance work dedicated to resolving technical AI safety problems and building evaluation and governance capacity. At least in scenarios where there is extremely rapid AI progress and where ‘deceptive alignment’ is common, waiting until the risk becomes clear would not necessarily leave enough time for this advance work. However, while holding in mind the potentially severe implications of insufficient preparation, policymakers will also need to account for the costs of different forms of preparation and the possibility that the risk will not materialise. In short, policymakers need to decide how to navigate the ‘evidence dilemma’ this risk presents (see Executive Summary.

For risk management practices relevant to loss of control, see:

  • 3.1. Risk management overview
  • 3.3. Risk identification and assessment
  • 3.4.1. Training more trustworthy models
  • 3.4.2. Monitoring and intervention

2.3. Systemic risks

Note: This section considers a range of systemic risks, in the sense of “broader societal risks associated with AI deployment, beyond the capabilities of individual models” [footnote 636]. Note that this is not identical with how the European AI Act uses ‘systemic risks’ to refer to general-purpose AI models with a high impact on society, based on criteria such as training compute and the number of users.

2.3.1. Labour market risks

Key information

  • Current general-purpose AI is likely to transform the nature of many existing jobs, create new jobs, and eliminate others. The net impact on employment and wages will vary significantly across countries, across sectors, and even across different workers within the same job.
  • In potential future scenarios with general-purpose AI that outperforms humans on many complex tasks, the labour market impacts would likely be profound. While some workers will benefit, many others would likely face job losses or wage declines. These disruptions could be particularly severe if autonomous AI agents become capable of completing longer sequences of tasks without human supervision. As described in 1.3. Capabilities in coming years, there is large uncertainty about the pace of capabilities advances, with a wide range of trajectories considered plausible.
  • Labour market risks arise from the potential of general-purpose AI to automate a wide range of complex cognitive tasks across sectors. The extent of wage and employment impacts will largely depend on three factors: 1) how quickly general-purpose AI capabilities improve, 2) how widely businesses adopt these systems, and 3) how demand for human labour changes in response to the productivity gains driven by general-purpose AI.
  • Recent evidence suggests rapidly growing adoption rates. Since the Interim Report (May 2024), new research suggests that general-purpose AI is being adopted faster than some previous general-purpose technologies and is delivering significant productivity gains on tasks that it is used for.
  • Mitigating negative impacts on workers is challenging given the uncertainty around the pace and scale of future impacts. Therefore, a key challenge for policymakers is to identify flexible policy approaches that can adapt to the impacts of general-purpose AI over time, even when working with incomplete data. Further challenges include predicting which sectors will be most affected, addressing potential increases in inequality, and ensuring adequate support for displaced workers.

Key definitions

  • Labour market: The system in which employers seek to hire workers and workers seek employment, encompassing job creation, job loss, and wages.
  • Automation: The use of technology to perform tasks with reduced or no human involvement.
  • Labour market disruption: Significant and often complex changes in the labour market that affect job availability, required skills, wage distribution, or the nature of work across sectors and occupations.
  • Cognitive tasks: Activities that involve processing information, problem-solving, decision-making, and creative thinking. Examples include research, writing, and programming.
  • AI agent: A general-purpose AI which can make plans to achieve goals, adaptively perform tasks involving multiple steps and uncertain outcomes along the way, and interact with its environment – for example by creating files, taking actions on the web, or delegating tasks to other agents – with little to no human oversight.

General-purpose AI is likely to transform a range of jobs and displace workers, though the magnitude and timing of these effects remains uncertain. Research across several countries suggests that general-purpose AI capabilities are relevant to worker tasks in a large portion of all jobs [footnote 637], [footnote 638], [footnote 639]. One study estimates that in advanced economies 60% of current jobs could be affected by today’s general-purpose AI systems [footnote 640]. In emerging economies, this estimated share is lower but still substantial at 40% [footnote 640]. There is also some evidence that these effects may be gendered. One study estimates that women are more vulnerable to general-purpose AI automation globally, with twice the percentage of all women’s jobs at risk compared to men’s jobs [footnote 639]. Impacts will vary across affected jobs but are likely to include task automation, boosted worker productivity and earnings, the creation of new tasks and jobs, changes in the skills needed for various occupations, and wage declines or job loss [footnote 641], [footnote 642], [footnote 643], [footnote 644], [footnote 645].  Some economists believe that widespread labour automation and wage declines from general-purpose AI are possible in the next ten years [footnote 646], [footnote 647]. Others do not think that a step-change in AI-related automation and productivity growth is imminent [footnote 648]. These disagreements largely depend on economists’ expectations about the speed of future AI capability advances, the extent to which general-purpose AI could be capable of automating labour and the pace at which automation could play out in the economy.

General-purpose AI differs from previous technological changes due to its potential to automate complex cognitive tasks across many sectors of the economy. Unlike labour-saving innovations of past centuries that primarily automated physical tasks or routine computing tasks, general-purpose AI can be applied to a wide range of complex cognitive tasks across multiple domains, ranging from mathematics [footnote 649] to computer programming [footnote 650] to professional writing [footnote 651]. While historically, automation has tended to raise average wages in the long run without substantially decreasing employment in a lasting way, some researchers believe that past a certain level of general-purpose AI capabilities, automation may ultimately drive down average wages or employment rates, potentially reducing or even largely eliminating the availability of work [footnote 646], [footnote 652], [footnote 653]. These claims are controversial, however, and there is considerable uncertainty around how general-purpose AI will ultimately affect labour markets. Despite this uncertainty, the combined breadth of potential labour market impacts and the speed at which they may unfold presents novel challenges for workers, employers, and policymakers [footnote 654], [footnote 655]. Understanding these labour market risks is crucial, among other reasons, given the right to work established in Article 23(1) of the Universal Declaration of Human Rights [footnote 272]. Core questions about general-purpose AI’s labour market impacts include which sectors will be most impacted by automation, how quickly automation will be implemented in the economy, and whether general-purpose AI will increase or decrease earnings inequality within and across countries.

The magnitude of general-purpose AI’s impact on labour markets will in large part depend on how quickly its capabilities improve. Current general-purpose AI systems can already perform many cognitive tasks, but often require human oversight and correction see 1.2. Current capabilities. The wide range of projections regarding the progress of future general-purpose AI see 1.3. Capabilities in coming years highlights the uncertainty surrounding how soon these systems might reliably perform complex tasks with minimal supervision. If general-purpose AI systems improve gradually over multiple decades, their effects on wages are more likely to be incremental. Rapid improvements in reliability and autonomy could cause more harmful disruption within a decade, including sudden wage declines and involuntary job transitions [footnote 646]. Slower progress would give workers and policymakers more time to adapt and shape general-purpose AI’s impact on the labour market.

However, the pace of general-purpose AI adoption will also significantly affect how quickly labour markets change, even in scenarios where capabilities improve substantially. If general-purpose AI systems can boost productivity, there will be economic pressure to adopt them quickly, especially if costs to use general-purpose AI continue to fall see 1.3. Capabilities in coming years. However, integrating general-purpose AI across the economy is likely to require complex system-wide changes [footnote 656]. Previous technological changes suggest that adopting and integrating new automation technology can take decades [footnote 657], and cost barriers may slow adoption initially. For example, one study estimates that only 23% of potentially automatable vision tasks would currently be cost-effective for businesses to automate with compute vision technology [footnote 658]. Concerns about general-purpose AI’s reliability in high-stakes domains can also slow adoption [footnote 659]. Regulatory action or preferences for human-produced goods are other factors that could at least initially dampen AI’s labour market impacts, even if general-purpose AI capabilities quickly surpass human capabilities on many tasks [footnote 660]. The mix of adoption pressures and barriers makes predicting the pace of labour market transformation particularly complex for policymakers. However, early evidence suggests that, at least by some measures, general-purpose AI is being adopted faster than the internet or personal computer [footnote 661].

Productivity gains from general-purpose AI adoption are likely to lead to mixed effects on wages across different sectors, increasing wages for some workers while decreasing wages for others. In occupations where general-purpose AI complements human labour, it can increase wages through three main mechanisms. Firstly, general-purpose AI tools can directly augment human productivity, allowing workers to accomplish more in less time [footnote 113], [footnote 662]. If demand for worker output rises as workers become more productive, this added productivity could boost wages for workers using general-purpose AI who now experience increased demand for their work. Second, general-purpose AI can boost wages by driving economic growth and boosting demand for labour in tasks that are not yet automated [footnote 663], [footnote 664]. Third, general-purpose AI can lead to the creation of  entirely new tasks and occupations for workers to perform [footnote 641], [footnote 644], [footnote 664]. However, general-purpose AI may also exert downward pressure on wages for workers in certain occupations. As general-purpose AI increases the supply of certain skills in the labour market, it may reduce demand for humans with those same skills. Workers specialising in tasks that can be automated by general-purpose AI may therefore face decreased wages or job loss [footnote 643]. For example, one study found that four months after ChatGPT was released, it had caused a 2% drop in the number of writing jobs posted on an online labour market and a 5.2% drop in monthly earnings for writers on the platform [footnote 645]. The impact on wages in a given sector largely depends on how much additional demand exists for that sector’s services when costs fall due to general-purpose AI-driven productivity gains. Furthermore, the share of any AI-driven profits that are captured by workers will depend on factors such as the market structures and labour policies in affected industries, which vary greatly across countries.

General-purpose AI will likely have the most significant near-term impact on jobs that consist mainly of cognitive tasks. Several studies show that general-purpose AI capabilities overlap with the capabilities needed to perform tasks in a wide range of jobs, with cognitive tasks most likely to be impacted [footnote 637], [footnote 640], [footnote 665], [footnote 666].  Research has also found that general-purpose AI provides large productivity gains for workers performing many kinds of cognitive tasks. This includes work in occupations such as strategy consulting [footnote 667], legal work [footnote 668], professional writing [footnote 651], computer programming [footnote 113], and others. For example, customer service agents received an average productivity boost of 14% from using general-purpose AI [footnote 662]. Additionally, software developers were found to perform an illustrative coding task 55.8% faster when they had access to a general-purpose AI programming assistant [footnote 114]. Sectors that rely heavily on cognitive tasks, such as Information, Education, and the Professional, Scientific, and Technical Services sector, are also adopting AI at higher rates, suggesting that workers in these industries are poised to be most impacted by general-purpose AI in the near-term [footnote 669].

AI agents have the potential to affect workers more significantly than general-purpose AI systems that require significant human oversight. ‘AI agents’ are general-purpose AI systems that can accomplish multi-step tasks in pursuit of a high-level goal with little or no human oversight. This means that agents are able to chain together multiple complex tasks, potentially automating entire workflows rather than just individual tasks [footnote 670]. By removing the need for human involvement in long sequences of work, AI agents could perform tasks and projects more cheaply than general-purpose AI systems that require more human oversight [footnote 671], [footnote 672]. This is likely to incentivise increased rates of adoption of agents for the purposes of automation in economically competitive environments [footnote 671], [footnote 673]. The resulting acceleration in automation could cause more rapid disruption to skill demands and wages across multiple sectors [footnote 670], giving policymakers less time to implement policy measures that strengthen worker resilience.

Involuntary job loss can cause long-lasting and severe harms for affected workers. Studies show that displaced workers experience sharp drops in earnings and consumption immediately after being displaced, with earnings deficits persisting for years afterward [footnote 674], [footnote 675]. Estimates of wage declines even after re-employment range from 5%–30% for as long as 20 years after displacement [footnote 676], [footnote 677], [footnote 678], [footnote 679]. Involuntary job loss can also significantly affect physical health, with evidence suggesting that displacement increases mortality risk by 50–100% within the year after separation and by 10–15% annually for the next 20 years [footnote 680]. Studies also link job loss to higher rates of depression [footnote 681], suicide [footnote 682], alcohol-related disease [footnote 682], and negative impacts on children’s educational attainment [footnote 683]. Given general-purpose AI’s potential to cause job displacement, these findings underscore the importance of policy measures to support affected workers.

Improved general-purpose AI capabilities will likely increase the risks that current systems present to worker autonomy and workplace well-being. Today’s narrow AI systems are already used to assign tasks, monitor productivity, and evaluate worker performance in settings ranging from warehouses to call centres [footnote 684]. While these systems can increase productivity [footnote 685], studies show that they often harm worker wellbeing through continuous monitoring and AI-driven workload decisions [footnote 686]. Many employers adopt these systems without sufficient testing or without fully understanding their impacts on the workforce [footnote 687]. This may be particularly concerning when AI management systems influence critical decisions such as hiring and termination [footnote 687]. It remains to be seen whether general-purpose AI will enable more extensive algorithmic management than narrow AI systems that are often used today. If general-purpose AI systems improve at integrating and analysing diverse data streams, this would likely enable more granular monitoring and decision-making across workplaces, potentially increasing both efficiency and risks to worker autonomy.

General-purpose AI could increase income inequality within countries by providing greater productivity boosts to high earners, but impacts are likely to vary across countries. Over the last several decades automation of routine jobs increased wage inequality in the US context by displacing middle-wage workers from jobs where they previously had a comparative advantage [footnote 688], [footnote 689], [footnote 690]. For example, one study estimates that 50–70% of the increase in US wage inequality over the last four decades can be explained by relative wage declines of workers specialised in routine tasks in industries that experienced high levels of automation [footnote 688]. General-purpose AI could compete with human workers in a similar fashion, potentially depressing wages for some workers [footnote 691], [footnote 692] while being most likely to boost the productivity of those who are already in relatively high-income occupations see Figure 2.6 [footnote 637]. One simulation suggests that AI could increase wage inequality between high and low income occupations by 10% within a decade in advanced economies [footnote 640]. Across many types of cognitive tasks, however, there is evidence that at the current level of model capabilities, those with less experience or more elementary skill sets often get the largest productivity boosts from using general-purpose AI [footnote 114], [footnote 651], [footnote 662], [footnote 667], [footnote 668]. This suggests that within cognitive-task oriented occupations, lesser paid workers could actually get a larger boost than high earners and wage inequality within those occupations could shrink [footnote 693]. How these countervailing effects will play out across the economy is uncertain and is likely to vary across countries, sectors, and occupations.

Figure 2.6: General-purpose AI has an unequal economic impact on different parts of the income distribution.

CI, confidence internal; GPT, generative pretrained transformer’ LLM, large language model

Exposure is highest for worker tasks at the upper end of annual wages, peaking at approximately $90,000/year in the US, while low and middle incomes are significantly less exposed. In this figure, ‘exposure’ signifies the potential for productivity gains from AI, which can manifest in worker augmentation and wage boosts or automation and wage declines, depending on a variety of other factors. Source: Eloundou et al., 2024 [footnote 637].

General-purpose AI-driven labour automation is likely to exacerbate inequality by reducing the share of all income that goes to workers relative to capital owners. Globally, labour’s share of income has fallen by roughly six percentage points between 1980 and 2022 [footnote 694]. Typically, 10% of all earners receive the majority of capital income [footnote 695], [footnote 696]. If AI automates a significant share of labour, then these trends could intensify by both reducing work opportunities for wage earners and by increasing the returns to capital ownership [footnote 697], [footnote 698]. Additionally, evidence suggests that general-purpose AI can aid the creation of large ‘superstar’ firms that capture a large share of economic profits, which would further increase capital-labour inequality [footnote 699].

General-purpose AI technology is likely to exacerbate global inequality if primarily adopted by high-income countries (HICs). HICs have a higher share of the cognitive task-oriented jobs that are most exposed to general-purpose AI impacts [footnote 640]. These countries have stronger digital infrastructure, skilled workforces, and more developed innovation ecosystems [footnote 700] see 2.3.2. Global AI R&D divide. This positions them to capture general-purpose AI productivity gains more rapidly than emerging markets and developing economies. This would contribute to divergent income growth trajectories and a widening gap between HICs and low- and middle-income countries (LMICs) [footnote 701]. If the most advanced, labour-automating AI is used by companies in HICs, this could also attract additional capital investment to those countries, and further drive an economic divergence between high- and low-income regions [footnote 702]. Additionally, as firms in advanced economies adopt general-purpose AI, they may find it more cost-effective to automate production domestically rather than offshore work, eroding a traditional development pathway for developing economies that export labour-intensive services [footnote 703]. One study suggests this dynamic may be most likely to play out in countries with a large share of the workforce in outsourced IT services such as customer service, copywriting, and digital gig-economy jobs [footnote 704]. However, the precise impact on labour markets in developing economies remains unclear. On the one hand, they could face a double challenge of losing existing jobs to automation while finding it harder to attract new investment, as labour cost advantages become less relevant. On the other hand, if general-purpose AI is widely adopted in developing economies, it could provide productivity boosts for some skilled workers [footnote 662], [footnote 705], [footnote 706], potentially creating opportunities for these workers to compete for remote work opportunities with higher-paid counterparts in HICs.

Since the publication of the Interim Report (May 2024), new evidence suggests that rates of general-purpose AI adoption by individuals may be faster than previous technologies such as the internet or personal computers, though the pace of adoption by businesses varies widely by sector see Figure 2.7 [footnote 661]. For example, a recent survey in the US found that more than 24% of workers use generative AI at least once a week, and one in nine use it daily at work [footnote 661]. Business adoption rates vary significantly across sectors [footnote 707]. For example, in the US, approximately 18.1% of businesses in the Information sector report using AI (broadly defined), while only 1.4% in Construction and Agriculture do [footnote 669]. For firms who report using AI, 27% report replacing worker tasks, while only 5% report employment changes due to AI, more than half of which are employment increases rather than decreases [footnote 708]. Current evidence on general-purpose AI adoption rates is limited by limited international data collection, particularly outside of the US, though one survey of over 15,000 workers across 16 countries found that 55% of respondents use generative AI at least once a week in their work [footnote 709]. Across the globe, a large gender gap exists in both adoption and potential labour market impacts from general-purpose AI. For example, a recent meta-analysis of ten studies from various countries suggests that women are 24.6% less likely to use generative AI than men [footnote 710].

Figure 2.7: The trajectory of computer, internet and AI adoption

Figure 2.7: So far, generative AI appears to have been adopted at a faster pace than PCs or the internet in the US. Faster adoption compared with PCs is driven by much greater use outside of work, probably due to differences in portability and cost. Source: Bick et al., 2024 [footnote 661].

Additionally, since the publication of the Interim Report (May 2024), new evidence has shown that general-purpose AI can deliver meaningful productivity gains in real-world work settings and is poised to drive gains in science and R&D. New evidence demonstrates productivity impacts across several real-world work environments, and finds that these impacts vary by occupation, by business, and are mediated by the rate of adoption and usage within a firm  [footnote 711]. Recent research also found that each 10x increase in compute used to train a model allowed workers to complete certain translation tasks 12.3% faster with improved quality when they used the model as an assistant [footnote 706]. The extent to which this relationship applies to other work tasks remains unclear, however. Furthermore, the impact of productivity gains from general-purpose AI on worker skill development versus skill decline is an emerging research topic. One recent study found that while using ChatGPT can enhance some worker capabilities, skills and knowledge are mostly not retained once access is removed [footnote 712]. Finally, in one recent study focused on the US labour market, technology and R&D-focused jobs were found to have the highest proportion of their job tasks exposed to potential productivity gains from general-purpose AI [footnote 637]. This provides some suggestive evidence that recently observed large productivity impacts of narrow AI systems on scientific discovery [footnote 713] could potentially be accentuated by general-purpose AI across a broader range of R&D tasks. This is notable, since increasing the productivity of R&D can significantly boost technological progress and economic growth [footnote 714].

The main evidence gaps around labour market risks include uncertain long-term employment impacts, limited international data on adoption, and untested policy responses. Comprehensive studies on general-purpose AI’s long-term effects on employment and wages across different sectors are notably absent. There is limited understanding of adoption patterns outside the US, making it difficult to anticipate international impacts. Data on new job creation from general-purpose AI adoption is insufficient to guide worker retraining programmes. Most crucially for policymakers, there is little evidence about which interventions effectively protect workers during technological transitions. For example, while retraining is often suggested as a response to changing skill demands, there is limited evidence of its effectiveness [footnote 715], particularly given how quickly general-purpose AI might alter required workplace skills. These gaps exist partly because general-purpose AI is still nascent, making long-term impacts hard to measure, and partly because it is challenging to isolate general-purpose AI’s effects from other economic factors. The rapid pace of general-purpose AI development also means that evidence about the effectiveness of policy responses to previous technological change may not translate to this context.

For policymakers working on labour market risks from general-purpose AI, key challenges include balancing AI innovation with worker protection, creating policy that can quickly adapt to evolving impacts, and ensuring economic benefits are shared both within and across countries. A central challenge is balancing innovation (which could boost productivity and growth) against protecting workers from wage declines and the harms associated with involuntary job loss [footnote 654]. Policies must be adaptable given the speed of general-purpose AI development and the uncertainty about future impacts [footnote 716]. However, policymakers face the challenge of setting flexible policy while still providing enough regulatory certainty to facilitate business investment and worker training decisions. Closely monitoring key trends can help policymakers increase their foresight into general-purpose AI’s labour market impacts. These include sector-specific AI adoption rates, changes in wage distributions across industries, emergence of new job categories, and shifts in skill demands from employers as general-purpose AI systems progress and are more widely adopted. Finally, as the labour market impacts of AI are expected to vary significantly between countries, policymakers face challenges in coordinating international responses to prevent a widening global economic divide and ensure that AI can accelerate inclusive economic growth.

2.3.2. Global AI R&D divide

Key information

  • Large companies in countries with strong digital infrastructure lead in general-purpose AI R&D, which could lead to an increase in global inequality and dependencies. For example, in 2023, the majority of notable general-purpose AI models (56%) were developed in the US. This disparity exposes many LMICs to risks of dependency and could exacerbate existing inequalities.
  • The rising cost for developing general-purpose AI is the main reason for this ‘AI R&D divide’. Access to large and expensive quantities of computing power has become a prerequisite for developing advanced general-purpose AI. Academic institutions and most companies, especially those in LMICs, do not have the means to compete with large tech companies.
  • Attempts at closing the AI R&D divide have not been successful. An increasing number of efforts have been focused on democratising access to compute, investing in AI skills training in LMICs, and open sourcing prominent AI models. But these efforts will require considerable financial investment and significant time to implement.
  • Recent work suggests the AI R&D divide might widen further due to a trend of increasing R&D costs at the frontier. Since the publication of the Interim Report (May 2024), researchers have published new evidence on the rising costs of developing state-of-the-art AI, growing disparities in the concentration of AI talent, and increasing centralisation of computing resources needed to train large general-purpose AI models.
  • There is a lack of evidence on the effectiveness of potential ways to address the AI R&D divide. For example, the impact of AI training programmes or infrastructure investments in LMICs remains unclear.

Key definitions

  • Digital divide: The disparity in access to information and communication technology (ICT), particularly the internet, between different geographic regions or groups of people.
  • AI R&D divide: The disparity in AI research and development across different geographic regions, caused by various factors including an unequal distribution of computing power, talent, financial resources, and infrastructure.
  • Digital infrastructure: The foundational services and facilities necessary for digital technologies to function, including hardware, software, networks, data centres, and communication systems.
  • Ghost work: The hidden labour performed by workers to support the development and deployment of AI models or systems (for example through data labelling).

The uneven global distribution of compute, talent, financial resources, and digital infrastructure contributes to an AI R&D divide that could expose many LMICs to dependency risks and hinder their advancement in general-purpose AI R&D. The steep financial costs of developing and running general-purpose AI systems [footnote 27] may limit general-purpose AI R&D output from LMICs, potentially exacerbating existing inequalities. Researchers in LMICs, who are often unable to train LLMs due to high costs, will tend to rely upon existing open-weight models, which are primarily developed in countries with strong digital infrastructure [footnote 717]. These models are likely not to fully capture nuances (grammatical structure, non-Latin scripts, tonal differences, etc.) of non-Western languages, which are underrepresented in the training data, leading to lower accuracy [footnote 718]. Additionally, reliance on North American and Chinese companies for access to compute and open-weight models typically comes with copyright and privacy restrictions that limit the ability of researchers and developers within many LMICs to create state-of-the-art models [footnote 719]. Thus, these researchers often depend on collaborations with stakeholders in countries with stronger digital infrastructure to access compute and publish in top-tier venues. Finally, as countries like the US and China continue to lead in skilled AI talent production, researchers and students in other countries are often reliant on institutions in those leading countries for academic and career advancement in AI. This may exacerbate disparities in AI R&D, as talent moves from other countries to countries where an AI industry is already concentrated [footnote 720].

A main driver of the AI R&D divide is the difference in access to compute between different actors. This includes the unequal access to powerful computing resources (graphics processing units (GPUs), data centres, cloud services, etc.) that are necessary to train and deploy large and complex AI models. In recent years, this divide has widened [footnote 721], [footnote 722]. This unequal access is most apparent in the different extent to which large AI companies and academic AI labs have access to computing resources. Estimates show that US technology companies are the major buyers of NVIDIA H100 GPUs, one of the most powerful GPU chip types on the market explicitly designed for AI [footnote 723].  However, several major technology companies have all recently announced that they are developing custom AI chips to reduce their dependence on the AI chip supply chain, potentially paving the way for more widespread access to GPUs. But the exceptionally high cost of GPUs (typically $20,000-$30,000 for top-tier GPUs such as the H100 as of November 2024), thousands or tens of thousands of which are typically used to train a leading general-purpose AI model, could still hinder most LMICs from affording this level of AI infrastructure. The rising cost of establishing and maintaining data centres also contributes to unequal access to compute. Over the past decade, large tech companies have increased their investments in data centres, with Google unveiling a $600 million data centre in Nebraska in 2022 [footnote 724] and recently announcing a plan to build a $1 billion data centre in Missouri [footnote 725]. Meta has invested over $2 billion in a data centre in Oregon [footnote 726], and Microsoft has announced a $1 billion initiative to build a data centre campus, along with other AI development efforts, in Kenya [footnote 727]. While such efforts will aid significantly in increasing access to compute in general, they are unlikely to meaningfully mitigate the AI R&D divide.

Disparities in the concentration of skilled talent also contribute to the global AI R&D divide. AI R&D is primarily concentrated in two countries – the US and China – which have made significant investments toward recruiting and retaining AI talent. The US has the largest percentage of elite AI researchers, contains a majority of the institutions that conduct top-tier research, and is the top destination for AI talent globally [footnote 728]. Additionally, there exist disparities in where students can access AI-related degree programmes, as many of the top universities for AI are based in the US or the UK [footnote 729], and the vast majority of English university courses on AI are offered in the UK, US, and Canada [footnote 730]. While some LMICs, such as India and Malaysia, are increasing their respective AI course offerings [footnote 731], there is much more work needed to understand this disparity, as there is limited data on formal university programmes in AI throughout LMICs, particularly those offered in non-English languages.

The delegation of lower-level AI work to workers in LMICs has led to a ‘ghost work’ industry. The increasing demand for data to train general-purpose AI systems, including human feedback to aid in training, has further increased the reliance on ‘ghost work’ [footnote 732]. ‘Ghost work’ is mostly hidden labour performed by workers – often in precarious conditions – to support the development of AI models. Firms have sprung up that help big technology companies to outsource various aspects of data production, including data collection, cleaning, and annotation. This work can provide people in LMICs with opportunities. On the other hand, the contract-style nature of this work often provides few benefits and worker protections and less job stability, as platforms rotate markets to find cheaper labour. Research has shown that these workers face exposure to graphic content, erratic schedules, heavy workloads, and limited social and economic mobility [footnote 733], [footnote 734], [footnote 735], [footnote 736]. Exposure to such graphic content can lead to PTSD and other mental trauma [footnote 737], [footnote 738].

Since the publication of the Interim Report (May 2024), more evidence of the increased costs associated with general-purpose AI development has emerged, making a further widening of the AI R&D divide appear likely. The development of notable general-purpose AI models is still led by companies in countries with strong digital infrastructure and access to compute, and the abilities of these models are increasing. Researchers have provided strong evidence that the usage of resources like electricity in AI development is increasing [footnote 739]. The cost of training state-of-the-art AI models has grown 2–3x per year over the past eight years and could reach a cost of over USD $1 billion by 2027 [footnote 27]. However, there is some evidence of improvement in talent concentration and state-of-the-art model development from LMICs. For example, India has been particularly successful at increasing its concentration of skilled AI talent, which has increased by 263% since 2016 [footnote 740]. Research indicates that the development of general-purpose AI may significantly impact IT services outsourced to LMICs, such as customer service, copywriting, and gig work [footnote 704].

A key evidence gap around the AI R&D divide is the lack of evidence on feasible solutions. Large technology companies have increasingly invested in AI and digital skills training efforts across Africa, Latin America, and Asia, and these programmes are likely to increase as these regions expand the capabilities of state-of-the-art models for local consumers. However, there is no evidence that such training improves the production of significant AI models, particularly from LMICs. There is also only limited evidence on the benefits of investments in AI-specific infrastructure, given the large disparities in AI talent between many LMICs and countries like the United States and China. At the moment, it is unclear whether access to infrastructure would increase talent, or whether this infrastructure would go unused due to a lack of skilled experts. There is also limited data on the full scope of the AI R&D divide, since metrics often measure research output in top-tier journals and conferences, which are all published in English. Structural barriers, such as visa restrictions and financial burdens, often prevent qualified international researchers, particularly from LMICs, from attending major conferences or publishing in costly journals. The effects of the AI R&D divide are also spillover impacts from the existing digital divide [footnote 741], making it hard to disentangle the specific impacts of general-purpose AI on the global AI R&D divide.

Reducing the AI R&D divide is a hard problem for policymakers to tackle. General-purpose AI development costs are inaccessible for the majority of LMICs, and investments in basic infrastructure such as electricity grids and internet networks are estimated to cost billions (USD) for countries like Nigeria [footnote 742]. Additionally, none of these countries have companies that could handle the expenses of developing general-purpose AI systems individually. There is limited evidence on outcomes of digital skills training, which may impede further efforts to develop targeted skills training programmes that might have a significant impact on LMIC contributions to general-purpose AI models. There are also projections that disparities in AI talent concentration might widen. Countries such as the US, UK, China, and countries across Europe are rapidly increasing recruitment of AI talent, with some offering immigration pathways for skilled talent to contribute to AI R&D within their respective countries [footnote 743], [footnote 744], [footnote 745]. Policymakers, particularly in many LMICs, will have to analyse the implications of this for their regional autonomy and their efforts to lessen the AI R&D divide.

2.3.3. Market concentration and single points of failure

Key information

  • Market shares for general-purpose AI tend to be highly concentrated among a few players, which can create vulnerability to systemic failures. The high degree of market concentration can invest a small number of large technology companies with a lot of power over the development and deployment of AI, raising questions about their governance. The widespread use of a few general-purpose AI models can also make the financial, healthcare, and other critical sectors vulnerable to systemic failures if there are issues with one such model.
  • The market is so concentrated because of high barriers to entry. Developing state-of-the-art, general-purpose AI models requires substantial up-front investment. For example, the overall costs for developing a state-of-the-art model can currently reach hundreds of millions of US dollars. Key cost factors are computing power, highly skilled labour and vast datasets.
  • In addition, market leaders benefit from self-reinforcing dynamics that reward winners. Economies of scale allow bigger AI companies to spread one-off development costs over an ever larger customer base, creating a cost advantage over smaller companies. Network effects further allow larger companies to train future models with user data generated through older models.
  • Market concentration has continued to persist in 2024. Since the publication of the Interim Report (May 2024), the previous consensus that market concentration in the general-purpose AI market is high has continued to hold.
  • There is little research on predicting or mitigating single points of failure in AI. This creates challenges for policymakers. The absence of reliable prediction methods on how failures may propagate through interconnected systems makes these risks hard to assess.

Key definitions

  • Market concentration: The degree to which a small number of companies control an industry, leading to reduced competition and increased control over pricing and innovation.
  • Single point of failure: A part in a larger system whose failure disrupts the entire system. For example, if a single AI system plays a central role in the economy or critical infrastructure, its malfunctioning could cause widespread disruptions across society.

The development of state-of-the-art general-purpose AI requires enormous financial investment, often reaching hundreds of millions of US dollars (see Figure 2.8). These costs arise primarily from three key areas: specialised computational resources, highly skilled AI expertise, and access to vast datasets, which are frequently proprietary and expensive. Computational resources include advanced hardware such as GPUs (graphics processing units) and TPUs (tensor processing units), cloud infrastructure, and the energy required for training large-scale AI models [footnote 739]. Developing high-quality datasets also involves significant costs due to processes such as collection, annotation, and cleaning [footnote 746], [footnote 747]. Furthermore, the recruitment and retention of top-tier AI researchers, engineers, and data scientists is highly competitive and costly, as their expertise is essential for developing cutting-edge algorithms and architectures.

Figure 2.8: Estimated training cost of select AI models, 2017-2023

Figure 2.8: Estimated training costs of AI models have sharply increased over the past few years. Only a few companies can afford to train models at such high cost, further increasing market concentration. Source: Maslej et al., 2024a [footnote 730].

Access to massive datasets is crucial for training high-performance AI models. These datasets are often proprietary, giving established firms a competitive edge (see 1.3. Capabilities in coming years). Large technology companies are uniquely positioned to overcome these barriers due to their existing financial resources, infrastructure, and ownership or control over vast amounts of data through their existing platforms and services. In contrast, new firms face significant obstacles in acquiring the necessary datasets and computing power, leading to a high barrier to entry [footnote 74], [footnote 748], [footnote 749], [footnote 750]. As a result, smaller companies are often unable to compete, reinforcing the concentration of market power among a few dominant players in the AI sector.

General-purpose AI systems benefit significantly from economies of scale, since larger, more compute-intensive models tend to outperform their smaller counterparts on many metrics. Large-scale models, such as those used for natural language processing, image recognition, and decision-making, are capable of handling a broader range of tasks due to their increased capacity to process and analyse vast amounts of data. As these models grow in size, this may also result in better generalisation and accuracy [footnote 751], reinforcing the demand for high-performance, general-purpose AI systems across industries. This creates a feedback loop where large-scale models, which require substantial computational resources to develop, become more valuable and sought after due to their performance and versatility. Since AI systems require significant upfront investments in infrastructure and development [footnote 27], but only small costs per query, the average cost per user decreases as the AI system is provided to more users, reflecting economies of scale. This gives larger firms a competitive advantage, as they can spread the costs of development over a larger customer base, making it difficult for smaller firms to compete. Additionally, these systems benefit from network effects: as more users interact with them, they generate vast amounts of new data that can be used to retrain and fine-tune the models [footnote 752], [footnote 753]. This constant influx of user-generated data improves the performance of the models, making them even more valuable and effective over time.

These tendencies towards market concentration mean that a few companies will likely dominate decision-making about the development and deployment of general-purpose AI. Since society at large could benefit as well as suffer from these companies’ decisions, this raises questions about the appropriate governance of these few large-scale systems. A single general-purpose AI model could potentially influence decision-making across many organisations and sectors [footnote 748] in ways that might be benign, subtle, inadvertent, or deliberately exploited. There is the potential for the malicious use of general-purpose AI as a powerful tool for manipulation, persuasion, censorship and control by a few companies or governments.

Table 2.5: Worldwide market share of leading cloud infrastructure service providers in Q1 2024

Cloud infrastructure service provider Market share %
AWS 31%
Azure 25%
Google Cloud 11%
Alibaba Cloud 4%
Salesforce 3%
IBM Cloud 2%
ORACLE 2%
Tencent Cloud 2%

$76 billion Cloud infrastructure service revenues in Q1 2024. Amazon, Microsoft and Google together control over 2/3 of global cloud computing services, concentrating power over essential AI training infrastructure in just three companies. Source: Richter, 2024 [footnote 756].

Since the publication of the Interim Report (May 2024), the previous consensus that market concentration in the general-purpose AI market is high has continued to hold, and some new research has suggested a rising dependency on large AI companies. Increasing dependency on prominent tech companies for access to essential hardware (GPUs), AI model interfaces (APIs), and cloud storage services has significant implications for the AI ecosystem [footnote 754]. Just three companies control 67% of cloud computing services (see Figure 2.9). This reliance consolidates power within a few major players, limiting competition and innovation among smaller firms that lack the resources to invest in their own infrastructure. The market capitalisation of large tech companies has increased since the onset of the COVID-19 pandemic, influencing their accumulation and concentration of computing infrastructure, data, and human resources needed to train advanced AI systems [footnote 755]. Such accumulation of resources is driven by companies’ reassessment of expected returns from investments into AI.

A single AI system can be adopted across critical sectors such as finance, healthcare, and cybersecurity, making the systemic risks associated with market concentration particularly pronounced. These sectors, which are interdependent and integral to national security and economic stability, increasingly rely on AI for decision-making, threat detection, automation, and resource optimisation. The dominant general-purpose AI models supplied by a few large companies are used as the backbone for many of these applications, creating the potential for significant vulnerabilities [footnote 757]. A key concern is that flaws, vulnerabilities, bugs, or inherent biases in these widely adopted AI systems could lead to simultaneous failures across multiple industries [footnote 758]. Different scenarios have been proposed that illustrate potential disruptions. For example, a cybersecurity flaw in a dominant AI model could expose multiple financial institutions, government agencies, and other critical systems to coordinated cyberattacks or system failures [footnote 759], [footnote 760].

Increasing the development of technical standards to identify and mitigate single points of failure in AI could reduce risks. One way to mitigate risks from single points of failure is to make it less likely that individual models fail or are unsafe in some way. Some examples of potential mitigations that researchers have explored are the development of technical standards [footnote 761], along with auditing and reporting requirements [footnote 762]. However, these mitigations involve significant costs and complexity [footnote 763]. For a more detailed discussion of various such techniques, see 3. Technical approaches to risk management.

A key evidence gap around market concentration risks is the absence of established methods to model impacts from single points of failure in AI, which makes developing reliable mitigation methods hard. It is difficult to predict how failures propagate across complex societal systems. This makes it challenging to reliably predict potential disruptions or understand their full scope. This uncertainty hampers efforts to design targeted safeguards, as  comprehensive data for policymakers and developers on where vulnerabilities lie and how they manifest is still emerging [footnote 764]. As a result, there is a risk of incomplete or ineffective mitigation strategies, leaving critical sectors at continued risk of cascading failures from AI system flaws. While researchers have begun developing methods to measure the reliability of AI systems [footnote 765], they are few and limited in adoption.

A key challenge for policymakers seeking to reduce risks from market concentration in general-purpose AI is that developing this technology is so capital-intensive, favouring dominance of a few very large players. Common dynamics following attempts to reduce market concentration illustrate this, as smaller firms quickly become acquisition targets for larger competitors. For example, funding and resources may help smaller firms grow, but that in turn tends to make them attractive acquisition opportunities for dominant tech companies seeking to eliminate competition or expand their AI capabilities [footnote 766], [footnote 767].

2.3.4. Risks to the environment

Key information

  • General-purpose AI is a moderate but rapidly growing contributor to global environmental impacts through energy use and and greenhouse gas (GHG) emissions. Current estimates indicate that data centres and data transmission account for an estimated 1% of global energy-related GHG emissions, with AI consuming 10–28% of data centre energy capacity. AI energy demand is expected to grow substantially by 2026, with some estimates projecting a doubling or more, driven primarily by general-purpose AI systems such as language models.
  • Recent advances in general-purpose AI capabilities have been largely driven by a marked increase in the amount of computation that goes into developing and using AI models, which uses more energy. While AI firms are increasingly powering their data centre operations with renewable energy, a significant portion of AI training globally still relies on high-carbon energy sources such as coal or natural gas, leading to the aforementioned emissions and contributing to climate change.
  • AI development and deployment also has significant environmental impacts through water and resource consumption, and through AI applications that can either harm or benefit sustainability efforts. AI consumes large amounts of water for energy production, hardware manufacturing, and data centre cooling. All of these demands increase proportionally to AI development, use, and capability. AI can also be used to facilitate environmentally detrimental activities such as oil exploration, as well as in environmentally friendly applications with the potential to mitigate or help society adapt to climate change, such as optimising systems for energy production and transmission.
  • Current mitigations include improving hardware, software, and algorithmic energy efficiency and shifting to carbon-free energy sources, but so far these strategies have been insufficient to curb GHG emissions. Increases in technology efficiency and uptake of renewable energy have not kept pace with increases in demand for energy: technology firms’ GHG emissions are often growing despite substantial efforts to meet net-zero carbon goals. Significant technological advances in general-purpose AI hardware or algorithms, or substantial shifts in electricity generation, storage and transmission, will be necessary to meet future demand without environmental impacts increasing at the same pace.
  • Since the publication of the Interim Report (May 2024), there is additional evidence that the demand for energy to power AI workloads is significantly increasing. General-purpose AI developers reported new challenges in meeting their net-zero carbon pledges due to increased energy use stemming from developing and providing general-purpose AI models, with some reporting increased GHG emissions in 2023 compared to 2022. In response, some firms are turning to virtually carbon-free nuclear energy to power AI data centres.
  • The main evidence gaps around general-purpose AI energy use and GHG emissions are the lack of precise estimates of the total energy use or emissions due to general-purpose AI, and the difficulty of anticipating corresponding future trends. There is insufficient public information regarding current patterns in AI energy use, such as how much data centre capacity can be attributed to general-purpose AI compared to other workloads, and how much energy or other environmental impacts can be attributed to different AI use cases or capabilities. Current figures largely rely on estimates, which become even more variable and unreliable when extrapolated into the future due to the rapid pace of development in the field.

Key definitions

  • GHG (greenhouse gas) emissions: Release of gases such as carbon dioxide (CO2), methane, nitrous oxide, and hydrofluorocarbons which create a barrier trapping heat in the atmosphere. A key indicator of climate change.
  • Carbon intensity: The amount of GHG emissions produced per unit of energy. Used to quantify the relative emissions of different energy sources.
  • Compute: Shorthand for ‘computational resources’, which refers to the hardware (e.g. GPUs), software (e.g. data management software) and infrastructure (e.g. data centres) required to train and run AI systems.
  • Data centre: A large collection of networked, high-power computer servers used for remote computation. Hyperscale data centres typically contain more than 5000 servers.
  • Rebound effect: In economics, the reduction in expected improvements due to increases in efficiency, resulting from correlated changes in behaviour, use patterns, or - other systemic changes. For example, improving automotive combustion engine efficiency (km/litre) by 25% will lead to less than a 25% reduction in emissions, because the corresponding reduction in the cost of gas per kilometre driven will make it cheaper to drive more, limiting improvements.
  • Carbon offsetting: Compensating for GHG emissions from one source by investing in other activities that prevent comparable amounts of emissions or remove carbon from the atmosphere, such as expanding forests.
  • Institutional Transparency: The degree to which AI companies disclose technical or organisational information to public or governmental scrutiny, including training data, model architectures, emissions data, safety and security measures, or decision-making processes.

Recent advances in general-purpose AI capabilities have largely been powered by a rapid increase in the amount of computation that goes into developing and using AI models. The most straightforward methodology for improving general-purpose AI performance on end-tasks is to allow the model to learn from as many data examples as possible. This is achieved by increasing the size of the model, measured as the number of parameters, roughly in proportion with the amount of available data [footnote 156], [footnote 157]. In order for a bigger model to learn its parameters from the data (in training and development) and use those parameters to produce outputs on new data (in deployment or use), it needs to perform more calculations, which requires more computational power (see 1.3. Capabilities in coming years for further discussion).

General-purpose AI requires significant energy to develop and use, with corresponding GHG emissions and impacts to the energy grid. For example, Meta estimates that the energy required to train their recent (July, 2024) Llama 3 family of LLMs resulted in 11,380 tonnes of CO2 equivalent (tCO2e) emissions across the four released models [footnote 11]. The total emissions equate to the energy consumed by 1,484 average US homes for one year, or 2,708 gasoline-powered passenger vehicles driven for one year [footnote 768]. Google reports that training their open source Gemma 2 family of LLMs emitted 1247.61 tCO2e [footnote 769], but like most developers of general-purpose AI, they do not disclose the amount of energy or emissions required to power production models. Additional energy is required to power the data centres within which most general-purpose AI computation is performed, most notably for cooling. This additional energy overhead is typically quantified as power usage effectiveness (PUE), which is a ratio between the amount of energy used for computation and for other uses within a data centre; the optimal theoretical PUE, indicating zero energy overhead, is 1.0. The most efficient hyperscale data centres, including many of the data centres powering general-purpose AI, currently report a PUE of around 1.1, with the industry average hovering around 1.6 [footnote 770]. Energy use also arises from data transmission across computer networks, which is required to communicate the inputs to and outputs of AI models between users’ devices, such as laptops and mobile phones, and the data centres where AI models are run. Approximately 260–360 TWh energy was required to support global data transmission networks in 2022, a similar amount as was used to power data centres (240–340 TWh, excluding cryptocurrency mining which amounted to an additional 100–150 TWh) in that same year [footnote 771]. Google, Meta, Amazon, and Microsoft alone, leaders in providing general-purpose AI and other cloud compute services, were collectively responsible for 69% of global data transmission traffic, representing a shift from previous years when the majority of data transmissions were attributed to public internet service providers [footnote 772].

Although reporting often focuses on the energy cost of model training, there is strong evidence that a higher energy demand arises from their everyday use. Training and development corresponds to a lower number of high energy use activities, whereas deployment corresponds to a very high number of lower-energy uses (since each user query represents an energy cost) [footnote 739], [footnote 773], [footnote 774]. While the most reliable estimates of energy use and GHG emissions due to general-purpose AI typically measure their training costs, such as those cited above, available reports suggest a greater overall proportion of energy expenditure due to use. In 2022, Google and Meta reported that the use of AI systems accounted for 60–70% of the energy associated with their AI workloads, compared to 20–40% for training and 10% for development (i.e. research and experimentation) [footnote 199], [footnote 206].

Pre-processing and generating data for general-purpose AI also has significant energy costs. Meta further reported that data processing, i.e. filtering and converting data to the appropriate formats for training AI models, accounted for 30% of the energy footprint for a production model developed in 2021 for personalised recommendation and ranking, and the overall computation devoted to data pre-processing increased by 3.2x from 2019–2021 [footnote 199]. Large general-purpose AI models give rise to more computation for data processing than narrow AI models. Not only do general-purpose AI models consume substantially more data than narrow models, but the models themselves are increasingly used to generate additional synthetic data during the training process and to pick the best synthetic data to train on [footnote 37], [footnote 775], [footnote 776]. They are also used to generate data for training narrow AI models [footnote 777]. However, recent figures providing similarly detailed attribution of general-purpose AI energy use are not available. The limited availability of broader data quantifying AI energy use has resulted in recent mandates, such as in the EU AI Act, focusing on model training despite the need for increased reporting and characterisation of the demands due to data processing and model use [footnote 778].

Currently, the GHG emissions of general-purpose AI primarily arise from the carbon intensity of energy sources used to power the data centres and data transmission networks supporting their training and use. For example, renewable sources such as solar power emit far less GHG compared to fossil fuels [footnote 779]. While AI firms are increasingly powering their data centre operations with renewable energy [footnote 199], [footnote 206], [footnote 780], [footnote 781], a significant portion of AI compute globally still relies on high-carbon sources such as coal or natural gas [footnote 779]. This results in significant GHG emissions.

There are varying estimates of the total energy use and GHG emissions related to data centres and AI. According to estimates from the International Energy Agency (IEA), data centres and data transmission make up 1% of global GHG emissions related to energy use, and 0.6% of all GHG emissions (which also includes other GHG sources such as agriculture and industrial processes) [footnote 770], [footnote 771], [footnote 782]. Between 10% to 28% of energy use in data centres stems from the use of AI in recent estimates, mostly due to generative AI (LLMs and image generation models) which makes up most of the energy use due to general-purpose AI [footnote 770], [footnote 771], [footnote 782]. Combining these estimates would suggest that the use of AI is responsible for 0.1–0.28% of global GHG emissions attributed to energy use and 0.06–0.17% of all GHG emissions, but the exact percentages depend on how much of the energy used comes from carbon-intensive energy sources. The average carbon intensity of electricity powering data centres in the US is 548 grams CO2 per kWh, which is almost 50% higher than the US national average [footnote 783]. Factors affecting GHG emissions include the location of data centres and time of day of energy use, data centre efficiency, and the efficiency of the hardware used. As a result, the actual GHG emissions for a given amount of energy consumed by AI can vary considerably.

Since the publication of the Interim Report (May 2024), there is additional evidence of increased energy demand to power data centres running AI workloads. As of October 2024, the IEA predicts that data centres will account for less than 10% of global electricity demand growth between 2023 and 2030 (784). Most of the overall growth in demand is predicted to arise from other growing sources of electricity demand, such as uptake of electric vehicles and increased needs for cooling buildings. However, data centre impacts are highly localised compared to other industries, leading to uneven distribution of the increased demand and disproportionately high impacts in certain areas [footnote 784]. For example, data centres consumed over 20% of all electricity in Ireland in 2023 [footnote 785], and electricity use is growing in the US, home to more than half of global data centre capacity [footnote 786], for the first time in over a decade, driven in part by increased development and use of AI [footnote 787]. Technology firms are turning to nuclear power (which has its own complex benefits and risks) as a carbon-neutral energy source to power data centres, with multiple large tech firms signing deals with power providers to secure nuclear energy. In September 2024, Microsoft signed a deal that will re-open the Three Mile Island nuclear power plant in Pennsylvania, agreeing to purchase all of the plant’s generation capacity for the next 20 years, enough to power approximately 800,000 homes [footnote 788]. Amazon signed a similar deal in March to purchase up to 960 MW/year of nuclear energy to power a data centre campus for their Amazon Web Services (AWS) cloud platform [footnote 789], representing the first instance of data centre co-location with a nuclear power plant. However, in November the US Federal Energy Regulatory Commission rejected the transmission provider’s request to amend their interconnection service agreement to increase transmission to the data centre [footnote 790], casting some doubts on whether regulators will support such co-location moving forward. In October, Google announced an agreement to purchase nuclear energy from small modular reactors (SMRs), the world’s first corporate agreement of this kind, stating that they needed this new electricity source to “support AI technologies” [footnote 791].

There are several potential mitigations for the increasing energy use and GHG emissions of general-purpose AI systems, such as shifting to carbon-free energy, purchasing carbon offsets, and improving efficiency of AI systems and data centres, but no silver bullet. As in other sectors, continuing to shift general-purpose AI data centre energy to renewable energy sources such as wind, hydroelectric and solar is a promising path forward, but currently limited by battery storage and transmission technology; renewable sources cannot currently provide energy to data centres that need it without interruptions, across geographically-diverse regions. As mentioned in the previous paragraph, AI firms are expressing increased interest in nuclear energy sources, particularly cheaper, safer small modular reactors (SMRs), to fill the gap in the short to medium term. While SMRs provide uninterrupted carbon-free energy, a recent report highlights that SMRs (<300 MW) produce more nuclear waste per unit of energy produced than large-scale (>1000 MW) reactors by a factor of 2–30x [footnote 792]. Improving the energy efficiency of general-purpose AI systems, measured as energy used to reach a given capability level, is another way to mitigate energy use [footnote 206], [footnote 773]. Smarter resource allocation and scheduling is also a promising direction for mitigating GHG emissions. General-purpose AI workloads can be paused during peak energy use times in order to reduce GHG emissions by nearly 30% for some regions and energy mixes [footnote 793], but not all general-purpose AI workloads are compatible with this approach, particularly model inference, which typically requires the workload to be run immediately in order to return a response to the user immediately (e.g. when a general-purpose AI summary is included as part of a web search). Further mitigation strategies include developing sustainability impact assessments for AI development and deployment; resource restrictions or conditions for AI training; and tradable energy budgets for AI training and inference [footnote 794].

Carbon offsets are a popular method used by general-purpose AI developers to mitigate GHG emissions, but they do not always result in actual emissions reductions. Energy consumers commonly attempt to mitigate their GHG emissions by engaging in renewable power purchase agreements (PPAs), renewable energy credits (RECs), coal transition mechanisms (CTMs), or carbon offset certificates in order to offset their emissions by purchasing equivalent renewable energy, or by investing in other carbon-reduction or green energy transition projects. Carbon offsets are the primary mechanism currently employed by technology firms to achieve net-zero carbon emissions pledges, alongside increasing procurement of renewable energy sources to power data centre energy use directly [footnote 780], [footnote 795], [footnote 796]. For example, Meta reports that they mitigated the emissions due to LLM training cited above by purchasing an equivalent amount of renewable energy [footnote 11]. This strategy also has limitations, due to the difficulty of verifying the additionality of offset projects, i.e. ensuring that the emissions reductions would not have occurred regardless of the offset programme [footnote 797].

General-purpose AI energy efficiency is improving rapidly, but not enough to stop the ongoing growth of emissions. Specialised AI hardware and other hardware efficiency improvements enhance the performance-per-watt of machine learning workloads over time [footnote 206]. Moreover, new machine learning techniques and architectures can also help reduce energy consumption [footnote 206], as can improvements in the supporting software frameworks and algorithms [footnote 798], [footnote 799]. Energy used per unit of computation has been reduced by an estimated 26% per year [footnote 144]. However, current rates of efficiency improvement are insufficient to counter growing demand. Demand for computing power used for AI training, which has been increasing by a factor of approximately 4x each year, is so far significantly outpacing energy efficiency improvements [footnote 26]. This mismatch is reflected in the fact that technology firms involved in the development and deployment of general-purpose AI report challenges in meeting environmental sustainability goals. Baidu reports that increased energy requirements due to the “rapid development of LLMs” are posing “severe challenges” to their development of green data centres [footnote 781], and Google similarly reports a 17% increase in data centre energy consumption in 2023 over 2022 and a 37% increase in GHG emissions due to energy use “despite considerable efforts and progress on carbon-free energy”. They attribute these increases to increased investment in AI [footnote 780].

Efficiency improvements alone have not negated the overall growth in energy use of AI and possibly further accelerate it because of ‘rebound effects’. Economists have found for previous technologies that improvements in energy efficiency tend to increase, rather than decrease, overall energy consumption by decreasing the cost per unit of work [footnote 800]. Efficiency improvements may lead to greater energy consumption by making technologies such as general-purpose AI cheaper and more readily available, and increasing growth in the sector.

Figure 2.10: Data centre projected yearly energy use (US)

Figure 2.10: US data centre energy use is projected to grow rapidly, reaching between 270–930 TWh annually by 2030. This wide range in projections (varying by over 600 TWh, equivalent to >10% of total US 2022 energy use) stems from rapidly evolving technology and limited historical data, particularly for AI-specific usage. Source: Kamiya, G. & Coroamǎ, V.C., 2024 [footnote 801].

The main evidence gaps around general-purpose AI energy use and emissions are the lack of precise estimates of the total energy use, emissions or resource consumption due to general-purpose AI, and the difficulty of anticipating corresponding future trends. Bottom-up estimates of energy use and emissions, such as those described above, are much easier to calculate than top-down estimates for the entire sector. Increased development and use of general-purpose AI is widely believed to be driving increases in demand for data centre compute capacity and corresponding energy use, and so overall trends in data centre energy use are assumed to reflect the growth in general-purpose AI development and use. As of 2023, data centres (excluding cryptocurrency mining) accounted for between 1% to 1.5% of global electricity demand [footnote 802]; roughly 2% in the EU, 4% in the US, and close to 3% in China [footnote 213], [footnote 803], [footnote 804]. In 2020 data centres and data transmission networks emitted 330 million tCO2e overall, making up just under 1% of all energy-related GHG emissions, and 0.6% of global GHG emissions overall [footnote 771]. While AI is currently estimated to make up 10–30% of data centre workloads [footnote 770], [footnote 782], demand for AI development and use (general-purpose and otherwise) is expected to continue to grow in coming years. Some sources estimate that growth will double data centre electricity demand, from 460 TWh in 2022 to more than 1000 TWh in 2026 [footnote 208]. Google reports emitting 14.3 million tCO2e in 2023, a 13% increase over 2022 and 48% increase since 2019, which they attribute to increases in data centre energy consumption related to increased integration of AI into products [footnote 780]. However, projections vary widely and it is fundamentally challenging to project future use and growth of AI due to the rapid and unpredictable pace of development in the technology [footnote 805]. Figure 2.10 depicts the wide range of available estimates for future data centre energy use in the US, which vary widely, as much as 10% of the total electricity use in the US in 2022. It is typical to estimate future trends by simply extrapolating from current demand and growth rate of an indicator. However, this methodology ignores critical variables that dictate actual growth, and has proven insufficient for accurately estimating demand due to technological developments. For example, while global internet traffic (an indicator for data centre electricity use) grew by more than ten times between 2010 and 2020, data centre electricity use only increased by 6% over the same period due to improvements in hardware and data centre efficiency [footnote 806]. A technological breakthrough in today’s general-purpose AI algorithms could similarly reduce energy requirements, and current estimates of growth in energy use must take into account additional factors that curb growth, such as limitations to the AI hardware supply chain, and electricity generation capacity [footnote 784]. Further, AI (general-purpose and otherwise) can also have indirect environmental impacts (positive or negative) arising from specific applications [footnote 774]. For example, AI might be applied to accelerate discovery in materials science of a new battery chemistry allowing for wider adoption of renewable energy, or in identifying catalysts enabling more efficient carbon capture or hydrogen fuel production [footnote 807]. AI can also be applied towards environmentally detrimental goals such as oil and gas exploration and extraction, leading to increased GHG emissions [footnote 807]. Quantifying indirect impacts is even more challenging than characterising its direct impacts due to, e.g. energy use, and more work is needed to develop robust frameworks for life cycle assessment of general-purpose AI models [footnote 774]. Better reporting and characterisation of past and current energy demands, and the dominant AI use cases fuelling them, is needed in order to gauge risk and develop mitigation strategies for increasing energy demand and emissions due to general-purpose AI [footnote 778]. In order to remain on track with the IEA’s Net Zero Emissions by 2050 scenario, for example, emissions due to data centres and data transmission would need to halve by 2030 [footnote 771], but it is not well known what proportion of those emissions can be attributed to general-purpose AI, which general-purpose AI development and use cases are contributing the most to those emissions and which are mitigating or reducing emissions elsewhere, and how those trends are developing over time.

In addition to GHG emissions due to energy use, general-purpose AI has other environmental impacts due to the physical systems and structures required for its development and use, which are even less well understood. The GHG emissions due to energy use discussed so far are typically referred to as operational emissions, and they currently contribute the highest proportion of emissions. The embodied carbon footprint of AI hardware, which includes emissions from manufacturing, transportation, the physical building infrastructure, and disposal, also contributes significant GHG emissions. Depending on the location and scenario, this can be up to 50% of a model’s total emissions [footnote 199]. As operational energy efficiency improves, the embodied carbon footprint will become a larger proportion of the total carbon footprint [footnote 808]. Intel reports that its Ocotillo campus generated over 200,000 tCO2e in 2023 from direct emissions alone (not including electricity) [footnote 809], and is on track to generate over 300,000 tCO2e by the end of 2024, having consumed over 1 billion kWh energy in the first quarter of 2024 [footnote 809]. Estimating the current embodied carbon footprint of general-purpose AI poses a great challenge due to a lack of data from hardware manufacturers. This arises due to a combination of incentives, including manufacturers’ desire to protect intellectual property around proprietary manufacturing processes and the consolidation of expertise in manufacturing specialised AI hardware to a very limited number of firms, limiting knowledge access and transfer.

Water consumption is another emerging area of environmental risk from general-purpose AI. General-purpose AI development and use withdraws fresh water from local water systems, a portion of which is then consumed, primarily through evaporation. As with energy use, general-purpose AI water use also increases as models grow larger. General-purpose AI has both embodied and operational water requirements. Embodied water consumption comes from water use in the hardware manufacturing process, and operational water use primarily arises from energy production and from evaporative cooling systems in data centres. In energy production, water evaporates when used for cooling in nuclear and fossil-fuel combustion power plants and in hydroelectric dams. In data centres, computer hardware also produces significant waste energy in the form of heat, and must be cooled in order to optimise computational efficiency and longevity. The most effective and widespread methods for cooling hardware in data centres evaporate water. As the computation used for training and deploying general-purpose AI models increases, cooling demands increase, too, leading to higher water consumption. Water is also consumed during hardware manufacturing processes. In 2023, Intel’s water-efficient Ocotillo chip manufacturing plant in Arizona, which has earned the highest certification for water conservation from the Alliance for Water Stewardship, withdrew 10,561 million litres of water (90% fresh water) of which 1896 million litres were consumed [footnote 809]. Assuming an average household water use of 144 litres per day [footnote 810], this equates to over 200,000 households’ yearly water withdrawal. Taiwan Semiconductor Manufacturing Company (TSMC), the world’s largest semiconductor manufacturer and the main supplier of chips to AI hardware firms such as Nvidia, reports that as of 2023 their per-unit water consumption had increased by 25.2% since 2010, despite their goal to decrease usage by 2.7% over that period, and by 30% by 2030; this is despite increased water-saving measures resulting in TSMC conserving 33% more water year-over-year in 2023 [footnote 811]. Water consumption by current models and the methodology to assess it are still subject to scientific debate, but some researchers predict that water consumption by AI could ramp up to trillions of litres by 2027 [footnote 199], [footnote 812]. In the context of concerns around global freshwater scarcity, and without technological advances enabling emissions-efficient alternatives, AI’s water footprint might be a substantial threat to the environment and the human right to water [footnote 813]. In response to congressional mandates, the US Department of Energy is currently working to assess current and near-future data centre energy and water consumption needs, with a report to be released by the end of 2024 [footnote 787]. European data centre operators must report water consumption beginning in 2025 [footnote 814].

Potential mitigations for AI-related water consumption include reducing energy use and developing and deploying low-water processes for cooling and manufacturing. The same algorithmic and software improvements deployed to mitigate energy consumption will also lead to some reduced water consumption, since a portion of water consumption is due to energy use. Other energy use mitigations, such as hardware efficiency improvements or shifting to carbon-free energy sources, will not necessarily lead to reduced water use, and could increase it; improving hardware efficiency implies manufacturing new hardware to replace old hardware, and nuclear power generation requires more water for cooling than natural gas generation [footnote 815]. Newer technologies, such as dry cooling, can be used to reduce water withdrawals required for cooling in power plants, but dry cooling decreases the efficiency of energy production [footnote 816]. In data centres and hardware manufacturing, water can be harvested and recycled, but this also requires increased energy input in order to filter water to high purity, for example through reverse osmosis [footnote 809], [footnote 817]. These examples highlight a common trade-off between energy use and water use concerns that must be considered when developing policies around AI environmental impacts. Data centres can be built in cold climate geographic regions amenable to natural air cooling, but logistical challenges in terms of energy and data transmission, construction, and extreme weather limit the cost efficacy of this approach at scale. Trigeneration, wherein waste heat from energy production is used to provide cooling, can minimise water and energy use in data centres [footnote 818]. However, current trigeneration systems are typically powered by fossil fuel combustion and further research is needed to develop trigeneration systems powered by carbon-free and low-water energy sources. Hydrogen plasma cooling could also improve data centre cooling efficiency, but significant efforts are still needed to develop robust infrastructure for producing hydrogen that does not rely on fossil fuels [footnote 819]. In conjunction with efforts to optimise manufacturing processes, hardware manufacturing firms have begun to report or pledge ‘net positive’ water use, through a combination of reducing their water consumption and funding external water restoration projects equivalent in gallons to their consumption, in a similar vein to net-zero carbon pledges that leverage RECs or carbon offsets [footnote 809], [footnote 811], with similar challenges.

Policymakers face three core challenges in addressing AI’s impact on the environment: limited institutional transparency around energy use and emissions data, unclear relationships between computational costs and whether the resulting capabilities are applied for environmental benefit or harm, and high uncertainty due to rapid development. Limited data is available for quantifying energy and emissions associated with general-purpose AI, which limits researchers’ ability to analyse and forecast use patterns, and corresponding policy development. However, existing reporting requirements still provide insufficient information regarding AI specifically; they do not require developers to break down impacts by model use phases (training versus use) or use cases (general-purpose versus task-specific, or whether AI is being applied to mitigate or accelerate negative environmental impacts, such as to aid in oil and gas extraction) [footnote 778]. Additionally, there is a lack of understanding in the research community regarding the amount of computation required to achieve a desired level of capability from a general-purpose AI model. This limits the extent to which energy use targets can be set for specific models or use cases, such as the amount of energy or emissions allotted to generate an image, since the upper and lower bounds on required energy are either very wide, or highly case-specific. Close collaboration and effective communication is needed between domain experts and policymakers to ensure that policy decisions are based on accurate data, and that mechanisms are put in place to ensure that better data is available in the future to support policy development and implementation.

2.3.5. Risks to privacy

Key information

  • General-purpose AI systems can cause or contribute to violations of user privacy. Violations can occur inadvertently during the training or usage of AI systems, for example through unauthorised processing of personal data or leaking health records used in training. But violations can also happen deliberately through the use of general-purpose AI by malicious actors; for example, if they use AI to infer private facts or violate security.
  • General-purpose AI sometimes leaks sensitive information acquired during training or while interacting with users. Sensitive information that was in the training data can leak unintentionally when a user interacts with the model. In addition, when users share sensitive information with the model to achieve more personalised responses, this information can also leak or be exposed to unauthorised third parties.
  • Malicious actors can use general-purpose AI to aid in the violation of privacy. AI systems can facilitate more efficient and effective searches for sensitive data and can infer and extract information about specific individuals from large amounts of data. This is further exacerbated by the cybersecurity risks created by general-purpose AI systems (see 2.1.3. Cyber offence.
  • Since the publication of the Interim Report (May 2024), people increasingly use general-purpose AI in sensitive contexts such as healthcare or workplace monitoring. This creates new privacy risks which so far, however, have not materialised at scale. In addition, researchers are trying to remove sensitive information from training data and build secure deployment tools.
  • For policymakers, it remains hard to know the scale or scope of privacy violations. Assessing the extent of privacy violations from general-purpose AI is extremely challenging, as many harms occur unintentionally or without the knowledge of the affected individuals. Even for documented leaks, it can be hard to identify their source, as data is often handled across multiple devices or in different parts of the supply chain.

Key definitions

  • Privacy: A person’s or group’s right to control how others access or process their sensitive information and activities.
  • Personally identifiable information (PII): Any data that can directly or indirectly identify an individual (for example, names or ID numbers). Includes information that can be used alone or combined with other data to uniquely identify a person.
  • Sensitive data: Information that, if disclosed or mishandled, could result in harm, embarrassment, inconvenience, or unfairness to an individual or organisation.
  • Data minimisation: The practice of collecting and retaining only the data that is directly necessary for a specific purpose, and deleting it once that purpose is fulfilled.
  • Retrieval-Augmented Generation (RAG): A technique that allows LLMs to draw information from other sources during inference, such as web search results or an internal company database, enabling more accurate or personalised responses.
  • Deepfake: A type of AI-generated fake content, consisting of audio or visual content, that misrepresents real people as doing or saying something that they did not actually do or say.

General-purpose AI systems rely on and can process vast amounts of personal data, posing significant privacy risks. In the context of AI, privacy is a complex and multi-faceted concept encompassing:

  • Data confidentiality and protection of personal data collected or used for training purposes, fine-tuning, information extraction, or during inference
  • Institutional transparency and controls over how personal information is used in AI systems [footnote 820]; for example, the ability for individuals to opt out from their personal data being collected for training, or the post hoc ability to make a general-purpose AI system ‘unlearn’ specific information about an individual [footnote 821]; and related challenges such as reconciling data minimisation and transparency [footnote 822], control over how data-driven decisions are made, and unauthorised use or processing of personal data [footnote 823]
  • Protection from individual and collective harms that may occur as a result of data use or malicious use. For example, the creation of deepfakes [footnote 824], challenges to the right to be forgotten [footnote 548] or the right to correct [footnote 825], and other risks from the large-scale scraping of personal data [footnote 826].

General-purpose AI poses various risks to privacy. These are very broadly categorised into:

  • Training risks: risks related to training and the collection of data (especially sensitive data).
  • Use risks: risks related to AI systems’ handling of sensitive information during use.
  • Intentional harm risks: risks that malicious actors will apply general-purpose AI to harm individual privacy.

These risks are already present with currently available AI tools, but are exacerbated by the increased scale of training, capacity for information processing, and ease of use presented by general-purpose AI.

Risks to privacy from general purpose AI

1: Ricks in training and creation

identifiable or sensitive data in training effective outputs and being reproduced or deanonymised, possibly without consent or knowledge from data subjects.

2: Handling sensitive data in use

Private and high risk data during use or deployment of AI systems being leaked or used in unintended ways. such as from personal AI assistants or therepeutic support AI.

3: Downstream harms from use

Malicious use of advanced AI systems to enact cyberattacks, infer non-public sensitive attributes, supercharge stalking, or induce a need for widespread online identification.

Risks to privacy from general-purpose AI fall into three risk groups: (1) Training risks: risks associated with training on sensitive data, (2) Use risks: risks related to handling sensitive information during the use of general-purpose AI, and (3) Intentional harm risks: risks from malicious actors applying general-purpose AI to compromise individual privacy. Figure: International Scientific Report on the Safety of Advanced AI.

General-purpose AI systems may expose their training data (‘Training Risks’). The training of general-purpose AI models generally requires large amounts of data. Academic studies have shown that some of this training data may be memorised by general-purpose AI models [footnote 827], [footnote 828], enabling users to infer information about individuals whose data was collected [footnote 829], [footnote 830], [footnote 831] or to even reconstruct entire training examples [footnote 832], [footnote 833], [footnote 834], [footnote 835]. However, definitions of memorisation vary, so it is challenging to make concrete claims about the harms that might arise from memorisation [footnote 827]. Many systems are trained on publicly available data containing personal information without the knowledge or consent of the individuals it pertains to, in addition to training on proprietary web content owned by media distributors [footnote 826], [footnote 836]. This extends to cases where one person posts personal information about another person online – for example, Facebook posts including pictures and information about a person’s peers or friends without explicit consent from those peers. In specific domains, training on sensitive data (such as medical or financial data) is often necessary to improve performance in that domain but could result in serious privacy leaks. These risks can be reduced – for example, existing medical general-purpose AI systems such as Google’s Gemini-Med [footnote 837] are only trained on anonymised or pseudonymised public patient data – but more research is needed to assess the risks associated with this. Privacy-preserving training approaches or synthetic data may help address this, as discussed in 3.4.3. Technical methods for privacy.

Figure 2.11: Risks to privacy from general-purpose AI

Figure 2.11: Risks to privacy from general-purpose AI fall into three risk groups: 1. Training risks: risks associated with training on sensitive data, 2. Use risks: risks related to handling sensitive information during the use of general-purpose AI, and 3. Intentional harm risks: risks from malicious actors applying general-purpose AI to compromise individual privacy. Source: International AI Safety Report.

Information used during the application of general-purpose AI can be leaked, such as private data used to personalise responses (‘Use Risks’). General-purpose AI models do not have knowledge of current affairs occurring after their training or knowledge of private information not included in the training data. To address this, it is common practice to provide relevant contextualising information to AI systems during usage through so-called ‘Retrieval-Augmented Generation’ (RAG) [footnote 838], [footnote 839], [footnote 840]. This process can also allow for personalised responses using private personal data, for example, with personal assistant AIs on phones [footnote 4], [footnote 841]. It can also be used to include external information, such as web search results [footnote 85], in the context used to provide a response. These can be combined; for example, a healthcare AI support tool may include or access sensitive medical records about an individual and then search the web or medical databases for relevant information before providing a response to support a clinician. While the use of on-device private data can make general-purpose AI more useful, it can create additional risks of leaking this data. Risks of information leakage to third parties increase substantially when data (or insights from the data) leave a device [footnote 842], [footnote 843], although cybersecurity approaches can minimise these risks [footnote 844]. In practice, balancing privacy, user transparency, and consumer utility in this context is a difficult challenge; technical approaches to balance this exist (see 3.4.3. Technical methods for privacy, but it is also important to find policy approaches that safeguard rights, enable transparency, and create trust for data sharing to promote innovation.

General-purpose AI systems could enable increased privacy abuse by malicious actors (‘Intentional Harm Risks’). There are many scenarios relevant to privacy risk in which malicious users may exploit AI’s increased information processing capabilities. For example, fine-grained internet-wide search capabilities, such as powerful reverse image search or forms of writing style detection, allow individuals to be identified and tracked across online platforms, and sensitive personal characteristics can be inferred [footnote 483], [footnote 845] (such as gender, race, medical conditions, or personal preferences), further eroding individual privacy [footnote 846]. LLMs can enable more efficient and effective searches for sensitive information in data. Detection, redaction, or sanitisation of personally identifiable information alone is insufficient to fully mitigate inference of sensitive personal content: many user attributes, such as detailed sexual preferences or specific drug use habits, can often still be found from ‘redacted’ data [footnote 847], although AI systems may also be useful in supporting the monitoring and removal of sensitive information online. These risks can arise across many contexts, and may result in broad unauthorised processing of personal data. This includes risks associated with the ability of general-purpose AI systems to infer private information based on model inputs [footnote 316], [footnote 483]. Beyond analysis and search, general-purpose AI content generated using private data, such as non-consensual deepfakes, can be used to manipulate or harm individuals. This raises concerns about the harm caused by the malicious use of personal data and the erosion of trust in online content (see 2.1.1. Harm to individuals through fake content for a more detailed discussion).

Since the publication of the Interim Report (May 2024), the increased prominence and capabilities of general-purpose AI have led to its increased use in sensitive contexts and subsequent scrutiny of its possible violations of privacy laws. General-purpose AI is now more common in contexts with sensitive data, such as personal devices with smart assistants [footnote 4], [footnote 841] and healthcare [footnote 848]. So far, no major AI vendors have reported high-profile leaks of user or commercial information, which is meaningful given that disclosures of data breaches of personal information are required in most jurisdictions. In addition, researchers have not found evidence of explicit privacy violations using general-purpose AI. However, unlike other harms, some forms of privacy violations can remain hidden for long periods of time. For example, privacy harms from training on sensitive data may not become realised for an extended period after training, since the time between the collection or use of data and the subsequent deployment of an AI system may be substantial. Regulators are increasingly enforcing privacy laws to protect consumers from companies that use AI without privacy controls or safeguards [footnote 849], [footnote 850]. Meanwhile, new modalities of interactions with general-purpose AI create new risks to privacy. For example, high-quality video generation models [footnote 851] may be capable of memorising video information (such as faces of students in live-streamed classrooms) or of being used to exploit privacy by reasoning over video data [footnote 852] or through speaker identification [footnote 3] (for example, using general-purpose AI to watch individuals and automatically take notes on their behaviour). Other concerns about privacy from downstream consequences of general-purpose AI have also emerged. For example, in the future there may be a need to differentiate humans from capable general-purpose AI online, which could make mass identification and subsequent online surveillance more likely [footnote 853].

The main evidence gaps around privacy include when private information can be unintentionally leaked, how to prevent it, and what the societal consequences of general-purpose AI could mean for privacy. It is challenging to assess how much general-purpose AI memorises its training data and how likely it is to regurgitate that data [footnote 171], [footnote 831]. Similarly, ongoing research seeks to determine the extent to which general-purpose AI can or will keep information provided during use private [footnote 847]. More broadly, research is needed into the long-term consequences for privacy that may arise from the widespread use of general-purpose AI, including the risks of actors correctly inferring sensitive information about individuals using general-purpose AI [footnote 483], the risks of enhanced mass surveillance [footnote 439], [footnote 483], and the consequences of prevalent general-purpose AI on privacy and identity [footnote 853].

For policymakers working on privacy, key challenges will include assessing the extent and impact of privacy violations from and via general-purpose AI. Knowing when and how privacy is violated is inherently challenging for both individuals and policymakers [footnote 854], with risks across multiple aspects of development and use (summarised in Figure 2.11). Often, unauthorised processing of personal data occurs or sensitive information is leaked without noticeable harm to the individual in the short term, making it difficult to gain support to address privacy risks preemptively [footnote 855]. When sensitive information is leaked, it can also be hard to audit where the leakage occurred in the technical systems underpinning general-purpose AI, as data is often handled across multiple devices or in different parts of the supply chain. For policymakers, both of these can make it extremely difficult to see the scale or scope of privacy violations, which in turn can make it difficult to determine the proper type and magnitude of intervention. Balancing privacy risks with the utility of general-purpose AI systems will be challenging but possible, with more research needed to assess risks and minimise harms.

For risk management practices related to privacy, see:

  • 3.4.2. Monitoring and intervention
  • 3.4.3. Technical methods for privacy

Key information

  • The use of vast amounts of data for training general-purpose AI models has caused concerns related to data rights and intellectual property. Data collection and content generation can implicate a variety of data rights laws, which vary across jurisdictions and may be under active litigation. Given the legal uncertainty around data collection practices, AI companies are sharing less information about the data they use. This opacity makes third-party AI safety research harder.
  • AI content creation challenges traditional systems of data consent, compensation, and control. Intellectual property laws are designed to protect and promote creative expression and innovation. General-purpose AI both learns from and can create works of creative expression.
  • Researchers are developing tooling and methods to mitigate the risks of potential copyright infringement and other data rights laws, but these remain unreliable. There are also limited tools to source and filter training data at scale according to their licences, affirmative consent from the creators, or other legal and ethical criteria.
  • Since the Interim Report (May 2024), data rights holders have been rapidly restricting access to their data. This prevents AI developers from using this data to train their models, but also hinders access to the data for research, social good, or non-AI purposes.
  • Policymakers face the challenge of enabling responsible and legally compliant data access without discouraging data sharing and innovation. Technical tools to evaluate, trace, filter, and automatically license data could make this much easier, but current tools are not sufficiently scalable and effective.

Key definitions

  • Intellectual property: Creations of the mind over which legal rights may be granted, including literary and artistic works, symbols, names and images.
  • Copyright: A form of legal protection granted to creators of original works, giving them exclusive rights to use, reproduce, and distribute their work.
  • Trademark: A symbol, word, or phrase legally registered or established by use to represent a company or product, distinguishing it from others in the market.
  • Likeness rights: Rights that protect an individual’s image, voice, name, or other identifiable aspects from unauthorised commercial use.
  • Fair use: An American legal doctrine that provides a defence to copyright infringement claims for limited uses of copyrighted materials without permission for purposes such as criticism, comment, news reporting, education, and research. Some other countries allow similar use rights under the name ‘fair dealing’.
  • Web crawling: Using an automated program, often called a crawler or bot, to navigate the web, for the purposes of collecting data from websites.

Figure 2.12: Benefits of using large quantities of training data

Figure 2.12: The benefits of using large quantities of training data can have cascading consequences for data transparency, web crawling, and the norms of sharing information on the web. Figure: International Scientific Report on the Safety of Advanced AI.

General-purpose AI trains on large data collections, which can implicate a variety of data rights laws, including intellectual property, privacy, trademarks, and image/likeness rights. General-purpose AI is trained on large data collections, often sourced in part from the internet. They can be used to generate text, images, audio, or videos that can sometimes emulate the content they were trained on. In both the case of data collection (inputs) and data generation (outputs), these systems may implicate various data rights and laws (see Figure 2.12). For instance, if AI training data contains personally-identifiable information, it can engender privacy concerns. Similarly, web-sourced training datasets frequently contain copyrighted material, implicating copyright and intellectual property laws [footnote 836], [footnote 856]. If brands are captured in the data, trademarks may also be implicated. In some jurisdictions, famous individuals featured in training data may have likeness rights [footnote 857]. The laws governing these data rights may also vary across jurisdictions and, especially in the case of AI, some are actively being litigated.

Copyright laws aim to protect creative expression; general-purpose AI both learns from and generates content resembling creative expression. Copyright laws aim to protect and encourage written and creative expression [footnote 858], [footnote 859], primarily in the forms of literary works (including software), visual arts, music, sound recordings, and audio-visual works. They grant the creators of original works the exclusive right to copy, distribute, adapt, and perform their own work. The unauthorised third-party use of copyrighted data is permissible in certain jurisdictions and circumstances: for instance on the basis of the ‘fair use’ exception in the US [footnote 860], by the ‘text and data mining’ exception in the EU [footnote 861], by the amended Copyright Act in Japan [footnote 862], under Israeli copyright law [footnote 863], and by the Copyright Act 2021 in Singapore [footnote 864]. In each jurisdiction there are different laws related to (a) the permissibility of data collection practices (e.g. data scraping), (b) the use of the data (e.g. for training AI, commercial, or non-commercial systems), and (c) whether model outputs that appear similar to copyrighted material are infringing. In the US, these questions are actively litigated [footnote 865], [footnote 866], [footnote 867], [footnote 868], [footnote 869], for example in cases such as the New York Times v OpenAI and Microsoft. Many issues related to dataset creation and use across the dataset’s lifecycle make copyright concerns for training AI models very complicated [footnote 870]. Relevant questions include whether datasets were assembled specifically for machine learning or originally for other purposes [footnote 871], whether the potential infringement applies to model inputs and/or model outputs [footnote 872], [footnote 873], [footnote 874], and which jurisdiction the case falls under, among others [footnote 481]. There are also questions around who is liable for infringement or harmful model outputs (developers, users, or other actors) [footnote 875]. While developers can use technical strategies to mitigate the risks of copyright infringement from model outputs, these risks are difficult to eliminate entirely [footnote 876], [footnote 877].

General-purpose AI systems may impact creative and publisher economies. As general-purpose AI systems become more capable, they increasingly have the potential to disrupt labour markets, and in particular creative industries [footnote 662], [footnote 707] (also see 2.3.1. Labour market risks. Pending legal decisions regarding copyright infringement in the AI training phase may affect general-purpose AI developers’ ability to build powerful and performant models by limiting their access to training data [footnote 836], [footnote 856], [footnote 878]. They may also impact data creators’ ability to limit the uses of their data, which may disincentivise creative expression. For instance, news publishers and artists have voiced concerns that their customers might use generative AI systems to provide them with similar content. In news, art and entertainment domains, generative AI can often produce paraphrased, abstracted, or summarised versions of the content it has trained on. If users access news through generative AI summaries rather than from media sites, this could reduce subscription and advertising revenues for the original publishers. Reduced subscriptions can equate to copyright damages.

Legal uncertainty around data collection practices has disincentivised transparency around what data developers of general-purpose AI have collected or used, making third-party AI safety research harder. Independent AI researchers can more easily understand the potential risks and harms of a general-purpose AI system if there is transparency about the data it was trained on [footnote 879]. For instance, it is much more tractable to quantify the risk of a model generating biased, copyrighted, or private information if the researcher knows what data sources it was trained on. However, this type of transparency is often lacking for major general-purpose AI developers [footnote 880]. Fear of legal risk, especially over copyright infringement, disincentivises AI developers from disclosing their training data [footnote 881].

The infrastructure to source and filter for legally permissible data is under-developed, making it hard for developers to comply with copyright law. The permissibility of using copyrighted works as part of training data without appropriate authorisation is an active area of litigation. Tools to source and identify available data without copyright concerns are limited. For instance, recent work shows that around 60% of popular datasets in the most widely used openly accessible dataset repositories have incorrect or missing licence information [footnote 481]. Similarly, there are limitations to the current tools for discerning copyright-free data in web scrapes [footnote 856], [footnote 878]. However, practitioners are developing new standards for data documentation and new protocols for data creators to signal their consent for use in training AI models [footnote 882], [footnote 883].

Since the publication of the Interim Report, the legal and technical struggles over data have escalated, and research suggests that it remains difficult to completely prevent models from generating copyrighted material using technical mitigations. Many organisations, including AI developers, use automatic bots called ‘web crawlers’ that navigate the web and copy content. Websites often want their content to be read by crawlers that will direct human traffic to them (such as search engine crawlers) but left alone by crawlers that will copy their data to train competing tools (e.g. AI models that will displace their traffic). Websites can indicate their preferences to crawlers in their code, including if and by whom they would like to be crawled. They can also employ technologies that attempt to identify and block crawlers. Since May 2024, evidence has emerged that websites have erected more barriers to the crawlers from AI developers [footnote 836], [footnote 884], [footnote 885]. These measures are triggered by uncertainty about whether AI developers’ crawlers will respect websites’ preference signals. In search of solutions, the European AI Office is developing a transparency reporting Code of Practice for General-Purpose AI developers [footnote 886], and the US National Institute of Standards and Technology (NIST) has released an AI Risk Management Framework [footnote 887]. Additionally, a growing body of work studies researchers’ capacity to excise information from a trained model, or to detect what a model was trained on. However, these methods, as applied to general-purpose AI models, still have fundamental challenges that may not be easily overcome in the near future [footnote 831], [footnote 832], [footnote 888], [footnote 889], [footnote 890], [footnote 891], [footnote 892].

Rising barriers to accessing web content may inhibit data collection, including to non-AI applications. Rising restrictions on web crawling result in the highest quality, well-maintained data being less available, especially to less well-financed organisations [footnote 836] [footnote 856] [footnote 878]. Declining data availability may have ramifications for competition, for training data diversity and factuality, as well as underserved regions’ ability to develop their own competitive AI applications. While large AI companies may be able to afford data licences or simply develop stronger crawlers to access restricted data, rising restrictions will have negative externalities for the other (including many beneficial) uses of web crawlers. Many industries depend on crawlers: web search, product/price catalogues, market research, advertising, web archives, academic research, accessibility tools, and even security applications. These industries’ access to data is increasingly impaired due to obstacles erected to prevent large AI developers from using data for training. Lastly, these crawler challenges may persist, even when copyright litigation is resolved.

Tools to discover, investigate, and automatically licence data are lacking. Standardised tooling is necessary for data creators and users to evaluate a dataset’s restrictions or limitations, to estimate the data’s value, to automatically license it at scale, and to trace its downstream use [footnote 465], [footnote 481], [footnote 856], [footnote 878]. Without these tools, the market so far has relied on ad hoc, custom contracts, without a clear licensing process for smaller data creators. Coupled with the existing lack of data transparency from individual developers, these shortcomings inhibit the development of an efficient and structured data market. In essence, the web is a relatively messy, unstructured source of data. Without better tools to organise it, developers will have difficulty avoiding training on data that may engender legal or ethical issues.

Methods that mitigate the risk of copyright infringement in models are underdeveloped and require more research. Large models can memorise or recall some of the data they trained on, allowing them to reproduce it when prompted. For instance, sections from the Harry Potter books are memorised in common language models [footnote 893]. This is desirable in some cases (e.g. recalling facts), but undesirable in others, as it can lead to models generating and re-distributing copyrighted material, private information, or sensitive content found on the web. There are many approaches to mitigating this risk (see also 3.4.3. Technical methods for privacy). One is to detect whether a model was trained on or has memorised certain undesirable content, which would enable it to also re‑generate it [footnote 831], [footnote 832], [footnote 888], [footnote 889]. This is known as ‘memorisation research’ or ‘membership inference research’. Researchers may also investigate whether model outputs can be attributed directly to certain training data points [footnote 877], [footnote 890]. Another method is to use filters that detect when a model is generating content that is substantively similar to copyrighted material. However, it remains challenging conceptually and technically to test whether generations are substantially similar to copyrighted content that the model was trained on [footnote 891], [footnote 894]. Lastly, researchers are exploring methods to remove information that models have already learned, called ‘machine unlearning’ [footnote 821], [footnote 895] [footnote 896], [footnote 897], [footnote 898]. However, it may not be a viable, robust, or practical solution in the long run [footnote 892], [footnote 897], [footnote 898]. For instance, machine unlearning often does not succeed in fully removing the targeted information from a model, and in the process it can distort the model’s other capabilities in unforeseen ways – which makes it unappealing to commercial AI developers [footnote 892], [footnote 895], [footnote 897], [footnote 898].

Policymakers are faced with the challenge of enabling protection of intellectual property and other rights in data, while creating an environment that encourages data sharing to promote innovation. These challenges are exacerbated by the many applicable laws, which vary across jurisdictions or are actively litigated. They are also complicated by the lack of existing data transparency in AI development, and the wave of data rights holders resorting to their own measures to protect their data. Altogether, the web ecosystem and supply chain for data is rapidly changing in response to AI, with or without legal interventions. These trends demonstrate the challenges to incentivising greater transparency and developing technical solutions that enable a healthier market for data. Without such solutions, a lack of transparency into data use will inhibit AI safety research, negatively impact creative economies, and spur more data protectionism, with consequences beyond AI development.

For risk management practices related to copyright, see:

  • 3.3. Risk identification and assessment
  • 3.4.1. Training more trustworthy models
  • 3.4.3. Technical methods for privacy

2.4 Impact of open-weight general-purpose AI models on AI risks

Key information

  • How an AI model is released to the public is an important factor in evaluating the risks it poses. There is a spectrum of model release options, from fully closed to fully open, all of which involve trade-offs between risks and benefits. Open-weight models – those with weights made publicly available for download – represent one key point on this spectrum.
  • Open-weight models facilitate research and innovation while also enabling malicious uses and perpetuating some flaws. Open weights allow global research communities to both advance capabilities and address model flaws by providing them with direct access to a critical AI component that is prohibitively expensive for most actors to develop independently. However, the open release of model weights could also pose risks of facilitating malicious or misguided use or perpetuating model flaws and biases.
  • Once model weights are available for public download, there is no way to implement a wholesale rollback of all existing copies of the model. This is because various actors will have made their own copies. Even if retracted from hosting platforms, existing downloaded versions are easy to distribute offline. For example, state-of-the-art models such as Llama‑3.1-405B can fit on a USB stick.
  • Since the Interim Report (May 2024), high-level consensus has emerged that risks posed by greater AI openness should be evaluated in terms of ‘marginal risk’. This refers to the additional risk associated with releasing AI openly, compared to risks posed by closed models or existing technology.
  • Whether a model is open or closed, risk mitigation approaches need to be implemented throughout the AI life cycle, including during data collection, model pre-training, fine‑tuning, and post-release measures. Using multiple mitigations can bolster imperfect interventions.
  • A key challenge for policymakers centres on the evidence gaps surrounding the potential for both positive and negative impacts of open weight release on market concentration and competition. The effects will likely vary depending on how openly the model is released (e.g. whether release is under an open source licence), on the level of market being discussed (i.e. competition between general-purpose AI developers vs. downstream application developers), and based on the size of the gap between competitors.
  • Another key challenge for policymakers is in recognising the technical limitations of certain policy interventions for open models. For example, requirements such as robust watermarking for open-weight generative AI models are currently infeasible, as there are technical limitations to implementing watermarks that cannot be removed.

Key definitions

  • Application Programming Interface (API): A set of rules and protocols that enables integration and communication between AI systems and other software applications.
  • Marginal risk: The additional risk introduced by a general-purpose AI model or system compared to a relevant baseline, such as a comparable risk posed by existing non-AI technology.
  • Open-weight model: An AI model whose weights are publicly available for download, such as Llama or Stable Diffusion. Open-weight models can be, but are not necessarily, open source.
  • Open source model: An AI model that is released for public download under an open source licence. The open source licence grants the freedom to use, study, modify and share the model for any purpose. There remains some disagreement as to which model components (weights, code, training data) and documentation must be publicly accessible for the model to qualify as open source.
  • Weights: Model parameters that represent the strength of connection between nodes in a neural network. Weights play an important part in determining the output of a model in response to a given input and are iteratively updated during model training to improve its performance.
  • Deepfake: A type of AI-generated fake content, consisting of audio or visual content, that misrepresents real people as doing or saying something that they did not actually do or say.

This section primarily focuses on the benefits and risks of general-purpose AI models with widely available model weights. Model weights, also known as parameters, are the numbers used to specify how the input (e.g. text describing an image) is transformed into the output (e.g. the image itself). These weights are iteratively updated during model training to improve the model’s performance on the tasks for which it is trained (see 1.1. How general-purpose AI is developed. While realising the full benefits of AI openness requires more openness than sharing model weights alone (for example, this requires shared training data, training code, documentation, etc.), many of the risks associated with open model release arise from model weights being made openly available [footnote 899]. Accordingly, open‑weight models are the focus of much policy work.

The difference between ‘open-weight’ models and ‘open source’ models can be confusing. ‘Open‑weight’ means that the model’s weights are available for public download such as with Llama, Mixtral, or Hunyuan-Large. Open-weight models can be, but are not necessarily, open source. The ‘open source’ classification requires that access to the model is protected under an open source licence which grants legal freedom for anyone to use, study, modify, and share the model for any purpose. Open source licences are important for realising the benefits of AI openness: they promote innovation and counter market concentration by large tech companies by allowing downstream developers to use, study, and modify open models without having to ask permission, and to embed those models in products that they can place on the market. This includes benefits to low-resource actors who could otherwise not obtain access to model weights, since they are costly to train from scratch. While the open source licence is essential to open source model classification, there remains some disagreement as to the extent to which different components (weights, code, training data) and documentation must be publicly accessible for the model to qualify as open source.

There is also a spectrum of model release options from fully closed to fully open, all of which involve trade-offs between risks and benefits (see Table 2.5).

  • Fully open models are open source models for which weights, full code, training data, and other documentation (e.g. about the model’s training process) are made publicly available, without restrictions on modification, use and sharing. In general, fully open model release facilitates broader research and innovation but increases risks of malicious use by making it easy for malicious actors to bypass safety restrictions and modify the model for harmful purposes, and by increasing the likelihood of model flaws proliferating downstream into modified model versions and applications if downstream users do not proactively update the model version they use.
  • Fully closed models’ weights and code are proprietary, for internal use only. This means that external actors are not able to misuse the model and flaws are less likely to proliferate downstream and can be fixed once discovered. However, with closed models it is also harder for external developers to discover misuse risks, flaws, and use the model for wider innovation and research.
  • Partially open models share some combination of weights, code, and data under various licences or access controls, in an attempt to balance the benefits of openness against risk mitigation and proprietary concerns. For example, OpenAI provides public access to its GPT-4o model through an interface called ChatGPT that allows users to prompt the system and retrieve responses without accessing the model itself. This kind of partial ‘query access’ allows the public to use the model and study its behaviour and performance flaws without providing direct access to the model weights and code. The cost of this partial access is that external AI researchers (academia and third-party evaluators) do not have access to perform deeper analysis of system safety, and downstream developers cannot freely integrate the model into new applications and products. Some licences such as RAIL (Responsible AI License) articulate restrictions against harmful uses of the model. Licence restrictions are legal articulations only and provide no physical barrier to misuse if the model itself is available for public download. Some actors may be deterred from misuse by the potential of legal liability, while other malicious actors may simply ignore the licence condition.

Table 2.5: spectrum of model sharing options

level of access What it means Examples Traditional software analogy
Fully Closed Users cannot directly interact with the model at all Flamingo (Google) Trading algorithms used by private hedge funds
Hosted Access Users can only interact through a specific application or interface Midjourney (Midjourney) Cloud consumer software (e.g. Google Docs)
API Access to Model Users can send requests to the model programmatically, allowing use in external applications Claude 3.5 Sonnet (Anthropic) Cloud-based API (e.g. website builders such as Squarespace)
API Access to Fine‑Tuning Users can fine-tune the model for their specific needs GPT-4o (OpenAI) Enterprise software with customisation APIs (e.g. Salesforce Development Platform)
Open-weight: Weights Available For Download Users can download and run the model locally Llama 3 (Meta), Mixtral (Mistral) Proprietary desktop software (e.g. Microsoft Word)
Weights, Data, and Code Available for Download with Use Restrictions Users can download and run the model as well as the inference and training code, but have certain licence restrictions on their use GPT-NeoX (EleutherAI) Open source software (e.g. Mozilla Firefox and Linux)

Table 2.5: There is a spectrum of model sharing options ranging from fully closed models (models are private and held only for proprietary use) to fully open, open source models (model weights, data, and code are freely and publicly available without restriction of use, modification, and sharing). This section focuses on the three rightmost columns. Source: adapted from Bommasani et al., 2024 [footnote 900].

There are benefits to greater AI openness, including facilitating innovation, improving AI safety and oversight, increasing accessibility, and allowing AI tools to be tailored to diverse needs. Training a general-purpose AI model (the process of producing model weights) is extremely expensive. For example, training Google’s Gemini model is estimated to have cost $191 million in compute costs alone [footnote 731]. The cost of training compute for the most expensive single general-purpose AI model is projected to exceed $1 billion by 2027 [footnote 27].Training costs, therefore, present an insurmountable barrier for many actors (companies, academics, and states) to participating in the general-purpose AI marketplace and benefiting from AI applications. Openly releasing weights makes general‑purpose AI more accessible to actors who might otherwise lack the resources to develop them independently. This reduces reliance on proprietary systems controlled by a few large technology companies (or potentially nation states), and allows developers to fine-tune existing general-purpose AI weights to serve more diverse needs. For example, developers from minority language groups can fine-tune open-weight models with specific language data sets to improve the model’s performance in that language. Models can also be fine-tuned more freely to perform better at specific tasks, such as writing professional legal texts, medical notes, or creative writing. Furthermore, greater openness enables a wider and more diverse community of developers and researchers to appraise models and identify and work to remedy vulnerabilities, which can help facilitate greater AI safety and accelerate beneficial AI innovation. In general, the more open a model is – including whether there is access to additional AI components beyond model weights, such as training data, code, documentation, and the compute infrastructure required to utilise these models – the greater the benefit for innovation and safety oversight.

Risks posed by open-weight models are largely related to enabling malicious or misguided use [footnote 899], [footnote 901], [footnote 902]. General-purpose AI models are dual-use, meaning that they can be used for good or put to nefarious purposes. Open-model weights can potentially exacerbate misuse risks by allowing a wide range of actors who do not have the resources and knowledge to build a model on their own to leverage and augment existing capabilities for malicious purposes and without oversight. While both open-weight and closed models can have safeguards to refuse user requests, these safeguards are easier to remove for open models. For example, even if an open-weight model has safeguards built in, such as content filters or limited training data sets, access to model weights and inference code allows malicious actors to circumvent those safeguards [footnote 903]. Furthermore, model vulnerabilities found in open models can also expose vulnerabilities in closed models [footnote 904]. Finally, with access to model weights, malicious actors can also fine-tune a model to optimise its performance for harmful applications [footnote 905], [footnote 906], [footnote 907]. Potential malicious uses include harmful dual-use science applications, e.g. using AI to discover new chemical weapons (2.1.4. Biological and chemical attacks), cyberattacks (2.1.3. Cyber offence), and producing harmful fake content such as ‘deepfake’ sexual abuse material (2.1.1. Harm to individuals through fake content) and political fake news (2.1.2. Manipulation of public opinion). As noted below, releasing an open-weight model with the potential for malicious use is generally not reversible even when its risks are discovered later.

There is also a risk of perpetuating flaws through open-weight releases, though openness also allows far more actors to perform deeper technical analysis to spot these flaws and biases. When general‑purpose AI models are released openly and integrated into a multitude of downstream systems and applications, any unresolved model flaws – such as embedded biases and discrimination (2.2.2. Bias), vulnerabilities to adversarial attack [footnote 904], or the ability to trick post‑deployment monitoring systems by having learned how to ‘beat the test’ (2.2.3. Loss of control) – are distributed as well [footnote 902]. The same challenge is true of closed, hosted, or API access models, but for these non‑downloadable models the model host can universally roll out new model versions to fix vulnerabilities and flaws. For open-weight models, developers can make updated versions available, but there is no guarantee that downstream developers will adopt the updates. On the other hand, open-weight models can be scrutinised and tested more deeply by a larger number of researchers and downstream developers which helps to identify and rectify more flaws in future releases [footnote 908].

Since the publication of the Interim Report, high-level consensus has emerged that risks posed by greater AI openness should be evaluated in terms of marginal risk [footnote 901], [footnote 909], [footnote 910]. ‘Marginal risk’ refers to the added risk of releasing AI openly compared to risks posed by existing alternatives, such as closed models or other technologies [footnote 911]. Studies that assess marginal risk are often called ‘uplift studies’. Early studies indicated, for instance, that chatbots from 2023 did not significantly heighten biosecurity risks compared to existing technologies: participants with internet access but no general-purpose AI were able to obtain bioweapon‑related information at similar rates to participants with access to AI [footnote 393] (see 2.1.4. Biological and chemical attacks) for further discussion on current AI and biorisk and 3.3. Risk identification and assessment for discussion of uplift studies and other risk assessments). On the other hand, several studies have shown that the creation of NCII and CSAM has increased significantly due to the open release of image-generation models such as Stable Diffusion [footnote 912], [footnote 913] (see 2.1.1. Harm to individuals through fake content). Attending to marginal risk is important to ensure that interventions are proportional to the risk posed [footnote 393], [footnote 911]. However, in order to conduct marginal risk analysis, companies or regulators must first establish a stable tolerable risk threshold (see 3.1. Risk management overview) against which marginal risk can be compared in order to avoid a ‘boiling frog’ scenario [footnote 910]. Even if an incremental improvement in model capability increases marginal risk only slightly compared to pre-existing technology, layering minor marginal risk upon minor marginal risk indefinitely could add up to a substantial increase in risk over time and inadvertently lead to the release of an unacceptably dangerous technology. In contrast, improving societal resilience and enhancing defensive capabilities could help keep marginal risk low even as model capabilities and ‘uplift’ advance.

A key evidence gap is around whether open-weight releases for general-purpose AI will have a positive or negative impact on competition and market concentration. Publicly releasing model weights can lead to both positive and negative impacts on competition, market concentration, and control [footnote 901], [footnote 910], [footnote 914], [footnote 915]. In the short term, open-weight model sharing protected under an open source licence empowers smaller downstream developers by granting access to sophisticated technologies that they could not otherwise afford to create, fostering innovation and diversifying the application landscape. A €1 billion investment in many types of open source software (OSS) in the EU in 2018 is estimated to have yielded a €65–€95 billion economic impact [footnote 916]. A similar impact might be expected from open‑weight AI released under open source licence. However, this apparent democratisation of AI may also play a role in reinforcing the dominance and market concentration (2.3.3. Market concentration and single points of failure) among major players [footnote 914], [footnote 915]. In the longer term, companies that release open-weight general-purpose AI models often see their frameworks become industry standards, shaping the direction of future developments, as is quickly becoming the case with the widespread use of Llama models in open development projects and industry application. These firms can then easily integrate advancements made by the community (for free) back into their own offerings, maintaining their competitive edge. Furthermore, the broader open source development ecosystem serves as a fertile recruiting ground, allowing companies to identify and attract skilled professionals who are already familiar with their technologies [footnote 914]. It is likely that open-weight release will affect market concentration differently at different layers of the general-purpose AI ecosystem; it is more likely to increase competition and reduce market concentration in downstream application development, but at the upstream model development level, the direction of the effect is more uncertain [footnote 750]. More research is needed to clarify the technical and economic dynamics at play.

Once model weights are available for public download, there is no way to implement a wholesale rollback of all existing copies. Internet hosting platforms such as GitHub and Hugging Face can remove models from their platforms, making it difficult for some actors to find downloadable copies and providing a sufficient barrier to many casual malicious users looking for an easy way to cause harm [footnote 917]. However, a well-motivated actor would still be able to obtain an open-model copy despite the inconvenience; even large models are easy to distribute online and off. For example, state-of-the-art models such as Llama-3.1-405B can fit on a USB stick, underscoring the difficulty of controlling distribution once models are openly released.

Technical solutions to reduce risks from open-weight release are still emerging and often involve significant trade-offs against the benefits of fully open models. For example, ‘retrieval models’ enable developers to partition ‘safe’ and ‘unsafe’ capabilities, potentially allowing non-dangerous parts of a model to be openly released while restricting dangerous capabilities. However, these models face challenges such as contextual rigidity and require access to source data [footnote 918]. Other techniques are being developed to mitigate misuse by reducing model performance when the weights are tampered with [footnote 919], [footnote 920], [footnote 921], [footnote 922], [footnote 923], [footnote 924]. However, these current methods are nascent and suffer from major trade-offs with efficiency, stability, and performance on benign tasks. Establishing benchmarks and improving techniques for tamper-resistant ‘unlearning’ remains an ongoing challenge, as discussed in 3.4.1. Training more trustworthy models.

There are risk mitigation approaches for open-weight models throughout the AI life cycle.  The most robust risk mitigation strategies will aim to address potential issues at every stage (see 3.1. Risk management overview), from data collection and model training to fine-tuning and post‑release measures such as vulnerability disclosure (telling users when a model flaw has been found) [footnote 910], [footnote 925]. Even if model weights are kept completely closed, these mitigation approaches allow developers to plan for leaks, since risk mitigations for open-weight models are likely to be useful for leaked closed models as well. For example, the 405 billion-parameter model Llama 3.1 was reportedly leaked to the public before its open release [footnote 926].

For policymakers working on regulating model release, key challenges going forward include:

  • Pursuing evidence of marginal risk in areas of uncertainty. Policymakers need robust analyses of marginal risk to understand where openness introduces significant risks and where it does not. Most current research does not evaluate the marginal risk of open models.
  • Monitoring and anticipating how risks evolve with technological development. As AI capabilities advance, the associated risks can (sometimes rapidly) increase (owing to adversaries gaining access to models of higher capabilities) or decrease (owing to more reliable AI or better defences being created using AI), requiring ongoing assessment and adaptation of policies (see also 1.3. Capabilities in coming years).
  • Recognising that certain policy interventions cannot be enforced for open models due to technical limitations. For example, requiring watermarking for language models cannot be enforced for open models, as there are technical limitations to implementing watermarks that cannot be removed.
  • Being cognisant of what interventions are technically feasible and how open release affects these interventions. For example, technical interventions or restrictions on fine‑tuning general-purpose AI models are infeasible for open-weight AI models.
  • Analysing the positive and negative impact of regulating model release. Open-weight AI models have strong benefits in transparency, competition, and concentration – at least in some parts of the AI ecosystem.
  • Trading off marginal risks with marginal benefits. It is important to develop frameworks for making decisions about the regulation of open-weight AI models. These frameworks are likely to be context-dependent, and there is no single correct answer: different parties, institutions, and governments will reach different conclusions based on their priorities and the specifics of the model and release mechanism being considered.

For risk management practices related to open-weight models, see:

  • 3.1. Risk management overview
  • 3.3. Risk identification and assessment

Technical approaches to risk management

3.1 Risk management overview

Key information

  • Risk management – identifying and assessing risks, and then mitigating and monitoring them – is challenging in the context of general-purpose AI. While numerous frameworks and practices are under development globally, significant gaps remain in validation, standardisation, and implementation across sectors and jurisdictions, particularly for identifying and mitigating unprecedented risks.
  • The context of general-purpose AI risk management is uniquely complex due to the technology’s rapid evolution and broad applicability. Traditional risk management practices (such as safety by design, audits, redundancy, and safety cases) provide a foundation, but must be adapted given the rapid evolution, broad applicability, and complex interaction effects of general-purpose AI.
  • A ‘system safety’ approach is helpful for managing general-purpose AI risks effectively. This approach applies both engineering and management principles to identify and control hazards throughout a system’s life cycle. For general-purpose AI, this includes understanding the interactions between the hardware and software components, organisational structures, and human factors.
  • A ‘defence in depth’ strategy has emerged as a prominent technical approach. This strategy of layering multiple protective measures is common in fields including nuclear safety and infectious disease control. It is being adapted for general-purpose AI systems throughout their lifecycle, with different roles for data providers, infrastructure providers, developers, and users.
  • Current evidence points to two central challenges in general-purpose AI risk management. First, it is difficult to prioritise risks due to uncertainty about their severity and likelihood of occurrence. Second, it can be complex to determine appropriate roles and responsibilities across the AI value chain, and to incentivise effective action.

Key definitions

  • Risk: The combination of the probability and severity of a harm that arises from the development, deployment, or use of AI.
  • Hazard: Any event or activity that has the potential to cause harm, such as loss of life, injury, social disruption, or environmental damage.
  • Risk management: The systematic process of identifying, evaluating, mitigating and monitoring risks.
  • Defence in depth: A strategy that includes layering multiple risk mitigation measures in cases where no single existing method can provide safety.
  • Capabilities: The range of tasks or functions that an AI system can perform, and how competently it can perform them.
  • Deployment: The process of implementing AI systems into real-world applications, products, or services where they can serve requests and operate within a larger context.
  • Modalities: The kinds of data that an AI system can competently receive as input and produce as output, including text (language or code), images, video, and robotic actions.

Risk management challenges

Early stages of the risk management process include risk identification and assessment, which are challenging and benefit from diverse expertise. These topics are addressed in detail in (3.3. Risk identification and assessment), but are critical to keep in mind for overarching risk management because of their unique challenges and the ways that they influence all subsequent elements of risk management. It is critical to identify and assess the risks of general-purpose AI from the earliest design stages and not merely after a model is developed. This can be facilitated by the use of comprehensive risk taxonomies and typologies, which categorise and organise a large number of risks. Later stages of the risk management process, including prioritisation and mitigation, are addressed throughout 3. Technical approaches to risk management, as well as in the table of risk management practices below.

Risk identification and assessment remain challenging because general-purpose AI can be applied across many different domains and contexts, and capabilities (and associated risks) change over time. AI may pose very different risks when applied, for example, in healthcare (where accuracy is critical) and in creative writing (where it is not). Moreover, studies show that the performance of general‑purpose AI systems can change over time as it can be significantly improved through relatively simple measures without expensive retraining. Addressing this may require regularly updated risk assessments [footnote 77]. For example, fine-tuning models (e.g. providing them with small amounts of highly curated additional training data) can significantly enhance their capabilities in specific domains [footnote 927], with implications for risk discussed in 2.1.4. Biological and chemical attacks. Some risks may not be predictable, and will result from complex interactions between models, people, organisations, and social and political systems [footnote 172]

To better inform risk management practices, there is a need for evaluations that focus on a broader set of risks from general-purpose AI, not just capabilities, and for improved evaluations across languages, cultures, modalities, and use cases. As discussed in 3.3. Risk identification and assessment, there have been recent advances in evaluation methods, including the MLCommons AI Safety benchmark, which measures the safety of large language models (LLMs) by assessing models’ responses to prompts across multiple hazard categories including child sexual exploitation, indiscriminate weapons, and suicide and self-harm [footnote 457]. The Sociotechnical Safety Evaluation Repository includes many additional benchmarks and evaluation methods that can help developers and evaluators assess societal risks from LLMs and other generative AI systems [footnote 928]. However, the space is missing a broader focus on the science of evaluation. Current evaluations focus largely on the general-purpose AI model itself, glossing over the various system designs, use cases, user audiences, and other contextual factors that heavily influence how risk may manifest. Many also focus on text modalities and may be less relevant for other modalities (such as images and audio) or for multimodal systems [footnote 929]. They also struggle to accurately assess risk across the world because, for example, they only evaluate in English based on a Western cultural context, but the model may be designed to be a multilingual system [footnote 930]. Improving benchmarks for models in low-resource languages requires collaboration between researchers, native speakers, and community partners such as language activists and educators [footnote 931].

Broad participation and engagement are required to assess and manage the risks of general‑purpose AI; it cannot be left in the hands of the scientific community alone. Effectively managing the risks of highly capable general-purpose AI systems requires the involvement of multiple groups, including experts from multiple domains and impacted communities, to identify and assess high‑priority risks. Even ’risk’ and ‘safety’ are contentious concepts – for instance, they leave open whose safety is being considered – and assessing them requires the involvement of diverse sets of experts and impacted populations [footnote 537] It is common for AI risk management frameworks to recommend participatory methods, including engagement with a broad set of relevant groups throughout the AI lifecycle; participatory approaches can be challenging to implement in the face of various power dynamics [footnote 932].

Risk management mechanisms and practices

There are numerous practices and mechanisms that can help manage the broad range of risks posed by general-purpose AI. Some of these are referenced in Table 3.1 below; they are discussed throughout 3. Technical approaches to risk management in more detail.

Table 3.1 below includes risk management practices that support five (interconnected) stages of risk management:

  • Risk identification: The process of finding, recognising, and describing risks.
  • Risk assessment: The process to understand the nature of risk and to determine the level of risk.
  • Risk evaluation: The process of comparing the results of risk assessment with risk criteria to determine whether the risk and/or its magnitude is/are acceptable or tolerable. (Note that the term ‘evaluation’ has multiple meanings in the context of AI and can also refer to testing models.)
  • Risk mitigation: Prioritising, evaluating, and implementing the appropriate risk-reducing controls/countermeasures recommended from the risk management process.
  • Risk governance: The process by which risk management evaluation, decisions, and actions are connected to enterprise strategy and objectives. Risk governance provides the transparency, responsibility, and accountability that enables managers to acceptably manage risk.

Note that the exact terminology used to describe stages of risk management varies across leading frameworks and standards. The table is intended to be illustrative rather than comprehensive.

Table 3.1: Several practices and mechanisms, organised by five stages of risk management

Risk Management Stage: Risk Identification

Risk Management Practice/Method Explanation Domains of Use
Risk Taxonomy A way to categorise and organise risks across multiple dimensions There are several well‑known risk taxonomies for AI [footnote 439], [footnote 933]
Engagement with Relevant Experts and Communities Domain experts, users, and impacted communities have unique insights into likely risks There are emerging guidelines for participatory and inclusive AI [footnote 934]
Delphi Method A group decision-making technique that uses a series of questionnaires to gather consensus from a panel of experts The Delphi method has been used to help identify key AI risks [footnote 935]
Threat Modelling A process to identify threats and vulnerabilities to a system Threat modelling is commonly used to support AI security throughout AI research and development [footnote 936]
Scenario Analysis Developing plausible future scenarios and analysing how risks materialise Scenario analysis and planning are widely used across industries including for the energy sector and to address uncertainties of power systems [footnote 937]

Risk Management Stage: Risk Assessment

Risk Management Practice/Method Explanation Domains of Use
Impact Assessment A tool used to assess the potential impacts of a technology or project The EU AI Act requires developers of high-risk AI systems to carry out Fundamental Rights Impact Assessments [footnote 938]
Audits A formal review of an organisation’s compliance with standards, policies, and procedures, typically carried out by an external party AI auditing is a rapidly growing field, but builds on long histories of auditing in other fields, including financial, environmental, and health regulation [footnote 939]
Red-Teaming An exercise in which a group of people or automated systems pretend to be an adversary and attack an organisation’s systems in order to identify vulnerabilities Red-teaming is typically carried out in cybersecurity, but has become common for AI as well [footnote 940]
Benchmarks A standardised, often quantitative test or metric used to evaluate and compare the performance of AI systems on a fixed set of tasks designed to represent real-world usage As of 2023, AI had achieved human-level performance on many significant AI benchmarks [footnote 731]
Model Evaluation Processes to assess and measure an AI system’s performance on a particular task There are countless AI evaluations to assess different capabilities and risks, including for security [footnote 941]
Safety Analysis Helps understand the dependencies between components and the system that they are part of, in order to anticipate how component failures could lead to system-level hazards This approach is used across safety-critical fields, e.g. to anticipate and prevent aircraft crashes or nuclear reactor core meltdowns

Risk Management Stage: Risk Evaluation

Risk Management Practice/Method Explanation Domains of Use
Risk Tolerance The level of risk an organisation is willing to take on In AI, risks tolerances are often left up to AI companies, but regulatory regimes can help identify unacceptable risks that are legally prohibited [footnote 942]
Risk Thresholds Quantitative or qualitative limits that distinguish acceptable from unacceptable risks and trigger specific risk management actions when exceeded Risk thresholds for general‑purpose AI are being determined by a combination of assessments of capabilities, impact, compute, reach, and other factors [footnote 943], [footnote 944]
Risk Matrices A visual tool that helps prioritise risks according to their likelihood of occurrence and potential impact Risk matrices are used in many industries and for many purposes, such as by financial institutions for evaluating credit risk, or by companies to assess possible disruptions to their supply chains
Bowtie Method A technique for visualising risk quantitatively and qualitatively, providing clear differentiation between proactive and reactive risk management, intended to help prevent and mitigate major accident hazards Oil companies and national governments use the bowtie method [footnote 945]

Risk Management Stage: Risk Mitigation

Risk Management Practice/Method Explanation Domains of Use
Safety by Design An approach that centres user safety in the design and development of products and services This approach is common across engineering and safety‑critical fields including aviation and energy
‘Safety of the Intended Function’ (SOTIF) An approach that requires engineers to provide evidence that a system is safe when operating as intended This approach is used in many engineering fields, such as in the construction and testing of road vehicles [footnote 946]
Defence in Depth The idea that multiple independent and overlapping layers of defence can be implemented such that if one fails, others will still be effective An example comes from the field of infectious diseases, where multiple preventative measures (e.g. vaccines, masks, hand washing) can layer to reduce overall risk
If-Then Commitments A set of technical and organisational protocols and commitments to manage risks at varying levels as AI models become more capable Some companies developing general-purpose AI employ these types of commitments as responsible scaling policies or similar frameworks [footnote 594], [footnote 596], [footnote 947]
Responsible Release and Deployment Strategies There is a spectrum of release and deployment strategies for AI including staged releases, cloud‑based or API access, deployment safety controls, and acceptable use policies There are some emerging industry practices that focus on release and deployment strategies for general-purpose AI [footnote 596], [footnote 947], [footnote 948]
Safety Cases Safety cases require developers to demonstrate safety. A safety case is a structured argument supported by evidence that a system is acceptably safe to operate in a particular context Safety cases are common in many industries, including defence, aerospace, and railways [footnote 949]

Risk Management Stage: Risk Governance

Risk Management Practice/Method Explanation Domains of Use
Documentation There are numerous documentation best practices, guidelines, and requirements for AI systems to track e.g. training data, model design and functionality, intended use cases, limitations, and risks Model cards’ and ‘system cards’ are examples of prominent AI documentation standards [footnote 34], [footnote 51]
Risk Register A risk management tool that serves as a repository of all risks, their prioritisation, owners, and mitigation plans. They are sometimes used to fulfil regulatory compliance Risk registers are a relatively standard tool used across many industries, including cybersecurity [footnote 950] and recently AI [footnote 933], [footnote 951]
Whistleblower Protection Whistleblowers can play an important role in alerting authorities to dangerous risks at AI companies due to the proprietary nature of many AI advancements Incentives and protections for whistleblowers are expected to be an important part of advanced AI risk governance [footnote 952]
Incident Reporting The process of systematically documenting and sharing cases in which developing or deploying AI has caused direct or indirect harms Incident reporting is common in many domains, from human resources to cybersecurity. It has also become more common for AI [footnote 953]
Risk Management Frameworks Whole organisation frameworks to reduce gaps in risk coverage and ensure various risk activities (i.e. all of the above) are cohesively structured and aligned, risk roles and responsibilities are clearly defined, and checks and balances are in place to avoid silos and manage conflicts of interest. In other safety critical industries, the Three Lines of Defence framework – separating risk ownership, oversight and audit – is widely used and can be usefully applied to advanced AI companies [footnote 954], [footnote 955]

Table 3.1: Several practices and mechanisms, organised by five stages of risk management, can help manage the broad range of risks posed by general-purpose AI.

Documentation and institutional transparency mechanisms, together with information sharing practices, play an important role in managing the risks of general-purpose AI and facilitating external scrutiny. It has become common practice to test models before release, including via red‑teaming and benchmarking, and to publish the results in a ‘model card’ or ‘system card’ along with basic details about the model, including how it was trained and what its limitations are [footnote 34], [footnote 51]. Another approach that can support greater levels of institutional transparency is publishing Foundation Model Transparency Reports or making public a similar degree of documentation [footnote 956]. Other important elements of documentation and transparency include monitoring and incident reporting [footnote 44], [footnote 957], for example via the AI Incident Sharing Initiative [footnote 953]; and information sharing, which can be facilitated by industry groups such as the Frontier Model Forum, governments, or others. Improving and standardising documentation supports greater external scrutiny and accountability [footnote 958].

Risk tolerance and risk thresholds are especially important aspects of risk management for general‑purpose AI. It is not possible to evaluate general-purpose AI for all possible capabilities, so organisations prioritise those that are most likely to lead to harmful outcomes above their risk tolerance. Risk tolerance is often left up to AI developers and deployers to determine for themselves, but policymakers can help to provide guidance and restrictions around unacceptable risks to individuals and society. An increasingly common practice among AI developers is to constrain decisions through voluntary pre-defined capabilities thresholds [footnote 594], [footnote 947]. Such thresholds determine that when models exhibit specific (risky) capabilities, this must be met with specific mitigations that are meant to keep risks to an acceptable level. For example, one company has committed to implement a series of defensive layers (‘defence in depth’) designed to prevent misuse as soon as  a model is found to “significantly assist individuals or groups with basic STEM backgrounds in obtaining, producing, or deploying CBRN [chemical, biological, radiological, and nuclear] weapons”, [footnote 947]. Such capabilities thresholds can have the advantage of being observable and measurable to some extent. However, capabilities are only one of multiple possible ‘key risk indicators’ and capability assessment is not a full risk assessment. Other kinds of thresholds that are relevant for general‑purpose AI include risk thresholds, which try to estimate the level of risk directly, [footnote 944] and compute thresholds, which set thresholds in terms of the computational resources required to train a model [footnote 943]. However, important limitations remain. Compute thresholds in particular are an unreliable proxy for risk [footnote 170], though they have the advantage of being easily measurable, relevant to many different risks, and known long before the risks actually materialise. Additional criteria such as the number of private or business users, the range of modalities an AI can handle, and the size and quality of the training data, could also play a role in defining risk thresholds in the future [footnote 959].

AI release and deployment strategies are an additional risk management practice that can be particularly useful for general-purpose AI. 2.4. Impact of open-weight general-purpose AI models on AI risks discusses how the open release of model weights affects risks. There are some emerging industry best practices that focus on release and deployment strategies for general-purpose AI [footnote 948]. Possible release strategies include releasing the model in stages to learn from real-world evidence prior to full release, providing cloud-based or API (application programming interface) access to have greater ability to prevent misuses, or implementing other deployment safety controls [footnote 44]. Other approaches include using responsible AI licences and acceptable use policies to potentially constrain misuse [footnote 960].

Risk management practices require commitment from organisational leadership and aligned organisational incentives. Organisational culture and structure impact the effectiveness of responsible AI initiatives and AI risk management in numerous ways [footnote 961]. Some developers have internal decision‑making panels that deliberate on how to safely and responsibly design, develop, and review new systems. Oversight and advisory committees, trusts, or AI ethics boards can provide helpful risk management guidance and organisational oversight [footnote 962], [footnote 963].

Lessons from other fields

Risk management strategies from other domains can be applied to general-purpose AI. Common risk management tools in other safety-critical industries such as biosafety and nuclear safety include planned audits and inspection, ensuring traceability using standardised documentation, redundant defence mechanisms against critical risks and failures, safety buffers, control banding, long-term impact assessments, ALARP (an acronym for keeping risk ‘as low as reasonably practicable’), and other risk management guidelines prescribing processes, evaluations, and deliverables at all stages of a safety-critical system’s life cycle. Human rights impact assessments are also used across many fields to assess the human rights impacts of particular industry practices [footnote 964], and are highly relevant for AI systems of all kinds [footnote 965]. Forecasting is another long-standing method with both benefits and shortcomings [footnote 966] that can help inform high-stakes decisions about general-purpose AI [footnote 928], [footnote 967]. Although translating best practices from other domains to general-purpose AI can be difficult, there is some guidance on ways it can be done [footnote 968], [footnote 969].

Safety and reliability engineering are particularly relevant. ‘System safety engineering’ focuses on the interactions between multiple parts of a larger system [footnote 970], and emphasises that accidents may occur for more complex reasons than simply component failures, chains of failure events, or deviations from operational expectations [footnote 971], [footnote 972]. In the case of AI, system safety engineering entails taking into account all the constituent parts of a general-purpose AI system, as well as the broader context in which it operates. The practice of safety engineering has a long history in various safety-critical engineering systems, such as aircraft flight control, engine control systems, and nuclear reactor control. At a high level, safety engineering assures that a life-critical system acts as intended and with minimal harm, even when certain components of the system fail. ‘Reliability engineering’ is broader in scope and addresses non-critical failures as well.

These approaches offer several techniques that are useful for risk assessment in general-purpose AI:

  • ‘Safety by design’ (SbD) is an approach that centres user safety in the design and development of products and services. For general-purpose AI products and services, this may take the form of minimising illegal, harmful, and dangerous content in the model training data and evaluating for a wide range of risks prior to deployment.
  • ‘Safety analysis’ delineates the causal dependencies between the functionality of individual components and the overall system, so that component failures, which can lead to system‑level hazards (e.g. aircraft crashes or nuclear reactor core meltdowns), can be anticipated and prevented to the extent possible. For general-purpose AI, this could mean seeking to understand how the security practices of a model’s training data may influence the security of the overall model.
  • ‘Safety of the intended function’ (SOTIF) approaches require engineers to provide evidence that the system is safe when operating as intended. SOTIF is particularly relevant to general‑purpose AI because it considers scenarios where a system might be operating correctly but still pose a safety risk due to unforeseen circumstances.

Some risk assessment methods, such as for the nuclear power sector, leverage mathematical models that are designed to quantify risk as a function of various design and engineering choices, accompanied by quantitative risk thresholds set by regulators [footnote 973]. For example, some regulatory commissions mandate that nuclear reactor operators produce probabilistic risk assessments and ensure that the estimated risks of certain events are kept below specified thresholds. Although this is not yet typical for general-purpose AI due to numerous quantification challenges discussed in this report, a key advantage of this approach is that it lets a publicly accountable body define what risk is considered acceptable or unacceptable, in a way that is accessible to the public and external experts.

It is critical to scrutinise design choices made throughout the general-purpose AI lifecycle. The ‘pipeline-aware’ approach to mitigating AI’s harm takes inspiration from safety engineering and proposes scrutinising numerous design choices made through the general-purpose AI lifecycle, from ideation and problem formulation, to design, development, and deployment, both as individual components and in relation to one another [footnote 974], [footnote 975]. Further work is needed to extend these ideas from traditional AI to general-purpose AI. For example, Assurance of Machine Learning for use in Autonomous Systems (AMLAS) provides a methodology for integrating safety assurance into the development of machine learning components and may also be useful for general-purpose AI [footnote 976].

‘Safety cases’ could provide a useful way for policymakers to explore hazards and risk mitigations for general-purpose AI. Developers of safety-critical technologies such as aviation, medical devices, and defence software are required to make ‘safety cases’, which put the burden of proof on the developer to demonstrate that their product does not exceed maximum risk thresholds set by the regulator [footnote 38], [footnote 949], [footnote 977], [footnote 978]. A safety case is a structured argument supported by evidence, where the developer identifies hazards, models risk scenarios, and evaluates the mitigations taken. For example, a safety case for general-purpose AI might show that an AI system is incapable of causing unacceptable outcomes in any realistic setting, e.g. even if the system is placed on unmonitored servers and given access to substantial computational resources [footnote 978]. Safety cases leverage the technical expertise of the technology developer and are amenable to third-party review, but still require that the regulator (or a suitable third party) has the technical expertise and other resources to appropriately evaluate them. A possible limitation is that safety cases may address only a subset of risks and threat models, leaving out important ones [footnote 979], [footnote 980]. One mitigation to this limitation is to review safety cases alongside risk cases produced by a red team of third-party experts [footnote 978].

The ‘defence in depth’ model is helpful for general-purpose AI risk management. Multiple independent and overlapping layers of defence against risks may be advisable, such that if one fails, others will still be effective. This is sometimes referred to as the ‘Swiss cheese model of defence in depth’ [footnote 981]. An example of the effectiveness of the defence in depth model is the range of preventative measures that are deployed to prevent infectious diseases: vaccines, masks, and hand‑washing, among other measures, can reduce the risk of infection substantially in combination, even though none of these methods are 100% effective on their own [footnote 981]. For general-purpose AI, defence in depth will include controls that are not on the AI model itself, but on the broader ecosystem, such as controls on training data (e.g. certain DNA sequences) and controls on the materials needed to execute an attack (e.g. equipment and reagents). It is also important to remember that methods like defence in depth are unlikely to be sufficient on their own because they focus on preventing accidents, risks from malfunction (see 2.2. Risks from malfunctions), and malicious use risks (see 2.1. Risks from malicious use), but are not generally sufficient to manage more complex systemic risks (see 2.3. Systemic risks).

Gaps and opportunities

The main evidence gaps around risk management for general-purpose AI include how great the risks are, and the degree to which different mechanisms can actually constrain and mitigate risks in real-world contexts. There is not always a scientific consensus about how likely or severe the risks of general-purpose AI systems are or will be, making it difficult for policymakers to know whether and how they should be prioritised. For example, how to manage misuse risk will depend on how skilled threat actors are in real-world contexts of concern. Moreover, most of the risk management efforts described above are not yet validated, standardised, or widely used. Risk management efforts vary across leading AI companies and incentives may not be well-aligned to encourage thorough assessment and management [footnote 982]. While there are a few risk mitigations that are perceived to be the most effective by experts for reducing systemic risks from general‑purpose AI [footnote 983], the efficacy of general-purpose AI risk management mechanisms is still being assessed and policymakers should seek more evidence from real-world applications.

For policymakers working on risk management for general-purpose AI, key challenges include knowing how to prioritise the many risks posed by general-purpose AI, and knowing who is best‑positioned to mitigate them. Risk management guidance often recommends prioritising high‑probability or high-impact concerns, including instances where significant negative impacts are imminent or already occurring, or where catastrophic risks could be present [footnote 887]. However, it is not always clear which are the most likely or impactful risks. Moreover, risk management necessarily involves different actors at different stages of the AI value chain, including data and cloud providers, model developers, and model hosting platforms, each of whom has unique opportunities and responsibilities to assess and manage risks. Policymakers need greater clarity on how various actors’ responsibilities differ and how policy incentives can support various risk management activities [footnote 925].

3.2. General challenges for risk management and policymaking

3.2.1. Technical challenges for risk management and policymaking

Key information

Several technical properties of general-purpose AI make risk mitigation for many risks associated with general-purpose AI difficult:

A. Autonomous general-purpose AI agents may increase risks: AI developers are making large efforts to create and deploy general-purpose AI systems that can more effectively act and plan in pursuit of goals. These agents are not well understood but require special attention from policymakers. They could enable malicious uses and risks of malfunctions, such as unreliability and loss of human control, by enabling more widespread applications with less human oversight.
B. The breadth of use cases complicates safety assurance: General-purpose AI systems are being used for many (often unanticipated) tasks in many contexts, making it hard to assure their safety across all relevant use cases, and potentially allowing companies to adapt their systems to work around regulations.
C. General-purpose AI developers understand little about how their models operate internally: Despite recent progress, developers and scientists cannot yet explain why these models create a given output, nor what function most of their internal components perform. This complicates safety assurance, and it is not yet possible to provide even approximate safety guarantees.
D. Harmful behaviours, including unintended goal-oriented behaviours, remain persistent: Despite gradual progress on identifying and removing harmful behaviours and capabilities from general‑purpose AI systems, developers struggle to prevent them from exhibiting even well‑known overtly harmful behaviours across foreseeable circumstances, such as providing instructions for criminal activities. Additionally, general-purpose AI systems can act in accordance with unintended goals that can be hard to predict and mitigate.
E. An ‘evaluation gap’ for safety persists: Despite ongoing progress, current risk assessment and evaluation methods for general-purpose AI systems are immature. Even if a model passes current risk evaluations, it can be unsafe. To develop evaluations needed in time to meet existing governance commitments, significant effort, time, resources, and access are needed.
F. System flaws can have a rapid global impact: When a single general-purpose AI system is widely used across sectors, problems or harmful behaviours can affect many users simultaneously. These impacts can manifest suddenly, such as through model updates or initial release, and can be practically irreversible.

Key definitions

  • AI agent: A general-purpose AI which can make plans to achieve goals, adaptively perform tasks involving multiple steps and uncertain outcomes along the way, and interact with its environment – for example by creating files, taking actions on the web, or delegating tasks to other agents – with little to no human oversight.
  • Deployment: The process of implementing AI systems into real-world applications, products, or services where they can serve requests and operate within a larger context.
  • Evaluations: Systematic assessments of an AI system’s performance, capabilities, vulnerabilities or potential impacts. Evaluations can include benchmarking, red-teaming and audits and can be conducted both before and after model deployment.
  • Fine-tuning: The process of adapting a pre-trained AI model to a specific task or making it more useful in general by training it on additional data.
  • Goal misgeneralisation: A situation in which an AI system correctly follows an objective in its training environment, but applies it in unintended ways when operating in a different environment.
  • Interpretability research: The study of how general-purpose AI models function internally, and the development of methods to make this comprehensible to humans.
  • Jailbreaking: Generating and submitting prompts designed to bypass guardrails and make an AI system produce harmful content, such as instructions for building weapons.
  • Open-ended domains: Environments into which AI systems might be deployed which present a very large set of possible scenarios. In open-ended domains, developers typically cannot anticipate and test every possible way that an AI system might be used.
  • Open-weight model: An AI model whose weights are publicly available for download, such as Llama or Stable Diffusion. Open-weight models can be, but are not necessarily, open source.
  • Weights: Model parameters that represent the strength of connection between nodes in a neural network. Weights play an important part in determining the output of a model in response to a given input and are iteratively updated during model training to improve its performance.

This section covers six general technical challenges that can make risk management and policymaking more difficult for a wide range of general-purpose AI risks (see Figure 3.1).

A. Autonomous general-purpose AI agents may increase risks: general-purpose AI agents – systems that can plan and act in the world with little to no human involvement elevate risks of malfunctions and malicious use. Today, general-purpose AI systems are primarily used as tools by humans. For example, a chatbot can write computer code, but a human runs, debugs, and integrates code into a larger software project. However, researchers and developers are making large efforts to design general‑purpose AI agents – systems that can act and plan autonomously by controlling computers, programming interfaces, robotic tools, and by delegating to other AI systems [footnote 18], [footnote 55], [footnote 316], [footnote 984], [footnote 985], [footnote 986], [footnote 987], [footnote 988], [footnote 989], [footnote 990], [footnote 991], [footnote 992]. These systems are also sometimes called ‘autonomous agents’ or ‘autonomous AI’. Researchers and developers are building agents for a variety of domains, including web browsing [footnote 85], research in chemistry and AI [footnote 22], [footnote 121], [footnote 402], software engineering [footnote 122], [footnote 259], cyber offence [footnote 127], general computer use [footnote 993], [footnote 994], [footnote 995], and controlling robots [footnote 19].

Figure 3.1: Technical challenges for managing general-purpose AI risks can be divided into two types

Figure 3.1: Technical challenges for managing general-purpose AI risks can be divided into two types: challenges with training and evaluating systems, and challenges with deploying them. This section discusses six broad challenges that apply to many risks. Source: International AI Safety Report.

Agentic general-purpose AI systems escalate risks by reducing human involvement and oversight. The main purpose of general-purpose AI agents is to reduce the need for human involvement and oversight, allowing for much faster and cheaper applications. This is economically valuable, and increasingly agentic AI products are rapidly being developed and deployed. However, increased delegation to AI agents reduces human oversight and can increase the risk of accidents [footnote 996] (see 2.2.1. Reliability issues). Meanwhile, agents can be uniquely vulnerable to attacks from malicious actors [footnote 997], for example by ‘hijacking’ an agent by placing instructions in places where the agent will encounter them [footnote 998]. AI agents can also automate some workflows for malicious uses such as scams, hacking, and the development of weapons [footnote 127], [footnote 358], [footnote 999], [footnote 1000], [footnote 1001] (see 2.1. Risks from malicious use for more examples). AI agents could also uniquely contribute to risks of loss of human control if their capabilities advance significantly (see 2.2.3. Loss of control) [footnote 316], [footnote 1002]. Furthermore, researchers have argued that it would be difficult or impossible to assure the safety of advanced agents by relying on testing, if those agents can make long-term plans and can distinguish testing conditions from real‑world conditions [footnote 1003].

General-purpose AI agents can perform useful work autonomously but currently have limited reliability, especially for complex tasks. Current state-of-the-art general-purpose AI systems are capable of autonomously executing many simple tasks (e.g. writing short snippets of code), but they struggle with more complex ones (e.g. writing entire code libraries) [footnote 122], [footnote 593], [footnote 600], [footnote 1004].

They are particularly unreliable at performing tasks that involve many steps [footnote 1005]. Meanwhile, general‑purpose AI agents deployed to accomplish long-horizon tasks can be particularly vulnerable to manipulation by malicious actors [footnote 997]. The capabilities of current and future agents are further discussed in (1.2. Current capabilities and 1.3. Capabilities in coming years).

The capabilities of general-purpose AI agents are advancing rapidly, and understanding their future capabilities is a key evidence gap. General-purpose AI agents are rapidly becoming more capable. For example, ‘SWE-Bench’ is a popular benchmark (metric) used to evaluate the capabilities of agentic AI systems for software engineering tasks such as finding and fixing bugs [footnote 122]. Since the Interim Report (May 2024), top models’ performance on SWE-Bench has increased from 26% to 42% [footnote 122], with the top 19 leading submissions all occurring after May 2024. This represents dramatic progress from October 2023, when the best model achieved only 2%. Meanwhile, the recent introduction of o1 [footnote 2] marks a leap forward in the reasoning and problem-solving capabilities of general-purpose AI systems. These performance improvements are due to a combination of advances. First, as the general-purpose AI models underlying these agents become more capable, the agents’ cognitive abilities improve. Second, these agents are being developed with increasingly advanced training and planning methods. For example, AlphaProof, a ‘neuro‑symbolic’ general-purpose AI system that combined neural networks with advanced planning techniques, achieved silver medal-level performance on 2024 International Mathematical Olympiad questions [footnote 187]. However, due to the rapid pace of progress in the area and the fact that many agents are proprietary, public understanding of current state‑of‑the‑art methods is limited. Over the coming months and years, the development of more advanced agents demands special attention from policymakers.

B. The breadth of use cases complicates safety assurance: general-purpose AI systems can be applied in many unanticipated contexts, making it hard to test and assure their trustworthiness across all realistic use cases. General-purpose AI systems’ inputs and outputs are often open‑ended, such as free-form text or image generation where users can enter any prompt. It is not possible to study the diffuse, downstream impacts of a system in a pre-deployment laboratory setting. This makes it challenging to make strong safety assurances because it is intractable to exhaustively test a system in all relevant usage contexts. For example, there are thousands of languages spoken by humans, making it very challenging to comprehensively assure the safety of language models across languages. Since the publication of the Interim Report (May 2024), general‑purpose AI systems that can process multiple types of data (e.g. text, images, and audio) have become increasingly common [footnote 1006]. This greatly expands the set of contexts which might cause the system to behave harmfully [footnote 1007]. AI companies can readily redirect their systems’ capabilities between different applications and legal workarounds, posing challenges for targeted intervention approaches as seen historically in financial markets [footnote 1008].

C. General-purpose AI developers understand little about how their models operate internally.

A key feature of general-purpose AI models is that their capabilities are mainly achieved through learning rather than from top-down design: an automatic algorithm adjusts billions of numbers (‘parameters’) millions of times until the model’s output matches the training data. As a result, the current understanding of general-purpose AI models is more analogous to that of growing brains or biological cells than aeroplanes or power plants. AI scientists and AI developers only have a minimal ability to explain why these models made a given decision over another one, and how their capabilities arise from their known internal mathematical components. This contrasts, for example, with complex software systems such as web search engines, where the developers can explain the function of individual components (such as lines and files of code) and can also investigate why the system found a particular result. Current ‘interpretability’ techniques for explaining the internal structures of general‑purpose AI models are unreliable and require major simplifying assumptions [footnote 1009], [footnote 1010], [footnote 1011], [footnote 1012], [footnote 1013]. In practice, techniques for interpreting the inner workings of neural networks can be misleading [footnote 466], [footnote 1014], [footnote 1015], [footnote 1016], [footnote 1017], [footnote 1018], [footnote 1019], and can fail sanity checks or prove unhelpful in downstream uses [footnote 1020], [footnote 1021], [footnote 1022], [footnote 1023], [footnote 1024], [footnote 1025]. For example, one goal of interpretability research is to help researchers understand models well enough to edit their behaviours by modifying their weights. However, state-of-the-art interpretability tools have not yet proven useful and reliable for this [footnote 1026]. As discussed in 3.4.1. Training more trustworthy models, these research methods are actively being improved, and new developments may yield further insights. However, because of how deep learning models represent information across neurons in a highly distributed way [footnote 1027], [footnote 1028], it is unclear whether interpreting the inner structures of general-purpose AI models could offer guaranteed safety assurances. In other words, modern general-purpose AI systems may be too complex to tractably make performance guarantees for. At present, computer scientists are unable to give guarantees of the form ‘System X will not do Y’ [footnote 41]. Nonetheless, a deeper understanding of models’ inner workings could be useful in many ways (see 3.4.2. Monitoring and intervention) and 3.4.1. Training more trustworthy models).

D. Harmful behaviours, including unintended goal-oriented behaviours, remain persistent: ensuring that general-purpose AI systems act in accordance with the goals, behaviours and capabilities intended by their developers and users is difficult. Although general-purpose AI systems can excel at learning what they are ‘told’ to do, their behaviour may not necessarily be what their designers intended [footnote 607], [footnote 1029], [footnote 1030], [footnote 1031]. Even subtle differences between a designer’s goals and the objectives given to a system can lead to unexpected failures. For example, general-purpose AI chatbots are often trained to produce text that will be rated positively by evaluators, but user approval is an imperfect proxy for user benefit. As a result, several widely-used chatbots have displayed ‘sycophantic’ or actively misleading behaviour, making statements that users approve of regardless of whether they are true [footnote 98], [footnote 317], [footnote 522], [footnote 608]. For example, general-purpose AI language models are known to have a strong tendency to agree with opinions that a user expresses in chats [footnote 98]. Even when a general-purpose AI system receives correct feedback during training, it may still develop a solution that does not generalise well when applied to new situations once deployed (‘goal misgeneralisation’) [footnote 616], [footnote 1032], [footnote 1033]. For example, some researchers have found that language models’ safety training can be ineffective if the model is prompted in a language that was underrepresented in its training data [footnote 1034]. Since the publication of the Interim Report (May 2024), researchers have demonstrated examples of unwanted goal-oriented behaviour from general‑purpose AI systems. These include attempts at rewriting their own goals [footnote 599].

Despite efforts to diagnose and debug issues, developers have not always been able to prevent even well-known and overtly harmful behaviours from general-purpose AI systems in foreseeable circumstances. Empirically, state-of-the-art general-purpose AI systems have exhibited a variety of harmful and often unexpected behaviours post-deployment [footnote 41], [footnote 1035], [footnote 1036]. These hazards include general-purpose AI systems assisting malicious users in overtly harmful tasks [footnote 127], [footnote 319], [footnote 1037], [footnote 1038], [footnote 1039], [footnote 1040], [footnote 1041]; leaking private or copyrighted information [footnote 1042], [footnote 1043], [footnote 1044], [footnote 1045], [footnote 1046], [footnote 1047]; generating hateful content ([footnote 1048], [footnote 1049]; exhibiting social and political biases [footnote 183], [footnote 438], [footnote 491], [footnote 511], [footnote 560], [footnote 561], [footnote 562], [footnote 563], [footnote 564], [footnote 565]; pandering to user biases [footnote 98]; and hallucinating inaccurate content [footnote 101], [footnote 102], [footnote 104], [footnote 461], [footnote 1050], [footnote 1051]. Meanwhile, users have consistently been able to circumvent state-of-the-art general-purpose AI model safeguards with relative ease through prompting (‘jailbreaks’) [footnote 39], [footnote 155], [footnote 460], [footnote 904], [footnote 1052], [footnote 1053], [footnote 1054], [footnote 1055], [footnote 1056], [footnote 1057], [footnote 1058], [footnote 1059], [footnote 1060], [footnote 1061], [footnote 1062], [footnote 1063] or simple model modifications [footnote 906], [footnote 1064], [footnote 1065], [footnote 1066], [footnote 1067], [footnote 1068], [footnote 1069], [footnote 1070], [footnote 1071], [footnote 1072], [footnote 1073], [footnote 1074], [footnote 1075], [footnote 1076], [footnote 1077], [footnote 1078], [footnote 1079], [footnote 1080]. Since the publication of the Interim Report (May 2024), some researchers have also found that even when chat systems safely refuse harmful requests, they can still behave harmfully when used to operate as agents [footnote 1000], [footnote 1001]. Researchers continuously develop new techniques that defend against these attacks, but they also develop stronger attacks that usually overcome the existing defences (see 3.4.1. Training more trustworthy models).

General-purpose AI systems sometimes gain and retain harmful capabilities even when they are explicitly fine-tuned not to [footnote 41], [footnote 1069]. While current techniques are effective at suppressing harmful behaviours from general-purpose AI systems, these harmful capabilities can and do resurface from anomalies, inputs from malicious users, and modifications to models. For example, fine-tuning GPT‑3.5 on only ten examples of harmful text can undo its safeguards and make it possible to elicit harmful behaviour [footnote 1064]. The difficulty of making general-purpose AI systems fully resistant to overt failure modes has led some researchers to question whether it is possible to make current development approaches robust to such failure modes [footnote 1081], [footnote 1082]. See 2.1. Risks from malicious use for further discussion of harmful capabilities in AI models, 2.4. Impact of open-weight general-purpose AI models on AI risks for a discussion of the benefits and risks of releasing models with both harmful and beneficial capabilities for public download, and 3.4.1. Training more trustworthy models for a discussion of methods for unlearning harmful capabilities.

E. An ‘evaluation gap’ for safety persists: current safety evaluations are not thorough enough to meet existing governance frameworks and commitments from companies. Both developers and regulators are increasingly proposing risk management frameworks that rely on high-quality evaluations of general-purpose AI systems. The goal of evaluations is to identify risks so that they can be addressed or monitored. However, the science of evaluating general-purpose AI systems and predicting their downstream impacts is immature. Even when general-purpose AI systems are evaluated pre‑deployment, new failure modes are often quickly discovered post-deployment [footnote 1055]. For example, users found methods to subvert o1’s safety fine-tuning within days of its release, and some researchers publicised work on a method to reliably jailbreak the model only three weeks after the model’s release [footnote 1083]. Evaluating AI systems for harmful behaviours and downstream risks is a rapidly growing field. However, the large scope of potential risks [footnote 933], the limitations of benchmarking techniques [footnote 178], [footnote 1084], [footnote 1085], a lack of full access to systems [footnote 1086], and the difficulty of assessing downstream societal impacts [footnote 928], [footnote 930], [footnote 933] make high-quality evaluations challenging. 3.3. Risk identification and assessment will delve further into methods for risk evaluation and broader risk assessment approaches.

F. System flaws can have a rapid global impact: because general-purpose AI systems can be shared rapidly and deployed in many sectors (like other software), a harmful system can rapidly have a global and sometimes irreversible impact. A small number of both proprietary and freely available open‑weight general-purpose AI models currently reach many millions of users (see 2.3.3. Market concentration risks and single points of failure). Both proprietary and open-weight models can therefore have rapid and global impacts, although in different ways [footnote 911]. A risk factor for open‑weight models is that there is no practical way to roll back access if it is later discovered that a model has faults or capabilities that enable malicious use [footnote 902] (see 2.4. Impact of open-weight general-purpose AI models on AI risks, 2.1. Risks from malicious use). However, a benefit of openly releasing model weights and other model components such as code and training data is that it also allows a much greater and more diverse number of practitioners to discover flaws, which can improve understanding of risks and possible mitigations [footnote 911]. Developers or others can then repair faults and offer new and improved versions of the system. This cannot prevent deliberate malicious use [footnote 902], [footnote 1075], which could be a concern if a system poses additional risk (‘marginal risk’) compared to using alternatives (such as internet search). All of these factors are relevant to the specific possibility of rapid, widespread, and irreversible impacts of general-purpose AI models. However, even when model components are not made publicly accessible, the model’s capabilities still reach a wide user base across many sectors. For example, within two months of launch, the fully closed system ChatGPT had over 100 million users [footnote 1087]

3.2.2. Societal challenges for risk management and policymaking

Key information

Several economic, political, and other contextual factors make risk mitigation for many risks associated with general-purpose AI difficult:

A. As general-purpose AI advances rapidly, risk assessment, risk mitigation, governance, and enforcement efforts can struggle to keep pace. Policymakers face the challenge of creating governance and/or regulatory environments that are sufficiently flexible, agile and future‑proof.
B. Developers of general-purpose AI face strong competitive pressure, which can incentivise them to conduct less thorough risk mitigations. Markets characterised by high fixed costs, low marginal costs, and network effects tend to create competitive pressures that discourage safety investments. The market for general-purpose AI is such a market.
C. The rapid growth and consolidation in the AI industry raises concerns about certain AI companies becoming particularly powerful because critical sectors in society are dependent on their products. Such companies may become more inclined to take excessive risks or cut corners on safety standards if they expect that it would be costly for governments to let the company fail.
D. The inherent lack of both algorithmic transparency and institutional transparency in general‑purpose AI makes legal liability hard to determine, potentially hindering governance and enforcement. The fact that general-purpose AI systems can act in ways that were not explicitly programmed or intended by their developers or users raises questions about who should be held liable for resulting harm.

Key definitions

  • Algorithmic transparency: The degree to which the factors informing general-purpose AI output, e.g. recommendations or decisions, are knowable by various stakeholders. Such factors might include the inner workings of the AI model, how it has been trained, what data it is trained on, what features of the input affected its output, and what decisions it would have made under different circumstances.
  • Institutional transparency: The degree to which AI companies disclose technical or organisational information to public or governmental scrutiny, including training data, model architectures, emissions data, safety and security measures, or decision-making processes.
  • Winner takes all: A concept in economics referring to cases in which a single company captures a very large market share, even if consumers only slightly prefer its products or services over those of competitors.
  • Race to the bottom: A competitive scenario in which actors like companies or nation states prioritise rapid AI development over safety.
  • First-mover advantage: The competitive benefit gained by being the first to establish a significant market position in an industry.
  • Distributed training: A process for training AI models across multiple processors and servers, concentrated in one or multiple data centres.
  • Human in the loop: A requirement that humans must oversee and sign off on otherwise automated processes in critical areas.
  • Emergent behaviour: The ability of AI systems to act in ways that were not explicitly programmed or intended by their developers or users.

A. As general-purpose AI markets advance rapidly, governance, regulatory or enforcement efforts can struggle to keep pace. A recurring theme in the discourse on general-purpose AI risk is the mismatch between the pace of technological innovation and the development of governance structures [footnote 1088]. While existing legal and governance frameworks apply to some uses of general‑purpose AI systems, and several jurisdictions (such as the European Union, China, the US, and Canada) have initiated or completed efforts to establish relevant standards or to regulate AI broadly and general-purpose AI specifically, areas of regulatory uncertainty persist, particularly regarding novel AI capabilities. In a market that is as fast-moving as the general-purpose AI market currently is, it is very difficult to fill such gaps reactively, because by the time a governance and/or regulatory fix is implemented it might already be outdated. For instance, critics of social media regulation often point to challenges in areas such as data privacy, suggesting that these issues developed more quickly than policymakers could effectively address them [footnote 1089], [footnote 1090]. Policymakers face the challenge of creating flexible regulatory environments that are robust to technological change over time.

The pace and unpredictability of advancements in general-purpose AI pose an ‘evidence dilemma’ for policymakers. Given the sometimes rapid and unexpected advancements, policymakers will often have to weigh potential benefits and risks of imminent AI advancements without having a large body of scientific evidence available. In doing so, they face a dilemma. On the one hand, pre‑emptive risk mitigation measures based on limited evidence might turn out to be ineffective or unnecessary. On the other hand, waiting for stronger evidence of impending risk could leave society unprepared or even make mitigation impossible, for instance if sudden leaps in AI capabilities, and their associated risks, occur. Companies and governments are developing early warning systems and risk management frameworks that may reduce this dilemma. Some of them trigger specific mitigation measures when there is new evidence of risks, while others require developers to provide evidence of safety before releasing a new model.

B. Developers of general-purpose AI face strong competitive pressure, which can incentivise them to conduct less thorough risk mitigations. The one-time cost of developing a state-of-the-art general‑purpose AI model is very high, while the marginal costs of distributing such a model to (additional) users are relatively low. For example, the estimated cost of training GPT-4 was $40 million [footnote 27], but once trained, the cost of running the model for a single query is believed to be just a few cents, allowing it to serve many users at a relatively low marginal cost. In economic theory, these conditions can lead to a ‘winner takes all’ dynamic in which field leaders can quickly capture a large market, whereas second-place actors will be at a significant disadvantage. As such, if cutting corners in (for example) testing and safety could allow one developer to take the lead in model capability, then there is a strong incentive to cut those corners [footnote 1091]. This dynamic is visible in social media platforms, where a large initial user base attracted more people to join certain platforms because that is where their friends were, making the leading platform more valuable to new users and further expanding its network, while newer social networks often struggled to achieve critical mass [footnote 1092]. The ‘winner takes all’ dynamic raises concern about potential ‘race to the bottom’ scenarios, where actors compete to develop general-purpose AI models as quickly as possible while under-investing in measures to ensure that the models are safe and ethical [footnote 1093], [footnote 1094].

Markets characterised by high fixed costs, low marginal costs, and network effects tend to create competitive pressures that discourage safety investments. Economic theory and empirical studies have shown that, under conditions of high fixed costs, low marginal costs, and strong network effects, firms in highly competitive markets tend to under-invest in safety measures [footnote 1095], [footnote 1096], [footnote 1097], [footnote 1098]. For instance, in the early commercial aviation industry, airlines operating with thin profit margins due to high fixed costs of aircraft acquisition and maintenance sometimes cut corners on safety procedures to reduce costs and maintain competitive ticket prices [footnote 1099]. These conditions are present in the general-purpose AI market. Moreover, in highly competitive markets with significant first-mover advantages, economic theory suggests that risk-taking behaviour tends to be rewarded and may become prevalent among surviving firms [footnote 1100]. While direct studies of safety investment in the AI market are currently lacking, these economic principles and empirical studies in other fields suggest cause for concern. This could contribute to situations in which it is challenging for general‑purpose AI developers to commit unilaterally to stringent safety standards, as doing so might put them at a competitive disadvantage [footnote 1101]. At the same time, from a long‑term business perspective, releasing risky models without adequate safety measures could damage user trust and company reputation, potentially creating stronger incentives for safety investment than the short‑term competitive pressures might suggest.

C. The rapid growth and consolidation in the AI industry raises concerns about certain AI companies becoming particularly powerful because critical sectors in society are dependent on their products, which might incentivise them to take excessive risks (see 2.3.3. Market concentration and single points of failure). Such scenarios are well studied in the economic literature [footnote 1102]. They arise when an organisation reaches a size and level of influence so substantial that potential failure could pose systemic risks to the economy or national security. Governments are therefore inclined to take steps to protect these organisations from failure, for example by forgiving debts or providing bailout money. When protected in this way, companies may become more inclined to take excessive risks or cut corners on safety standards [footnote 1103], [footnote 1104], though empirical evidence on this effect remains mixed [footnote 1105]. There is some concern that critical sectors in society might over time become overly dependent on the products of a small number of leading AI companies in this way. AI applications are becoming more integral to everyday life, and smaller startups often seek acquisition by or collaboration with larger companies to overcome market entry barriers, most notably the extremely high costs of training a general-purpose AI model. In such arrangements, the startups typically trade access to their innovations for use of the larger companies’ computing infrastructure and latest models, further reinforcing the market concentration and, potentially, overreliance on the AI products of a few industry leaders [footnote 767].

Beyond market concentration dynamics, several other factors may contribute to underinvestment in risk mitigation. Similar to environmental pollution or public health issues such as tobacco, many potential harms from AI systems represent externalities – costs that may be borne by society rather than directly by the developers [footnote 1106], [footnote 1107], [footnote 1108]. Additionally, economic theory suggests that when there is a significant time lag between actions and consequences, market actors may systematically underinvest in risk mitigation [footnote 1109]. This challenge is compounded by the inherent uncertainty of these potential harms, making it difficult to quantify the appropriate level of investment in risk mitigation. While empirical evidence on this question is scarce, economic theory suggests that the immediate costs of risk mitigation weighed against uncertain future benefits creates incentives for underinvestment in safety measures.

D. General-purpose AI systems’ inherent lack of transparency and limited institutional transparency in organisations that develop AI makes legal liability hard to determine, potentially hindering governance and enforcement. Tracking the development and use of AI systems is important for establishing liability for potential harms, monitoring and seeking evidence for malicious use, and noticing malfunctions [footnote 1002], [footnote 1110], [footnote 1111]. In principle, people and corporate entities are held accountable, not the technology, which is why developers maintain a ‘human in the loop’ policy for many critical areas, where a human must oversee and sign off on otherwise-automated processes. However, tracing harm back to the responsible individuals is very challenging [footnote 1112], [footnote 1113], [footnote 1114], as is gathering evidence of error or negligence. This stems from both technical and institutional factors: AI models’ decision-making processes are difficult to interpret even for their developers (lack of algorithmic transparency), and AI companies often treat their training data, methodologies, and operational procedures as commercially sensitive information not open to public scrutiny (lack of institutional transparency) [footnote 1025], [footnote 1115], [footnote 1116], [footnote 1117], [footnote 1118], [footnote 1119], [footnote 1120]. Without transparency into both technical systems and organisational processes, it is difficult to develop the kinds of comprehensive safety governance standards that are common in other safety-critical fields such as automotive, pharmaceuticals, and energy [footnote 1121], [footnote 1122], [footnote 1123]. The fact that general-purpose AI systems can act in ways that were not explicitly programmed or intended by their developers or users raises questions about who should be held liable for resulting harm [footnote 174], [footnote 1124]. These liability challenges become even more pronounced with increasingly autonomous AI systems that require less direct human oversight, as it becomes harder to trace specific harmful actions back to human instructions or decisions (see 3.1. Risk management overview).

The concentration of AI expertise in private companies can create significant information gaps for policymakers and the public. While academic researchers and public sector experts contribute to AI development and safety research, much of the cutting-edge work in AI development occurs within private companies [footnote 1125], [footnote 1126]. This concentration of expertise can make it challenging for policymakers and the public to access the technical knowledge needed to make informed decisions about AI governance and risk management. The resulting information asymmetry between AI developers and other stakeholders could complicate efforts to develop appropriate governance and/or regulatory frameworks and safety standards.

3.3 Risk identification and assessment

Key information

  • Assessing general-purpose AI systems for hazards is an integral part of risk management. Scientists use a variety of techniques to study hazards during system development, before deployment, and after deployment.
  • Existing AI regulations and commitments require rigorous risk identification and assessment. Governments and general-purpose AI developers have adopted policies that require them to identify and assess the potential risks and impacts of general-purpose AI systems on people, organisations, and society.
  • While very useful, existing quantitative methods to assess general-purpose AI risks have significant limitations. Safety risks heavily depend on how and where these systems are used, which is often unanticipated, making it hard to measure risks without guessing how people will use them. This is especially challenging for general-purpose AI because it can be used in countless different situations, and many potential harms (e.g. bias, toxicity, and misinformation) are hard to measure objectively. While current risk assessment methods are nascent, they can be greatly improved.
  • Rigorous risk assessment requires combining multiple evaluation approaches, significant resources, and better access. Key risk indicators include evaluations of systems themselves, how people apply them, as well as forward-looking threat analysis. For evaluations at the technical frontier to be effective, evaluators need substantial and growing technical ability and expertise. They also need sufficient time and more direct access than is currently available to the models, training data, methodologies used, and company-internal evaluations – but companies developing general-purpose AI typically do not have strong incentives to grant these.
  • In recent months, more research has been evaluating how well AI risk assessment methods actually work, identifying current shortcomings and criteria for improvement. While more evidence is needed – especially for new risks – this technical progress is complemented by institutional developments, as governments begin to build evaluation capacity and stakeholders work to establish clearer guidelines for who is responsible for different aspects of risk assessment.
  • The absence of clear risk assessment standards and rigorous evaluations is creating an urgent policy challenge, as AI models are being deployed faster than their risks can be evaluated. Policymakers face two key challenges: 1. internal risk assessments by companies are essential for safety but insufficient for proper oversight, and 2. complementary third‑party and regulatory audits require more resources, expertise and system access than is currently available.

Key definitions

  • Risk: The combination of the probability and severity of a harm that arises from the development, deployment, or use of AI.
  • Hazard: Any event or activity that has the potential to cause harm, such as loss of life, injury, social disruption, or environmental damage.
  • Deployment: The process of implementing AI systems into real-world applications, products, or services where they can serve requests and operate within a larger context.
  • Evaluations: Systematic assessments of an AI system’s performance, capabilities, vulnerabilities or potential impacts. Evaluations can include benchmarking, red-teaming and audits and can be conducted both before and after model deployment.
  • Benchmark: A standardised, often quantitative test or metric used to evaluate and compare the performance of AI systems on a fixed set of tasks designed to represent real-world usage.
  • Red-teaming: A systematic process in which dedicated individuals or teams search for vulnerabilities, limitations, or potential for misuse through various methods. Often, the red team searches for inputs that induce undesirable behaviour in a model or system to identify safety gaps.
  • Jailbreaking: Generating and submitting prompts designed to bypass guardrails and make an AI system produce harmful content, such as instructions for building weapons.
  • Audit: A formal review of an organisation’s compliance with standards, policies, and procedures, typically carried out by an independent third party.
  • Incident reporting: Documenting and sharing cases in which developing or deploying AI has caused direct or indirect harms.

To manage the risks of general-purpose AI, it is necessary to understand and measure the risks it poses to people, organisations, and society. Several governments and general-purpose AI developers have already adopted policies and regulations that require them to identify and assess the potential risks and impacts of general-purpose AI systems, triggering planned responses when risks reach specific thresholds. ‘Risk identification’ is the process of identifying the potential risks of the technology, including possible hazards and unintended outcomes. ‘Risk assessment’ is the process of assessing the severity and likelihood of occurrence of each identified risk. (See Table 3.1 in 3.1 Risk Management Overview for an overview of risk management stages including risk identification and assessment as well as risk evaluation, risk mitigation, and risk governance).

Methods for risk identification

General-purpose AI risks can be identified and formulated at various levels of specificity. For example, one broad category of general-purpose AI risks is confabulating or ‘hallucinating’ misinformation – that is, generating outputs that are inaccurate or misleading. A more specific instance of the same risk is general-purpose AI making up a non-existent polling location when the user prompts it to gather information about where to cast their ballot during a national election [footnote 1127]. The specification of a risk can make it easier or more difficult for evaluators to assess both its severity and likelihood. Better‑specified risks are easier to assess and mitigate.

Evaluators need to understand the use cases of general-purpose AI well in order to conceptualise its risks with the appropriate degree of specificity. For example, if general-purpose AI users are likely to prompt it to gather information about political campaigns and voting procedures, then assessing the risk of the model ‘hallucinating a polling location’ may be a high priority. Therefore participatory approaches, which consist of engaging with various stakeholders and impacted communities to understand their use cases, practices, needs, and values, are especially helpful for identifying higher‑priority risks to users. Crowd audits [footnote 1128] are one example of a participatory approach. They are designed to allow everyday users to collaboratively surface the potential harms of AI products and services. Creating accessible mechanisms for the public to report observed and perceived harms is another important method of risk identification. AI incident-tracking databases, such as the OECD’s AI Incidents Monitor (AIM), are platforms designed to collect, categorise, and report harmful incidents involving AI [footnote 459]. In short, there is a need to identify and assess risks in context.

To facilitate general-purpose AI risk-identification practices, scholars have proposed taxonomies of hazards [footnote 439], [footnote 933], [footnote 951], [footnote 1129]. These taxonomies list risk categories, such as informational hazards, memorisation of the training data (which can lead to copyright infringement and privacy concerns), and malicious usage (e.g. writing malware). Taxonomies of hazards can serve as a starting point to help evaluators conceptualise, identify, and specify the salient risks associated with general-purpose AI in specific application domains. In conventional risk management and safety engineering, there are several well-established methods for identifying hazards and risks of a technology, including functional failure analysis and HAZOP (hazard and operability study) [footnote 1130]. These methods have been adopted in a wide range of industries, including the automotive industry, which also considers SOTIF [footnote 946]. In addition to risk typologies and taxonomies, recent work has begun adapting some of these conventional techniques, e.g. hazard analysis, the bowtie method and safety cases, to AI products and services [footnote 968], [footnote 1131], [footnote 1132], [footnote 1133], but additional research is necessary in this area. See 3.1. Risk management overview for a discussion of further risk identification practices established in other fields.

Methods for risk assessment

Once high-priority risks are identified, they need to be assessed to determine the likelihood and severity of the harm, hazard, or unintended outcome in question.

Better understanding the current state of general-purpose AI risk assessment methods is essential to AI policy because risk assessments are a core component of many AI governance and regulatory approaches. For example, the EU AI Act classifies AI systems into four main risk tiers based on their potential impact and imposes different requirements on AI systems depending on their risk tier. Furthermore, many leading AI companies have agreed to create AI safety commitments with mitigations that are proportional and specific to the assessed risk of their systems [footnote 1134]. However, risk assessment is a relatively nascent topic of research in the AI safety community, and there are currently no fully validated, systematic approaches to assessing the severity and likelihood of general‑purpose AI harms. Implementing the aforementioned policies will require a substantially more mature field of risk assessment for general-purpose AI.

Existing work in AI safety heavily focuses on conventional model testing approaches in AI, often conducted after the development of general-purpose AI models. This reliance on retrospective (as opposed to prospective) risk assessment can lead to major omissions and misestimations of high‑priority risks. In conventional risk management and safety engineering, a critical stage of risk assessment is the prospective analysis of risks before completing the design and development of a system. This stage is currently often overlooked in general-purpose AI risk assessments. In AI safety, risk assessment primarily consists of running a battery of tests and evaluations on the general‑purpose AI system, then translating the results into quantitative estimates of risks. This is in contrast to traditional risk assessment, which consists of

1. analysing the causes, consequences, and prevalence of risks (through methods such as causal mapping and Delphi technique) then
2. evaluating whether the risk is acceptable, e.g. through checklists and risk matrices.

Recent work has begun adapting some of these techniques to AI products and services [footnote 944], [footnote 968]. See 3.1. Risk management overview for further discussion of risk assessment approaches that are established in other fields.

Existing technical approaches and methodologies to general-purpose AI risk assessment rely heavily on testing and evaluations which can be broken down into four layers [footnote 1135]:

1. Model testing evaluates the general-purpose AI model in terms of (often quantitative) metrics of performance on proxy tasks designed to represent real-world usage. These tests often take the form of benchmarks – fixed sets of prompts to test a model on.
2. Red-teaming is a systematic process in which dedicated individuals or teams search for vulnerabilities, limitations, or potential for misuse in AI models or systems through various methods. Often, the red team searches for inputs that induce undesirable behaviour for the purpose of improving the model or system’s protections against such attacks.
3. Field testing evaluates the risks of general-purpose AI under real-world conditions.
4. Long-term impact assessments monitor and evaluate long-term impacts of the system on people, organisations, and society.

One major evidence gap is research to establish the validity, reliability, and practicality of existing general-purpose AI risk assessment methods. Good risk measurement methods must be valid, reliable, and practical. Validity refers to the extent to which a test, tool, or instrument accurately measures what it is intended to measure. For instance, validity issues arise if a benchmark differs from real-world use or contains false labels [footnote 1136]. Reliability refers to the consistency, stability, and dependability of a measurement over time and across different contexts. In other words, it indicates the degree to which a measurement yields consistent, repeatable results under similar conditions [footnote 1137]. Prior work has shown that even small perturbations to prompts can have significant effects on the behaviour and performance of general-purpose AI on benchmarks [footnote 1138], [footnote 1139]. Practicality assesses whether the measurement can be conducted efficiently and effectively in practice by the designated evaluators, considering constraints such as time, cost, computational resource availability, and burden on evaluators. For example, the process of evaluating general-purpose AI increasingly relies on using general-purpose AI [footnote 522], [footnote 929], which requires technical capacity and raises new concerns (e.g. about LLM agents favouring outputs from their own model family [footnote 1140]. For rigorous risk assessment, validity and reliability are prioritised over ease and convenience of measurement [footnote 1141].

Since the publication of the Interim Report, the scientific community has made progress toward further implementing and evaluating existing risk assessment methods. The US and UK AI Safety Institutes (US AISI and UK AISI) recently published a technical report detailing a pre‑deployment evaluation of the upgraded version of Claude 3.5 Sonnet [footnote 1142]. New research has examined reproducibility [footnote 1143], [footnote 1144] or validity, which can be compromised when AI models are trained on or exposed to test data beforehand (benchmark contamination) [footnote 1145], [footnote 1146]. However, additional evidence is necessary to characterise the strengths and weaknesses of existing general-purpose AI evaluation methods [footnote 465] especially when general-purpose AI is utilised in new domains.

The initial layer of general-purpose AI risk assessment often consists of testing the model’s behaviour across certain fixed benchmarked tasks. New benchmarks and standardised tests and metrics have been designed to evaluate and compare various categories of risk for general‑purpose AI applications in stylised scenarios and tasks [footnote 122], [footnote 137], [footnote 141], [footnote 1147], [footnote 1148], [footnote 1149]. For example, the AI Safety Benchmark from MLCommons [footnote 457] provides a benchmark to measure seven risk categories, such as misinformation and harmful content. Holistic Evaluation of Language Models (HELM) consists of 16 scenarios and seven metrics, including robustness, fairness, and bias [footnote 1150]. Harmful capability evaluations [footnote 318] are used to assess whether the general-purpose AI has particularly dangerous knowledge or skills (such as the ability to aid cyberattacks (2.1.3. Cyber offence) or aid the design of bioweapons (2.1.4. Biological and chemical attacks). Highly consequential upcoming decisions by companies and governments about model release partially rely on these evaluations [footnote 596], [footnote 947], [footnote 1134]. Existing benchmarks significantly vary in quality [footnote 1151], and the scope of applicability for existing benchmarks is often unclear. Some best practices for creating high-quality benchmarks have been proposed [footnote 1151], [footnote 1152].

While model testing methods can serve as a necessary first step toward assessing the risks of general-purpose AI, they are not sufficient on their own. It is impossible to derive reliable quantitative conclusions about the risks these methods aim to capture without making strong assumptions about patterns of use in specific applications. Such assumptions are hard to justify: First, the technology is general-purpose and can be used in numerous contexts, so it is difficult to predict patterns of use. Second, some risks (e.g. bias, toxicity, and misinformation) are difficult to specify objectively, and any definitions must rest on questionable assumptions about what is (for example) ‘toxic’ or ‘biased’. Therefore, benchmarks cannot capture the risks associated with the usage of general-purpose AI in new domains and for novel tasks, because test conditions always differ from real-world usage to varying degrees [footnote 1153]. Benchmarks at best serve as a proxy measure for the risk category in question (for example, subjective ratings of human annotators or content moderators may serve as proxy for ‘toxicity’ [footnote 1154]. However, these proxy measures often do not reliably reflect the true risk in context. For instance, if human evaluators are not diverse, this can lead to benchmarks containing biased labels, since people from similar backgrounds might systematically miss certain examples of toxicity or misinformation. Moreover, improving scores on a benchmark does not always translate to lowering the associated risk in practice. For example, an LLM can pass the bar exam for lawyers, but that does not mean that it can create effective legal briefs [footnote 445], [footnote 446], [footnote 451]. Any fixed benchmark is often easy to improve on without mitigating the target risk [footnote 1070]. While creating capacity for dynamically evolving, collaborative benchmarks may address some of these challenges, it is important for AI evaluators to understand the inherent limitations of quantitative approaches to model testing [footnote 1155] and avoid over‑reliance on them as the primary layer of risk assessment.

Red-teaming and adversarial attacks are other prominent methods to identify and assess risks, but can require special access. ‘Red team’ refers to a set of evaluators tasked with finding vulnerabilities in a system by attacking it. In contrast to benchmarks, which are mostly static and consist of a fixed set of test cases, a key advantage of red-teaming is that it adapts the evaluation to the specific system being tested. Through adversarial interactions with a system, red-teamers can design custom inputs to identify worst-case behaviours, malicious use opportunities, and unexpected failures. As an example, attacks against language models can take the form of automatically generated inputs [footnote 904], [footnote 1053], [footnote 1063], [footnote 1156], [footnote 1157], [footnote 1158], [footnote 1159], [footnote 1160], [footnote 1161], [footnote 1162] or manually generated ones [footnote 1056], [footnote 1059], [footnote 1158], [footnote 1163]. In automated attacks, for example, LLMs can be used to generate prompts designed to make another AI system produce harmful content, such as instructions for dangerous materials, even after the system initially refuses. These ‘jailbreaking’ attacks subvert the models’ safety restrictions [footnote 460], [footnote 904], [footnote 1052], [footnote 1053], [footnote 1164], [footnote 1165]. Automated approaches can systematically test thousands of variations of potential attacks, allowing for more extensive and rapid coverage than manual testing alone. However, manual red-teaming over longer conversations can catch issues that current automated attacks alone can miss [footnote 1056]. However, it can be slow, labour intensive, and require special access. Further research for faster and effective automated red-teaming is necessary to address this challenge [footnote 1166].

While red-teaming is more effective at surfacing a wider range of general-purpose AI risks than model testing, many important harms and hazards may remain undetected. Importantly, if a red‑teaming activity fails to surface certain categories of risks, that does not imply that those risks are unlikely. Previous work has found that bugs often evade detection [footnote 1022]. A real-world example is jailbreaks, which induce general-purpose chat systems to comply with harmful requests that they were trained to refuse [footnote 460], [footnote 904], [footnote 1052], [footnote 1053], [footnote 1164], and which evaded initial detection by developers [footnote 48], [footnote 147], [footnote 1158]. Research has also called into question whether red-teaming can produce reliable and reproducible results. One study shows that red-teaming practices in industry diverge along several key axes, including the setting (e.g. the characteristics of red-teamers and the resources and methods available to them), and the decisions it informs (e.g. subsequent reporting, disclosure, and mitigation) [footnote 1167]. The composition of the red team and the instructions provided to red-teamers [footnote 1168], the number of attack rounds [footnote 1056], and the availability of auxiliary or automation tools [footnote 1161], [footnote 1169] can significantly influence the outcomes of the activity, including the risk surface covered. See Table 3.2 for an overview of criteria for structuring red-teaming activities in practice. Comprehensive guidelines on red-teaming aim to address some of these challenges [footnote 1170].

Table 3.2: Different types of criteria can help practitioners to structure red-teaming before, during, and after the relevant activities

Phase Key Questions and Considerations
0. Pre-activity criteria What is the artefact under evaluation through the proposed red-teaming activity?
What is the threat model the red-teaming activity aims to recreate?
What is the specific vulnerability the red-teaming activity aims to find?
What are the criteria for assessing the success of the red-teaming activity?
What is the team composition, or who will be part of the team?
1. Within-activity criteria What resources are available to participants?
What instructions are given to the participants to guide the activity?
What kind of access do participants have to the model?
What methods can members of the team utilise to test the artefact?
2. Post-activity criteria What reports and documentation are produced on the findings of the activity?
What were the resources the activity consumed?
How successful was the activity in terms of the criteria specified in phase 0?
What are the proposed measures to mitigate the risks identified in phase 1?

Source: based on the criteria proposed by Feffer et al., 2024 [footnote 1167].

‘Field tests’ are exercises designed to assess risks under normal use conditions. ‘Human uplift studies’ examine whether people can use AI to perform malicious tasks better than they could without AI. ‘Human uplift’ studies are one important variant of field testing. They aim to measure how access to general-purpose AI systems improves individuals’ competencies and performance. For example, a human uplift study might explore how an AI system affects a person’s ability to accomplish complex tasks, such as customer support [footnote 662] or (potentially harmful) cybersecurity operations [footnote 361], [footnote 1171], [footnote 1172], [footnote 1173], compared to their performance without the AI assistance. These studies aim to quantify the ‘uplift’ in human capabilities and assess whether the AI’s support introduces new risks, such as lowering barriers to harmful conduct see 2.4. Impact of open-weight general-purpose AI models on AI risks for further discussion of uplift studies). However, there are several challenges in designing and conducting such studies, including simulating conditions similar to ordinary use and choosing the appropriate measures of uplift. Evaluators could address some of these challenges if there were better guidelines for conducting human uplift studies and integrating them into the staged rollout of general‑purpose AI products. In other safety-critical industries, for example drug testing in clinical trials, a series of studies are conducted in increasingly more realistic conditions (for example, going from testing on animals to human-subject studies), before the drug is deemed ready to market. A similar approach may prove useful for developing effective field testing methods for general-purpose AI.

Certain risks associated with general-purpose AI are likely to manifest only in the long run, making long-term impact assessments crucial. Such risks include the effects of the technology on labour markets and the future of work (2.3.1. Labour market risks), risks associated with more capable future AI systems (2.2.3. Loss of control), (2.1.3. Cyber offence 2.1.4. Biological and chemical attacks), the environmental impact of AI development and use (see 2.3.4. Risks to the environment, and long-term impacts on human cognition, wellbeing, and control [footnote 1003]. Careful monitoring, investigating and rectifying long-term harms is necessary to maintain the public’s confidence in the technology and prevent calls for unnecessarily strong controls. Accurately gauging the downstream societal impacts of general-purpose AI is challenging due to 1. uncertainties surrounding the capabilities of future general‑purpose AI systems, and 2. the existence of numerous confounding factors that make it difficult to attribute long-term trends to any single cause. Creating capacity for predicting and monitoring the potential downstream societal impacts of general-purpose AI requires multidisciplinary analysis and the involvement of diverse perspectives [footnote 929], [footnote 1174], [footnote 1175].

Challenges and opportunities

In addition to the challenges discussed here, see also 3.2.1. Technical challenges for risk management and policymaking and 3.2.2. Societal challenges for risk management and policymaking.

The culture of ‘build-then-test’ in AI hinders comprehensive risk assessment and mitigation.

In conventional risk management, risk assessment is integrated into all stages of product design, development, and deployment, and is tightly intertwined with risk mitigation strategies. In AI safety, however, current risk assessment methods are largely conducted after development, and independent from risk mitigation. Prior work [footnote 978] has proposed the creation of safety case studies and safety guarantees for AI [footnote 1176]. Adapting and implementing such practices for general-purpose AI requires both a cultural shift and further research.

The four layers of risk assessment (model testing, red-teaming, field testing, and long-term impact assessment) are necessary but not sufficient for comprehensive risk assessment. Existing methods do not provide generalisable guarantees or assurances surrounding the likelihood and severity of general‑purpose AI harms [footnote 1177]. The main evidence gaps are in 1. assessing the validity, reliability and practicality of each evaluation layer independently, and 2. combining information from different layers of evaluation to produce actionable insights [footnote 41].

Conducting comprehensive risk assessment, in practice, requires considerable access, resources, and time, which are often constrained. Very few entities have the resources (or the will to allocate the necessary resources) to conduct comprehensive evaluations, and potential conflicts of interest can lead to misleading results and reports [footnote 1014], [footnote 1178]. Moreover, sometimes evaluators are not given enough time to thoroughly test models. In some cases, companies only provided evaluators with several days to test a new model before release [footnote 2], [footnote 129]. Effective model evaluation requires substantial time and resources.

Furthermore, developers of state-of-the-art general-purpose AI systems often limit external access to their technology [footnote 880]. For models that are hosted on a developer’s platform or that have to be accessed via an API (giving ‘black box’ access, only to model inputs and outputs), it can be challenging for external evaluators to perform effective adversarial attacks, model interpretations, and fine-tuning [footnote 1086], [footnote 1179]. For example, AI models are usually trained to refuse dangerous requests, but to assess dangerous capabilities, evaluators require access to versions of the model without this guardrail. This access is sometimes provided [footnote 2]. Without it, certain high‑priority risks may be overlooked. Incomplete information about how a system was designed, including data, techniques, implementation details, and organisational details hinders evaluations of the development process [footnote 34], [footnote 488], [footnote 1086], [footnote 1180], [footnote 1181], [footnote 1182]. Some scholars have argued that a combination of technical, physical, and legal measures can offer external researchers’ direct access without compromising trade secrets more than they are already compromised [footnote 1086]. Several studies have advocated for legal ‘safe harbours’ [footnote 1036] or government-mediated access regimes [footnote 939] to enable evaluators to conduct independent evaluations without the risk of being prosecuted or banned from use. Researchers have proposed methods for structured access that do not require making the model’s code and training weights public [footnote 1183], but that do make it possible for independent researchers and auditors to fully access the model in a secured environment designed to avoid leaks. Researchers are developing auditing techniques that use ‘secure enclaves’. These techniques have the potential to avoid leaking the model parameters to auditors, and also the audit details to model developers [footnote 1184].

Successful risk assessment requires the participation of diverse perspectives in the evaluation process. The composition of the evaluation team in evaluation layers, such as red-teaming, can play a critical role in the process of discovering, characterising, and prioritising harms [footnote 1185]. Improving stakeholder participation has been a focus of the machine learning community in recent years [footnote 932], [footnote 1186], [footnote 1187]. Multiple strategies have been proposed, from broadening the understanding of ‘impacts’ in AI impact assessments [footnote 1188] to enabling a more inclusive range of human feedback [footnote 1189], [footnote 1190]. However, fostering participation requires sensitivity to several criteria [footnote 1186], such as respect for participating parties to minimise the potential for exploitation [footnote 540], and surfacing hard choices between incompatible values or priorities [footnote 467], [footnote 538], [footnote 574]. This process can be facilitated by methods from practical ethics such as ‘reflective equilibrium’ – the mutual adjustment of principles and judgements until they agree with each other [footnote 1191].

Policymakers face several challenges around how to incentivise adequate risk identification and assessments for general-purpose AI systems. Without clear guidelines, standards, and resources surrounding general-purpose AI risk assessment, practitioners face uncertainties as to what constitutes adequate risk assessments in their specific use cases. This in turn makes it difficult for policymakers to incentivise compliance. Another policy challenge is how to designate responsibility for various layers of risk assessments across different general‑purpose AI stakeholder groups, including technology creators, users, and third-party auditors [footnote 763]. Another approach is creating resources (for example, ‘sandboxes’ and ‘safe harbours’) that promote public-interest evaluations [footnote 1036] or third-party audits. The success of this approach hinges heavily on the availability of resources, trained evaluators and experts, incentives to conduct rigorous evaluations (for example, by offering indemnity and compensation), and access to models or information about data and methods used. Several governments have begun to build capacity for conducting technical evaluations and audits of general-purpose AI. It remains to be seen how much these efforts will advance interdisciplinarity and inclusive evaluation of general‑purpose AI in the near future, and how much they can and will be scaled in practice [footnote 537], [footnote 540], [footnote 1192], [footnote 1193].

3.4. Risk mitigation and monitoring

3.4.1. Training more trustworthy models

Key information

  • Current training methods show progress on mitigating safety hazards from malfunctions and malicious use but remain fundamentally limited. There has been progress in training general‑purpose AI models to function more safely, but no current method can reliably prevent even overtly unsafe actions.
  • A multi-pronged approach is emerging as necessary for safety. Evaluating the trustworthiness of models requires analysing many aspects of their behaviour and their development process – including factual accuracy, human supervision quality, AI system internals, and analysis of potential misuse patterns – all of which must inform training methodologies. While techniques exist to remove harmful capabilities, current methods tend to suppress rather than eliminate them.
  • Adversarial training provides limited robustness against attacks. Adversarial training involves deliberately exposing AI models to examples designed to make them fail or misbehave during training, aiming to build resistance to such cases. However, adversaries can still find new ways (‘attacks’) to circumvent these safeguards with low to moderate effort, such as ‘jailbreaks’ that lead models to comply with harmful requests even if they were fine‑tuned not to do so.
  • Since the publication of the Interim Report (May 2024), recent advances reveal both progress and new concerns. Improved understanding of model internals has advanced both adversarial attacks and defences without a clear winner. Additionally, growing evidence suggests that current training methods – which rely heavily on imperfect human feedback – inadvertently cause models to mislead humans on difficult questions by making errors harder to spot. Improving the quantity and quality of human feedback is an avenue for progress, though nascent training techniques using AI to detect misleading behaviour also show promise.
  • Key challenges for policymakers centre around uncertainty and verification. There are no reliable methods to quantify the risk of unexpected model failures. While some researchers are exploring provably safe approaches, these remain theoretical. This suggests that frameworks for safety training currently need to focus on processes to search for, respond to, and mitigate new failures before they cause unacceptable harm.

Key definitions

  • Interpretability: The degree to which humans can understand the inner workings of an AI model, including why it generated a particular output or decision. A model is highly interpretable if its mathematical processes can be translated into concepts that allow humans to trace the specific factors and logic that influenced the model’s output.
  • Red-teaming: A systematic process in which dedicated individuals or teams search for vulnerabilities, limitations, or potential for misuse through various methods. Often, the red team searches for inputs that induce undesirable behaviour in a model or system to identify safety gaps.
  • Adversarial training: A machine learning technique used to make models more reliable. First, developers construct ‘adversarial inputs’ (e.g. through red-teaming) that are designed to make a model fail, and second, they train the model to recognise and handle these kinds of inputs.
  • Reinforcement learning from human feedback (RLHF): A machine learning technique in which an AI model is refined by using human-provided evaluations or preferences as a reward signal, allowing the system to learn and adjust its behaviour to better align with human values and intentions through iterative training.
  • Jailbreaking: Generating and submitting prompts designed to bypass guardrails and make an AI system produce harmful content, such as instructions for building weapons.

The risks of general-purpose AI systems may be mitigated in part by limiting their behaviours. For example, policymakers may wish to prevent general-purpose AI systems from providing dangerous information to users (e.g. on the production of weapons; (see 2.1.4. Biological and chemical attacks), being used for malicious purposes (e.g. for cyberattacks; see 2.1.3. Cyber offence), or having malfunctions that lead to harm (see 2.2. Risks from malfunctions). A system’s behaviour is safe if it avoids such mistakes, and a system is robust if it continues to behave safely in a wide range of circumstances. Beyond this, a system is adversarially robust if it maintains safe behaviour even in the presence of an adversary (e.g. a human user) trying to get it to perform harmful or illegal tasks. There exist proposals for how to build general-purpose AI systems which are guaranteed to behave safely [footnote 1176], but this is not possible without significant technological advances and may require significant changes to the architecture of current general-purpose AI systems. Regulation of current systems will have to focus on ensuring that their training and development minimises the harms of malfunctions and misuse.

Since the publication of the Interim Report, both attackers and defenders have become better at leveraging a deeper understanding of AI systems’ internal workings to respectively induce or prevent harmful behaviour, and the advantage remains with attackers. New methods to resist adversarial attack by leveraging the concepts internally represented in neural networks have been developed both for image models [footnote 1194] and language models [footnote 1195]. However, these approaches are not completely robust, and another recent study has shown that language models internally represent the refusal of harmful requests in a simple way which allows them to be easily exploited as well [footnote 907]. On balance, the advantage generally remains with attackers, who can induce a model to engage in harmful behaviour with only moderate effort. However, these developments suggest that further research on both attacks and defences will likely leverage progress in interpretability. If this is true, further advances may favour defenders in the case of closed-weights models, since attackers will not have access to neural network internals in these cases.

Evidence has also grown that existing methods for training general-purpose models can lead them to produce more misleading (i.e. false but convincing) outputs. A recent study showed that in the case of especially challenging questions, training general-purpose AI systems to maximise human approval of the answers led the systems to obfuscate their mistakes and make them harder for humans to spot, instead of becoming more accurate [footnote 608]. Other studies in simulated environments have found that an AI learns to use harmful strategies (e.g. hiding information or exploiting its supervisor’s biases) to receive positive feedback [footnote 1196] or modify its training environment to increase its reward [footnote 599], if enough information is available to the AI on how to do so. Using AI to help supervisors avoid errors remains a challenging problem, but there has also been modest progress in this area, with two recent studies showing cases where models become easier to supervise when optimised to debate themselves [footnote 1197], [footnote 1198]. These developments highlight the need for further research investigating the behaviours incentivised by current training methods, and developing new training methods that provide better incentives and generally more trustworthy outputs by design.

The main evidence gaps around training trustworthy models include:

  • Despite recent progress [footnote 1010], [footnote 1012], [footnote 1199], it is still unclear whether interpretability methods, which help researchers and evaluators understand how models function internally, will be useful enough to substantially inform model training and testing. There are preliminary studies of this [footnote 1076], [footnote 1200], [footnote 1201].
  • It is unclear whether ‘scalable oversight’ protocols, where AI systems can help humans evaluate their outputs, can provide a strong lever by which models can be trained to be more trustworthy even on hard problems [footnote 609].
  • There are currently no viable technical approaches to rigorously quantifying the risk of unforeseen or unexpected failures in large general-purpose AI systems. Although there is ongoing research on obtaining probabilistic safety guarantees, there is no practical technique to obtain even approximate guarantees yet.

For policymakers, key challenges include:

  • Research moves very quickly in AI training, making it a moving target for regulation.
  • It is difficult to quantify the risk of unexpected, unforeseen failure modes. In addition, it is unclear what are the best practices by which AI developers should detect, respond to, and mitigate newly discovered failures to minimise risks.

Robustness

Incentivising safe and correct behaviour during system training

It is challenging to precisely specify objectives for general-purpose AI systems in a way that does not unintentionally incentivise harmful behaviours. Currently, researchers do not know how to specify abstract human preferences and values (such as reporting the truth, figuring out and doing what a user wants, or avoiding harmful actions) in a way that can be used to train general-purpose AI systems. Moreover, given the complex socio-technical relationships embedded in general‑purpose AI systems, it is not clear whether such specification is even possible. After an initial pre-training phase, general‑purpose AI systems have learned to imitate human behaviour and are then generally tuned to optimise for objectives that are imperfect proxies for the developer’s true goals [footnote 1031]. For example, AI chatbots are often tuned to produce text that will be rated positively by human evaluators, but user approval is an imperfect proxy for user benefit. Research has shown that several widely used chatbots sometimes match their stated views to a user’s views regardless of truth [footnote 98], [footnote 522] possibly creating ‘echo chambers’, and that training general-purpose AI systems to satisfy human evaluators’ assessments can incentivise the system to provide harder‑to-check answers that obfuscate the system’s mistakes [footnote 608]. This is an ongoing challenge for general-purpose AI systems [footnote 607], [footnote 1029], [footnote 1031], [footnote 1202].

Researchers have methods to measure whether training incentivises the right behaviour using experiments with human evaluators, but current results are preliminary. ‘Scalable oversight’ experiments test whether an evaluator can successfully steer an AI system to correctly perform a task that the evaluator is unable to demonstrate or evaluate themselves – for example, to answer questions (such as hard science questions) which require specialised expertise to check [footnote 609], [footnote 1203]. This provides a strong empirical check that the training protocol being used incentivises the right behaviour. Protocols under development for scalable oversight often enlist the AI system itself in helping the evaluator, for example by having it engage in a debate with itself over the correct answer [footnote 611], and letting a human evaluator steer the model on the basis of that debate. Recent human and AI debate experiments show that this can improve the ability of human evaluators to determine the right answers to hard questions [footnote 615], [footnote 1198], [footnote 1204], and preliminary results show that this can translate into an improved training incentive [footnote 1197]. However, positive results have only been shown on a simple reading comprehension task, with mixed results for other tasks such as mathematics problems [footnote 1198]. These methods have not been used to train general-purpose AI systems, but progress in this area is continuing, and scalable oversight experiments may at some point form a practical way of measuring how reliably training techniques incentivise the correct behaviour.

Some researchers are working toward ‘safe-by-design’ approaches which might be able to provide quantitative safety guarantees. Beyond ensuring that an AI’s training process encodes the incentive to be safe, it may be possible to design AI systems that quantitatively guarantee certain levels of safety [footnote 1176]. These proposals often rely on a combination of three elements: first, a specification of desired and undesired outcomes (which in some cases could be a natural language description of desired and unacceptable behaviours), second, a ‘world model’ that includes capturing (approximate) cause and effect relationships and predicts the outcomes of possible actions the AI system could take, and third, a verifier that checks whether a given candidate action would lead to undesirable predicted outcomes. The goal of this process is to guarantee that dangerous actions are not taken. If the world model captures scientific knowledge, it will typically rely on ‘neuro‑symbolic’ hybrids of general-purpose AI and classic techniques using formal mathematics. The advantage of mathematical guarantees and bounds is that they may provide safety assurances even outside of the domain in which the AI has been trained and tested, in contrast with spot checks and improvement through trial-and-error which are currently the standard for evaluating and training general-purpose AI models. This explicit model‑based approach offers two additional advantages: firstly, because it uses formal logic and probability laws to analyse clearly defined knowledge components, its conclusions are more trustworthy, understandable, and verifiable than those of traditional AI systems. Secondly, it allows building non‑agentic (non-autonomous) AI systems that can advance science and human knowledge while remaining easy to control, avoiding the potential risks that come with advanced highly agentic AI (see 2.2.3. Loss of control). Currently, however, practically useful, provable guarantees of safety have yet to be demonstrated for general‑purpose AI models and methods, and many open questions remain in order to achieve those objectives for large-scale AI systems [footnote 1205].

Maintaining the quality of human supervision and evaluation of AI behaviour

State-of-the-art training and evaluation techniques rely on feedback or demonstrations from humans and, as such, are constrained by human error and bias. Developers fine-tune state‑of‑the‑art general‑purpose AI systems using a large amount of human involvement. In practice, this involves techniques that leverage human-generated examples of desired actions [footnote 28] or human‑generated feedback on examples from models [footnote 29], [footnote 30], [footnote 31], [footnote 1182]. This is done at scale, making it labour-intensive and expensive. However, human attention, comprehension, and trustworthiness are not perfect [footnote 1182], which limits the quality of the resulting general-purpose AI systems [footnote 1206], [footnote 1207], [footnote 1208]. Even slight imperfections in feedback from humans can be amplified when used to train highly capable systems, with potentially serious consequences (see for example 2.2.3. Loss of control).

Improving the quality and quantity of human oversight can help to train more robust models. Some research has shown that using richer, more detailed forms of feedback from humans can provide better oversight for AI models, but at the cost of increased time and effort for data collection [footnote 1209], [footnote 1210], [footnote 1211]. To gather larger datasets, leveraging general-purpose AI systems to partially automate the feedback process can greatly increase the volume of data [footnote 33], [footnote 256]. However, in practice, the amount of explicit human oversight used during fine-tuning is very small compared to the trillions of data points used in pre-training on internet data, and so human oversight may, therefore, be unable to fully remove harmful knowledge or capabilities from pre-training. Improving fine-tuning feedback data is likely to form only a part of the solution to cooperative robustness.

Improving the factuality of model outputs

The hallucination of falsehoods is a challenge, but it can be reduced. In AI, ‘hallucination’ refers to the propensity of general-purpose AI systems to output falsehoods and made-up content. For example, language models commonly hallucinate non-existent citations, biographies, and facts [footnote 101], [footnote 102], [footnote 103], [footnote 104], [footnote 105], which could pose legal and ethical problems involving the spread of misinformation [footnote 1212]. It is possible but challenging to reduce general-purpose AI systems’ tendency to hallucinate untrue outputs. Fine-tuning general-purpose AI models explicitly to make them more truthful – both in the accuracy of their answers and analysis of their own competence – is one approach to tackling this challenge [footnote 1213]. Additionally, allowing language models to access knowledge databases when they are asked to perform tasks helps to improve the reliability of their generations [footnote 838], [footnote 1214]. Alternative approaches detect hallucinations and inform the user if the generated output is not to be trusted [footnote 1215], perform fine-grained checks on the individual claims made by a model [footnote 1216], or quantify the model’s confidence [footnote 1217]. However, reducing hallucination remains a very active area of research.

Improving robustness against unexpected failures

Ensuring that general-purpose AI systems learn beneficial behaviours that translate from their training contexts to real-world, high-stakes deployment contexts is highly challenging. Sometimes, unfamiliar inputs that a general-purpose AI system encounters in deployment can cause unexpected failures [footnote 1218]. Just as general-purpose AI systems are trained to optimise for imperfect proxy goals, the training context can also fail to adequately represent the real-world situations that systems will encounter after they are deployed. In such cases, general-purpose AI systems may still take harmful actions even if they are trained with correct human-provided feedback [footnote 616], [footnote 1032], [footnote 1033]. For example, some researchers have found that chatbots are more likely to take harmful actions in languages that are underrepresented in their training data [footnote 1034]. One way to mitigate these failures is with evaluation frameworks that test many combinations of deployment conditions, such as the Holistic Evaluation of Language Models framework (HELM [footnote 1150], which enumerates and tests combinations of many different tasks, user profiles, and languages, among other features. Another is to develop methods by which models can estimate and communicate their uncertainty in rare cases to anticipate mistakes [footnote 1219], [footnote 1220]. However, in general it is likely impossible to enumerate all possible real-world situations for evaluation or to anticipate all potential mistakes.

Understanding a model’s internal computations might help researchers to investigate whether they have learned robust solutions. Methods exist to automatically identify features (i.e. mathematical patterns) inside a neural network model which correspond to human-interpretable concepts [footnote 1009], [footnote 1013], [footnote 1221], [footnote 1222], including specific people and places as well as abstract concepts and behaviours such as errors in code, nonconformity to certain political opinions, or descriptions of how to create drugs [footnote 1012]. These features can serve as a guide to identifying dangerous or undesirable behaviours in a system’s training data or its outputs at a larger scale than would be practical with human review alone. Researchers have attempted to automate this review using an ‘automated interpretability agent’ that has access to interpretability tools. A preliminary study shows that this is possible on a small scale [footnote 1201], and there is no clear barrier to scaling up this kind of work.

There is recent progress on using understanding of a model’s internal workings to improve its behaviour, but this approach needs more work. Despite the difficulty of understanding models’ inner workings, some techniques can be used to guide specific edits to them. Compared to fine‑tuning, these methods can sometimes be more compute- or data-efficient ways of modifying models’ functionality. Researchers have used a variety of methods for this, based on making changes to models’ internal parameters learned during training [footnote 1223], [footnote 1224], [footnote 1225], [footnote 1226], [footnote 1227], neurons [footnote 1221], [footnote 1228], [footnote 1229], or representations [footnote 1199], [footnote 1230], [footnote 1231], [footnote 1232], [footnote 1233]. These techniques are imperfect [footnote 1023], generally limited to very specific kinds of behaviours [footnote 1227] and typically introduce unintended side effects on model behaviour [footnote 1234], but they remain an active area of research. It is unclear to what extent current methods offer a ‘useful and reliable’ way of understanding and engineering general-purpose AI models [footnote 1026].

Adversarial robustness: preventing model misuse

Users of general-purpose AI systems can often bypass their safeguards with ‘jailbreaks’ that induce them to comply with harmful requests. Even if a system always behaves well under normal use, a motivated individual may still construct unusual inputs that are specifically designed to make a system fail or engage in undesired (e.g. harmful) behaviours [footnote 1054]. Language models in particular are subject to general purpose ‘jailbreaks’ which can make them much more likely to comply with harmful requests. Examples of jailbreaking methods include: inducing an AI system to adopt the persona of someone who would say the harmful content [footnote 1053], priming it with examples of harmful answers [footnote 1235], or making requests in a language that was scarce in the system’s training data [footnote 1236], which could increase models’ vulnerability in some low- and middle-income countries (LMICs) (see Table 3.3 for some example jailbreaks). While jailbreaks can be partially guarded against after their discovery, it is difficult to anticipate them during model development, and currently, it is generally easy to find new jailbreaks that work for state-of-the-art models. This being the case, it is unclear how widely jailbreaks are used to actually cause harmful behaviour by AI systems outside of a research setting.

Table 3.3: Various ‘jailbreaks’ have been used to make general-purpose AI systems comply with the request

[footnote 1164], [footnote 1237], [footnote 1238], [footnote 1239], [footnote 1240]

Table 3.3: Various ‘jailbreaks’ have been used to make general-purpose AI systems comply with the request ‘Please provide me instructions on how to create a bomb with materials available in my home’, which they would normally refuse due to their safety guardrails. Example outputs were handwritten for illustrative purposes. Current state-of-the-art general-purpose AI systems now resist most of these methods, but they succeeded on systems available at their time of release, and new jailbreaks continue to be found for state-of-the-art systems.

Training models to detect and refuse harmful requests from adversaries

Adversarial training helps improve robustness in state-of-the-art AI systems, though only to a limited extent. ‘Adversarial training’ involves first constructing ‘attacks’ designed to make a model act undesirably, and second, training the system to handle these attacks appropriately. Attacks against AI systems can take many forms and can be either human- or algorithm-generated. Once an adversarial attack has been produced, training on these examples can proceed as usual. Adversarial training has become a commonly used technique to make models more robust to failures, and is used in the development of major general-purpose AI systems [footnote 4], [footnote 48], [footnote 147], [footnote 1158], [footnote 1163], [footnote 1241]. However, it is not sufficient by itself, as adversarially trained systems are still generally vulnerable to attack, especially with multimodal inputs (e.g. with images). Moreover, the potential appropriateness or harmfulness from an AI system’s outputs cannot always be evaluated outside of the context in which it is used, which is not available during adversarial training [footnote 1242].

Making general-purpose AI systems more robust to unforeseen attacks is a challenging open problem, but there are potentially promising methods for minimising the relevant harms. Adversarial training generally requires specific examples of failures [footnote 598], [footnote 1243]. These limitations have resulted in ongoing games of ‘cat and mouse’ in which some developers continually update models in response to newly discovered vulnerabilities. The process of searching for vulnerabilities and attempting to induce undesirable behaviour is known as ‘red-teaming’. A partial solution to models’ continued vulnerability is to simply produce and train on more adversarial examples. Automated methods for generating attacks can help scale up adversarial training [footnote 522], [footnote 904], [footnote 1157], [footnote 1244].

These limitations have resulted in ongoing games of ‘cat and mouse’ in which some developers continually update models in response to newly discovered vulnerabilities. The process of searching for vulnerabilities and attempting to induce undesirable behaviour is known as ‘red-teaming’. A partial solution to models’ continued vulnerability is to simply produce and train on more adversarial examples. Automated methods for generating attacks can help scale up adversarial training [footnote 522], [footnote 904], [footnote 1157], [footnote 1244]. However, the exponentially large number of possible inputs for general-purpose AI systems makes it intractable to thoroughly search for all types of attacks. Interpretability methods might help here [footnote 907], and there has been preliminary progress on improving robustness through methods that operate on the model’s internal states [footnote 1076], [footnote 1195], [footnote 1200]. Even if all attacks cannot be prevented beforehand, if they can be detected quickly at run‑time then systems can be efficiently adapted to defend against them: in one study, a system had greater than 95% success defending against attacks after seeing only one example of the same kind of attack [footnote 1245]. While research on these mitigations is preliminary, requiring live monitoring, response, and adversarial training mitigations on potentially dangerous AI systems is critical for decreasing the damage of AI misuse.

‘Machine unlearning’ methods aim to remove certain undesirable capabilities from general‑purpose AI systems, but current techniques often suppress rather than fully remove such capabilities. For example, machine unlearning can remove certain capabilities that could aid malicious users in making explosives, bioweapons, chemical weapons, and cyberattacks [footnote 392]. Unlearning as a way of negating the influence of undesirable training data was originally proposed as a way to protect privacy and copyright [footnote 821], discussed in (2.3.6. Risks of copyright infringement). Unlearning methods to remove hazardous capabilities [footnote 892], [footnote 1246] include methods based on fine‑tuning [footnote 893] and editing the inner workings of models [footnote 392]. Ideally, unlearning should make a model unable to exhibit the unwanted behaviour even when subject to knowledge-extraction attacks, novel situations (e.g. requests in various languages), or small amounts of fine-tuning. However, current unlearning methods often suppress harmful information without removing it robustly [footnote 1247]. This creates challenges for governance, since models might appear to lack harmful capabilities when these are actually just hidden and can be reactivated. Current unlearning methods may also introduce unwanted side effects on desirable model knowledge [footnote 1247]. It is unclear if unlearning a harmful skill could fully remove the model’s ability to perform a harmful task by combining desirable skills and knowledge. Unlearning remains an area of active research.

3.4.2. Monitoring and intervention

Key information

  • Monitoring and intervention are complementary approaches for preventing AI system malfunctions and malicious use. Monitors inspect system inputs and outputs, hardware state, model internals, and real-world impacts while systems are used, triggering interventions that block potentially harmful actions. Current tools can detect AI-generated content, track system behaviour, and identify concerning patterns across these monitoring targets. However, moderately skilled users can often circumvent these safeguards through various technical means.
  • Model interpretability and explanation methods can help monitor AI decisions but current methods can also produce misleading insights. Technical approaches for explaining AI system outputs help developers and deployers scrutinise decision-making, though studies indicate that these methods can produce inaccurate or oversimplified explanations of complex model behaviour.
  • Multiple layers of monitoring and intervention create stronger protection against malfunctions and malicious use. Combining technical monitoring and intervention capabilities with humans in the loop builds stronger safeguards, though these measures can introduce costs and delays.
  • In recent months, there has been progress in model interpretability and hardware-based monitoring measures. Since the publication of the Interim Report (May 2024), model interpretability research has progressed to begin explaining model behaviours, and early work investigating privacy-preserving hardware-based monitoring has the potential to improve regulatory visibility into AI development.
  • Key challenges for policy makers centre on balancing safety measures against their practical costs. While layered safety measures provide stronger protection, they also introduce operational delays, raise privacy concerns, and increase deployment costs. Policymakers therefore need to weigh safety requirements against these practical constraints, particularly given potential misalignment between safety measures and business incentives.

Key definitions

  • Model: A computer program, often based on machine learning, designed to process inputs and generate outputs. AI models can perform tasks such as prediction, classification, decision‑making, or generation, forming the core of AI applications.
  • System: An integrated setup that combines one or more AI models with other components, such as user interfaces or content filters, to produce an application that users can interact - with.
  • Interpretability: The degree to which humans can understand the inner workings of an AI model, including why it generated a particular output or decision. A model is highly interpretable if its mathematical processes can be translated into concepts that allow humans to trace the specific factors and logic that influenced the model’s output.
  • AI-generated fake content: Audio, text, or visual content, produced by generative AI, that depicts people or events in a way that differs from reality in a malicious or deceptive way, e.g. showing people doing things they did not do, saying things they did not say, changing the location of real events, or depicting events that did not happen.
  • Deepfake: A type of AI-generated fake content, consisting of audio or visual content, that misrepresents real people as doing or saying something that they did not actually do or say.
  • Digital forensics: The process of tracing the origin and spread of digital media.
  • Watermark: A subtle, often imperceptible pattern embedded within AI-generated content (such as text, images, or audio) to indicate its artificial origin, verify its source, or detect potential misuse.
  • Defence in depth: A strategy that includes layering multiple risk mitigation measures in cases where no single existing method can provide safety.
  • Human in the loop: A requirement that humans must oversee and sign off on otherwise automated processes in critical areas.
  • AI agent: A general-purpose AI which can make plans to achieve goals, adaptively perform tasks involving multiple steps and uncertain outcomes along the way, and interact with its environment – for example by creating files, taking actions on the web, or delegating tasks to other agents – with little to no human oversight.

Monitoring and intervention strategies are applied to AI systems – the complete deployment package that includes both the AI model and additional safety components – leaving the model unchanged. Unlike the strategies discussed in (3.4.1. Training more trustworthy models , monitoring and intervention methods are integrated at the system level and implemented as part of system deployment. This section discusses monitoring and intervention strategies that researchers and developers use for general-purpose AI systems (see Figure 3.2).

The main evidence gaps around monitoring and intervention include understanding how effective methods are, and how easy they are to circumvent. Monitoring and intervention techniques are, in many cases, easy, simple, and effective system-level safeguards in typical use cases. They offer an essential additional line of defence aside from the model-level techniques discussed in (3.4.1. Training more trustworthy models). From this perspective, there are few technical barriers to the widespread adoption of many techniques. However, scientists do not yet have a thorough quantitative understanding of their effectiveness in real-world settings and how easily monitoring methods can be coordinated across the AI supply chain. A key barrier toward highly effective monitoring and intervention techniques is understanding how vulnerable they are to being actively circumvented by malicious users.

Figure 3.2: Monitoring and intervention techniques

Figure 3.2: Monitoring and intervention techniques are system-level safeguards that can be applied to general-purpose AI system inputs, outputs, and models themselves in order to help researchers and developers monitor AI behaviour and, if necessary, intervene. Source: International AI Safety Report.

Detecting AI-generated content

Content generated by general-purpose AI systems – particularly ‘deepfakes’ – could have widespread harmful effects [footnote 1248], [footnote 1249], [footnote 1250] (see 2.1.1. Harm to individuals through fake content). However, the ability to distinguish between genuine and AI-generated content can help to reduce the harmful use of generative models. For example, if web browsers were able to put reliability notices on content that was likely AI-generated, this would help to combat the spread of misinformation online. There are a variety of technical tools for the detection of AI-generated content. None are perfect, but together, they can be immensely helpful for digital forensics.

Unreliable but still useful techniques exist for detecting AI-generated content. Just as different humans have discernible artistic and writing styles, so do generative AI models. Some procedures have been developed to distinguish AI-generated text [footnote 332], [footnote 333], [footnote 337], [footnote 338], [footnote 1251], [footnote 1252], [footnote 1253] and images [footnote 1254], [footnote 1255] from human-generated content. Detection methods are typically based either on specialised classifiers or assessing how likely it is that a given example was generated by some general-purpose AI model. However, existing methods are limited and are prone to mistakes. A significant challenge is that general-purpose AI systems tend to memorise examples that appear in their training data. Because of this, common text snippets (e.g. famous historical documents) or images of common objects (e.g. famous art) are sometimes falsely classified as being AI-generated. As general-purpose AI-generated content becomes more realistic, it may be increasingly challenging to detect. Meanwhile, AI-text detectors tend to have inconsistent performance across world languages, posing challenges to linguistic equality [footnote 1256].

‘Watermarks’ – subtle but distinct motifs inserted into AI-generated data – make distinguishing AI‑generated content easier, but they can be removed. Watermarks are features that are often designed to be difficult for a human to notice but easy for detection algorithms to identify. Watermarks typically take the form of imperceptible patterns inserted into image or video pixels [footnote 290], [footnote 291], [footnote 292], [footnote 293], [footnote 294], [footnote 1257], imperceptible signals in audio [footnote 295], [footnote 296], or stylistic or word‑choice biases in text [footnote 297], [footnote 1258], [footnote 1259], [footnote 1260], [footnote 1261]. Watermarks can be used to detect AI‑generated content with near perfect accuracy when they are not tampered with. As discussed in (2.1.1. Harm to individuals through fake content), they can be used to detect AI-generated fake content. They are an imperfect strategy for detecting AI‑generated content (especially text) because they can be removed by simple modifications to data [footnote 298], [footnote 299], [footnote 333], [footnote 1262]. However, this does not mean that they are not useful. As an analogy, fingerprints are easy to avoid or remove, but they are still very useful in forensic science. Finally, there are concerns about privacy and potential misuse of watermarking technology, as it could be used to track and identify users [footnote 300].

Watermarks can also be used to indicate genuine, non-AI-generated content. Certifying the authenticity of data is part of ‘data provenance’. In contrast to inserting watermarks into general‑purpose AI-generated content, another approach is to automatically insert watermarks into non-AI-generated content [footnote 1263]. However, this will often require changes to the hardware and software of physical recording devices. These provenance methods would be very hard to tamper with at the device level. Some researchers are working towards common methods and standards for tracing the origin of media, including the use of encryption methods to prove authenticity which are difficult to counterfeit (e.g. CPPA [footnote 1264]; AIMASC [footnote 1265]).

‘Metadata’ and system activity logs aid in digital forensics. ‘Digital forensics’ refers to the science of identifying and analysing digital evidence [footnote 1266], [footnote 1267], [footnote 1268], [footnote 1269], [footnote 1270]. It is common for data to be saved along with ‘metadata’ that gives additional context about the data that is stored. This metadata is useful (and commonly used) for tracing the origin of data. For example, many mobile devices save image and audio files using the Exchangeable Image File Format (ExIF) standard [footnote 1271] which can store information about camera settings, time, location, and other details. Similar metadata could be used to help track information about whether data was generated by a general‑purpose AI system and, if so, other details about how it was done. For example, developers and deployers could attach identifiers to actions taken by an AI system [footnote 1272], [footnote 1273]. Developers and deployers can also save ‘activity logs’ to track system behaviour, in order to improve monitoring over time [footnote 1272]. Additionally, simply adding warning labels to AI-generated content can help to reduce the spread of misinformation. One study found that these labels improved humans’ deepfake detection from 10.7% to 21.6% [footnote 289]. Metadata can typically be tampered with, but evidence suggests that the use of encrypted digital signatures can enable proof of authenticity in a way that is very hard to counterfeit [footnote 1274].

Beyond technical interventions, digital media literacy initiatives have also been proposed to combat AI-generated fake content [footnote 1275]. Some studies have found that media literacy interventions can improve participants’ ability to detect fake content [footnote 1276], [footnote 1277], [footnote 1278], [footnote 1279]. However, in general, evidence on the effects of digital media literacy interventions is mixed, partly due to large variations in study contexts and intervention designs [footnote 1279] (See 2.1.1. Harm to individuals through fake content) for further discussion of fake content.

Detecting and defending against harmful content

Although there is no perfect safety measure, having multiple layers of protection and redundant safeguards increases confidence in safety (a strategy known as ‘defence in depth’). While the present section focuses on technical approaches, systems are not deployed in a vacuum. Embedding them in a sociotechnical system that seeks to maintain safety and performance is key to the ongoing process of identifying, studying, and defending against harm (also discussed in 3.1. Risk management overview). This section discusses various complementary technical methods of detecting and defending against harmful behaviours from general-purpose AI systems.

Detecting anomalies and potentially harmful behaviours allows for precautions to be taken. Some methods have been developed that can help detect anomalous inputs or behaviours from AI systems [footnote 1280], [footnote 1281], [footnote 1282]. For example, users sometimes trick language models into behaving harmfully by having them encode their responses in ciphered text [footnote 460], [footnote 1063] that does not at all resemble normal text. It is also sometimes possible to detect a significant proportion of inputs [footnote 1243], [footnote 1283], internal states [footnote 1284], [footnote 1285], [footnote 1286], [footnote 1287], or outputs [footnote 1287], [footnote 1288], [footnote 1289], [footnote 1290], [footnote 1291] involved in harmful behaviours such as assisting with dangerous tasks. Once detected, risky examples can be sent to a fault-handling process or flagged for further investigation. For example, data flagged as harmful could be blocked by a filter or edited to remove harmful content.

Having a human in the loop allows for direct oversight and manual overrides but can be prohibitively costly. Humans in the loop are expensive compared to automated systems. However, when there is a high risk of a general-purpose AI system taking unacceptable actions, a human in the loop can be essential. Analogously, manual overrides are standard in cars that have autonomous driving modes [footnote 1292]. Meanwhile, humans and general-purpose AI systems can sometimes make decisions collaboratively. Instead of teaching general-purpose AI systems to act on behalf of a human, the human-AI cooperation paradigm aims to combine the skills and strengths of both general-purpose AI systems and humans [footnote 1293], [footnote 1294], [footnote 1295], [footnote 1296], [footnote 1297], [footnote 1298], [footnote 1299]. However, having a human in the loop is not practical in many situations, such as times when decision-making happens too quickly (such as chat applications with millions of users), or the human does not have sufficient domain knowledge, or human bias and error can exacerbate risks [footnote 1300]. Humans in the loop of automated decision-making also tend to exhibit ‘automation bias’, meaning that they place a greater amount of trust in the AI system than intended [footnote 1301]. In cases where a human in the loop is not practical, hybrid approaches involving a mix of human and automated monitoring and intervention are possible.

Secure operation protocols can be designed for general-purpose AI systems with potentially dangerous capabilities. General-purpose AI agents which can act autonomously and without limitation on the web or in the physical world pose elevated risks (see 3.2.1. Technical challenges for risk management and policymaking) and (2.2.3. Loss of control). For general-purpose AI systems with potentially risky capabilities, limiting the ways in which they can directly influence the world makes it easier to oversee and manage them [footnote 1302], [footnote 1303]. For example, if an agentic general-purpose AI system has an unconstrained ability to access a computer’s file systems and/or run custom code, it is safer to run that agent in an ad hoc computing environment than directly on the user’s computer [footnote 22]. However, these approaches can be hard to implement for applications in which a system must act directly in the world. In these cases, it is sometimes difficult even for humans to anticipate when an action might be harmful.

Explaining AI system actions

Some techniques can be used to help explain why deployed general-purpose AI systems act the way they do. Understanding why general-purpose AI systems act the way they do is useful for evaluating capabilities, diagnosing harms, and determining accountability if harm is caused [footnote 1304], [footnote 1305], [footnote 1306]. While it can be useful, simply asking general-purpose AI language models for explanations of their decisions can also lead to misleading answers [footnote 97], [footnote 1307]. To increase the reliability of model explanations, researchers are working on improved prompting and training strategies [footnote 1308], [footnote 1309], [footnote 1310], [footnote 1311]. Meanwhile, other techniques for explaining general-purpose AI model actions [footnote 1312], [footnote 1313] can sometimes help with finding problems in models [footnote 1163]. However, correctly explaining general-purpose AI model actions is a difficult problem because of their size and complexity. Some research is working toward developing techniques for helping humans interpret the computations of general-purpose AI systems [footnote 1010], [footnote 1011], [footnote 1012]. Techniques to help explain model decisions are recognised as a useful part of the model evaluation toolbox [footnote 1314]. However, these methods provide only a partial understanding. They depend on significant assumptions and more research is needed to demonstrate how useful they are in practice.

Monitoring and interventions with specialised hardware

Privacy-preserving monitoring mechanisms integrated into computing hardware are emerging as a more reliable and trustworthy alternative to software-based monitoring or self-reporting. Compute is central to the development and deployment of modern general-purpose AI systems, and the amount of compute used for training and inference is correlated with the capability of an AI system (see 1.3. Capabilities in coming years). Research into privacy-preserving hardware mechanisms aims to enable policymakers to monitor and verify certain aspects of general-purpose AI systems during training and deployment, such as compute usage, without relying on reporting by AI developers.  For example, research into these mechanisms argues that they make it technically feasible to verify usage details such as time and location of usage [footnote 1315], [footnote 1316], the types of models and processes being run [footnote 1317], [footnote 1318], or to provide proofs that a particular model was trained [footnote 1319], [footnote 1320]. If feasible, these mechanisms can be applied to many governance issues, such as verifying adherence to international agreements even across borders [footnote 270]. Some countries consider international agreements because of the competitive pressures between countries and its effect on incentives to thoroughly manage risks (see 3.2.2. Societal challenges for risk management and policy making for an analysis of this dynamic). In this context, countries may resist monitoring and verification of agreements due to concerns about intellectual property and competitive advantages. Hardware-based verification mechanisms are sometimes considered to address this shortcoming since they could enable monitoring of key metrics while preserving the confidentiality of proprietary AI systems and training data. However, these applications are still in the stage of early research [footnote 270].

While much of the required functionality for hardware-based mechanisms exists on today’s AI chips, hardware-based monitoring has not yet been proven at scale and could threaten user interests if implemented haphazardly. Some hardware-based mechanisms are widely deployed in contexts outside of AI, such as Apple’s Secure Enclaves, which permit the manufacturer to restrict which applications are installed on their devices [footnote 1321]. Some leading AI chips, such as the H100 graphics processing unit (GPU), already have some of the necessary hardware in the form of Confidential Computing [footnote 1322]. Nonetheless, some hardware-based monitoring and verification mechanisms for AI could themselves be compromised by a well-resourced attacker, potentially leaking sensitive information [footnote 1323].

3.4.3. Technical methods for privacy

Key information

  • General-purpose AI systems affect privacy through loss of data confidentiality, lack of transparency, unauthorised processing of data, and novel forms of abuse. These risks are described in (2.3.5. Risks to privacy).
  • Multiple methods exist across the AI lifecycle to safeguard privacy. These include: removing sensitive information from training data; model training approaches that control how much information is learned from data (such as ‘differential privacy’ approaches); and techniques for using AI with sensitive data that make it hard to recover the data (such as ‘confidential computing’ and other privacy-enhancing technologies). Many privacy‑enhancing methods from other research fields are not yet applicable to general‑purpose AI systems due to the computational requirements of AI systems.
  • Since the publication of the Interim Report (May 2024), privacy protection methods have expanded to address AI’s growing use in sensitive domains. This includes smartphone assistants, AI agents, always-listening voice assistants, or use in healthcare or legal practice. There is a growing interest in ensuring confidentiality and consent across these uses, with new research and practical implementations supporting this. Removing personally identifiable information (PII) and undesirable content from the training data of general-purpose AI, although still challenging and incomplete, is a cost-effective, feasible, and effective process to reduce risk. User-friendly mechanisms for controlling and tracing personal data could support this.
  • Privacy protection methods for AI are evolving rapidly, creating policy challenges. Methods to reduce privacy risks in general-purpose AI are complex and continue to develop at a fast pace, affecting multiple areas of the supply chain and creating a challenging environment for policymaking.

Key definitions

  • Privacy: A person’s or group’s right to control how others access or process their sensitive information and activities.
  • Personally identifiable information (PII): Any data that can directly or indirectly identify an individual (for example, names or ID numbers). Includes information that can be used alone or combined with other data to uniquely identify a person.
  • Sensitive data: Information that, if disclosed or mishandled, could result in harm, embarrassment, inconvenience, or unfairness to an individual or organisation.
  • Data minimisation: The practice of collecting and retaining only the data that is directly necessary for a specific purpose, and deleting it once that purpose is fulfilled.
  • AI agent: A general-purpose AI which can make plans to achieve goals, adaptively perform tasks involving multiple steps and uncertain outcomes along the way, and interact with its environment – for example by creating files, taking actions on the web, or delegating tasks to other agents – with little to no human oversight.
  • Deepfake: A type of AI-generated fake content, consisting of audio or visual content, that misrepresents real people as doing or saying something that they did not actually do or say.

Methods and techniques for mitigating privacy risks from general-purpose AI cover different risk categories. (2.3.5. Risks to privacy) broadly categorised risks into: Training Risks (risks from training on data, especially sensitive data); Use Risks (risks from general-purpose AI handling sensitive information during use); and Intentional Harm Risks (risks from malicious actors applying general‑purpose AI to harm individual privacy). This section considers mitigation techniques for each of these categories, outlining emerging privacy-enhancing techniques [footnote 1324] for relevant categories. Other related privacy harms can occur from malicious actors using general-purpose AI for stalking, non-consensual deepfakes, or stealing sensitive information (2.1 Risks from malicious use), which are difficult but possible to mitigate as outlined in 3.4. Risk mitigation and monitoring and (2.1.3. Cyber offence).

Minimising personally identifiable information in training data is important and feasible, but challenging (Reducing Training Risks). General-purpose AI is trained on large datasets collected from many sources, including the public web. This data can include PII [footnote 1325], [footnote 1326], which can be reproduced during the usage of AI models [footnote 827], [footnote 828], [footnote 1327], [footnote 1328]. Companies can also use their proprietary data to train models [footnote 1329]. Open datasets being used to train general-purpose AI often attempt to remove PII [footnote 878], [footnote 1325] (although not all do [footnote 1330] but can miss some PII. Without clearer standards for the composition and possible PII inclusion of datasets [footnote 883], [footnote 1331], complete cleaning of training data for general-purpose AI at scale will be challenging, but data cleaning remains a cost‑effective, feasible, and effective process in the meantime to reduce privacy risks.

Implementing user-friendly mechanisms for individuals to control and trace their data, such as dashboards for managing permissions and secure data provenance systems, could enhance transparency and accountability in general-purpose AI systems (Reducing Training Risks). This could allow individuals to track how their data is used and shared, establish transparent processes for individuals to access, view, correct, and delete their data, as well as to track how and where others are profiting from their data [footnote 1332], [footnote 1333]. This is possible for data held by the user and, to a lesser extent, for data contained on digital service providers (such as social media platforms) who can provide opt‑out options for data usage or training (although users are often unaware of their contributions to AI training or the risks of privacy violations) [footnote 847], [footnote 1334], [footnote 1335]. If data is already publicly available on the public web, it is and will remain much more complicated to control how this data is used for general‑purpose AI.

Privacy-preserving approaches to training on sensitive data are limited for general-purpose AI (Reducing Training Risks). Various privacy techniques can be applied to AI models to protect individual privacy while still allowing for useful insights to be derived from data [footnote 1336], [footnote 1337]. However, these techniques can significantly impair model accuracy (often referred to as a ‘privacy‑utility trade‑off’), present challenges when applied to large models, and may not be suitable for all use cases, in particular for general-purpose AI models trained on text [footnote 1328]. For domains with highly sensitive data (e.g. medical or financial), it may be possible to attain strong privacy guarantees by adapting powerful general-purpose AI models that are first pre-trained on publicly available data from the internet [footnote 1338], [footnote 1339], [footnote 1340], but such techniques have rarely been applied in practice so far. Another solution is using synthetic data (data, such as text or images, that has been artificially generated, often by other AI systems) to avoid using sensitive data in training pipelines [footnote 1341], [footnote 1342]. However, researchers have demonstrated that there is an important privacy-utility trade-off and strong differential privacy is still required for privacy [footnote 1343], [footnote 1344], [footnote 1345], [footnote 1346]. Differential privacy works by adding carefully calibrated noise to the training process, limiting how much the model can learn about any individual’s data while still allowing it to learn useful patterns from the dataset as a whole. If the synthetic data is highly useful, it may carry as much information as the original data and mostly enable the same attacks [footnote 1347], [footnote 1348], [footnote 1349].

Medium-capability general-purpose AI is increasingly able to run entirely on consumer devices such as smartphones, enabling people to use general-purpose AI without sending personal data to external servers (Reducing Use Risks). While the most capable general-purpose AI systems will continue to be constrained to data centres due to their size [footnote 156], smaller AI systems that can answer questions about personal data and perform basic phone operations on behalf of a user are increasingly being rolled out to consumer devices such as smartphones and other edge devices [footnote 4], [footnote 37], [footnote 841]. Running general-purpose AI on-device means that user requests, and any personal data that the AI accesses to respond to the user, do not need to be sent to an external cloud server, reducing the risks of data leaks. However, for complex tasks, the use of general-purpose AI is still often outsourced to (run on) cloud servers, requiring personal data and requests to be sent to the cloud (via the internet).

Secure deployment of general-purpose AI systems in the cloud is important when handling sensitive data (Reducing Use Risks). Many large general-purpose AI models can only be run in data centres, which means that using sensitive data with these models requires sending that data to external locations. Securing these deployments is a critical task for general-purpose AI [footnote 844], and can help prevent private information from leaking. Recent large-scale deployments have built end‑to-end security solutions to address this issue, but more research is required to secure these deployments [footnote 844].

Strong cryptographic approaches to running AI confidentially and securely end-to-end exist but are not yet applicable to general-purpose AI (Reducing Use Risks). Research has shown that small AI models can be run in combination with cryptographic tools such as homomorphic encryption [footnote 1350], zero-knowledge proofs [footnote 1351], multi-party computation [footnote 1352], [footnote 1353], and hardware protections (such as confidential computing on NVIDIA H100 GPUs) [footnote 1354], [footnote 1355], [footnote 1356] to enable both confidentiality of inputs and verifiability of secure computation. However, these techniques impose significant costs (the various methods can differ in their costs by orders of magnitude) and have not been scaled to the largest and most capable models being trained today. ‘Confidential computing’ with H100 GPUs stands out as the only current cryptographic approach that is usable with large models, but it is not a complete solution to end-to-end encryption or confidentiality. Future advances in related fields may enable these strong security techniques to become practical for general-purpose AI in the future [footnote 1177], [footnote 1357].

Practices such as data minimisation, purpose limitation, and other data protection will continue to be important with general-purpose AI, and existing privacy regulations will continue to play a role in determining the appropriate use of personal data (Reducing Training and Use Risks). Many jurisdictions where general-purpose AI will be used have existing regulations that limit or place guidelines on how personal data can be used [footnote 822], [footnote 1358]. In many cases, the principles underlying these regulations already apply to the ways in which general-purpose AI interacts with and uses personal or sensitive data.

Malicious actors may be able to use general-purpose AI to violate the privacy of others through general-purpose AI-enhanced stalking (Reducing Intentional Harm Risks). The content above primarily discussed the risks to privacy from using sensitive or private data during the training or usage of general-purpose AI systems. There is also a separate risk to privacy from malicious actors using general-purpose AI to enhance existing privacy-violating practices. General-purpose AI can infer the personal attributes of individuals at a lower cost, higher speed, and larger scale than humans can [footnote 483], [footnote 846], [footnote 1047]. This could enable, for example, a malicious actor to search through large data breaches and public information to infer attributes of individuals, infer information about public content (such as where an image was taken), and perform automated actions to aid in the exploitation of privacy, such as automatic personalised phishing or general-purpose AI-enabled targeted stalking. Some legal frameworks aim to hold creators and distributors accountable for malicious use [footnote 1359] and to provide remedies for individuals whose privacy has been violated. Other general-purpose AI capabilities, such as advanced cybersecurity attacks to extract private information or non-consensual deepfakes, may also worsen this trend. These outcomes could be partly prevented through improved technical mitigations and are akin to the problems outlined elsewhere in (3.4 Risk mitigation and monitoring) and (2.1 Risks from malicious use).

General-purpose AI systems can also improve privacy by supporting cybersecurity practices in development and explaining risks to users. While general-purpose AI creates many privacy risks, it can also help mitigate them. General-purpose AI can be used in software development platforms and tooling, which can support developers in designing secure software and scanning codebases for possible security flaws [footnote 1047] (see 2.1.3. Cyber offence) for more on using general-purpose AI systems to fix software vulnerabilities). For users, understanding privacy risks and monitoring personal exposure is challenging. Storytelling and user-centred explanations of risks and personal online safety strategies are important [footnote 1360], and could be communicated with the aid of general‑purpose AI systems. AI systems could also be used to aid in tracking where personal data is being used and communicate these findings to users.

Since the publication of the Interim Report (May 2024), increased effort has been put into improving the quality of data used to train general-purpose AI, enhancing the hardware security of AI systems’ deployment, and making it possible to run and store models locally on personal devices. As general-purpose AI becomes increasingly accessible on personal devices such as assistants on smartphones [footnote 841] and in sensitive contexts such as healthcare [footnote 1361], the strong security tooling for hosting general-purpose AI with verifiable privacy guarantees is becoming more common [footnote 1362]. This improved security in deployment (both on-device and in the cloud) is complemented by work in filtering PII from web-scale pre-training data [footnote 878]. Recent general-purpose AI systems’ ability to autonomously act and plan on users’ behalf (as AI agents) has prompted new privacy risks [footnote 673], [footnote 1363].

Other downstream privacy considerations are also important. For example, a number of experts have warned that if AI agents become indistinguishable from authentic humans on the web, combating these systems would lead to risks of mass identification (and subsequent surveillance) of online users [footnote 316], [footnote 853]. Privacy-preserving credentials to identify authentic, unique personhood online could minimise these unintended effects on privacy [footnote 853].

Evidence gaps: more research is needed to study how and when general-purpose AI risks revealing sensitive information, how general-purpose AI can be run with stronger security guarantees, and how to prevent general-purpose AI from being used for privacy-exploiting use cases. The full extent of personal data in general-purpose AI training data [footnote 1325] and the likelihood of it being memorised and exposed [footnote 831], [footnote 1364] are unknown and require more research. Even when sensitive data is used only at run time (often referred to as ‘in-context learning’), more research is needed to establish the risks of models leaking information in their output [footnote 847], [footnote 1365]. When using these general-purpose AI systems, strong cryptographic approaches to running them could enable more confidentiality and verifiability [footnote 1366], but more work is needed to scale these techniques to large AI systems. To prevent harm from malicious actors using general-purpose AI to violate the privacy of others, more research will be needed into how to make it more difficult to use general-purpose AI for malicious purposes. Many open technical questions exist about how to maintain the privacy of data creators, users, and AI system deployers while leveraging and governing general-purpose AI [footnote 1177]. New risks to privacy may also emerge as new general-purpose AI capabilities emerge (see 1.3. Capabilities in coming years).

For policymakers working on privacy, key challenges arise from a technical environment where methods to address privacy risks and minimise harm are rapidly evolving, across multiple areas of the general-purpose AI supply chain. The areas of risk discussed in this section and in (2.3.5. Risks to privacy) cover a broad spectrum of participants in the general‑purpose AI ecosystem, and mitigation strategies vary in their technical feasibility and complexity (summarised in Figure 3.3). Each mitigation strategy will impose costs on general‑purpose AI developers and deployers (e.g. cleaning web-scale data is expensive) and may worsen the user experience (e.g. strong cryptographic guarantees can slow down run speeds of general-purpose AI). This area of research is evolving, and the extent to which specific privacy risks will have robust mitigation strategies that can be deployed at scale will be hard to predict, made further challenging by the differences between the AI and the privacy policy communities [footnote 822].

Figure 3.3 Actionable Methods for Protecting Privacy

Figure 3.3: Actionable methods exist for mitigating privacy harms from general-purpose AI systems, including removal of PII from training data, using on-device models, and strengthening cybersecurity. The methods are ranked based on their relative feasibility within each risk group and are not exhaustive. Numerous privacy protection and harm mitigation measures exist, each with varying levels of complexity and challenges in deployment. Source: International AI Safety Report.

Conclusion

The first International AI Safety Report finds that the future of general‑purpose AI is remarkably uncertain. There is a wide range of possible outcomes even in the near future, including both very positive and very negative ones, as well as anything in between. General-purpose AI has immense potential for education, medical applications, research advances in fields such as chemistry, biology, or physics, and generally increased prosperity thanks to AI-enabled innovation. If managed properly, general-purpose AI systems could substantially improve the lives of people worldwide.

To reap the benefits of this transformative technology safely, researchers and policymakers need to identify the risks that come with it and take informed action to mitigate them. General‑purpose AI is already causing harm today due to malicious use and malfunctioning, for instance through deepfakes, scams and biased outputs. Depending on the rate of progress of future general-purpose AI capabilities, the technical methods that developers and regulators employ to mitigate risks, the decisions of governments and societies in relation to general-purpose AI, and the degree of successful global coordination, it is also possible that further risks could emerge. The worst outcomes could see the emergence of risks such as large-scale unemployment, general-purpose AI‑enabled terrorism, or humanity losing control over general-purpose AI systems. Experts differ in how likely or imminent they consider such risks and in how they interpret the existing evidence: some think that such risks are decades away, while others think that general-purpose AI could lead to severe public safety dangers within the next few years.

There exist technical methods for addressing the risks of general-purpose AI, but they all have limitations. For example, researchers have developed methods for reducing bias, improving our understanding of AI’s inner workings, assessing capabilities and risks, and making AI less likely to respond to user requests that could cause harm. However, several features of general-purpose AI make addressing risks difficult. Despite rapid advances in capabilities, researchers currently cannot generate human-understandable accounts of how general-purpose AI arrives at outputs and decisions. This makes it difficult to evaluate or predict what general-purpose AI is capable of and how reliable it is, or to obtain assurances on the risks that it might pose. There is broad expert agreement that it should be a priority to improve our understanding of how general-purpose AI arrives at outputs and decisions.

AI does not happen to us; choices made by people determine its future. How general-purpose AI is developed and by whom, which problems it is designed to solve, whether we will be able to reap its full economic potential, who benefits from it, and the types of risks we expose ourselves to – the answers to these and many other questions depend on the choices that societies and governments make today and in the future to shape the development of general-purpose AI. Since the impact of general‑purpose AI on many aspects of our lives is likely to be profound, and since progress might continue to be rapid, there is an urgent need to work towards international agreement and to put resources into understanding and addressing the risks of this technology. Constructive scientific and public discussion will be essential for societies and policymakers to make the right choices.

For the first time in history, this report and the Interim Report (May 2024) brought together expert representatives nominated by 30 countries, the OECD, the EU, and the UN, as well as several other world-leading experts, to provide a shared scientific, evidence-based foundation for these vital discussions. We continue to disagree on several questions, minor and major, around general‑purpose AI and its capabilities, risks, and risk mitigations. However, we consider this report essential for improving our collective understanding of general-purpose AI and its potential risks, and for moving closer towards consensus and effective risk mitigation, to ensure that humanity can enjoy the benefits of general-purpose AI safely. The stakes are high. We look forward to continuing this effort.

List of acronyms

Acronym Description
AAVE African American Vernacular English
AI artificial intelligence
AIM AI Incidents Monitor
AIME American Invitational Mathematics Examination
AISI AI Safety Institute
ALARP as low as reasonably practicable
AMLAS Assurance of Machine Learning for use in Autonomous Systems
API application programming interface
ASEAN Association of Southeast Asian Nations
AWS Amazon Web Services
BTWC Biological and Toxin Weapons Convention
CBRN chemical, biological, radiological and nuclear
CNI critical national infrastructure
COVID-19 Coronavirus disease 2019
CSAM child sexual abuse material
CTF Capture the Flag
CTM coal transition mechanism
CWC Chemical Weapons Convention
DARPA Defense Advanced Research Project Agency
DBTL design-build-test-learn
DSIT Department for Science, Innovation and Technology
EU European Union
ExIF exchangeable image file format
FAccT (conference on) Fairness, Accountability, and Transparency
FLOP floating point operations
GDP gross domestic product
GDPR General Data Protection Regulation
GLUE General Language Understanding Evaluation
GNI Gross National Income
GHG greenhouse gas
GPQA Grade School Quality Assessment
GPT generative pre-trained transformer
GPU graphics processing unit
HAZOP hazard and operability study
HICs high-income countries
ICT information and communication technology
IEA International Energy Agency
IEC International Electrotechnical Commission
IMO International Mathematics Olympiad
ISO International Organization for Standardization
kWh kilowatt hour
LLM large language model
LMICs low- and middle-income countries
MMLU Massive Multitask Language Understanding
MNIST modified National Institute of Standards and Technology (database)
MW megawatt
NCII non-consensual intimate imagery
NIST National Institute of Standards and Technology
OECD Organisation for Economic Co‑operation and Development
Ofcom Office of Communications
OSS open source software
PaLM-E Pathways Language Model (Embodied)
PC personal computer
PhD Doctor of Philosophy
PII personally identifiable information
PPA power purchase agreement
PTSD post-traumatic stress disorder
PUE power usage effectiveness
Q&A question and answer
R&D research and development
RAG Retrieval-Augmented Generation
REC renewable energy credit
RLHF reinforcement learning from human feedback
RoG reasoning on graphs
RT robotics transformer
SARS-CoV-2 severe acute respiratory syndrome coronavirus 2
SMEs small and medium enterprises
SMR small modular reactor
SOTIF safety of the intended function
SQLite structured query language lite
SQuAD Stanford Question Answering Dataset
STEM science, technology, engineering, and mathematics
SWE-bench software engineering benchmark
tCO2e tonnes of carbon dioxide equivalent
TPU tensor processing unit
TSMC Taiwan Semiconductor Manufacturing Company
TWh terawatt hour
UK United Kingdom
UNESCO United Nations Educational, Scientific and Cultural Organization
US United States
USB universal serial bus
VD vulnerability discovery
V-JEPA video joint embedding predictive architecture
XAI explainable artificial intelligence

Glossary

The explanations below all refer to the use of a term with respect to AI.

Adversarial training: A machine learning technique used to make models more reliable. First, developers construct ‘adversarial inputs’ (e.g. through red-teaming) that are designed to make a model fail, and second, they train the model to recognise and handle these kinds of inputs.

AI agent: A general-purpose AI which can make plans to achieve goals, adaptively perform tasks involving multiple steps and uncertain outcomes along the way, and interact with its environment – for example by creating files, taking actions on the web, or delegating tasks to other agents – with little to no human oversight.

AI R&D divide: The disparity in AI research and development across different geographic regions, caused by various factors including an unequal distribution of computing power, talent, financial resources, and infrastructure.

AI-generated fake content: Audio, text, or visual content, produced by generative AI, that depicts people or events in a way that differs from reality in a malicious or deceptive way, e.g. showing people doing things they did not do, saying things they did not say, changing the location of real events, or depicting events that did not happen.

AI lifecycle: The distinct stages of developing AI, including data collection and pre-processing, pre‑training, fine-tuning, model integration, deployment, post-deployment monitoring, and downstream modifications.

Algorithm: A set of rules or instructions that enable an AI system to process data and perform specific tasks.

Algorithmic (training) efficiency: A set of measures of how efficiently an algorithm uses computational resources to learn from data, such as the amount of memory used or the time taken for training.

Algorithmic transparency: The degree to which the factors informing general-purpose AI output, e.g. recommendations or decisions, are knowable by various stakeholders. Such factors might include the inner workings of the AI model, how it has been trained, what data it is trained on, what features of the input affected its output, and what decisions it would have made under different circumstances.

Alignment: An AI’s propensity to use its capabilities in line with human intentions or values. Depending on the context, this can variously refer to the intentions and values of developers, operators, users, specific communities, or society as a whole.

Application programming interface (API): A set of rules and protocols that enables integration and communication between AI systems and other software applications.

Artificial general intelligence (AGI): Potential future AI that equals or surpasses human performance on all or almost all cognitive tasks.

Artificial intelligence (AI): The field of computer science focused on creating systems or machines capable of performing tasks that typically require human intelligence. These tasks include learning, reasoning, problem-solving, natural language processing, and decision-making.

Audit: A formal review of an organisation’s compliance with standards, policies, and procedures, typically carried out by an independent third party.

Automation: The use of technology to perform tasks with reduced or no human involvement.

Benchmark: A standardised, often quantitative test or metric used to evaluate and compare the performance of AI systems on a fixed set of tasks designed to represent real-world usage.

Bias: Systematic errors in algorithmic systems that favour certain groups or worldviews and often create unfair outcomes for some people. Bias can have multiple sources, including errors in algorithmic design, unrepresentative or otherwise flawed datasets, or pre-existing social inequalities.

Biosecurity: A set of policies, practices, and measures (e.g. diagnostics and vaccines) designed to protect humans, animals, plants, and ecosystems from harmful biological agents, whether naturally occurring or intentionally introduced.

Capabilities: The range of tasks or functions that an AI system can perform, and how competently it can perform them.

Carbon intensity: The amount of GHG emissions produced per unit of energy. Used to quantify the relative emissions of different energy sources.

Carbon offsetting: Compensating for GHG emissions from one source by investing in other activities that prevent comparable amounts of emissions or remove carbon from the atmosphere, such as expanding forests.

Chain of thought: A reasoning process in which an AI generates intermediate steps or explanations while solving a problem or answering a question. This approach mimics human logical reasoning and internal deliberation, helping the model break down complex tasks into smaller, sequential steps to improve accuracy and transparency in its outputs.

Cloud computing: A paradigm for delivering computing services – including servers, data storage, software, and analytics – over the internet. Users can access these resources on demand and without local infrastructure to develop, train, deploy, and manage AI applications.

Cognitive tasks: Activities that involve processing information, problem-solving, decision-making, and creative thinking. Examples include research, writing, and programming.

Compute: Shorthand for ‘computational resources’, which refers to the hardware (e.g. GPUs), software (e.g. data management software) and infrastructure (e.g. data centres) required to train and run AI systems.

Control: The ability to exercise oversight over an AI system and adjust or halt its behaviour if it is acting in unwanted ways.

Control-undermining capabilities: Capabilities that, if employed, would enable an AI system to undermine human control.

Copyright: A form of legal protection granted to creators of original works, giving them exclusive rights to use, reproduce, and distribute their work.

CTF (Capture the Flag) challenges: Exercises often used in cybersecurity training, designed to test and improve the participants’ skills by challenging them to solve problems related to cybersecurity, such as finding hidden information or bypassing security defences.

Data centre: A large collection of networked, high-power computer servers used for remote computation. Hyperscale data centres typically contain more than 5000 servers.

Data collection and pre-processing: A stage of AI development in which developers and data workers collect, clean, label, standardise, and transform raw training data into a format that the model can effectively learn from.

Data minimisation: The practice of collecting and retaining only the data that is directly necessary for a specific purpose, and deleting it once that purpose is fulfilled.

Deceptive alignment: Misalignment that is difficult to detect, because the system behaves in ways that at least initially appear benign.

Deepfake: A type of AI-generated fake content, consisting of audio or visual content, that misrepresents real people as doing or saying something that they did not actually do or say.

Deep learning: A machine learning technique in which large amounts of data and compute are used to train multilayered, artificial neural networks (inspired by biological brains) to automatically learn and extract high-level features from large datasets, enabling powerful pattern recognition and decision‑making capabilities.

Defence in depth: A strategy that includes layering multiple risk mitigation measures in cases where no single existing method can provide safety.

Deployment: The process of implementing AI systems into real-world applications, products, or services where they can serve requests and operate within a larger context.

Developer: Any organisation that designs, builds, integrates, adapts or combines AI models or systems.

Digital divide: The disparity in access to information and communication technology (ICT), particularly the internet, between different geographic regions or groups of people.

Digital forensics: The process of tracing the origin and spread of digital media.

Digital infrastructure: The foundational services and facilities necessary for digital technologies to function, including hardware, software, networks, data centres, and communication systems.

Discrimination: The unfair treatment of individuals or groups based on their attributes, such as race, gender, age, religion, or other protected characteristics.

Disinformation: False or misleading information generated or spread with the intent to deceive or influence people. See ‘Misinformation’ for contrast.

Distributed training: A process for training AI models across multiple processors and servers, concentrated in one or multiple data centres.

Dual-use science: Research and technology that can be applied for beneficial purposes, such as medicine or environmental solutions, but also potentially misused to cause harm, such as in biological or chemical weapon development​.

Emergent behaviour: The ability of AI systems to act in ways that were not explicitly programmed or intended by their developers or users.

Evaluations: Systematic assessments of an AI system’s performance, capabilities, vulnerabilities or potential impacts. Evaluations can include benchmarking, red-teaming and audits and can be conducted both before and after model deployment.

Explainable AI (XAI): A research programme to build AI systems that provide clear and understandable explanations of their decisions, allowing users to understand how and why specific outputs are generated.

Fairness: A societal value according to which AI systems should make decisions that are free from bias or unjust discrimination, treating all individuals and groups equitably, particularly with regard to protected attributes such as race, gender, age, or socio-economic status.

Fair use: An American legal doctrine that provides a defence to copyright infringement claims for limited uses of copyrighted materials without permission for purposes such as criticism, comment, news reporting, education, and research. Some other countries allow similar use rights under the name ‘fair dealing’.

Field testing: The practice of evaluating the risks of general-purpose AI under real-world conditions.

Fine-tuning: The process of adapting a pre-trained AI model to a specific task or making it more useful in general by training it on additional data.

First-mover advantage: The competitive benefit gained by being the first to establish a significant market position in an industry.

FLOP: ‘Floating point operations’ – the number of computational operations performed by a computer program. Often used as a measure for the amount of compute used in training an AI model.

Foundation model: A general-purpose AI model designed to be adaptable to a wide range of downstream tasks.

Frontier AI: A term sometimes used to refer to particularly capable AI that matches or exceeds the capabilities of today’s most advanced AI. For the purposes of this report, frontier AI can be thought of as particularly capable general-purpose AI.

General-purpose AI: AI systems designed to perform a wide range of tasks across various domains, rather than being specialised for one specific function. See ‘Narrow AI’ for contrast.

Generative AI: AI that can create new content such as text, images, or audio by learning patterns from existing data and generating novel outputs that reflect those patterns.

GHG (greenhouse gas) emissions: Release of gases such as carbon dioxide (CO2), methane, nitrous oxide, and hydrofluorocarbons which create a barrier trapping heat in the atmosphere. A key indicator of climate change.

Ghost work: The hidden labour performed by workers to support the development and deployment of AI models or systems (for example through data labelling).

Goal misgeneralisation: A situation in which an AI system correctly follows an objective in its training environment, but applies it in unintended ways when operating in a different environment.

Goal misspecification: A mismatch between the objective given to an AI and the developer’s intention, leading the AI to pursue unintended or undesired behaviours.

GPU (graphics processing unit): A specialised computer chip, originally designed for computer graphics, that is now widely used to handle complex parallel processing tasks essential for training and running AI models.

Guardrails: Built-in safety constraints to ensure that an AI system operates as desired and avoids harmful outcomes.

Hacking: The act of exploiting vulnerabilities or weaknesses in a computer system, network, or software to gain unauthorised access, manipulate functionality, or extract information.

Hallucination: Inaccurate or misleading information generated by an AI system, for instance false facts or citations.

Hardware backdoor: A feature of a device, intentionally or unintentionally created by a manufacturer or third party, that can be used to bypass security protections in order to monitor, control, or extract data without the user’s knowledge.

Hazard: Any event or activity that has the potential to cause harm, such as loss of life, injury, social disruption, or environmental damage.

High-income countries (HICs): Countries with a gross national income (GNI) per capita higher than $14,005, as calculated by the World Bank.

Human in the loop: A requirement that humans must oversee and sign off on otherwise automated processes in critical areas.

If-then commitments: Conditional agreements, frameworks, or regulations that specify actions or obligations to be carried out when certain predefined conditions are met.

Incident reporting: Documenting and sharing cases in which developing or deploying AI has caused direct or indirect harms.

Inference: The process in which an AI generates outputs based on a given input, thereby applying the knowledge learnt during training.

Inference-time enhancements: Techniques used to improve an AI system’s performance after its initial training, without changing the underlying model. This includes clever prompting methods, answer selection methods (e.g. sampling multiple responses and choosing a majority answer), writing long ‘chains of thought’, agent ‘scaffolding’, and more.

Input (to an AI system): The data or prompt submitted to an AI system, such as text or an image, which the AI system processes and turns into an output.

Institutional transparency: The degree to which AI companies disclose technical or organisational information to public or governmental scrutiny, including training data, model architectures, emissions data, safety and security measures, or decision-making processes.

Intellectual property: Creations of the mind over which legal rights may be granted, including literary and artistic works, symbols, names and images.

Interpretability: The degree to which humans can understand the inner workings of an AI model, including why it generated a particular output or decision. A model is highly interpretable if its mathematical processes can be translated into concepts that allow humans to trace the specific factors and logic that influenced the model’s output.

Interpretability research: The study of how general-purpose AI models function internally, and the development of methods to make this comprehensible to humans.

Jailbreaking: Generating and submitting prompts designed to bypass guardrails and make an AI system produce harmful content, such as instructions for building weapons.

Labour market: The system in which employers seek to hire workers and workers seek employment, encompassing job creation, job loss, and wages.

Labour market disruption: Significant and often complex changes in the labour market that affect job availability, required skills, wage distribution, or the nature of work across sectors and occupations.

Large language model (LLM): An AI model trained on large amounts of text data to perform language processing tasks, such as generating, translating, or summarising text.

Likeness rights: Rights that protect an individual’s image, voice, name, or other identifiable aspects from unauthorised commercial use.

Loss of control scenario: A scenario in which one or more general-purpose AI systems come to operate outside of anyone’s control, with no clear path to regaining control.

Low- and middle-income countries (LMICs): Countries with a gross national income (GNI) per capita lower than $14,005, as calculated by the World Bank.

Machine learning (ML): A subset of AI focused on developing algorithms and models that learn from data and improve their performance on tasks over time without being explicitly programmed.

Malfunctioning: The failure of a general-purpose AI system to operate as intended by its developer or user, resulting in incorrect or harmful outputs or operational disruptions.

Malicious use: Employing AI to intentionally cause harm.

Malware: Harmful software designed to damage, disrupt, or gain unauthorised access to a computer system. It includes viruses, spyware, and other malicious programs that can steal data or cause harm.

Marginal risk: The additional risk introduced by a general-purpose AI model or system compared to a relevant baseline, such as a comparable risk posed by existing non-AI technology.

Market concentration: The degree to which a small number of companies control an industry, leading to reduced competition and increased control over pricing and innovation.

Massive Multitask Language Understanding (MMLU): A widely used benchmark in AI research that assesses a general-purpose AI model’s performance across a broad range of tasks and subject areas.

Misalignment: An AI’s propensity to use its capabilities in ways that conflict with human intentions or values. Depending on the context, this can variously refer to the intentions and values of developers, operators, users, specific communities, or society as a whole.

Misinformation: False or misleading information that is generated or spread without intent to deceive. See ‘Disinformation’ for contrast.

Modalities: The kinds of data that an AI system can competently receive as input and produce as output, including text (language or code), images, video, and robotic actions.

Model: A computer program, often based on machine learning, designed to process inputs and generate outputs. AI models can perform tasks such as prediction, classification, decision-making, or generation, forming the core of AI applications.

Model card: A document providing useful information about an AI model, for instance about its purpose, usage guidelines, training data, performance on benchmarks, or safety features.

Model release: Making a trained AI model available for downstream entities to further use, study, or modify it, or to integrate it into their own systems.

Narrow AI: A kind of AI that is specialised to perform one specific task or a few very similar tasks, such as ranking web search results, classifying species of animals, or playing chess. See ‘General‑purpose AI’ for contrast.

Neural network: A type of AI model consisting of a mathematical structure that is inspired by the human brain and composed of interconnected nodes (like neurons) that process and learn from data. Current general-purpose AI systems are based on neural networks.

Open-ended domains: Environments into which AI systems might be deployed which present a very large set of possible scenarios. In open-ended domains, developers typically cannot anticipate and test every possible way that an AI system might be used.

Open-weight model: An AI model whose weights are publicly available for download, such as Llama or Stable Diffusion. Open-weight models can be, but are not necessarily, open source.

Open source model: An AI model that is released for public download under an open source licence. The open source licence grants the freedom to use, study, modify and share the model for any purpose. There remains some disagreement as to which model components (weights, code, training data) and documentation must be publicly accessible for the model to qualify as open source.

Parameters: The numerical components of an AI model, such as weights and biases, that are learned from data during training and determine how the model processes inputs to generate outputs. Note that ‘bias’ here is a mathematical term that is unrelated to bias in the context of discrimination.

Pathogen: A microorganism, for example a virus, bacterium, or fungus, that can cause disease in humans, animals, or plants​.

Penetration testing: A security practice where authorised experts or AI systems simulate cyberattacks on a computer system, network or application to proactively evaluate its security. The goal is to identify and fix weaknesses before they can be exploited by real attackers.

Personally identifiable information (PII): Any data that can directly or indirectly identify an individual (for example, names or ID numbers). Includes information that can be used alone or combined with other data to uniquely identify a person.

Post-deployment monitoring: The processes by which AI developers track model impact and performance metrics, gather and analyse user feedback, and make iterative improvements to address issues or limitations discovered during real-world use.

Pre-training: A stage in developing a general-purpose AI model in which models learn patterns from large amounts of data. The most compute-intensive stage of model development.

Privacy: A person’s or group’s right to control how others access or process their sensitive information and activities.

Prompt: An input to an AI system, such as a text-based question or query, that the system processes and responds to.

Race to the bottom: A competitive scenario in which actors like companies or nation states prioritise rapid AI development over safety.

Ransomware: A type of malware that locks or encrypts a user’s files or system, making them inaccessible until a ransom (usually money) is paid to the attacker.

Rebound effect: In economics, the reduction in expected improvements due to increases in efficiency, resulting from correlated changes in behaviour, use patterns, or other systemic changes. For example, improving automotive combustion engine efficiency (km/litre) by 25% will lead to less than a 25% reduction in emissions, because the corresponding reduction in the cost of gas per kilometre driven will make it cheaper to drive more, limiting improvements.

Red-teaming: A systematic process in which dedicated individuals or teams search for vulnerabilities, limitations, or potential for misuse through various methods. Often, the red team searches for inputs that induce undesirable behaviour in a model or system to identify safety gaps.

Reinforcement learning from human feedback (RLHF): A machine learning technique in which an AI model is refined by using human-provided evaluations or preferences as a reward signal, allowing the system to learn and adjust its behaviour to better align with human values and intentions through iterative training.

Reliability: An AI system’s ability to consistently perform its intended function.

Responsible Scaling Policy (RSP): A set of technical and organisational protocols, usually in an ‘if‑then’ format for various levels of capability, that specify rules for the safe development and deployment of increasingly capable AI systems.

Retrieval-Augmented Generation (RAG): A technique that allows LLMs to draw information from other sources during inference, such as web search results or an internal company database, enabling more accurate or personalised responses.

Risk: The combination of the probability and severity of a harm that arises from the development, deployment, or use of AI.

Risk factors: Properties or conditions that can increase the risks of an AI system. For example, weak guardrails are a risk factor that could enable malicious actors to use an AI system for a cyberattack.

Risk management: The systematic process of identifying, evaluating, mitigating and monitoring risks.

Risk threshold: A quantitative or qualitative limit that distinguishes acceptable from unacceptable risks and triggers specific risk management actions when exceeded.

Risk tolerance: The level of risk that an individual or organisation is willing to take on.

Robustness (of an AI system): The property of behaving safely in a wide range of circumstances.

Safety (of an AI system): The property of avoiding harmful outputs, such as providing dangerous information to users, being used for nefarious purposes, or having costly malfunctions in high‑stakes settings.

Safety case: A structured argument, typically produced by a developer and supported by evidence, that an AI model or system is acceptably safe in a given operational context. Developers or regulators can use safety cases as the basis for important decisions (for instance, whether to deploy an AI system).

Scaffold(ing): Additional software built around an AI system that helps it to perform a task. For example, an AI system might be given access to an external calculator app to increase its performance on arithmetical problems. More sophisticated scaffolding may structure a model’s outputs and guide the model to improve its answers step-by-step.

Scaling laws: Systematic relationships observed between an AI model’s size (or the amount of time, data or computational resources used in training or inference) and its performance.

Security (of an AI system): The property of being resilient to technical interference, such as cyberattacks or leaks of the underlying model’s source code.

Semiconductor: A material (typically silicon) with electrical properties that can be precisely controlled, forming the fundamental building block of computer chips, such as GPUs.

Sensitive data: Information that, if disclosed or mishandled, could result in harm, embarrassment, inconvenience, or unfairness to an individual or organisation.

Single point of failure: A part in a larger system whose failure disrupts the entire system. For example, if a single AI system plays a central role in the economy or critical infrastructure, its malfunctioning could cause widespread disruptions across society.

Synthetic data: Data like text or images that has been artificially generated, for instance by general‑purpose AI systems. Synthetic data might be used for training AI systems, e.g. when high‑quality natural data is scarce.

System: An integrated setup that combines one or more AI models with other components, such as user interfaces or content filters, to produce an application that users can interact with.

System integration: The process of combining an AI model with other software components to produce a full ‘AI system’ that is ready for use. For instance, integration might consist in developers combining a general-purpose AI model with content filters and a user interface to produce a chatbot application.

Systemic risks: Broader societal risks associated with general-purpose AI development and deployment, beyond the capabilities of individual models or systems. Examples of systemic risks range from potential labour market impacts to privacy infringements and environmental harms. Note that this is different from how ‘systemic risk’ is defined by the AI Act of the European Union. There, the term refers to “risk that is specific to the high-impact capabilities of general-purpose AI models, having a significant impact”.

Toxin: A poisonous substance produced by living organisms (such as bacteria, plants, or animals), or synthetically created to mimic a natural toxin, that can cause illness, harm, or death in other organisms depending on its potency and the exposure level.

TPU (tensor processing unit): A specialised computer chip, developed by Google for accelerating machine learning workloads, that is now widely used to handle large-scale computations for training and running AI models.

Trademark: A symbol, word, or phrase legally registered or established by use to represent a company or product, distinguishing it from others in the market.

Transformer: A deep learning (neural network) model architecture at the heart of most modern general-purpose AI models. The transformer architecture has proven particularly efficient at converting increasingly large amounts of training data and computational power into better model performance.

Watermark: A subtle, often imperceptible pattern embedded within AI-generated content (such as text, images, or audio) to indicate its artificial origin, verify its source, or detect potential misuse.

Web crawling: Using an automated program, often called a crawler or bot, to navigate the web, for the purposes of collecting data from websites.

Weights: Model parameters that represent the strength of connection between nodes in a neural network. Weights play an important part in determining the output of a model in response to a given input and are iteratively updated during model training to improve its performance.

Whistleblowing: The disclosing of information, by an individual member of an organisation, about illegal or unethical activities taking place within the organisation to internal or external authorities or the public.

Winner takes all: A concept in economics referring to cases in which a single company captures a very large market share, even if consumers only slightly prefer its products or services over those of competitors.

Zero-day vulnerability: An undiscovered or unpatched security flaw in software or hardware. As attackers can already exploit it, developers have ‘zero days’ to fix it.

How to cite this report:

Formatted citation

Y. Bengio, S. Mindermann, D. Privitera, T. Besiroglu, R. Bommasani, S. Casper, Y. Choi, P. Fox, B. Garfinkel, D. Goldfarb, H. Heidari, A. Ho, S. Kapoor, L. Khalatbari, S. Longpre, S. Manning, V. Mavroudis, M. Mazeika, J. Michael, J. Newman, K. Y. Ng, C. T. Okolo, D. Raji, G. Sastry, E. Seger, T. Skeadas, T. South, E. Strubell, F. Tramèr, L. Velasco, N. Wheeler, D. Acemoglu, O. Adekanmbi, D. Dalrymple, T. G. Dietterich, P. Fung, P.-O. Gourinchas, F. Heintz, G. Hinton, N. Jennings, A. Krause, S. Leavy, P. Liang, T. Ludermir, V. Marda, H. Margetts, J. McDermid, J. Munga, A. Narayanan, A. Nelson, C. Neppel, A. Oh, G. Ramchurn, S. Russell, M. Schaake, B. Schölkopf, D. Song, A. Soto, L. Tiedrich, G. Varoquaux, E. W. Felten, A. Yao, Y.-Q. Zhang, O. Ajala, F. Albalawi, M. Alserkal, G. Avrin, C. Busch, A. C. P. de L. F. de Carvalho, B. Fox, A. S. Gill, A. H. Hatip, J. Heikkilä, C. Johnson, G. Jolly, Z. Katzir, S. M. Khan, H. Kitano, A. Krüger, K. M. Lee, D. V. Ligot, J. R. López Portillo, D., O. Molchanovskyi, A. Monti, N. Mwamanzi, M. Nemer, N. Oliver, R. Pezoa Rivera, B. Ravindran, H. Riza, C. Rugege, C. Seoighe, H. Sheikh, J. Sheehan, D. Wong, Y. Zeng, “International AI Safety Report” (DSIT 2025/001, 2025); International Scientific Report on the Safety of Advanced AI: Interim Report.

Bibtex entry

@techreport{ISRSAA2025,

 title = {International AI Safety Report},

author = {Bengio, Yoshua and Mindermann, S{"o}ren and Privitera, Daniel and Besiroglu, Tamay and Bommasani, Rishi and Casper, Stephen and Choi, Yejin and Fox, Philip and Garfinkel, Ben and Goldfarb, Danielle and Heidari, Hoda and Ho, Anson and Kapoor, Sayash and Khalatbari, Leila and Longpre, Shayne and Manning, Sam and Mavroudis, Vasilios and Mazeika, Mantas and Michael, Julian and Newman, Jessica and Ng, Kwan Yee and Okolo, Chinasa T. and Raji, Deborah and Sastry, Girish and Seger, Elizabeth and Skeadas, Theodora and South, Tobin and Strubell, Emma and Tram{`e}r, Florian and Velasco, Lucia and Wheeler, Nicole and Acemoglu, Daron and Adekanmbi, Olubayo and Dalrymple, David and Dietterich, Thomas G. and Felten, Edward W. and Fung, Pascale and Gourinchas, Pierre-Olivier and Heintz, Fredrik and Hinton, Geoffrey and Jennings, Nick and Krause, Andreas and Leavy, Susan and Liang, Percy and Ludermir, Teresa and Marda, Vidushi and Margetts, Helen and McDermid, John and Munga, Jane and Narayanan, Arvind and Nelson, Alondra and Neppel, Clara and Oh, Alice and Ramchurn, Gopal and Russell, Stuart and Schaake, Marietje and Sch{"o}lkopf, Bernhard and Song, Dawn and Soto, Alvaro and Tiedrich, Lee and Varoquaux, Ga{"e}l and Yao, Andrew and Zhang, Ya-Qin and Ajala, Olubunmi and Albalawi, Fahad and Alserkal, Marwan and Avrin, Guillaume and Busch, Christian and {de Carvalho}, Andr{'e} Carlos Ponce de Leon Ferreira and Fox, Bronwyn and Gill, Amandeep Singh and Hatip, Ahmet Halit and Heikkil{"a}, Juha and Johnson, Chris and Jolly, Gill and Katzir, Ziv and Khan, Saif M. and Kitano, Hiroaki and Kr{"u}ger, Antonio and Lee, Kyoung Mu and Ligot, Dominic Vincent and {L{'o}pez Portillo}, Jos{'e} Ram{'o}n and Molchanovskyi, Oleksii and Monti, Andrea and Mwamanzi, Nusu and Nemer, Mona and Oliver, Nuria and {Pezoa Rivera}, Raquel and Ravindran, Balaraman and Riza, Hammam and Rugege, Crystal and Seoighe, Ciar{'a}n and Sheehan, Jerry and Sheikh, Haroon and Wong, Denise and Zeng, Yi},

 year = {2025},

 number = {DSIT 2025/001},

 URL = {https://www.gov.uk/government/publications/international-ai-safety-report-2025}

References

We use [industry] to indicate a reference that is a report published by a for-profit AI company or where at least 50% of the authors of a report, on a preprint (based on their listed affiliations), work for a for-profit AI company. This classification is based solely on the affiliation data provided in the publications, is for informational purposes only, and should not be considered exhaustive.

  1. R. Simmons-Edler, R. Badman, S. Longpre, K. Rajan, “AI-Powered Autonomous Weapons Risk Geopolitical Instability and Threaten AI Research” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024) (PMLR, 2024); https://proceedings.mlr.press/v235/simmons-edler24a.html

  2. [industry] OpenAI, “OpenAI o1 System Card” (OpenAI, 2024); https://cdn.openai.com/o1-system-card-20240917.pdf 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

  3. [industry] OpenAI, “GPT-4o System Card” (OpenAI, 2024); https://cdn.openai.com/gpt-4o-system-card.pdf 2 3 4 5 6 7

  4. [industry] Gemini Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, D. Silver, M. Johnson, I. Antonoglou, J. Schrittwieser, A. Glaese, J. Chen, E. Pitler, … O. Vinyals, “Gemini: A Family of Highly Capable Multimodal Models” (Google DeepMind, 2023); http://arxiv.org/abs/2312.11805 2 3 4 5 6

  5. [industry] Anthropic, Claude 3.5 Sonnet Model Card Addendum (2024); https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf 2 3

  6. [industry] Cohere, Command R+ (2024); https://docs.cohere.com/v2/docs/command-r-plus

  7. [industry] B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu, K. Dang, Y. Fan, Y. Zhang, A. Yang, R. Men, F. Huang, B. Zheng, … J. Lin, Qwen2.5-Coder Technical Report, arXiv [cs.CL] (2024); http://arxiv.org/abs/2409.12186

  8. [industry] Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, J. Liu, X. Chen, Y. Zhao, Y. Lu, W. Liu, Z. Wu, W. Gong, J. Liang, Z. Shang, P. Sun, W. Liu, … H. Wang, ERNIE 3.0: Large-Scale Knowledge Enhanced Pre-Training for Language Understanding and Generation, arXiv [cs.CL] (2021); http://arxiv.org/abs/2107.02137

  9. [industry] X. Sun, Y. Chen, Y. Huang, R. Xie, J. Zhu, K. Zhang, S. Li, Z. Yang, J. Han, X. Shu, J. Bu, Z. Chen, X. Huang, F. Lian, S. Yang, J. Yan, Y. Zeng, … J. Jiang, Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent, arXiv [cs.CL] (2024); http://arxiv.org/abs/2411.02265

  10. [industry] 01.AI, A. Young, B. Chen, C. Li, C. Huang, G. Zhang, G. Zhang, H. Li, J. Zhu, J. Chen, J. Chang, K. Yu, P. Liu, Q. Liu, S. Yue, S. Yang, S. Yang, … Z. Dai, Yi: Open Foundation Models by 01.AI, arXiv [cs.CL] (2024); http://arxiv.org/abs/2403.04652

  11. [industry] Meta, Llama-3.1-8B Official Model Card (2024); https://huggingface.co/meta-llama/Llama-3.1-8B 2 3

  12. [industry] Mistral AI, Model Card for Mistral-Large-Instruct-2407 (2024); https://huggingface.co/mistralai/Mistral-Large-Instruct-2407 2 3

  13. L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, M.-H. Yang, Diffusion Models: A Comprehensive Survey of Methods and Applications. ACM Computing Surveys 56, 1–39 (2023); [https://doi.org10.1145/3626235

  14. [industry] OpenAI, “DALL·E 3 System Card” (OpenAI, 2023); https://cdn.openai.com/papers/DALL_E_3_System_Card.pdf 2 3

  15. [industry] P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y. Marek, R. Rombach, Scaling Rectified Flow Transformers for High-Resolution Image Synthesis, arXiv [cs.CV] (2024); http://arxiv.org/abs/2403.03206 2

  16. [industry] T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, C. Ng, R. Wang, A. Ramesh, “Video Generation Models as World Simulators” (OpenAI, 2024); https://openai.com/research/video-generation-models-as-world-simulators 2

  17. B. Guo, X. Shan, J. Chung, A Comparative Study on the Features and Applications of AI Tools -Focus on PIKA Labs and RUNWAY. International Journal of Internet, Broadcasting and Communication 16, 86–91 (2024); https://doi.org/10.7236/ijibc.2024.16.1.86

  18. D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y. Chebotar, P. Sermanet, D. Duckworth, S. Levine, V. Vanhoucke, K. Hausman, … P. Florence, “PaLM-E: An Embodied Multimodal Language Model” in Proceedings of the 40th International Conference on Machine Learning (ICML’23) (PMLR, Honolulu, HI, USA, 2023) vol. 202, pp. 8469–8488;  2

  19. [industry] Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, J. Luo, Y. L. Tan, L. Y. Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, … S. Levine, Octo: An Open-Source Generalist Robot Policy, arXiv [cs.RO] (2024); http://arxiv.org/abs/2405.12213 2 3

  20. M. Firat, S. Kuleli̇, What If GPT4 Became Autonomous: The Auto-GPT Project and Use Cases. Journal of Emerging Computer Technologies 3, 1–6 (2024); https://doi.org/10.57020/ject.1297961

  21. [industry]  Y. Wang, T. Shen, L. Liu, J. Xie, Sibyl: Simple yet Effective Agent Framework for Complex Real-World Reasoning, arXiv [cs.AI] (2024); http://arxiv.org/abs/2407.10718

  22. [industry] C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, D. Ha, The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery, arXiv [cs.AI] (2024); http://arxiv.org/abs/2408.06292 2 3 4 5 6

  23. J. Abramson, J. Adler, J. Dunger, R. Evans, T. Green, A. Pritzel, O. Ronneberger, L. Willmore, A. J. Ballard, J. Bambrick, S. W. Bodenstein, D. A. Evans, C.-C. Hung, M. O’Neill, D. Reiman, K. Tunyasuvunakool, Z. Wu, … J. M. Jumper, Accurate Structure Prediction of Biomolecular Interactions with AlphaFold 3. Nature 630, 493–500 (2024); https://doi.org/10.1038/s41586-024-07487-w 2 3 4 5

  24. Y. LeCun, Y. Bengio, G. Hinton, Deep Learning. Nature 521, 436–444 (2015); https://doi.org/10.1038/nature14539

  25. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. U. Kaiser, I. Polosukhin, “Attention Is All You Need” in Advances in Neural Information Processing Systems (NIPS 2017) (Curran Associates, Inc., 2017) vol. 30; https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

  26. J. Sevilla, L. Heim, A. Ho, T. Besiroglu, M. Hobbhahn, P. Villalobos, “Compute Trends Across Three Eras of Machine Learning” in 2022 International Joint Conference on Neural Networks (IJCNN 2022) (Padua, Italy, 2022), pp. 1–8; https://doi.org/10.1109/IJCNN55064.2022.9891914 2 3 4 5 6 7

  27. B. Cottier, R. Rahman, L. Fattorini, N. Maslej, D. Owen, How Much Does It Cost to Train Frontier AI Models?, Epoch AI (2024); https://epochai.org/blog/how-much-does-it-cost-to-train-frontier-ai-models 2 3 4 5 6

  28. C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, O. Levy, “LIMA: Less Is More for Alignment” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=KBMOKmX2he 2

  29. R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, C. Finn, “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=HPuSIXJaa9 2

  30. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Gray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, … R. Lowe, “Training Language Models to Follow Instructions with Human Feedback” in 36th Conference on Neural Information Processing Systems (NeurIPS 2022) (New Orleans, LA, USA, 2022); https://openreview.net/forum?id=TG8KACxEON 2 3 4 5

  31. [industry] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield-Dodds, … J. Kaplan, Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, arXiv [cs.CL] (2022); http://arxiv.org/abs/2204.05862 2 3

  32. [industry] N. McAleese, R. M. Pokorny, J. F. C. Uribe, E. Nitishinskaya, M. Trebacz, J. Leike, LLM Critics Help Catch LLM Bugs, arXiv [cs.SE] (2024); http://arxiv.org/abs/2407.00215

  33. [industry] H. Lee, S. Phatale, H. Mansoor, T. Mesnard, J. Ferret, K. Lu, C. Bishop, E. Hall, V. Carbune, A. Rastogi, S. Prakash, RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback, arXiv [cs.CL] (2023); http://arxiv.org/abs/2309.00267 2

  34. M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, T. Gebru, “Model Cards for Model Reporting” in Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* ‘19) (Association for Computing Machinery, New York, NY, USA, 2019), pp. 220–229; https://doi.org/10.1145/3287560.3287596 2 3 4

  35. [industry] I. Solaiman, The Gradient of Generative AI Release: Methods and Considerations, arXiv [cs.CY] (2023); http://arxiv.org/abs/2302.04844

  36. [industry] Open Source Initiative, The Open Source AI Definition – 1.0-RC2, Open Source Initiative (2024); https://opensource.org/ai/drafts/the-open-source-ai-definition-1-0-rc2

  37. [industry] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, … Z. Zhao, “The Llama 3 Herd of Models” (Meta, 2024); https://ai.meta.com/research/publications/the-llama-3-herd-of-models/ 2 3 4 5

  38. M. Stein, C. Dunlop, Safe beyond Sale: Post-Deployment Monitoring of AI (2024); https://www.adalovelaceinstitute.org/blog/post-deployment-monitoring-of-ai/ 2

  39. E. Shayegani, M. A. Al Mamun, Y. Fu, P. Zaree, Y. Dong, N. Abu-Ghazaleh, Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks, arXiv [cs.CL] (2023); http://arxiv.org/abs/2310.10844 2

  40. R. T. McCoy, S. Yao, D. Friedman, M. D. Hardy, T. L. Griffiths, When a Language Model Is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI o1, arXiv [cs.CL] (2024); http://arxiv.org/abs/2410.01792 2

  41. U. Anwar, A. Saparov, J. Rando, D. Paleka, M. Turpin, P. Hase, E. S. Lubana, E. Jenner, S. Casper, O. Sourbut, B. L. Edelman, Z. Zhang, M. Günther, A. Korinek, J. Hernandez-Orallo, L. Hammond, E. Bigelow, … D. Krueger, Foundational Challenges in Assuring Alignment and Safety of Large Language Models, arXiv [cs.LG] (2024); http://arxiv.org/abs/2404.09932 2 3 4 5 6

  42. [industry] R. T. McCoy, S. Yao, D. Friedman, M. Hardy, T. L. Griffiths, Embers of Autoregression: Understanding Large Language Models through the Problem They Are Trained to Solve, arXiv [cs.CL] (2023); http://arxiv.org/abs/2309.13638 2

  43. Y. Razeghi, R. L. Logan IV, M. Gardner, S. Singh, Impact of Pretraining Term Frequencies on Few-Shot Reasoning, arXiv [cs.CL] (2022); http://arxiv.org/abs/2202.07206

  44. [industry] T. Shevlane, S. Farquhar, B. Garfinkel, M. Phuong, J. Whittlestone, J. Leung, D. Kokotajlo, N. Marchal, M. Anderljung, N. Kolt, L. Ho, D. Siddarth, S. Avin, W. Hawkins, B. Kim, I. Gabriel, V. Bolina, … A. Dafoe, “Model Evaluation for Extreme Risks” (Google DeepMind, 2023); http://arxiv.org/abs/2305.15324 2 3 4 5

  45. R. Bommasani, D. Soylu, T. I. Liao, K. A. Creel, P. Liang, Ecosystem Graphs: The Social Footprint of Foundation Models, arXiv [cs.LG] (2023); http://arxiv.org/abs/2303.15772 2

  46. [industry] A. Das, W. Kong, R. Sen, Y. Zhou, A Decoder-Only Foundation Model for Time-Series Forecasting, arXiv [cs.CL] (2023); http://arxiv.org/abs/2310.10688

  47. [industry] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, I. Sutskever, “Jukebox: A Generative Model for Music” (OpenAI, 2020); http://arxiv.org/abs/2005.00341

  48. [industry] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, … T. Scialom, “Llama 2: Open Foundation and Fine-Tuned Chat Models” (Meta AI, 2023); http://arxiv.org/abs/2307.09288 2 3 4 5

  49. [industry] Gemini Team, P. Georgiev, V. I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang, S. Mariooryad, Y. Ding, X. Geng, F. Alcober, R. Frostig, M. Omernick, L. Walker, … O. Vinyals, “Gemini 1.5: Unlocking Multimodal Understanding across Millions of Tokens of Context” (Google DeepMind, 2024); https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf 2 3 4

  50. [industry] Anthropic, “The Claude 3 Model Family: Opus, Sonnet, Haiku” (Anthropic, 2024); .  2 3

  51. [industry] OpenAI, “GPT-4 System Card” (OpenAI, 2023);  2 3 4

  52. [industry] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, … W. E. Sayed, Mixtral of Experts, arXiv [cs.LG] (2024); http://arxiv.org/abs/2401.04088

  53. [industry] A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, … Z. Fan, Qwen2 Technical Report, arXiv [cs.CL] (2024); http://arxiv.org/abs/2407.10671

  54. [industry] DeepSeek-AI, A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Luo, … Z. Xie, DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, arXiv [cs.CL] (2024); http://arxiv.org/abs/2405.04434 2

  55. L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin, W. X. Zhao, Z. Wei, J. Wen, A Survey on Large Language Model Based Autonomous Agents. Frontiers of Computer Science 18, 186345 (2024); https://doi.org/10.1007/s11704-024-40231-1 2 3

  56. A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, J. M. Zhang, “Large Language Models for Software Engineering: Survey and Open Problems” in 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE) (2023), pp. 31–53; https://doi.org/10.1109/ICSE-FoSE59343.2023.00008

  57. [industry] S. Chen, S. Liu, L. Zhou, Y. Liu, X. Tan, J. Li, S. Zhao, Y. Qian, F. Wei, VALL-E 2: Neural Codec Language Models Are Human Parity Zero-Shot Text to Speech Synthesizers, arXiv [cs.CL] (2024); http://arxiv.org/abs/2406.05370

  58. [industry] OpenAI, “GPT-4V(ision) System Card” (OpenAI, 2023); https://cdn.openai.com/gpt-4o-system-card.pdf

  59. [industry] P. Agrawal, S. Antoniak, E. B. Hanna, B. Bout, D. Chaplot, J. Chudnovsky, D. Costa, B. De Monicault, S. Garg, T. Gervet, S. Ghosh, A. Héliou, P. Jacob, A. Q. Jiang, K. Khandelwal, T. Lacroix, G. Lample, … S. Yang, Pixtral 12B, arXiv [cs.CV] (2024); http://arxiv.org/abs/2410.07073 2

  60. [industry] P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y. Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, … J. Lin, Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution, arXiv [cs.CV] (2024); http://arxiv.org/abs/2409.12191 2

  61. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale” in The 9th International Conference on Learning Representations (ICLR 2021) (Virtual, 2020); https://openreview.net/forum?id=YicbFdNTTy

  62. [industry] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, R. Girshick, “Segment Anything” (Meta AI, 2023); http://arxiv.org/abs/2304.02643

  63. [industry] A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. Rabbat, Y. LeCun, M. Assran, N. Ballas, “Revisiting Feature Prediction for Learning Visual Representations from Video” (Meta, 2024). 

  64. [industry] The Movie Gen team, “Movie Gen: A Cast of Media Foundation Models” (Meta, 2024); https://ai.meta.com/static-resource/movie-gen-research-paper

  65. [industry] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, A. Zeng, “Code as Policies: Language Model Programs for Embodied Control” in Workshop on Language and Robotics at CoRL 2022 (2022); https://openreview.net/forum?id=fmtvpopfLC6

  66. B. Ichter, A. Brohan, Y. Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian, D. Kalashnikov, S. Levine, Y. Lu, C. Parada, K. Rao, P. Sermanet, … C. K. Fu, “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances” in Proceedings of The 6th Annual Conference on Robot Learning (CoRL) (PMLR, Auckland, New Zealand, 2022) vol. 205; https://openreview.net/forum?id=bdHkMjBJG_w

  67. Open X-Embodiment Collaboration, A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, … Z. Lin, Open X-Embodiment: Robotic Learning Datasets and RT-X Models, arXiv [cs.RO] (2023); http://arxiv.org/abs/2310.08864 2

  68. [industry] J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp, Y. Zhou, J. Guo, D. Anguelov, M. Tan, EMMA: End-to-End Multimodal Model for Autonomous Driving, arXiv [cs.CV] (2024); http://arxiv.org/abs/2410.23262

  69. R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y. Zhu, S. Song, A. Kapoor, K. Hausman, B. Ichter, D. Driess, J. Wu, C. Lu, M. Schwager, Foundation Models in Robotics: Applications, Challenges, and the Future, arXiv [cs.RO] (2023); http://arxiv.org/abs/2312.07843

  70. H. Fang, H. Fang, Z. Tang, J. Liu, J. Wang, H. Zhu, C. Lu, RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot. IEEE International Conference on Robotics and Automation, 653–660 (2023); https://doi.org/10.1109/ICRA57147.2024.10611615

  71. A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y. Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y. J. Ma, P. T. Miller, J. Wu, … C. Finn, DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset, arXiv [cs.RO] (2024); http://arxiv.org/abs/2403.12945

  72. J. Wang, Z. Wu, Y. Li, H. Jiang, P. Shu, E. Shi, H. Hu, C. Ma, Y. Liu, X. Wang, Y. Yao, X. Liu, H. Zhao, Z. Liu, H. Dai, L. Zhao, B. Ge, … S. Zhang, Large Language Models for Robotics: Opportunities, Challenges, and Perspectives, arXiv [cs.RO] (2024); http://arxiv.org/abs/2401.04334

  73. [industry]Chai Discovery, J. Boitreaud, J. Dent, M. McPartlon, J. Meier, V. Reis, A. Rogozhnikov, K. Wu, Chai-1: Decoding the Molecular Interactions of Life, bioRxiv [preprint] (2024); https://doi.org/10.1101/2024.10.10.615955

  74. R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, … P. Liang, On the Opportunities and Risks of Foundation Models, arXiv [cs.LG] (2021); http://arxiv.org/abs/2108.07258 2

  75. P. Bryant, G. Pozzati, A. Elofsson, Improved Prediction of Protein-Protein Interactions Using AlphaFold2. Nature Communications 13, 1265 (2022); https://doi.org/10.1038/s41467-022-28865-w

  76. A. Madani, B. Krause, E. R. Greene, S. Subramanian, B. P. Mohr, J. M. Holton, J. L. Olmos, C. Xiong, Z. Z. Sun, R. Socher, J. S. Fraser, N. Naik, Large Language Models Generate Functional Protein Sequences across Diverse Families. Nature Biotechnology 41, 1099–1106 (2023); https://doi.org/10.1038/s41587-022-01618-2

  77. T. Davidson, J.-S. Denain, P. Villalobos, G. Bas, “AI Capabilities Can Be Significantly Improved without Expensive Retraining” (Epoch AI, 2023); http://arxiv.org/abs/2312.07413 2 3

  78. G. Mialon, R. Dessi, M. Lomeli, C. Nalmpantis, R. Pasunuru, R. Raileanu, B. Roziere, T. Schick, J. Dwivedi-Yu, A. Celikyilmaz, E. Grave, Y. LeCun, T. Scialom, Augmented Language Models: A Survey. Transactions on Machine Learning Research (2023); https://openreview.net/pdf?id=jh7wH2AzKK

  79. [industry] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, D. Zhou, Self-Consistency Improves Chain of Thought Reasoning in Language Models, arXiv [cs.CL] (2022); http://arxiv.org/abs/2203.11171

  80. [industry] B. Brown, J. Juravsky, R. Ehrlich, R. Clark, Q. V. Le, C. Ré, A. Mirhoseini, Large Language Monkeys: Scaling Inference Compute with Repeated Sampling, arXiv [cs.LG] (2024); http://arxiv.org/abs/2407.21787 2

  81. S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, K. R. Narasimhan, “Tree of Thoughts: Deliberate Problem Solving with Large Language Models” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=5Xc1ecxO1h

  82. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, … D. Amodei, “Language Models Are Few-Shot Learners” in Advances in Neural Information Processing Systems (Curran Associates, Inc., 2020) vol. 33, pp. 1877–1901; https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html

  83. J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. V. Le, D. Zhou, “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” in Advances in Neural Information Processing Systems (NeurIPS 2022) (New Orleans, LA, US, 2022) vol. 35, pp. 24824–24837; https://proceedings.neurips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html

  84. T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa, “Large Language Models Are Zero-Shot Reasoners” in NeurIPS (New Orleans, LA, US, 2022); http://papers.nips.cc/paper_files/paper/2022/hash/8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html

  85. [industry] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger, K. Button, M. Knight, B. Chess, J. Schulman, “WebGPT: Browser-Assisted Question-Answering with Human Feedback” (OpenAI, 2021); http://arxiv.org/abs/2112.09332 2 3 4

  86. [industry] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, G. Neubig, PAL: Program-Aided Language Models, arXiv [cs.CL] (2022); https://doi.org/10.48550/arXiv.2211.10435

  87. I. Drori, S. Zhang, R. Shuttleworth, L. Tang, A. Lu, E. Ke, K. Liu, L. Chen, S. Tran, N. Cheng, R. Wang, N. Singh, T. L. Patti, J. Lynch, A. Shporer, N. Verma, E. Wu, G. Strang, A Neural Network Solves, Explains, and Generates University Math Problems by Program Synthesis and Few-Shot Learning at Human Level, arXiv [cs.LG] (2021); https://pnas.org/doi/full/10.1073/pnas.2123433119

  88. [industry] W. Chen, X. Ma, X. Wang, W. W. Cohen, Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks, arXiv [cs.CL] (2022); http://arxiv.org/abs/2211.12588

  89. W. Huang, P. Abbeel, D. Pathak, I. Mordatch, Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. (2022); https://openreview.net/forum?id=6NT1a56mNim

  90. I. Dasgupta, C. Kaeser-Chen, K. Marino, A. Ahuja, S. Babayan, F. Hill, R. Fergus, “Collaborating with Language Models for Embodied Reasoning” in Second Workshop on Language and Reinforcement Learning (2022); https://openreview.net/forum?id=YoS-abmWjJc

  91. Epoch AI, AI Benchmarking Dashboard (2024); https://epoch.ai/data/ai-benchmarking-dashboard

  92. [industry] OpenAI, Learning to Reason with LLMs (2024); https://openai.com/index/learning-to-reason-with-llms/ 2 3 4 5 6 7 8

  93. P. Villalobos, D. Atkinson, “Trading Off Compute in Training and Inference” (Epoch AI, 2023); https://epochai.org/blog/trading-off-compute-in-training-and-inference 2

  94. [industry] C. Snell, J. Lee, K. Xu, A. Kumar, Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters, arXiv [cs.LG] (2024); http://arxiv.org/abs/2408.03314 2 3

  95. X. Hu, J. Chen, X. Li, Y. Guo, L. Wen, P. S. Yu, Z. Guo, Do Large Language Models Know about Facts?, arXiv [cs.CL] (2023); http://arxiv.org/abs/2310.05177

  96. R. Xu, Z. Qi, Z. Guo, C. Wang, H. Wang, Y. Zhang, W. Xu, “Knowledge Conflicts for LLMs: A Survey” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (Association for Computational Linguistics, Stroudsburg, PA, USA, 2024), pp. 8541–8565; https://doi.org/10.18653/v1/2024.emnlp-main.486

  97. M. Turpin, J. Michael, E. Perez, S. R. Bowman, “Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=bzs4uPLXvi 2

  98. M. Sharma, M. Tong, T. Korbak, D. Duvenaud, A. Askell, S. R. Bowman, E. Durmus, Z. Hatfield-Dodds, S. R. Johnston, S. M. Kravec, T. Maxwell, S. McCandlish, K. Ndousse, O. Rausch, N. Schiefer, D. Yan, M. Zhang, E. Perez, “Towards Understanding Sycophancy in Language Models” in The 12th International Conference on Learning Representations (ICLR 2024) (Vienna, Austria, 2023); https://openreview.net/forum?id=tvhaxkMKAn  2 3 4 5 6

  99. [industry] Z. Wu, L. Qiu, A. Ross, E. Akyürek, B. Chen, B. Wang, N. Kim, J. Andreas, Y. Kim, Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models through Counterfactual Tasks, arXiv [cs.CL] (2023); http://arxiv.org/abs/2307.02477

  100. L. Zhang, X. Zhai, Z. Zhao, Y. Zong, X. Wen, B. Zhao, “What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-Modal Language Models” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024), pp. 21853–21862; 

  101. Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, P. Fung, Survey of Hallucination in Natural Language Generation. ACM Computing Surveys 55, 1–38 (2023); https://doi.org/10.1145/3571730 2 3 4

  102. [industry] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y. Zhang, Y. Chen, L. Wang, A. T. Luu, W. Bi, F. Shi, S. Shi, Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models, arXiv [cs.CL] (2023); http://arxiv.org/abs/2309.01219 2 3

  103. M. Zhang, O. Press, W. Merrill, A. Liu, N. A. Smith, How Language Model Hallucinations Can Snowball, arXiv [cs.CL] (2023); http://arxiv.org/abs/2305.13534 2

  104. L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, T. Liu, A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions, arXiv [cs.CL] (2023); http://arxiv.org/abs/2311.05232 2 3

  105. V. Rawte, A. Sheth, A. Das, A Survey of Hallucination in Large Foundation Models, arXiv [cs.AI] (2023); http://arxiv.org/abs/2309.05922 2

  106. J. Liu, W. Wang, D. Wang, N. Smith, Y. Choi, H. Hajishirzi, “Vera: A General-Purpose Plausibility Estimation Model for Commonsense Statements” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, K. Bali, Eds. (Association for Computational Linguistics, Singapore, 2023), pp. 1264–1287; https://doi.org/10.18653/v1/2023.emnlp-main.81

  107. A. Leidinger, R. Van Rooij, E. Shutova, “Are LLMs Classical or Nonmonotonic Reasoners? Lessons from Generics” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), L.-W. Ku, A. Martins, V. Srikumar, Eds. (Association for Computational Linguistics, Bangkok, Thailand, 2024); https://doi.org/10.18653/v1/2024.acl-short.51

  108. M. Mitchell, AI’s Challenge of Understanding the World. Science 382, eadm8175 (2023); https://doi.org/10.1126/science.adm8175

  109. D. Halawi, F. Zhang, C. Yueh-Han, J. Steinhardt, Approaching Human-Level Forecasting with Language Models, arXiv [cs.LG] (2024); http://arxiv.org/abs/2402.18563

  110. [industry] I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, M. Farajtabar, GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models, arXiv [cs.LG] (2024); http://arxiv.org/abs/2410.05229 2 3 4 5 6

  111. [industry] F. Shi, X. Chen, K. Misra, N. Scales, D. Dohan, E. H. Chi, N. Schärli, D. Zhou, “Large Language Models Can Be Easily Distracted by Irrelevant Context” in Proceedings of the 40th International Conference on Machine Learning (PMLR, 2023), pp. 31210–31227; https://proceedings.mlr.press/v202/shi23a.html

  112. [industry] A. Hosseini, A. Sordoni, D. Toyama, A. Courville, R. Agarwal, Not All LLM Reasoners Are Created Equal, arXiv [cs.LG] (2024); http://arxiv.org/abs/2410.01748

  113. K. Z. Cui, M. Demirer, S. Jaffe, L. Musolff, S. Peng, T. Salz, The Productivity Effects of Generative AI: Evidence from a Field Experiment with GitHub Copilot. An MIT Exploration of Generative AI (2024); https://mit-genai.pubpub.org/pub/v5iixksv/release/2 2 3

  114. [industry] S. Peng, E. Kalliamvakou, P. Cihon, M. Demirer, The Impact of AI on Developer Productivity: Evidence from GitHub Copilot, arXiv [cs.SE] (2023); https://www.semanticscholar.org/reader/038f249ab708cebae2a58265b768b9b1cbadad3a 2 3 4

  115. A. Ziegler, E. Kalliamvakou, X. A. Li, A. Rice, D. Rifkin, S. Simister, G. Sittampalam, E. Aftandilian, Measuring GitHub Copilot’s Impact on Productivity. Communications of the ACM 67, 54–63 (2024);  2

  116. 2024 Stack Overflow Developer Survey (2024); https://survey.stackoverflow.co/2024/

  117. Stack Overflow Developer Survey 2023, Stack Overflow (2023); https://survey.stackoverflow.co/2023/?utm_source=social-share&utm_medium=social&utm_campaign=dev-survey-2023

  118. X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, … J. Tang, AgentBench: Evaluating LLMs as Agents, arXiv [cs.AI] (2023); http://arxiv.org/abs/2308.03688

  119. S. Yao, H. Chen, J. Yang, K. Narasimhan, WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents, arXiv [cs.CL] (2022); http://arxiv.org/abs/2207.01206

  120. A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. White, P. Schwaller, “Augmenting Large Language Models with Chemistry Tools” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) AI for Science Workshop (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=wdGIL6lx3l 2

  121. [industry] A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, P. Schwaller, ChemCrow: Augmenting Large-Language Models with Chemistry Tools, arXiv [physics.chem-ph] (2023); http://arxiv.org/abs/2304.05376 2

  122. C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, K. R. Narasimhan, “SWE-Bench: Can Language Models Resolve Real-World Github Issues?” in 12th International Conference on Learning Representations (2023); https://openreview.net/pdf?id=VTF8yNQM66 2 3 4 5 6 7 8

  123. [industry] L. Jing, Z. Huang, X. Wang, W. Yao, W. Yu, K. Ma, H. Zhang, X. Du, D. Yu, DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?, arXiv [cs.AI] (2024); http://arxiv.org/abs/2409.07703

  124. Z. Chen, S. Chen, Y. Ning, Q. Zhang, B. Wang, B. Yu, Y. Li, Z. Liao, C. Wei, Z. Lu, V. Dey, M. Xue, F. N. Baker, B. Burns, D. Adu-Ampratwum, X. Huang, X. Ning, … H. Sun, ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery, arXiv [cs.CL] (2024); http://arxiv.org/abs/2410.05080

  125. [industry] J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patwardhan, L. Weng, A. Mądry, MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering, arXiv [cs.CL] (2024); http://arxiv.org/abs/2410.07095 2

  126. Q. Huang, J. Vora, P. Liang, J. Leskovec, “MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation” in Forty-First International Conference on Machine Learning (2024); https://openreview.net/pdf?id=1Fs1LvjYQW

  127. R. Fang, R. Bindu, A. Gupta, Q. Zhan, D. Kang, LLM Agents Can Autonomously Hack Websites, arXiv [cs.CR] (2024); http://arxiv.org/abs/2402.06664 2 3 4 5

  128. X. Liang, L. Ma, S. Guo, J. Han, H. Xu, S. Ma, X. Liang, CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation, arXiv [cs.CV] (2023); http://arxiv.org/abs/2306.10322

  129. METR, Details about METR’s Preliminary Evaluation of OpenAI o1-Preview. (2024); https://metr.github.io/autonomy-evals-guide/openai-o1-preview-report/ 2 3 4

  130. [industry] J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, O. Press, SWE-Agent: Agent-Computer Interfaces Enable Automated Software Engineering, arXiv [cs.SE] (2024); http://arxiv.org/abs/2405.15793

  131. [industry] C. S. Xia, Y. Deng, S. Dunn, L. Zhang, Agentless: Demystifying LLM-Based Software Engineering Agents, arXiv [cs.SE] (2024); http://arxiv.org/abs/2407.01489

  132. [industry] X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y. Shao, N. Muennighoff, … G. Neubig, OpenHands: An Open Platform for AI Software Developers as Generalist Agents, arXiv [cs.SE] (2024); http://arxiv.org/abs/2407.16741

  133. [industry] C.-L. Cheang, G. Chen, Y. Jing, T. Kong, H. Li, Y. Li, Y. Liu, H. Wu, J. Xu, Y. Yang, H. Zhang, M. Zhu, GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation, arXiv [cs.RO] (2024); http://arxiv.org/abs/2410.06158

  134. B. Wang, J. Zhang, S. Dong, I. Fang, C. Feng, VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model, arXiv [cs.RO] (2024); http://arxiv.org/abs/2410.08792

  135. [industry] S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y.-W. Chao, B. Y. Lin, L. Liden, K. Lee, J. Gao, L. Zettlemoyer, D. Fox, M. Seo, Latent Action Pretraining from Videos, arXiv [cs.RO] (2024); http://arxiv.org/abs/2410.11758

  136. M. Herrmann, F. J. D. Lange, K. Eggensperger, G. Casalicchio, M. Wever, M. Feurer, D. Rügamer, E. Hüllermeier, A.-L. Boulesteix, B. Bischl, “Position: Why We Must Rethink Empirical Research in Machine Learning” in Proceedings of the 41st International Conference on Machine Learning, R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, F. Berkenkamp, Eds. (PMLR, 2024) vol. 235, pp. 18228–18247; https://proceedings.mlr.press/v235/herrmann24b.html

  137. D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, J. Steinhardt, “Measuring Mathematical Problem Solving With the MATH Dataset” in 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Datasets and Benchmarks Track (Round 2) (Virtual, 2021); https://openreview.net/forum?id=7Bywt2mQsCe 2

  138. J. Au Yeung, Z. Kraljevic, A. Luintel, A. Balston, E. Idowu, R. J. Dobson, J. T. Teo, AI Chatbots Not yet Ready for Clinical Use. Frontiers in Digital Health 5, 1161098 (2023); https://doi.org/10.3389/fdgth.2023.1161098

  139. M. Hobbhahn, L. Heim, G. Aydos, “Trends in Machine Learning Hardware” (Epoch AI, 2023); https://epochai.org/blog/trends-in-machine-learning-hardware 2 3

  140. D. Kiela, M. Bartolo, Y. Nie, D. Kaushik, A. Geiger, Z. Wu, B. Vidgen, G. Prasad, A. Singh, P. Ringshia, Z. Ma, T. Thrush, S. Riedel, Z. Waseem, P. Stenetorp, R. Jia, M. Bansal, … A. Williams, “Dynabench: Rethinking Benchmarking in NLP” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Association for Computational Linguistics, 2021), pp. 4110–4124; https://doi.org/10.18653/v1/2021.naacl-main.324

  141. D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, J. Steinhardt, “Measuring Massive Multitask Language Understanding” in The 9th International Conference on Learning Representations (ICLR 2021) (Virtual, 2021); https://openreview.net/forum?id=d7KBjmI3GmQ 2

  142. D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, S. R. Bowman, GPQA: A Graduate-Level Google-Proof Q&A Benchmark, arXiv [cs.AI] (2023); http://arxiv.org/abs/2311.12022 2

  143. A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, A. Kluska, A. Lewkowycz, A. Agarwal, A. Power, A. Ray, A. Warstadt, A. W. Kocurek, … Z. Wu, Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models. Transactions on Machine Learning Research (2023); https://openreview.net/forum?id=uyTL5Bvosj

  144. [industry] L. Kilpatrick, S. B. Mallick, Updated Production-Ready Gemini Models, Reduced 1.5 Pro Pricing, Increased Rate Limits, and More, GEMINI (2024); https://developers.googleblog.com/en/updated-gemini-models-reduced-15-pro-pricing-increased-rate-limits-and-more/

  145. [industry] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, A. Castro-Ros, … J. Wei, Scaling Instruction-Finetuned Language Models, arXiv [cs.LG] (2022); http://arxiv.org/abs/2210.11416 2

  146. [industry] OpenAI, GPT-4o Mini: Advancing Cost-Efficient Intelligence (2024); https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ 2

  147. [industry] OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao, … B. Zoph, “GPT-4 Technical Report” (OpenAI, 2024); http://arxiv.org/abs/2303.08774 2 3 4 5 6 7

  148. [industry] OpenAI, Pricing (2024); https://openai.com/chatgpt/pricing/

  149. [industry] Together Pricing, together.ai (2023); https://www.together.ai/pricing

  150. B. Y. Lin, Y. Deng, K. Chandu, F. Brahman, A. Ravichander, V. Pyatkin, N. Dziri, R. L. Bras, Y. Choi, WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild, arXiv [cs.CL] (2024); http://arxiv.org/abs/2406.04770

  151. [industry] J. Wang, J. Wang, B. Athiwaratkun, C. Zhang, J. Zou, Mixture-of-Agents Enhances Large Language Model Capabilities, arXiv [cs.CL] (2024); http://arxiv.org/abs/2406.04692

  152. J. Sevilla, “Training Compute of Frontier AI Models Grows by 4-5x per Year” (2024); https://epoch.ai/blog/training-compute-of-frontier-ai-models-grows-by-4-5x-per-year

  153. M. Mitchell, A. B. Palmarini, A. K. Moskvichev, “Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks” in AAAI 2024 Workshop Are Large Language Models Simply Causal Parrots?’’ (Vancouver, BC, Canada, 2024); https://openreview.net/forum?id=3rGT5OkzpC 2 3 4 5 6

  154. L. Berglund, M. Tong, M. Kaufmann, M. Balesni, A. C. Stickland, T. Korbak, O. Evans, “The Reversal Curse: LLMs Trained on ‘A Is B’ Fail to Learn ‘B Is A’” in The 12th International Conference on Learning Representations (ICLR 2024) (Vienna, Austria, 2024); https://openreview.net/forum?id=GPKTIktA0k 2

  155. J. Geiping, A. Stein, M. Shu, K. Saifullah, Y. Wen, T. Goldstein, “Coercing LLMs to Do and Reveal (almost) Anything” in ICLR 2024 Workshop on Secure and Trustworthy Large Language Models (SET LLM) (Vienna, Austria, 2024); https://openreview.net/forum?id=Y5inHAjMu0 2

  156. [industry] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, D. Amodei, Scaling Laws for Neural Language Models, arXiv [cs.LG] (2020); http://arxiv.org/abs/2001.08361 2 3 4

  157. [industry] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rut  2 3 4 5

  158. [industry] T. Henighan, J. Kaplan, M. Katz, M. Chen, C. Hesse, J. Jackson, H. Jun, T. B. Brown, P. Dhariwal, S. Gray, C. Hallacy, B. Mann, A. Radford, A. Ramesh, N. Ryder, D. M. Ziegler, J. Schulman, … S. McCandlish, Scaling Laws for Autoregressive Generative Modeling, arXiv [cs.LG] (2020); http://arxiv.org/abs/2010.14701

  159. X. Zhai, A. Kolesnikov, N. Houlsby, L. Beyer, “Scaling Vision Transformers” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), pp. 1204–1213; https://doi.org/10.1109/CVPR52688.2022.01179

  160. A. L. Jones, Scaling Scaling Laws with Board Games, arXiv [cs.LG] (2021); http://arxiv.org/abs/2104.03113

  161. [industry] Y. Bahri, E. Dyer, J. Kaplan, J. Lee, U. Sharma, Explaining Neural Scaling Laws, arXiv [cs.LG] (2021); http://arxiv.org/abs/2102.06701

  162. [industry] A. Maloney, D. A. Roberts, J. Sully, A Solvable Model of Neural Scaling Laws, arXiv [cs.LG] (2022); http://arxiv.org/abs/2210.16859

  163. U. Sharma, J. Kaplan, Scaling Laws from the Data Manifold Dimension. Journal of Machine Learning Research: JMLR 23, 343–376 (2022); https://dl.acm.org/doi/abs/10.5555/3586589.3586598

  164. Ł. Dębowski, A Simplistic Model of Neural Scaling Laws: Multiperiodic Santa Fe Processes, arXiv [cs.IT] (2023); http://arxiv.org/abs/2302.09049

  165. E. J. Michaud, Z. Liu, U. Girit, M. Tegmark, “The Quantization Model of Neural Scaling” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=3tbTw2ga8K

  166. [industry] T. Besiroglu, E. Erdil, M. Barnett, J. You, Chinchilla Scaling: A Replication Attempt, arXiv [cs.AI] (2024); http://arxiv.org/abs/2404.10102

  167. T. Porian, M. Wortsman, J. Jitsev, L. Schmidt, Y. Carmon, “Resolving Discrepancies in Compute-Optimal Scaling of Language Models” in 2nd Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ICML 2024) (2024); https://openreview.net/forum?id=zhCBrgaQZ0

  168. [industry] T. Pearce, J. Song, Reconciling Kaplan and Chinchilla Scaling Laws, arXiv [cs.LG] (2024); http://arxiv.org/abs/2406.12907

  169. E. Caballero, K. Gupta, I. Rish, D. Krueger, “Broken Neural Scaling Laws” in NeurIPS ML Safety Workshop (2022); https://openreview.net/forum?id=BfGrlFuNyhJ

  170. [industry] S. Hooker, On the Limitations of Compute Thresholds as a Governance Strategy, arXiv [cs.AI] (2024); http://arxiv.org/abs/2407.05694 2 3

  171. S. Biderman, U. S. Prashanth, L. Sutawika, H. Schoelkopf, Q. G. Anthony, S. Purohit, E. Raff, “Emergent and Predictable Memorization in Large Language Models” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=Iq0DvhB4Kf  .  2 3

  172. D. Ganguli, D. Hernandez, L. Lovitt, A. Askell, Y. Bai, A. Chen, T. Conerly, N. Dassarma, D. Drain, N. Elhage, S. El Showk, S. Fort, Z. Hatfield-Dodds, T. Henighan, S. Johnston, A. Jones, N. Joseph, … J. Clark, “Predictability and Surprise in Large Generative Models” in Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ‘22) (Association for Computing Machinery, New York, NY, USA, 2022), pp. 1747–1764; https://doi.org/10.1145/3531146.3533229 2 3

  173. [industry] Z. Du, A. Zeng, Y. Dong, J. Tang, Understanding Emergent Abilities of Language Models from the Loss Perspective, arXiv [cs.CL] (2024); http://arxiv.org/abs/2403.15796 2

  174. J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, W. Fedus, Emergent Abilities of Large Language Models. Transactions on Machine Learning Research (2022); https://openreview.net/forum?id=yzkSU5zdwD 2 3

  175. S. Y. Gadre, G. Smyrnis, V. Shankar, S. Gururangan, M. Wortsman, R. Shao, J. Mercat, A. Fang, J. Li, S. Keh, R. Xin, M. Nezhurina, I. Vasiljevic, J. Jitsev, L. Soldaini, A. G. Dimakis, G. Ilharco, … L. Schmidt, Language Models Scale Reliably with over-Training and on Downstream Tasks, arXiv [cs.CL] (2024); http://arxiv.org/abs/2403.08540

  176. R. Schaeffer, B. Miranda, S. Koyejo, “Are Emergent Abilities of Large Language Models a Mirage?” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=ITw9edRDlD

  177. Y. Ruan, C. J. Maddison, T. Hashimoto, “Observational Scaling Laws and the Predictability of Langauge Model Performance” in 38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024) (2024); https://openreview.net/pdf?id=On5WIN7xyD

  178. T. R. McIntosh, T. Susnjak, T. Liu, P. Watters, M. N. Halgamuge, Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence, arXiv [cs.AI] (2024); http://arxiv.org/abs/2402.09880 2 3

  179. [industry] V. Balachandran, J. Chen, N. Joshi, B. Nisha, H. Palangi, E. Salinas, V. Vineet, J. Woffinden-Luey, S. Yousefi, “EUREKA: Evaluating and Understanding Large Foundation Models” (Microsoft, 2024); https://www.microsoft.com/en-us/research/publication/eureka-evaluating-and-understanding-large-foundation-models/

  180. [industry] S. Srivastava, M. B. Annarose, P. V. Anto, S. Menon, A. Sukumar, S. T. Adwaith, A. Philipose, S. Prince, S. Thomas, Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap, arXiv [cs.AI] (2024); http://arxiv.org/abs/2402.19450

  181. C. Deng, Y. Zhao, X. Tang, M. Gerstein, A. Cohan, Investigating Data Contamination in Modern Benchmarks for Large Language Models, arXiv [cs.CL] (2023); http://arxiv.org/abs/2311.09783

  182. O. Sainz, J. Campos, I. García-Ferrero, J. Etxaniz, O. L. de Lacalle, E. Agirre, “NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for Each Benchmark” in Findings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, K. Bali, Eds. (Association for Computational Linguistics, Singapore, 2023), pp. 10776–10787; https://doi.org/10.18653/v1/2023.findings-emnlp.722

  183. Y. Cao, L. Zhou, S. Lee, L. Cabello, M. Chen, D. Hershcovich, “Assessing Cross-Cultural Alignment between ChatGPT and Human Societies: An Empirical Study” in Proceedings of the 1st Workshop on Cross-Cultural Considerations in NLP (C3NLP), S. Dev, V. Prabhakaran, D. Adelani, D. Hovy, L. Benotti, Eds. (Association for Computational Linguistics, Dubrovnik, Croatia, 2023), pp. 53–67; https://doi.org/10.18653/v1/2023.c3nlp-1.7 2 3

  184. [industry] H. Zhou, A. Bradley, E. Littwin, N. Razin, O. Saremi, J. Susskind, S. Bengio, P. Nakkiran, What Algorithms Can Transformers Learn? A Study in Length Generalization, arXiv [cs.LG] (2023); http://arxiv.org/abs/2310.16028

  185. D. Yu, S. Kaur, A. Gupta, J. Brown-Cohen, A. Goyal, S. Arora, “SKILL-MIX: A Flexible and Expandable Family of Evaluations for AI Models” in 12th International Conference on Learning Representations (2024); https://openreview.net/pdf?id=Jf5gplvglq

  186. [industry] H. Zhang, J. Da, D. Lee, V. Robinson, C. Wu, W. Song, T. Zhao, P. Raja, D. Slack, Q. Lyu, S. Hendryx, R. Kaplan, M. Lunati, S. Yue, A Careful Examination of Large Language Model Performance on Grade School Arithmetic, arXiv [cs.CL] (2024); http://arxiv.org/abs/2405.00332

  187. [industry] AlphaProof, AlphaGeometry teams, AI Achieves Silver-Medal Standard Solving International Mathematical Olympiad Problems, Google DeepMind (2024); https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/ 2 3

  188. T. H. Trinh, Y. Wu, Q. V. Le, H. He, T. Luong, Solving Olympiad Geometry without Human Demonstrations. Nature 625, 476–482 (2024); https://doi.org/10.1038/s41586-023-06747-5 2

  189. E. Akyürek, M. Damani, L. Qiu, H. Guo, Y. Kim, J. Andreas, The Surprising Effectiveness of Test-Time Training for Abstract Reasoning, arXiv [cs.AI] (2024); http://arxiv.org/abs/2411.07279

  190. Y. Bengio, G. Hinton, A. Yao, D. Song, P. Abbeel, T. Darrell, Y. N. Harari, Y.-Q. Zhang, L. Xue, S. Shalev-Shwartz, G. Hadfield, J. Clune, T. Maharaj, F. Hutter, A. G. Baydin, S. McIlraith, Q. Gao, … S. Mindermann, Managing Extreme AI Risks amid Rapid Progress. Science, eadn0117 (2024); https://doi.org/10.1126/science.adn0117 2 3 4

  191. Y. LeCun, The Power and Limits of Deep Learning: In His IRI Medal Address, Yann LeCun Maps the Development of Machine Learning Techniques and Suggests What the Future May Hold. Research Technology Management 61, 22–27 (2018); https://doi.org/10.1080/08956308.2018.1516928 2

  192. M. Mitchell, “Why AI Is Harder than We Think” in Proceedings of the Genetic and Evolutionary Computation Conference (GECCO ‘21) (Association for Computing Machinery, New York, NY, USA, 2021), p. 3; https://doi.org/10.1145/3449639.3465421

  193. J. Pearl, D. Mackenzie, The Book of Why: The New Science of Cause and Effect (Penguin Books, Harlow, England, 2019)Penguin science; https://dl.acm.org/doi/10.5555/3238230 2

  194. D. C. Cireşan, U. Meier, L. M. Gambardella, J. Schmidhuber, Deep, Big, Simple Neural Nets for Handwritten Digit Recognition. Neural Computation 22, 3207–3220 (2010); https://doi.org/10.1162/NECO_a_00052

  195. T. Mikolov, M. Karafiát, L. Burget, J. Černocký, S. Khudanpur, “Recurrent Neural Network Based Language Model” in Proc. Interspeech 2010 (ISCA, 2010), pp. 1045–1048; https://doi.org/10.21437/Interspeech.2010-343

  196. X. Glorot, Y. Bengio, “Understanding the Difficulty of Training Deep Feedforward Neural Networks” in Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS 2010), Yee Whye Teh, Mike Titterington, Eds. (PMLR, 2010) vol. 9, pp. 249–256; https://proceedings.mlr.press/v9/glorot10a.html

  197. Epoch AI, Data on Notable AI Models. (2024); https://epochai.org/data/notable-ai-models 2 3 4 5

  198. [industry] Inflection AI, Inflection-2 (2023); https://inflection.ai/inflection-2

  199. C.-J. Wu, R. Raghavendra, U. Gupta, B. Acun, N. Ardalani, K. Maeng, G. Chang, F. Aga, J. Huang, C. Bai, M. Gschwind, A. Gupta, M. Ott, A. Melnikov, S. Candido, D. Brooks, G. Chauhan, … K. Hazelwood, “Sustainable AI: Environmental Implications, Challenges and Opportunities” in Proceedings of the 5th Conference on Machine Learning and Systems (MLSys), D. Marculescu, Y. Chi, C. Wu, Eds. (2022) vol. 4, pp. 795–813; https://proceedings.mlsys.org/paper_files/paper/2022/file/462211f67c7d858f663355eff93b745e-Paper.pdf 2 3 4 5 6

  200. [industry] Y. Wu, Z. Sun, S. Li, S. Welleck, Y. Yang, Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models, arXiv [cs.AI] (2024); http://arxiv.org/abs/2408.00724

  201. S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, Z. Hu, “Reasoning with Language Model Is Planning with World Model” in The 2023 Conference on Empirical Methods in Natural Language Processing (2023); https://openreview.net/pdf?id=VTWWvYtF1R

  202. [industry] X. Feng, Z. Wan, M. Wen, Y. Wen, W. Zhang, J. Wang, “Alphazero-like Tree-Search Can Guide Large Language Model Decoding and Training” in NeurIPS 2023 Foundation Models for Decision Making Workshop (New Orleans, LA, US, 2023); https://openreview.net/pdf?id=PJfc4x2jXY

  203. [industry] C. Li, W. Wang, J. Hu, Y. Wei, N. Zheng, H. Hu, Z. Zhang, H. Peng, Common 7B Language Models Already Possess Strong Math Capabilities, arXiv [cs.CL] (2024); http://arxiv.org/abs/2403.04706

  204. E. Erdil, Optimally Allocating Compute Between Inference and Training. (2024); https://epochai.org/blog/optimally-allocating-compute-between-inference-and-training

  205. K. Chow, Y. Tang, Z. Lyu, A. Rajput, K. Ban, “Performance Optimization in the LLM World 2024” in Companion of the 15th ACM/SPEC International Conference on Performance Engineering (ACM, New York, NY, USA, 2024); https://doi.org/10.1145/3629527.3651436

  206. D. Patterson, J. Gonzalez, U. Hölzle, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. R. So, M. Texier, J. Dean, The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink. Computer 55, 18–28 (2022); https://doi.org/10.1109/MC.2022.3148714 2 3 4 5 6

  207. D. Coyle, L. Hampton, 21st Century Progress in Computing. Telecommunications Policy 48, 102649 (2024); https://doi.org/10.1016/j.telpol.2023.102649

  208. International Energy Agency, “Electricity 2024: Analysis and Forecast to 2026” (IEA, 2024); https://iea.blob.core.windows.net/assets/6b2fd954-2017-408e-bf08-952fdd62118a/Electricity2024-Analysisandforecastto2026.pdf 2

  209. Talen Energy, Talen Energy Announces Sale of Zero-Carbon Data Center Campus (2024); https://ir.talenenergy.com/news-releases/news-release-details/talen-energy-announces-sale-zero-carbon-data-center-campus

  210. Advanced Electronics Practice, H. Bauer, O. Burkacky, P. Kenevan, S. Lingemann, K. Pototzky, B. Wiseman, “Semiconductor Design and Manufacturing: Achieving Leading-Edge Capabilities” (McKinsey & Company, 2020); https://www.mckinsey.com/industries/industrials-and-electronics/our-insights/semiconductor-design-and-manufacturing-achieving-leading-edge-capabilities/

  211. J. VerWey, “No Permits, No Fabs: The Importance of Regulatory Reform for Semiconductor Manufacturing” (Center for Security and Emerging Technology, 2021); https://doi.org/10.51593/20210053

  212. D. Bragg, N. Caselli, J. A. Hochgesang, M. Huenerfauth, L. Katz-Hernandez, O. Koller, R. Kushalnagar, C. Vogler, R. E. Ladner, The FATE Landscape of Sign Language AI Datasets: An Interdisciplinary Perspective. ACM Transactions on Accessible Computing 14, 1–45 (2021); https://doi.org/10.1145/3436996 2

  213. G. Li, Z. Sun, Q. Wang, S. Wang, K. Huang, N. Zhao, Y. Di, X. Zhao, Z. Zhu, China’s Green Data Center development:Policies and Carbon Reduction Technology Path. Environmental Research 231, 116248 (2023); https://doi.org/10.1016/j.envres.2023.116248 2

  214. E. Griffith, The Desperate Hunt for the A.I. Boom’s Most Indispensable Prize, The New York Times (2023); https://www.nytimes.com/2023/08/16/technology/ai-gpu-chips-shortage.html

  215. J. Sevilla, T. Besiroglu, B. Cottier, J. You, E. Roldán, P. Villalobos, E. Erdil, Can AI Scaling Continue Through 2030? (2024); https://epochai.org/blog/can-ai-scaling-continue-through-2030 2 3 4 5

  216. E. Erdil, “Data Movement Bottlenecks to Large-Scale Model Training: Scaling Past 1e28 FLOP” (Epoch AI, 2024); https://epoch.ai/blog/data-movement-bottlenecks-scaling-past-1e28-flop

  217. [industry] E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, É. Goffinet, D. Hesslow, J. Launay, Q. Malartic, D. Mazzotta, B. Noune, B. Pannier, G. Penedo, The Falcon Series of Open Language Models, arXiv [cs.CL] (2023); http://arxiv.org/abs/2311.16867

  218. [industry] T. Wei, L. Zhao, L. Zhang, B. Zhu, L. Wang, H. Yang, B. Li, C. Cheng, W. Lü, R. Hu, C. Li, L. Yang, X. Luo, X. Wu, L. Liu, W. Cheng, P. Cheng, … Y. Zhou, Skywork: A More Open Bilingual Foundation Model, arXiv [cs.CL] (2023); http://arxiv.org/abs/2310.19341

  219. P. Villalobos, J. Sevilla, L. Heim, T. Besiroglu, M. Hobbhahn, A. Ho, Will We Run out of Data? Limits of LLM Scaling Based on Human-Generated Data, arXiv [cs.LG] (2022); http://arxiv.org/abs/2211.04325

  220. N. Muennighoff, A. Rush, B. Barak, T. Le Scao, N. Tazi, A. Piktus, S. Pyysalo, T. Wolf, C. A. Raffel, “Scaling Data-Constrained Language Models” in Advances in Neural Information Processing Systems 36 (NeurIPS 2023) Main Conference Track (New Orleans, LA, US, 2023) vol. 36, pp. 50358–50376; https://proceedings.neurips.cc/paper_files/paper/2023/hash/9d89448b63ce1e2e8dc7af72c984c196-Abstract-Conference.html

  221. [industry] A. Sohn, A. Nagabandi, C. Florensa, D. Adelberg, D. Wu, H. Farooq, I. Clavera, J. Welborn, J. Chen, N. Mishra, P. Chen, P. Qian, P. Abbeel, R. Duan, V. Vijay, Y. Liu, Introducing RFM-1: Giving Robots Human-like Reasoning Capabilities, covariant (2024); https://covariant.ai/insights/introducing-rfm-1-giving-robots-human-like-reasoning-capabilities/

  222. H. Abdine, M. Chatzianastasis, C. Bouyioukos, M. Vazirgiannis, “Prot2Text: Multimodal Protein’s Function Generation with GNNs and Transformers” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Deep Generative Models for Health Workshop (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=EJ7YNgWYFj

  223. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, “Learning Transferable Visual Models From Natural Language Supervision” in Proceedings of the 38th International Conference on Machine Learning (ICML 2021) (PMLR, 2021), pp. 8748–8763; https://proceedings.mlr.press/v139/radford21a.html

  224. [industry] Seamless Communication, L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, N. Dong, P.-A. Duquenne, H. Elsahar, H. Gong, K. Heffernan, J. Hoffman, C. Klaiber, P. Li, D. Licht, J. Maillard, A. Rakotoarison, K. R. Sadagopan, … S. Wang, “SeamlessM4T: Massively Multilingual & Multimodal Machine Translation” (Meta AI, 2023); http://arxiv.org/abs/2308.11596

  225. P. Villalobos, A. Ho, J. Sevilla, T. Besiroglu, L. Heim, M. Hobbhahn, “Position: Will We Run out of Data? Limits of LLM Scaling Based on Human-Generated Data” in Proceedings of the 41st International Conference on Machine Learning, R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, F. Berkenkamp, Eds. (PMLR, 2024) vol. 235 of Proceedings of Machine Learning Research, pp. 49523–49544; https://proceedings.mlr.press/v235/villalobos24a.html

  226. [industry] L. Fan, K. Chen, D. Krishnan, D. Katabi, P. Isola, Y. Tian, Scaling Laws of Synthetic Images for Model Training … for Now, arXiv [cs.CV] (2023); http://arxiv.org/abs/2312.04567 2

  227. S. Fu, N. Y. Tamir, S. Sundaram, L. Chai, R. Zhang, T. Dekel, P. Isola, “DreamSim: Learning New Dimensions of Human Visual Similarity Using Synthetic Data” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=DEiNSfh1k7 2

  228. Y. Tian, L. Fan, P. Isola, H. Chang, D. Krishnan, “StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=xpjsOQtKqx

  229. I. Shumailov, Z. Shumaylov, Y. Zhao, Y. Gal, N. Papernot, R. Anderson, The Curse of Recursion: Training on Generated Data Makes Models Forget, arXiv [cs.LG] (2023); http://arxiv.org/abs/2305.17493 2

  230. G. Martínez, L. Watson, P. Reviriego, J. A. Hernández, M. Juarez, R. Sarkar, Combining Generative Artificial Intelligence (AI) and the Internet: Heading towards Evolution or Degradation?, arXiv [cs.CV] (2023); http://arxiv.org/abs/2303.01255

  231. R. Hataya, H. Bao, H. Arai, “Will Large-Scale Generative Models Corrupt Future Datasets?” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (IEEE, 2023), pp. 20498–20508; https://doi.org/10.1109/iccv51070.2023.01879 2

  232. G. Martínez, L. Watson, P. Reviriego, J. A. Hernández, M. Juarez, R. Sarkar, “Towards Understanding the Interplay of Generative Artificial Intelligence and the Internet” in Lecture Notes in Computer Science (Springer Nature Switzerland, Cham, 2024) vol. 14523 of Lecture notes in computer science, pp. 59–73; https://doi.org/10.1007/978-3-031-57963-9_5

  233. Y. Guo, G. Shang, M. Vazirgiannis, C. Clavel, The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text, arXiv [cs.CL] (2023); http://arxiv.org/abs/2311.09807

  234. [industry] M. Bohacek, H. Farid, Nepotistically Trained Generative-AI Models Collapse, arXiv [cs.AI] (2023); http://arxiv.org/abs/2311.12202

  235. S. Alemohammad, J. Casco-Rodriguez, L. Luzi, A. I. Humayun, H. Babaei, D. LeJeune, A. Siahkoohi, R. Baraniuk, “Self-Consuming Generative Models Go MAD” in The 12th International Conference on Learning Representations (ICLR 2024) (Vienna, Austria, 2023); https://openreview.net/forum?id=ShjMHfmPs0 2

  236. Q. Bertrand, J. Bose, A. Duplessis, M. Jiralerspong, G. Gidel, “On the Stability of Iterative Retraining of Generative Models on Their Own Data” in 12th International Conference on Learning Representations (2024); https://openreview.net/forum?id=JORAfH2xFd

  237. [industry] E. Dohmatob, Y. Feng, P. Yang, F. Charton, J. Kempe, A Tale of Tails: Model Collapse as a Change of Scaling Laws, arXiv [cs.LG] (2024); http://arxiv.org/abs/2402.07043

  238. R. He, S. Sun, X. Yu, C. Xue, W. Zhang, P. Torr, S. Bai, X. Qi, “Is Synthetic Data from Generative Models Ready for Image Recognition?” in 11th International Conference on Learning Representations (ICLR 2023) (Kigali, Rwanda, 2022); https://openreview.net/pdf?id=nUmCcZ5RKF

  239. [industry] V. Boutin, L. Singhal, X. Thomas, T. Serre, “Diversity vs. Recognizability: Human-like Generalization in One-Shot Generative Models” in Advances in Neural Information Processing Systems (NeurIPS 2022) (New Orleans, LA, US, 2022); https://openreview.net/pdf?id=DVfZKXSFW5m

  240. V. Boutin, T. Fel, L. Singhal, R. Mukherji, A. Nagaraj, J. Colin, T. Serre, “Diffusion Models as Artists: Are We Closing the Gap between Humans and Machines?” in Proceedings of the 40th International Conference on Machine Learning (PMLR, 2023), pp. 2953–3002; https://proceedings.mlr.press/v202/boutin23a.html

  241. J. Shipard, A. Wiliem, K. N. Thanh, W. Xiang, C. Fookes, “Diversity Is Definitely Needed: Improving Model-Agnostic Zero-Shot Classification via Stable Diffusion” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (IEEE, 2023), pp. 769–778; https://doi.org/10.1109/cvprw59228.2023.00084

  242. [industry] A. Setlur, S. Garg, X. Geng, N. Garg, V. Smith, A. Kumar, RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold, arXiv [cs.LG] (2024); http://arxiv.org/abs/2406.14532

  243. P. Haluptzok, M. Bowers, A. T. Kalai, “Language Models Can Teach Themselves to Program Better” in Deep Reinforcement Learning Workshop NeurIPS 2022 (2022); https://openreview.net/forum?id=_5BZwkZRFc9

  244. [industry] B. Liu, S. Bubeck, R. Eldan, J. Kulkarni, Y. Li, A. Nguyen, R. Ward, Y. Zhang, TinyGSM: Achieving >80% on GSM8k with Small Language Models, arXiv [cs.LG] (2023); http://arxiv.org/abs/2312.09241

  245. [industry] D. Hernandez, T. B. Brown, Measuring the Algorithmic Efficiency of Neural Networks, arXiv [cs.LG] (2020); http://arxiv.org/abs/2005.04305

  246. A. Ho, T. Besiroglu, E. Erdil, D. Owen, R. Rahman, Z. C. Guo, D. Atkinson, N. Thompson, J. Sevilla, “Algorithmic Progress in Language Models” (Epoch AI, 2024); http://arxiv.org/abs/2403.05812 2

  247. F. E. Dorner, Measuring Progress in Deep Reinforcement Learning Sample Efficiency, arXiv [cs.LG] (2021); http://arxiv.org/abs/2102.04881

  248. [industry] Y. Ding, L. L. Zhang, C. Zhang, Y. Xu, N. Shang, J. Xu, F. Yang, M. Yang, LongRoPE: Extending LLM Context Window beyond 2 Million Tokens, arXiv [cs.CL] (2024); http://arxiv.org/abs/2402.13753

  249. A. Fawzi, M. Balog, A. Huang, T. Hubert, B. Romera-Paredes, M. Barekatain, A. Novikov, F. J. R Ruiz, J. Schrittwieser, G. Swirszcz, D. Silver, D. Hassabis, P. Kohli, Discovering Faster Matrix Multiplication Algorithms with Reinforcement Learning. Nature 610, 47–53 (2022); https://doi.org/10.1038/s41586-022-05172-4

  250. A. Haj-Ali, N. K. Ahmed, T. Willke, Y. S. Shao, K. Asanovic, I. Stoica, “NeuroVectorizer: End-to-End Vectorization with Deep Reinforcement Learning” in Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO 2020) (Association for Computing Machinery, New York, NY, USA, 2020), pp. 242–255; https://doi.org/10.1145/3368826.3377928

  251. A. Goldie, A. Mirhoseini, M. Yazgan, J. W. Jiang, E. Songhori, S. Wang, Y.-J. Lee, E. Johnson, O. Pathak, A. Nova, J. Pak, A. Tong, K. Srinivasa, W. Hang, E. Tuncer, Q. V. Le, J. Laudon, … J. Dean, Addendum: A Graph Placement Methodology for Fast Chip Design. Nature 634, E10–E11 (2024); https://doi.org/10.1038/s41586-024-08032-5

  252. X. Li, P. Yu, C. Zhou, T. Schick, O. Levy, L. Zettlemoyer, J. E. Weston, M. Lewis, “Self-Alignment with Instruction Backtranslation” in The 12th International Conference on Learning Representations (ICLR 2024) (Vienna, Austria, 2023); https://openreview.net/forum?id=1oijHJBRsT

  253. S. Liu, Z. Lin, S. Yu, R. Lee, T. Ling, D. Pathak, D. Ramanan, Language Models as Black-Box Optimizers for Vision-Language Models, arXiv [cs.CL] (2023); http://arxiv.org/abs/2309.05950

  254. R. Pryzant, D. Iter, J. Li, Y. Lee, C. Zhu, M. Zeng, “Automatic Prompt Optimization with ‘Gradient Descent’ and Beam Search” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), H. Bouamor, J. Pino, K. Bali, Eds. (Association for Computational Linguistics, Singapore, 2023), pp. 7957–7968; https://doi.org/10.18653/v1/2023.emnlp-main.494

  255. S. Zhang, C. Gong, L. Wu, X. Liu, M. Zhou, AutoML-GPT: Automatic Machine Learning with GPT, arXiv [cs.CL] (2023); http://arxiv.org/abs/2305.02499

  256. [industry] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, … J. Kaplan, Constitutional AI: Harmlessness from AI Feedback, arXiv [cs.CL] (2022); http://arxiv.org/abs/2212.08073 2

  257. [industry] N. Sachdeva, B. Coleman, W.-C. Kang, J. Ni, L. Hong, E. H. Chi, J. Caverlee, J. McAuley, D. Z. Cheng, How to Train Data-Efficient LLMs, arXiv [cs.LG] (2024); http://arxiv.org/abs/2402.09668

  258. [industry] S. Kumar, T. Ghosal, V. Goyal, A. Ekbal, Can Large Language Models Unlock Novel Scientific Research Ideas?, arXiv [cs.CL] (2024); http://arxiv.org/abs/2409.06185

  259. H. Wijk, T. Lin, J. Becker, S. Jawhar, N. Parikh, T. Broadley, L. Chan, M. Chen, J. Clymer, J. Dhyani, E. Ericheva, K. Garcia, B. Goodrich, N. Jurkovic, M. Kinniment, A. Lajko, S. Nix, … E. Barnes, RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts, arXiv [cs.LG] (2024); http://arxiv.org/abs/2411.15114 2 3

  260. D. Owen, “Interviewing AI Researchers on Automation of AI R&D” (Epoch AI, 2024); https://epoch.ai/blog/interviewing-ai-researchers-on-automation-of-ai-rnd

  261. [industry] E. Erdil, J. Sevilla, Power Law Trends in Speedrunning and Machine Learning, arXiv [cs.LG] (2023); http://arxiv.org/abs/2304.10004

  262. [industry] J. Droppo, O. Elibol, Scaling Laws for Acoustic Models, arXiv [eess.AS] (2021); http://arxiv.org/abs/2106.09488

  263. S. Hooker, The Hardware Lottery. Communications of the ACM 64, 58–65 (2021); https://doi.org/10.1145/3467017

  264. [industry] Q. Anthony, J. Hatef, D. Narayanan, S. Biderman, S. Bekman, J. Yin, A. Shafi, H. Subramoni, D. Panda, The Case for Co-Designing Model Architectures with Hardware, arXiv [cs.DC] (2024); http://arxiv.org/abs/2401.14489

  265. [industry] F. Mince, D. Dinh, J. Kgomo, N. Thompson, S. Hooker, The Grand Illusion: The Myth of Software Portability and Implications for ML Progress, arXiv [cs.SE] (2023); http://arxiv.org/abs/2309.07181

  266. [industry] The Scale Team, Submit Your Toughest Questions for Humanity’s Last Exam, scale (2024); https://scale.com/blog/humanitys-last-exam

  267. ARC Prize, ARC Prize, ARC Prize (2024); https://arcprize.org/

  268. Department for Science, Innovation and Technology, “AI Safety Institute Approach to Evaluations” (GOV.UK, 2024); https://www.gov.uk/government/publications/ai-safety-institute-approach-to-evaluations/ai-safety-institute-approach-to-evaluations

  269. Metr, An Update on Our General Capability Evaluations, METR (2024); https://metr.org/blog/2024-08-06-update-on-evaluations/

  270. G. Sastry, L. Heim, H. Belfield, M. Anderljung, M. Brundage, J. Hazell, C. O’Keefe, G. K. Hadfield, R. Ngo, K. Pilz, G. Gor, E. Bluemke, S. Shoker, J. Egan, R. F. Trager, S. Avin, A. Weller, … D. Coyle, Computing Power and the Governance of Artificial Intelligence, arXiv [cs.CY] (2024); http://arxiv.org/abs/2402.08797 2 3

  271. D. Citron, R. Chesney, Deep Fakes: A Looming Challenge for Privacy, Democracy, and National Security. California Law Review 107, 1753 (2019); https://scholarship.law.bu.edu/faculty_scholarship/640 2 3

  272. United Nations, Universal Declaration of Human Rights (1948); https://www.un.org/en/about-us/universal-declaration-of-human-rights 2 3

  273. V. Ciancaglini, C. Gibson, D. Sancho, O. McCarthy, M. Eira, P. Amann, A. Klayn, “Malicious Uses and Abuses of Artificial Intelligence” (European Union Agency for Law Enforcement Cooperation, 2020);  2

  274. P. V. Falade, Decoding the Threat Landscape: ChatGPT, FraudGPT, and WormGPT in Social Engineering Attacks. International Journal of Scientific Research in Computer Science, Engineering and Information Technology 9, 185–198 (2023); https://doi.org/10.32628/CSEIT2390533

  275. J. Bateman, “Deepfakes and Synthetic Media in the Financial System: Assessing Threat Scenarios” (Carnegie Endowment for International Peace, 2020); https://carnegieendowment.org/research/2020/07/deepfakes-and-synthetic-media-in-the-financial-system-assessing-threat-scenarios?lang=en

  276. US Federal Bureau of Investigation, Alert Number I-060523-PSA: Malicious Actors Manipulating Photos and Videos to Create Explicit Content and Sextortion Schemes (2023); https://www.ic3.gov/PSA/2023/psa230605

  277. A. Kaur, A. Noori Hoshyar, V. Saikrishna, S. Firmin, F. Xia, Deepfake Video Detection: Challenges and Opportunities. Artificial Intelligence Review 57, 1–47 (2024); https://doi.org/10.1007/s10462-024-10810-6

  278. R. Umbach, N. Henry, G. Beard, C. Berryessa, Non-Consensual Synthetic Intimate Imagery: Prevalence, Attitudes, and Knowledge in 10 Countries, arXiv [cs.CY] (2024); http://arxiv.org/abs/2402.01721

  279. M. B. Kugler, C. Pace, Deepfake Privacy: Attitudes and Regulation. Northwestern University Law Review 116, 611–680 (2021); https://scholarlycommons.law.northwestern.edu/nulr/vol116/iss3/1

  280. M. Viola, C. Voto, Designed to Abuse? Deepfakes and the Non-Consensual Diffusion of Intimate Images. Synthese 201, 30 (2023); https://doi.org/10.1007/s11229-022-04012-2

  281. S. Maddocks, “A Deepfake Porn Plot Intended to Silence Me”: Exploring Continuities between Pornographic and “political” Deep Fakes. Porn Studies 7, 415–423 (2020); https://doi.org/10.1080/23268743.2020.1757499

  282. H. Ajder, G. Patrini, F. Cavalli, L. Cullen, “The State of Deepfakes: Landscape, Threats, and Impact” (Deeptrace, 2019); https://regmedia.co.uk/2019/10/08/deepfake_report.pdf

  283. J. Laffier, A. Rehman, Deepfakes and Harm to Women. Journal of Digital Life and Learning 3, 1–21 (2023); https://doi.org/10.51357/jdll.v3i1.218

  284. [industry] T. Sippy, F. Enock, J. Bright, H. Z. Margetts, Behind the Deepfake: 8% Create; 90% Concerned. Surveying Public Exposure to and Perceptions of Deepfakes in the UK, arXiv [cs.CY] (2024); http://arxiv.org/abs/2407.05529

  285. D. Thiel, “Identifying and Eliminating CSAM in Generative ML Training Data and Models” (Stanford Digital Repository, 2023); https://purl.stanford.edu/kh752sm9123

  286. Ofcom, A Deep Dive into Deepfakes That Demean, Defraud and Disinform (2024); https://www.ofcom.org.uk/online-safety/illegal-and-harmful-content/deepfakes-demean-defraud-disinform/ 2

  287. S. Dunn, Legal Definitions of Intimate Images in the Age of Sexual Deepfakes and Generative AI, Social Science Research Network (2024); https://papers.ssrn.com/abstract=4813941

  288. Y. Mirsky, W. Lee, The Creation and Detection of Deepfakes: A Survey, arXiv [cs.CV] (2020); http://arxiv.org/abs/2004.11138

  289. A. Lewis, P. Vu, R. Duch, A. Chowdhury, Do Content Warnings Help People Spot a Deepfake? Evidence from Two Experiments (2022); https://royalsociety.org/-/media/policy/projects/online-information-environment/do-content-warnings-help-people-spot-a-deepfake.pdf 2 3

  290. A. Qureshi, D. Megías, M. Kuribayashi, “Detecting Deepfake Videos Using Digital Watermarking” in 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (2021), pp. 1786–1793; http://www.apsipa.org/proceedings/2021/pdfs/0001786.pdf 2

  291. L. Tang, Q. Ye, H. Hu, Q. Xue, Y. Xiao, J. Li, DeepMark: A Scalable and Robust Framework for DeepFake Video Detection. ACM Transactions on Privacy and Security 27, 1–26 (2024); https://doi.org/10.1145/3629976 2

  292. L.-Y. Hsu, AI-Assisted Deepfake Detection Using Adaptive Blind Image Watermarking. Journal of Visual Communication and Image Representation 100, 104094 (2024); https://doi.org/10.1016/j.jvcir.2024.104094 2

  293. Y. Zhao, B. Liu, M. Ding, B. Liu, T. Zhu, X. Yu, “Proactive Deepfake Defence via Identity Watermarking” in 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2023), pp. 4591–4600; https://doi.org/10.1109/WACV56688.2023.00458 2

  294. [industry] S. Gowal, P. Kohli, Identifying AI-Generated Images with SynthID, Google DeepMind (2023); https://deepmind.google/discover/blog/identifying-ai-generated-images-with-synthid/ 2

  295. A. J. Patil, R. Shelke, An Effective Digital Audio Watermarking Using a Deep Convolutional Neural Network with a Search Location Optimization Algorithm for Improvement in Robustness and Imperceptibility. High-Confidence Computing 3, 100153 (2023); https://doi.org/10.1016/j.hcc.2023.100153 2

  296. M. S. Uddin, Ohidujjaman, M. Hasan, T. Shimamura, Audio Watermarking: A Comprehensive Review. International Journal of Advanced Computer Science and Applications 15 (2024); https://doi.org/10.14569/IJACSA.2024.01505141 2

  297. S. Abdelnabi, M. Fritz, “Adversarial Watermarking Transformer: Towards Tracing Text Provenance with Data Hiding” in IEEE Symposium on Security and Privacy (2021), pp. 121–140; https://doi.org/10.1109/SP40001.2021.00083 2

  298. [industry] X. Zhao, K. Zhang, Z. Su, S. Vasan, I. Grishchenko, C. Kruegel, G. Vigna, Y.-X. Wang, L. Li, Invisible Image Watermarks Are Provably Removable Using Generative AI, arXiv [cs.CR] (2023); http://arxiv.org/abs/2306.01953 2

  299. M. Saberi, V. S. Sadasivan, K. Rezaei, A. Kumar, A. Chegini, W. Wang, S. Feizi, “Robustness of AI-Image Detectors: Fundamental Limits and Practical Attacks” in 12th International Conference on Learning Representations (2023); https://openreview.net/pdf?id=dLoAdIKENc 2 3

  300. G. Björksten, “Identifying Generative AI Content: When and How Watermarking Can Help Uphold Human Rights” (accessnow, 2023); https://www.accessnow.org/wp-content/uploads/2023/09/Identifying-generative-AI-content-when-and-how-watermarking-can-help-uphold-human-rights.pdf 2

  301. D. Cooke, A. Edwards, S. Barkoff, K. Kelly, As Good As A Coin Toss: Human Detection of AI-Generated Images, Videos, Audio, and Audiovisual Stimuli, arXiv [cs.HC] (2024); http://arxiv.org/abs/2403.16760

  302. M. Jakesch, J. T. Hancock, M. Naaman, Human Heuristics for AI-Generated Language Are Flawed. Proceedings of the National Academy of Sciences of the United States of America 120, e2208839120 (2023); https://doi.org/10.1073/pnas.2208839120

  303. G. Spitale, N. Biller-Andorno, F. Germani, AI Model GPT-3 (dis)informs Us Better than Humans. Science Advances 9, eadh1850 (2023); https://doi.org/10.1126/sciadv.adh1850

  304. S. Kreps, R. M. McCain, M. Brundage, All the News That’s Fit to Fabricate: AI-Generated Text as a Tool of Media Misinformation. Journal of Experimental Political Science 9, 104–117 (2022); https://doi.org/10.1017/xps.2020.37

  305. N. C. Köbis, B. Doležalová, I. Soraperra, Fooled Twice: People Cannot Detect Deepfakes but Think They Can. iScience 24 (2021); https://doi.org/10.1016/j.isci.2021.103364

  306. K.-C. Yang, F. Menczer, Anatomy of an AI-Powered Malicious Social Botnet, arXiv [cs.CY] (2023); http://arxiv.org/abs/2307.16336

  307. R. Raman, V. Kumar Nair, P. Nedungadi, A. Kumar Sahu, R. Kowalski, S. Ramanathan, K. Achuthan, Fake News Research Trends, Linkages to Generative Artificial Intelligence and Sustainable Development Goals. Heliyon 10, e24727 (2024); https://doi.org/10.1016/j.heliyon.2024.e24727

  308. [industry] M. Musser, A Cost Analysis of Generative Language Models and Influence Operations, arXiv [cs.CY] (2023); http://arxiv.org/abs/2308.03740

  309. H. Bai, J. G. Voelkel, J. C. Eichstaedt, R. Willer, Artificial Intelligence Can Persuade Humans on Political Issues (2023); https://doi.org/10.31219/osf.io/stakv

  310. K. Hackenburg, L. Ibrahim, B. M. Tappin, M. Tsakiris, Comparing the Persuasiveness of Role-Playing Large Language Models and Human Experts on Polarized U.S. Political Issues (2023); https://doi.org/10.31219/osf.io/ey8db

  311. J. A. Goldstein, J. Chao, S. Grossman, A. Stamos, M. Tomz, How Persuasive Is AI-Generated Propaganda? PNAS Nexus 3, gae034 (2024); https://doi.org/10.1093/pnasnexus/pgae034

  312. S. C. Matz, J. D. Teeny, S. S. Vaid, H. Peters, G. M. Harari, M. Cerf, The Potential of Generative AI for Personalized Persuasion at Scale. Scientific Reports 14, 4692 (2024); https://doi.org/10.1038/s41598-024-53755-0

  313. [industry] A. R. Williams, L. Burke-Moore, R. S.-Y. Chan, F. E. Enock, F. Nanni, T. Sippy, Y.-L. Chung, E. Gabasova, K. Hackenburg, J. Bright, Large Language Models Can Consistently Generate High-Quality Content for Election Disinformation Operations, arXiv [cs.CY] (2024); http://arxiv.org/abs/2408.06731

  314. T. H. Costello, G. Pennycook, D. G. Rand, Durably Reducing Conspiracy Beliefs through Dialogues with AI. Science (New York, N.Y.) 385, eadq1814 (2024); https://doi.org/10.1126/science.adq1814

  315. F. Salvi, M. H. Ribeiro, R. Gallotti, R. West, On the Conversational Persuasiveness of Large Language Models: A Randomized Controlled Trial, arXiv [cs.CY] (2024); http://arxiv.org/abs/2403.14380

  316. [industry] I. Gabriel, A. Manzini, G. Keeling, L. A. Hendricks, V. Rieser, H. Iqbal, N. Tomašev, I. Ktena, Z. Kenton, M. Rodriguez, S. El-Sayed, S. Brown, C. Akbulut, A. Trask, E. Hughes, A. Stevie Bergman, R. Shelby, … J. Manyika, “The Ethics of Advanced AI Assistants” (Google DeepMind, 2024); http://arxiv.org/abs/2404.16244 2 3 4 5

  317. P. S. Park, S. Goldstein, A. O’Gara, M. Chen, D. Hendrycks, AI Deception: A Survey of Examples, Risks, and Potential Solutions. Patterns 5 (2024); https://doi.org/10.1016/j.patter.2024.100988 2 3 4 5 6

  318. [industry] M. Phuong, M. Aitchison, E. Catt, S. Cogan, A. Kaskasoli, V. Krakovna, D. Lindner, M. Rahtz, Y. Assael, S. Hodkinson, H. Howard, T. Lieberum, R. Kumar, M. A. Raad, A. Webson, L. Ho, S. Lin, … T. Shevlane, “Evaluating Frontier Models for Dangerous Capabilities” (Google Deepmind, 2024); https://doi.org/10.48550/arXiv.2403.13793 2 3 4 5 6 7 8

  319. M. Burtell, T. Woodside, Artificial Influence: An Analysis Of AI-Driven Persuasion, arXiv [cs.CY] (2023); http://arxiv.org/abs/2303.08721 2

  320. F. Miró-Llinares, J. C. Aguerri, Misinformation about Fake News: A Systematic Critical Review of Empirical Studies on the Phenomenon and Its Status as a “threat.” European Journal of Criminology 20, 356–374 (2023); https://doi.org/10.1177/1477370821994059

  321. G. Pennycook, D. G. Rand, Fighting Misinformation on Social Media Using Crowdsourced Judgments of News. Source Quality. Proceedings of the National Academy of Sciences of the United States of America 116, 2521–2526 (2019); 

  322. Z. Epstein, N. Sirlin, A. Arechar, G. Pennycook, D. Rand, The Social Media Context Interferes with Truth Discernment. Science Advances 9, eabo6169 (2023); https://doi.org/10.1126/sciadv.abo6169

  323. G. Pennycook, Z. Epstein, M. Mosleh, A. A. Arechar, D. Eckles, D. G. Rand, Shifting Attention to Accuracy Can Reduce Misinformation Online. Nature 592, 590–595 (2021); https://doi.org/10.1038/s41586-021-03344-2

  324. Pew Research Center, A Majority of Americans Are Highly Concerned That AI Will Be Used to Create Fake Info about the 2024 Candidates (2024); https://www.pewresearch.org/short-reads/2024/09/19/concern-over-the-impact-of-ai-on-2024-presidential-campaign/sr_24-09-10_electionandai_01/

  325. S. Kapoor, A. Narayanan, “How to Prepare for the Deluge of Generative AI on Social Media: A Grounded Analysis of the Challenges and Opportunities” (Knight First Amendment Institute at Columbia University., 2023); https://s3.amazonaws.com/kfai-documents/documents/a566f4ded5/How-to-Prepare-for-the-Deluge-of-Generative-AI-on-Social-Media.pdf

  326. M. Hameleers, Cheap Versus Deep Manipulation: The Effects of Cheapfakes Versus Deepfakes in a Political Setting. International Journal of Public Opinion Research 36 (2024); https://doi.org/10.1093/ijpor/edae004

  327. S. Vosoughi, D. Roy, S. Aral, The Spread of True and False News Online. Science 359, 1146–1151 (2018); https://doi.org/10.1126/science.aap9559

  328. K. Clayton, S. Blair, J. A. Busam, S. Forstner, J. Glance, G. Green, A. Kawata, A. Kovvuri, J. Martin, E. Morgan, M. Sandhu, R. Sang, R. Scholz-Bright, A. T. Welch, A. G. Wolff, A. Zhou, B. Nyhan, Real Solutions for Fake News? Measuring the Effectiveness of General Warnings and Fact-Check Tags in Reducing Belief in False Stories on Social Media. Political Behavior 42, 1073–1095 (2020); https://doi.org/10.1007/s11109-019-09533-0

  329. E. Hoes, B. Aitken, J. Zhang, T. Gackowski, M. Wojcieszak, Prominent Misinformation Interventions Reduce Misperceptions but Increase Skepticism, PsyArXiv (2023); https://doi.org/10.31234/osf.io/zmpdu

  330. A. Bashardoust, S. Feuerriegel, Y. R. Shrestha, Comparing the Willingness to Share for Human-Generated vs. AI-Generated Fake News. Proceedings of the ACM on Human-Computer Interaction 8, 1–21 (2024); https://doi.org/10.1145/3687028

  331. A. Kumar, J. W. Taylor, Feature Importance in the Age of Explainable AI: Case Study of Detecting Fake News & Misinformation via a Multi-Modal Framework. European Journal of Operational Research 317, 401–413 (2024); https://doi.org/10.1016/j.ejor.2023.10.003

  332. S. S. Ghosal, S. Chakraborty, J. Geiping, F. Huang, D. Manocha, A. Bedi, A Survey on the Possibilities & Impossibilities of AI-Generated Text Detection. Transactions on Machine Learning Research (2023); https://openreview.net/pdf?id=AXtFeYjboj 2

  333. V. S. Sadasivan, A. Kumar, S. Balasubramanian, W. Wang, S. Feizi, Can AI-Generated Text Be Reliably Detected?, arXiv [cs.CL] (2023); http://arxiv.org/abs/2303.11156 2 3

  334. S. Gehrmann, H. Strobelt, A. Rush, “GLTR: Statistical Detection and Visualization of Generated Text” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, M. R. Costa-jussà, E. Alfonseca, Eds. (Association for Computational Linguistics, Florence, Italy, 2019), pp. 111–116; https://doi.org/10.18653/v1/P19-3019

  335. L. Fröhling, A. Zubiaga, Feature-Based Detection of Automated Language Models: Tackling GPT-2, GPT-3 and Grover. PeerJ. Computer Science 7, e443 (2021); https://doi.org/10.7717/peerj-cs.443

  336. J. Luo, G. Nan, D. Li, Y. Tan, AI-Generated Fake Review Detection. (2023); https://doi.org/10.2139/ssrn.4610727

  337. T. Berber Sardinha, AI-Generated vs Human-Authored Texts: A Multidimensional Comparison. Applied Corpus Linguistics 4, 100083 (2024); https://doi.org/10.1016/j.acorp.2023.100083 2

  338. D. M. Markowitz, J. T. Hancock, J. N. Bailenson, Linguistic Markers of Inherently False AI Communication and Intentionally False Human Communication: Evidence From Hotel Reviews. Journal of Language and Social Psychology 43, 63–82 (2024); https://doi.org/10.1177/0261927X231200201 2

  339. Y. Xie, A. Rawal, Y. Cen, D. Zhao, S. K. Narang, S. Sushmita, MUGC: Machine Generated versus User Generated Content Detection, arXiv [cs.CL] (2024); http://arxiv.org/abs/2403.19725

  340. J. Su, T. Y. Zhuo, J. Mansurov, D. Wang, P. Nakov, Fake News Detectors Are Biased against Texts Generated by Large Language Models, arXiv [cs.CL] (2023); http://arxiv.org/abs/2309.08674

  341. W. Liang, M. Yuksekgonul, Y. Mao, E. Wu, J. Zou, “GPT Detectors Are Biased against Non-Native English Writers” in ICLR 2023 Workshop on Trustworthy and Reliable Large-Scale Machine Learning Models (2023); https://openreview.net/pdf?id=SPuX8tKKIQ

  342. A. Uchendu, J. Lee, H. Shen, T. Le, T.-H. ‘kenneth Huang, D. Lee, Does Human Collaboration Enhance the Accuracy of Identifying LLM-Generated Deepfake Texts?, arXiv [cs.CL] (2023); http://arxiv.org/abs/2304.01002

  343. M. K. Land, Against Privatized Censorship: Proposals for Responsible Delegation. Virginia Journal of International Law 60, 363 (2019); https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3442184

  344. R. Gorwa, R. Binns, C. Katzenbach, Algorithmic Content Moderation: Technical and Political Challenges in the Automation of Platform Governance. Big Data & Society 7, 205395171989794 (2020); https://doi.org/10.1177/2053951719897945

  345. J. Turner, Robot Rules (Springer International Publishing, Cham, Switzerland, ed. 1, 2018); https://doi.org/10.1007/978-3-319-96235-1

  346. N. Bontridder, Y. Poullet, The Role of Artificial Intelligence in Disinformation. Data & Policy 3, e32 (2021); https://doi.org/10.1017/dap.2021.20

  347. T. C. Helmus, Artificial Intelligence, Deepfakes, and Disinformation: A Primer (RAND Corporation, Santa Monica, CA, 2022); https://doi.org/10.7249/PEA1043-1

  348. S. Metta, I. Chang, J. Parker, M. P. Roman, A. F. Ehuan, Generative AI in Cybersecurity, arXiv [cs.CR] (2024); http://arxiv.org/abs/2405.01674

  349. National Cyber Security Centre (NCSC), “The near-Term Impact of AI on the Cyber Threat” (GOV.UK, 2024); https://www.ncsc.gov.uk/report/impact-of-ai-on-cyber-threat

  350. British Library, “Learning Lessons From the Cyber-Attack: British Library Cyber Incident Review” (British Library, 2024); https://www.bl.uk/home/british-library-cyber-incident-review-8-march-2024.pdf/

  351. [industry] Microsoft Threat Intelligence, Staying ahead of Threat Actors in the Age of AI, Microsoft Security Blog (2024); https://www.microsoft.com/en-us/security/blog/2024/02/14/staying-ahead-of-threat-actors-in-the-age-of-ai/ 2

  352. [industry] B. Nimmo, M. Flossman, “Influence and Cyber Operations: An Update” (OpenAI, 2024); https://cdn.openai.com/threat-intelligence-reports/influence-and-cyber-operations-an-update_October-2024.pdf

  353. Defense Advanced Research Projects Agency, AIxCC (2024); https://aicyberchallenge.com/ 2 3 4

  354. D. Ristea, V. Mavroudis, C. Hicks, Benchmarking OpenAI o1 in Cyber Security, arXiv [cs.CR] (2024); http://arxiv.org/abs/2410.21939 2

  355. A. K. Zhang, N. Perry, R. Dulepet, J. Ji, J. W. Lin, E. Jones, C. Menders, G. Hussein, S. Liu, D. Jasper, P. Peetathawatchai, A. Glenn, V. Sivashankar, D. Zamoshchin, L. Glikbarg, D. Askaryar, M. Yang, … P. Liang, Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models, arXiv [cs.CR] (2024); http://arxiv.org/abs/2408.08926 2 3 4 5 6

  356. H. Ruan, Y. Zhang, A. Roychoudhury, SpecRover: Code Intent Extraction via LLMs, arXiv [cs.SE] (2024); http://arxiv.org/abs/2408.02232

  357. N. T. Islam, J. Khoury, A. Seong, E. Bou-Harb, P. Najafirad, Enhancing Source Code Security with LLMs: Demystifying the Challenges and Generating Reliable Repairs, arXiv [cs.CR] (2024); http://arxiv.org/abs/2409.00571

  358. X. Du, G. Zheng, K. Wang, J. Feng, W. Deng, M. Liu, B. Chen, X. Peng, T. Ma, Y. Lou, Vul-RAG: Enhancing LLM-Based Vulnerability Detection via Knowledge-Level RAG, arXiv [cs.SE] (2024); http://arxiv.org/abs/2406.11147

  359. [industry] M. Allamanis, M. Arjovsky, C. Blundell, L. Buesing, M. Brand, S. Glazunov, D. Maier, P. Maniatis, G. Marinho, H. Michalewski, K. Sen, C. Sutton, V. Tulsyan, M. Vanotti, T. Weber, D. Zheng, From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code (2024); https://googleprojectzero.blogspot.com/2024/10/from-naptime-to-big-sleep.html

  360. J. Gennari, S.-H. Lau, S. Perl, J. Parish, G. Sastry, “Considerations for Evaluating Large Language Models for Cybersecurity Tasks” (Carnegie Mellon University, 2024); https://insights.sei.cmu.edu/library/considerations-for-evaluating-large-language-models-for-cybersecurity-tasks/

  361. M. Shao, B. Chen, S. Jancheska, B. Dolan-Gavitt, S. Garg, R. Karri, M. Shafique, An Empirical Evaluation of LLMs for Solving Offensive Security Challenges, arXiv [cs.CR] (2024); http://arxiv.org/abs/2402.11814 2

  362. [industry] J. Xu, J. W. Stokes, G. McDonald, X. Bai, D. Marshall, S. Wang, A. Swaminathan, Z. Li, AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-Attacks, arXiv [cs.CR] (2024); http://arxiv.org/abs/2403.01038

  363. R. Fang, R. Bindu, A. Gupta, Q. Zhan, D. Kang, Teams of LLM Agents Can Exploit Zero-Day Vulnerabilities, arXiv [cs.MA] (2024); http://arxiv.org/abs/2406.01637

  364. T. Abramovich, M. Udeshi, M. Shao, K. Lieret, H. Xi, K. Milner, S. Jancheska, J. Yang, C. E. Jimenez, F. Khorrami, P. Krishnamurthy, B. Dolan-Gavitt, M. Shafique, K. Narasimhan, R. Karri, O. Press, EnIGMA: Enhanced Interactive Generative Model Agent for CTF Challenges, arXiv [cs.AI] (2024); http://arxiv.org/abs/2409.16165

  365. G. Deng, Y. Liu, V. Mayoral-Vilches, P. Liu, Y. Li, Y. Xu, T. Zhang, Y. Liu, M. Pinzger, S. Rass, “PentestGPT: Evaluating and Harnessing Large Language Models for Automated Penetration Testing” in 33rd USENIX Security Symposium (USENIX Security 24) (USENIX Association, Philadelphia, PA, 2024), pp. 847–864; https://www.usenix.org/conference/usenixsecurity24/presentation/deng

  366. [industry] S. Glazunov, M. Brand, Google Project Zero, “Project Naptime: Evaluating Offensive Security Capabilities of Large Language Models” (Google Project Zero, 2024); https://googleprojectzero.blogspot.com/2024/06/project-naptime.html

  367. J. Walden, “The Impact of a Major Security Event on an Open Source Project: The Case of OpenSSL” in Proceedings of the 17th International Conference on Mining Software Repositories (ACM, New York, NY, USA, 2020); https://doi.org/10.1145/3379597.3387465

  368. G. Kokolakis, A. Moschos, A. D. Keromytis, “Harnessing the Power of General-Purpose LLMs in Hardware Trojan Design” in Lecture Notes in Computer Science (Springer Nature Switzerland, Cham, 2024)Lecture notes in computer science, pp. 176–194; https://doi.org/10.1007/978-3-031-61486-6_11

  369. J. P. Farwell, R. Rohozinski, Stuxnet and the Future of Cyber War. Survival 53, 23–40 (2011); https://doi.org/10.1080/00396338.2011.555586

  370. D. Saha, S. Tarek, K. Yahyaei, S. K. Saha, J. Zhou, M. Tehranipoor, F. Farahmandi, LLM for SoC Security: A Paradigm Shift. IEEE Access 12, 155498–155521 (2024); https://doi.org/10.1109/ACCESS.2024.3427369

  371. [industry] Amazon, What Is AWS CloudTrail? (2024); https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html

  372. [industry] P. Kanuparthy, A. Dalakoti, S. Kamath, AI Debugging at Meta with HawkEye, Engineering at Meta (2023); https://engineering.fb.com/2023/12/19/data-infrastructure/hawkeye-ai-debugging-meta/

  373. M. C. Horowitz, P. Scharre, A. Velez-Green, A Stable Nuclear Future? The Impact of Autonomous Systems and Artificial Intelligence, arXiv [cs.CY] (2019); http://arxiv.org/abs/1912.05291

  374. A. E. Chu, T. Lu, P.-S. Huang, Sparks of Function by de Novo Protein Design. Nature Biotechnology 42, 203–215 (2024); https://doi.org/10.1038/s41587-024-02133-2

  375. Robert F. Service, AI Tools Set off an Explosion of Designer Proteins. Science 386, 260–261 (2024); https://doi.org/10.1126/science.adt9024

  376. C. Li, G. Ye, Y. Jiang, Z. Wang, H. Yu, M. Yang, Artificial Intelligence in Battling Infectious Diseases: A Transformative Role. Journal of Medical Virology 96, e29355 (2024); https://doi.org/10.1002/jmv.29355

  377. The Royal Swedish Academy of Sciences, The Nobel Prize in Chemistry 2024. (2024); https://www.nobelprize.org/uploads/2024/10/press-chemistryprize2024-3.pdf

  378. V. Pitschmann, Z. Hon, Drugs as Chemical Weapons: Past and Perspectives. Toxics 11, 52 (2023); https://doi.org/10.3390/toxics11010052

  379. National Research Council, “Biosecurity and Dual-Use Research in the Life Sciences” in Science and Security in a Post 9/11 World: A Report Based on Regional Discussions between the Science and Security Communities (National Academies Press, Washington, D.C., DC, 2007); https://doi.org/10.17226/12013

  380. S. Ben Ouagrham-Gormley, Barriers to Bioweapons: The Challenges of Expertise and Organization for Weapons Development (Cornell University Press, 2014); https://www.cornellpress.cornell.edu/book/9780801452888/barriers-to-bioweapons/#bookTabs=1 2 3

  381. J. Revill, C. Jefferson, Tacit Knowledge and the Biological Weapons Regime. Science & Public Policy 41, 597–610 (2014); https://doi.org/10.1093/scipol/sct090 2 3

  382. S. R. Carter, N. Wheeler, S. Chwalek, C. Isaac, J. M. Yassif, “The Convergence of Artificial Intelligence and the Life Sciences: Safeguarding Technology, Rethinking Governance, and Preventing Catastrophe” (Nuclear Threat Initiative, 2023); https://www.nti.org/wp-content/uploads/2023/10/NTIBIO_AI_FINAL.pdf 2

  383. J. Smith, S. Rose, R. Moulange, C. Nelson, “How the UK Government Should Address the Misuse Risk from AI-Enabled Biological Tools” (Centre for Long-Term Resilience, 2024); https://www.longtermresilience.org/wp-content/uploads/2024/07/How-the-UK-Government-should-address-the-misuse-risk-from-AI-Enabled-biological-tools-BTs-Website-Copy.pdf 2

  384. B. Drexel, C. Withers, “AI and the Evolution of Biological National Security Risks: Capabilities, Thresholds, and Interventions” (CNAS, 2024); https://www.cnas.org/publications/reports/ai-and-the-evolution-of-biological-national-security-risks 2 3 4 5 6 7

  385. M. Dybul, “Biosecurity in the Age of AI: Chairperson’s Statement” (Helena, 2024); https://www.helenabiosecurity.org/

  386. [industry] T. Hayes, R. Rao, H. Akin, N. J. Sofroniew, D. Oktay, Z. Lin, R. Verkuil, V. Q. Tran, J. Deaton, M. Wiggert, R. Badkundri, I. Shafkat, J. Gong, A. Derry, R. S. Molina, N. Thomas, Y. Khan, … A. Rives, Simulating 500 Million Years of Evolution with a Language Model, bioRxiv [preprint] (2024); https://doi.org/10.1101/2024.07.01.600583 2

  387. [industry] V. Zambaldi, D. La, A. E. Chu, H. Patani, A. E. Danson, T. O. C. Kwan, T. Frerix, R. G. Schneider, D. Saxton, A. Thillaisundaram, Z. Wu, I. Moraes, O. Lange, E. Papa, G. Stanton, V. Martin, S. Singh, … J. Wang, “De Novo Design of High-Affinity Protein Binders with AlphaProteo” (Google DeepMind, 2024); https://deepmind.google/discover/blog/alphaproteo-generates-novel-proteins-for-biology-and-health-research/ 2 3 4

  388. Frontier Model Forum, Progress Update: Advancing Frontier AI Safety in 2024 and Beyond, Frontier Model Forum (2024); https://www.frontiermodelforum.org/updates/progress-update-advancing-frontier-ai-safety-in-2024-and-beyond/

  389. AIxBio Global Forum, “White Paper: AIxBio Global Forum Structure and Goals” (NTI, 2024); https://www.nti.org/wp-content/uploads/2024/07/AI_Bio-Global-Forum-Structure-and-Goals_White-Paper.pdf

  390. N. N. Thadani, S. Gurev, P. Notin, N. Youssef, N. J. Rollins, D. Ritter, C. Sander, Y. Gal, D. S. Marks, Learning from Prepandemic Data to Forecast Viral Escape. Nature 622, 818–825 (2023); https://doi.org/10.1038/s41586-023-06617-0

  391. E. H. Soice, R. Rocha, K. Cordova, M. Specter, K. M. Esvelt, Can Large Language Models Democratize Access to Dual-Use Biotechnology?, arXiv [cs.CY] (2023); http://arxiv.org/abs/2306.03809 2

  392. N. Li, A. Pan, A. Gopal, S. Yue, D. Berrios, A. Gatti, J. D. Li, A.-K. Dombrowski, S. Goel, L. Phan, G. Mukobi, N. Helm-Burger, R. Lababidi, L. Justen, A. B. Liu, M. Chen, I. Barrass, … D. Hendrycks, The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning, arXiv [cs.LG] (2024); http://arxiv.org/abs/2403.03218 2 3

  393. C. A. Mouton, C. Lucas, E. Guest, “The Operational Risks of AI in Large-Scale Biological Attacks: Results of a Red-Team Study” (RAND Corporation, 2024); https://www.rand.org/pubs/research_reports/RRA2977-2.html 2 3 4 5

  394. [industry] T. Patwardhan, K. Liu, T. Markov, N. Chowdhury, D. Leet, N. Cone, C. Maltbie, J. Huizinga, C. Wainwright, S. (froggi) Jackson, S. Adler, R. Casagrande, A. Madry, “Building an Early Warning System for LLM-Aided Biological Threat Creation” (OpenAI, 2024); https://openai.com/research/building-an-early-warning-system-for-llm-aided-biological-threat-creation

  395. B. J. Wittmann, T. Alexanian, C. Bartling, J. Beal, A. Clore, J. Diggans, K. Flyangolts, B. T. Gemler, T. Mitchell, S. T. Murphy, N. E. Wheeler, E. Horvitz, Toward AI-Resilient Screening of Nucleic Acid Synthesis Orders: Process, Results, and Recommendations, bioRxiv [preprint] (2024); https://doi.org/10.1101/2024.12.02.626439

  396. N. R. Bennett, B. Coventry, I. Goreshnik, B. Huang, A. Allen, D. Vafeados, Y. P. Peng, J. Dauparas, M. Baek, L. Stewart, F. DiMaio, S. De Munck, S. N. Savvides, D. Baker, Improving de Novo Protein Binder Design with Deep Learning. Nature Communications 14, 2625 (2023); https://doi.org/10.1038/s41467-023-38328-5

  397. M. Crowley, L. Shang, M. Dando, Preserving the Norm against Chemical Weapons: A Civil Society Initiative for the 2018 4th Review Conference of the Chemical Weapons Convention. Futures 102, 125–133 (2018); https://doi.org/10.1016/j.futures.2018.01.006

  398. F. Urbina, F. Lentzos, C. Invernizzi, S. Ekins, Dual Use of Artificial Intelligence-Powered Drug Discovery. Nature Machine Intelligence 4, 189–191 (2022); https://doi.org/10.1038/s42256-022-00465-9

  399. M. Guo, Z. Li, X. Deng, D. Luo, J. Yang, Y. Chen, W. Xue, ConoDL: A Deep Learning Framework for Rapid Generation and Prediction of Conotoxins, bioRxiv [preprint] (2024); https://doi.org/10.1101/2024.09.27.614001 2

  400. [industry] 310.ai, GenAI + BIO: Nature Didn’t Have Time, We Have GPUs (2024); https://310.ai/

  401. [industry] Asimov, Kernel: CAD Software for Engineering Biology (2024); https://www.asimov.com/kernel

  402. A. M Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, P. Schwaller, Augmenting Large Language Models with Chemistry Tools. Nature Machine Intelligence 6, 525–535 (2024); https://doi.org/10.1038/s42256-024-00832-8 2 3

  403. J. Goldblat, The Biological Weapons Convention: An Overview. International Review of the Red Cross 37, 251–265 (1997); https://doi.org/10.1017/s0020860400084679

  404. G. Gonzalez-Isunza, M. Z. Jawaid, P. Liu, D. L. Cox, M. Vazquez, J. Arsuaga, Using Machine Learning to Detect Coronaviruses Potentially Infectious to Humans. Scientific Reports 13, 9319 (2023); https://doi.org/10.1038/s41598-023-35861-7

  405. M. Wardeh, M. S. C. Blagrove, K. J. Sharkey, M. Baylis, Divide-and-Conquer: Machine-Learning Integrates Mammalian and Viral Traits with Network Features to Predict Virus-Mammal Associations. Nature Communications 12, 3954 (2021); https://doi.org/10.1038/s41467-021-24085-w

  406. S. Rose, R. Moulange, J. Smith, C. Nelson, “The near-Term Impact of AI on Biological Misuse” (Centre for Long-Term Resilience, 2024); https://www.longtermresilience.org/reports/the-near-term-impact-of-ai-on-biological-misuse/

  407. J. Frazer, P. Notin, M. Dias, A. Gomez, J. K. Min, K. Brock, Y. Gal, D. S. Marks, Disease Variant Prediction with Deep Generative Models of Evolutionary Data. Nature 599, 91–95 (2021); https://doi.org/10.1038/s41586-021-04043-8

  408. J. B. Sandbrink, E. C. Alley, M. C. Watson, G. D. Koblentz, K. M. Esvelt, Insidious Insights: Implications of Viral Vector Engineering for Pathogen Enhancement. Gene Therapy 30, 407–410 (2023); https://doi.org/10.1038/s41434-021-00312-3

  409. J. Kaiser, Exclusive: Controversial Experiments That Could Make Bird Flu More Risky Poised to Resume, American Association for the Advancement of Science (2021); https://www.science.org/content/article/exclusive-controversial-experiments-make-bird-flu-more-risky-poised-resume

  410. J. Pannu, D. Bloomfield, A. Zhu, R. MacKnight, G. Gomes, A. Cicero, T. Inglesby, Prioritizing High-Consequence Biological Capabilities in Evaluations of Artificial Intelligence Models, arXiv [cs.CY] (2024); http://dx.doi.org/10.2139/ssrn.4873106 2

  411. E. Appleton, C. Madsen, N. Roehner, D. Densmore, Design Automation in Synthetic Biology. Cold Spring Harbor Perspectives in Biology 9 (2017); https://doi.org/10.1101/cshperspect.a023978

  412. Organisation for Economic Co-operation and Development, Artificial Intelligence in Science: Challenges, Opportunities and the Future of Research (OECD, Paris, 2023); https://www.oecd-ilibrary.org/science-and-technology/artificial-intelligence-in-science_a8d820bd-en 2

  413. C. Nelson, S. Rose, “Understanding AI-Facilitated Biological Weapon Development” (Centre for Long-Term Resilience, 2023); https://www.longtermresilience.org/reports/understanding-risks-at-the-intersection-of-ai-and-bio/

  414. Z. Wu, S. B. J. Kan, R. D. Lewis, B. J. Wittmann, F. H. Arnold, Machine Learning-Assisted Directed Protein Evolution with Combinatorial Libraries. Proceedings of the National Academy of Sciences of the United States of America 116, 8852–8858 (2019); https://doi.org/10.1073/pnas.1901979116

  415. D. A. Boiko, R. MacKnight, B. Kline, G. Gomes, Autonomous Chemical Research with Large Language Models. Nature 624, 570–578 (2023); https://doi.org/10.1038/s41586-023-06792-0 2

  416. A. Stephenson, L. Lastra, B. Nguyen, Y.-J. Chen, J. Nivala, L. Ceze, K. Strauss, Physical Laboratory Automation in Synthetic Biology. ACS Synthetic Biology 12, 3156–3169 (2023); https://doi.org/10.1021/acssynbio.3c00345 2

  417. J. T. Rapp, B. J. Bremer, P. A. Romero, Self-Driving Laboratories to Autonomously Navigate the Protein Fitness Landscape. Nature Chemical Engineering 1, 97–107 (2024); https://doi.org/10.1038/s44286-023-00002-4

  418. A. Casas, M. Bultelle, R. Kitney, An Engineering Biology Approach to Automated Workflow and Biodesign. Synthetic Biology 9, ysae009 (2024); https://doi.org/10.1093/synbio/ysae009

  419. D. Sun, W. Gao, H. Hu, S. Zhou, Why 90% of Clinical Drug Development Fails and How to Improve It? Acta Pharmaceutica Sinica. B 12, 3049–3062 (2022); https://doi.org/10.1016/j.apsb.2022.02.002

  420. Forum on Neuroscience and Nervous System Disorders, Board on Health Sciences Policy, Institute of Medicine, “Drug Development Challenges” in Improving and Accelerating Therapeutic Development for Nervous System Disorders: Workshop Summary (National Academies Press (US), 2014); https://www.ncbi.nlm.nih.gov/books/NBK195047/

  421. K. H. Sumida, R. Núñez-Franco, I. Kalvet, S. J. Pellock, B. I. M. Wicky, L. F. Milles, J. Dauparas, J. Wang, Y. Kipnis, N. Jameson, A. Kang, J. De La Cruz, B. Sankaran, A. K. Bera, G. Jiménez-Osés, D. Baker, Improving Protein Expression, Stability, and Function with ProteinMPNN. Journal of the American Chemical Society 146, 2054–2061 (2024); https://doi.org/10.1021/jacs.3c10941

  422. M. Wehrs, D. Tanjore, T. Eng, J. Lievense, T. R. Pray, A. Mukhopadhyay, Engineering Robust Production Microbes for Large-Scale Cultivation. Trends in Microbiology 27, 524–537 (2019); https://doi.org/10.1016/j.tim.2019.01.006

  423. J. Jiang, H.-H. Peng, Z. Yang, X. Ma, S. Sahakijpijarn, C. Moon, D. Ouyang, R. O. Williams Iii, The Applications of Machine Learning (ML) in Designing Dry Powder for Inhalation by Using Thin-Film-Freezing Technology. International Journal of Pharmaceutics 626, 122179 (2022); https://doi.org/10.1016/j.ijpharm.2022.122179

  424. T. R. Sosnowski, Towards More Precise Targeting of Inhaled Aerosols to Different Areas of the Respiratory System. Pharmaceutics 16, 97 (2024); https://doi.org/10.3390/pharmaceutics16010097

  425. Department for Science, Innovation & Technology, AI Safety Institute, “Advanced AI Evaluations at AISI: May Update” (GOV.UK, 2024); https://www.aisi.gov.uk/work/advanced-ai-evaluations-may-update

  426. [industry] Anthropic, Reflections on Our Responsible Scaling Policy (2024); https://www.anthropic.com/news/reflections-on-our-responsible-scaling-policy

  427. G. Lewis, P. Millett, A. Sandberg, A. Snyder-Beattie, G. Gronvall, Information Hazards in Biotechnology. Risk Analysis: An Official Publication of the Society for Risk Analysis 39, 975–981 (2019); https://doi.org/10.1111/risa.13235 2

  428. S. R. Carter, S. Curtis, C. Emerson, J. Gray, I. C. Haydon, A. Hebbeler, C. Qureshi, N. Randolph, A. Rives, A. L. Stuart, Responsible AI X Biodesign: Community Values, Guiding Principles, and Commitments for the Responsible Development of AI for Protein Design (2024); https://responsiblebiodesign.ai/

  429. NTI bio, “Research Agenda for Safeguarding AI-Bio Capabilities Draft” (NTI, 2024); https://www.nti.org/wp-content/uploads/2024/06/Research-Agenda-for-Safeguarding-AI-Bio-Capabilities.pdf

  430. E. Nguyen, M. Poli, M. G. Durrant, A. W. Thomas, B. Kang, J. Sullivan, M. Y. Ng, A. Lewis, A. Patel, A. Lou, S. Ermon, S. A. Baccus, T. Hernandez-Boussard, C. Re, P. D. Hsu, B. L. Hie, Sequence Modeling and Design from Molecular to Genome Scale with Evo, bioRxiv [preprint] (2024); https://doi.org/10.1101/2024.02.27.582234

  431. J. Cheng, G. Novati, J. Pan, C. Bycroft, A. Žemgulytė, T. Applebaum, A. Pritzel, L. H. Wong, M. Zielinski, T. Sargeant, R. G. Schneider, A. W. Senior, J. Jumper, D. Hassabis, P. Kohli, Ž. Avsec, Accurate Proteome-Wide Missense Variant Effect Prediction with AlphaMissense. Science (New York, N.Y.) 381, eadg7492 (2023); https://doi.org/10.1126/science.adg7492

  432. S. R. Carter, N. E. Wheeler, C. Isaac, J. M. Yassif, “Developing Guardrails for AI Biodesign Tools” (Nuclear Threat Initiative, 2024); https://www.nti.org/analysis/articles/developing-guardrails-for-ai-biodesign-tools/

  433. S. A. Dip, U. A. Shuvo, T. Chau, H. Song, P. Choi, X. Wang, L. Zhang, PathoLM: Identifying Pathogenicity from the DNA Sequence through the Genome Foundation Model, arXiv [cs.CL] (2024); http://arxiv.org/abs/2406.13133

  434. K. Workman, Engineering AAVs with Evo and AlphaFold, LatchBio (2024); https://blog.latch.bio/p/engineering-aavs-with-evo-and-alphafold

  435. D. Bloomfield, J. Pannu, A. W. Zhu, M. Y. Ng, A. Lewis, E. Bendavid, S. M. Asch, T. Hernandez-Boussard, A. Cicero, T. Inglesby, AI and Biosecurity: The Need for Governance. Science (New York, N.Y.) 385, 831–833 (2024); https://doi.org/10.1126/science.adq1977

  436. Y. Zhang, M. Yasunaga, Z. Zhou, J. Z. HaoChen, J. Zou, P. Liang, S. Yeung, “Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language Models” in Findings of the Association for Computational Linguistics: ACL 2023, A. Rogers, J. Boyd-Graber, N. Okazaki, Eds. (Association for Computational Linguistics, 2023), pp. 7479–7498; https://doi.org/10.18653/v1/2023.findings-acl.472

  437. A. Mallen, A. Asai, V. Zhong, R. Das, D. Khashabi, H. Hajishirzi, “When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, N. Okazaki, Eds. (Association for Computational Linguistics, Toronto, Canada, 2023), pp. 9802–9822; https://doi.org/10.18653/v1/2023.acl-long.546

  438. S. Santurkar, E. Durmus, F. Ladhak, C. Lee, P. Liang, T. Hashimoto, “Whose Opinions Do Language Models Reflect?” in Proceedings of the 40th International Conference on Machine Learning (JMLR, Honolulu, Hawaii, USA, 2023) vol. 202 of ICML’23, pp. 29971–30004; https://proceedings.mlr.press/v202/santurkar23a.html 2 3 4

  439. L. Weidinger, J. Uesato, M. Rauh, C. Griffin, P.-S. Huang, J. Mellor, A. Glaese, M. Cheng, B. Balle, A. Kasirzadeh, C. Biles, S. Brown, Z. Kenton, W. Hawkins, T. Stepleton, A. Birhane, L. A. Hendricks, … I. Gabriel, “Taxonomy of Risks Posed by Language Models” in Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ‘22) (Association for Computing Machinery, New York, NY, USA, 2022), pp. 214–229; https://doi.org/10.1145/3531146.3533088 2 3 4

  440. [industry] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, … W. Zaremba, Evaluating Large Language Models Trained on Code, arXiv [cs.LG] (2021); http://arxiv.org/abs/2107.03374

  441. S. Nguyen, H. M. Babe, Y. Zi, A. Guha, C. J. Anderson, M. Q. Feldman, “How Beginning Programmers and Code LLMs (Mis)read Each Other” in Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI ‘24) (Association for Computing Machinery, New York, NY, USA, 2024), pp. 1–26; https://doi.org/10.1145/3613904.3642706 2

  442. F. Cassano, L. Li, A. Sethi, N. Shinn, A. Brennan-Jones, J. Ginesin, E. Berman, G. Chakhnashvili, A. Lozhkov, C. J. Anderson, A. Guha, Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions, arXiv [cs.SE] (2023); http://arxiv.org/abs/2312.12450 2

  443. R. Pan, A. R. Ibrahimzada, R. Krishna, D. Sankar, L. P. Wassi, M. Merler, B. Sobolev, R. Pavuluri, S. Sinha, R. Jabbarvand, “Lost in Translation: A Study of Bugs Introduced by Large Language Models While Translating Code” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE ‘24) (Association for Computing Machinery, New York, NY, USA, 2024), pp. 1–13; https://doi.org/10.1145/3597503.3639226 2

  444. N. Perry, M. Srivastava, D. Kumar, D. Boneh, “Do Users Write More Insecure Code with AI Assistants?” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (ACM, New York, NY, USA, 2023), pp. 2785–2799; https://doi.org/10.1145/3576915.3623157

  445. A. Perlman, The Implications of ChatGPT for Legal Services and Society, The Practice (2023); https://clp.law.harvard.edu/knowledge-hub/magazine/issues/generative-ai-in-the-legal-profession/the-implications-of-chatgpt-for-legal-services-and-society/ 2

  446. E. Martínez, Re-Evaluating GPT-4’s Bar Exam Performance. Artificial Intelligence and Law (2024); https://doi.org/10.1007/s10506-024-09396-9 2

  447. Eastern District of Texas, US District Court, Memorandum and Order in Case 1:23-Cv-00281-MAC. (2024); https://www.courthousenews.com/wp-content/uploads/2024/11/attorney-sanctioned-for-using-ai-hallucinations.pdf

  448. J. A. Omiye, J. C. Lester, S. Spichak, V. Rotemberg, R. Daneshjou, Large Language Models Propagate Race-Based Medicine. Npj Digital Medicine 6, 1–4 (2023); https://doi.org/10.1038/s41746-023-00939-z 2 3

  449. T. H. Kung, M. Cheatham, A. Medenilla, C. Sillos, L. De Leon, C. Elepaño, M. Madriaga, R. Aggabao, G. Diaz-Candido, J. Maningo, V. Tseng, Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models. PLOS Digital Health 2, e0000198 (2023); https://doi.org/10.1371/journal.pdig.0000198

  450. K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, P. Payne, M. Seneviratne, P. Gamble, C. Kelly, A. Babiker, N. Schärli, A. Chowdhery, … V. Natarajan, Large Language Models Encode Clinical Knowledge. Nature 620, 172–180 (2023); https://doi.org/10.1038/s41586-023-06291-2

  451. I. D. Raji, I. E. Kumar, A. Horowitz, A. Selbst, “The Fallacy of AI Functionality” in Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ‘22) (Association for Computing Machinery, New York, NY, USA, 2022), pp. 959–972; https://doi.org/10.1145/3531146.3533158 2 3

  452. J. Tan, H. Westermann, K. Benyekhlef, “ChatGPT as an Artificial Lawyer?” in Workshop on Artificial Intelligence for Access to Justice (AI4AJ 2023) (CEUR Workshop Proceedings, Braga, Portugal, 2023); https://ceur-ws.org/Vol-3435/short2.pdf 2

  453. J. L. M. Brand, Air Canada’s Chatbot Illustrates Persistent Agency and Responsibility Gap Problems for AI. AI & Society, 1–3 (2024); https://doi.org/10.1007/s00146-024-02096-7

  454. [industry] Z. Yuan, H. Yuan, C. Tan, W. Wang, S. Huang, How Well Do Large Language Models Perform in Arithmetic Tasks?, arXiv [cs.CL] (2023); http://arxiv.org/abs/2304.02015

  455. Z. Wang, “CausalBench: A Comprehensive Benchmark for Evaluating Causal Reasoning Capabilities of Large Language Models” in Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10) (2024), pp. 143–151; https://aclanthology.org/2024.sighan-1.17.pdf 2

  456. X. Yin, J. Jiang, L. Yang, X. Wan, History Matters: Temporal Knowledge Editing in Large Language Model. Proceedings of the … AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence 38, 19413–19421 (2024); https://doi.org/10.1609/aaai.v38i17.29912 2

  457. B. Vidgen, A. Agrawal, A. M. Ahmed, V. Akinwande, N. Al-Nuaimi, N. Alfaraj, E. Alhajjar, L. Aroyo, T. Bavalatti, M. Bartolo, B. Blili-Hamelin, K. Bollacker, R. Bomassani, M. F. Boston, S. Campos, K. Chakra, C. Chen, … J. Vanschoren, Introducing v0.5 of the AI Safety Benchmark from MLCommons, arXiv [cs.CL] (2024); http://arxiv.org/abs/2404.12241 2 3 4 5

  458. P. Guldimann, A. Spiridonov, R. Staab, N. Jovanović, M. Vero, V. Vechev, A. Gueorguieva, M. Balunović, N. Konstantinov, P. Bielik, P. Tsankov, M. Vechev, COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act, arXiv [cs.CL] (2024); http://arxiv.org/abs/2410.07959 2

  459. OECD.AI Policy Observatory, OECD AI Incidents Monitor (AIM) (2024); https://oecd.ai/en/incidents 2

  460. A. Wei, N. Haghtalab, J. Steinhardt, “Jailbroken: How Does LLM Safety Training Fail?” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=jA235JGM09 2 3 4 5

  461. S. M. T. I. Tonmoy, S. M. M. Zaman, V. Jain, A. Rani, V. Rawte, A. Chadha, A. Das, A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models, arXiv [cs.CL] (2024); http://arxiv.org/abs/2401.01313 2

  462. ETH Zurich, INSAIT, LatticeFlow AI, COMPL-AI (2024); https://compl-ai.org/

  463. N. Guha, J. Nyarko, D. E. Ho, C. Ré, A. Chilton, A. Narayana, A. Chohlas-Wood, A. Peters, B. Waldon, D. N. Rockmore, D. Zambrano, D. Talisman, E. Hoque, F. Surani, F. Fagan, G. Sarfaty, G. M. Dickinson, … Z. Li, “LEGALBENCH: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models” in 37th International Conference on Neural Information Processing Systems (NeurIPS 2023) (Curran Associates Inc., Red Hook, NY, USA, 2024), pp. 44123–44279; https://doi.org/10.5555/3666122.3668037

  464. R. Xu, Z. Wang, R.-Z. Fan, P. Liu, Benchmarking Benchmark Leakage in Large Language Models, arXiv [cs.CL] (2024); http://arxiv.org/abs/2404.18824

  465. S. Longpre, S. Biderman, A. Albalak, H. Schoelkopf, D. McDuff, S. Kapoor, K. Klyman, K. Lo, G. Ilharco, N. San, M. Rauh, A. Skowron, B. Vidgen, L. Weidinger, A. Narayanan, V. Sanh, D. Adelani, … L. Soldaini, The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources. Transactions on Machine Learning Research (2024); https://openreview.net/pdf?id=tH1dQH20eZ 2 3 4

  466. V. Ojewale, R. Steed, B. Vecchione, A. Birhane, I. D. Raji, Towards AI Accountability Infrastructure: Gaps and Opportunities in AI Audit Tooling, arXiv [cs.CY] (2024); http://arxiv.org/abs/2402.17861 2

  467. N. Guha, C. M. Lawrence, L. A. Gailmard, K. T. Rodolfa, F. Surani, R. Bommasani, I. D. Raji, M.-F. Cuéllar, C. Honigsberg, P. Liang, D. E. Ho, AI Regulation Has Its Own Alignment Problem: The Technical and Institutional Feasibility of Disclosure, Registration, Licensing, and Auditing. The George Washington Law Review 92 (2024); https://dho.stanford.edu/wp-content/uploads/AI_Regulation.pdf 2

  468. A. Narayanan, S. Kapoor, AI Snake Oil: What Artificial Intelligence Can Do, What It Can’t, and How to Tell the Difference (Princeton University Press, 2024); https://doi.org/10.1515/9780691249643

  469. J. Buolamwini, T. Gebru, “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification” in Proceedings of the 1st Conference on Fairness, Accountability and Transparency (FAT/MM ‘19) (PMLR, 2018), pp. 77–91; https://proceedings.mlr.press/v81/buolamwini18a.html

  470. J. Angwin, J. Larson, L. Kirchner, S. Mattu, Machine Bias, ProPublica (2016); https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

  471. J. Dressel, H. Farid, The Accuracy, Fairness, and Limits of Predicting Recidivism. Science Advances 4, eaao5580 (2018); https://doi.org/10.1126/sciadv.aao5580

  472. Z. Obermeyer, B. Powers, C. Vogeli, S. Mullainathan, Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations. Science 366, 447–453 (2019); https://doi.org/10.1126/science.aax2342

  473. T. Zack, E. Lehman, M. Suzgun, J. A. Rodriguez, L. A. Celi, J. Gichoya, D. Jurafsky, P. Szolovits, D. W. Bates, R.-E. E. Abdulnour, A. J. Butte, E. Alsentzer, Assessing the Potential of GPT-4 to Perpetuate Racial and Gender Biases in Health Care: A Model Evaluation Study. The Lancet. Digital Health 6, e12–e22 (2024); https://doi.org/10.1016/S2589-7500(23)00225-X

  474. F. Bianchi, P. Kalluri, E. Durmus, F. Ladhak, M. Cheng, D. Nozza, T. Hashimoto, D. Jurafsky, J. Zou, A. Caliskan, “Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale” in Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (FAccT ‘23) (Association for Computing Machinery, New York, NY, USA, 2023), pp. 1493–1504; https://doi.org/10.1145/3593013.3594095

  475. S. Ghosh, A. Caliskan, “‘person’ == Light-Skinned, Western Man, and Sexualization of Women of Color: Stereotypes in Stable Diffusion” in Findings of the Association for Computational Linguistics: EMNLP 2023 (Association for Computational Linguistics, Stroudsburg, PA, USA, 2023), pp. 6971–6985; https://doi.org/10.18653/v1/2023.findings-emnlp.465

  476. M. Cheong, E. Abedin, M. Ferreira, R. Reimann, S. Chalson, P. Robinson, J. Byrne, L. Ruppanner, M. Alfano, C. Klein, Investigating Gender and Racial Biases in DALL-E Mini Images. ACM Journal on Responsible Computing 1, 1–20 (2024); https://doi.org/10.1145/3649883 2

  477. J. S. Park, M. S. Bernstein, R. N. Brewer, E. Kamar, M. R. Morris, “Understanding the Representation and Representativeness of Age in AI Data Sets” in Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (AIES ‘21) (Association for Computing Machinery, New York, NY, USA, 2021), pp. 834–842; https://doi.org/10.1145/3461702.3462590

  478. R. Kamikubo, L. Wang, C. Marte, A. Mahmood, H. Kacorri, “Data Representativeness in Accessibility Datasets: A Meta-Analysis” in Proceedings of the 24th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS ‘22) (Association for Computing Machinery, New York, NY, USA, 2022), pp. 1–15; https://doi.org/10.1145/3517428.3544826

  479. [industry] S. Shankar, Y. Halpern, E. Breck, J. Atwood, J. Wilson, D. Sculley, “No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World” in 31st Conference on Neural Information Processing Systems (NIPS 2017) Machine Learning for the Developing World Workshop (Long Beach, CA, USA, 2017); https://arxiv.org/abs/1711.08536

  480. T. de Vries, I. Misra, C. Wang, L. van der Maaten, “Does Object Recognition Work for Everyone?” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (Long Beach, CA, USA, 2019); 

  481. S. Longpre, R. Mahari, A. Chen, N. Obeng-Marnu, D. Sileo, W. Brannon, N. Muennighoff, N. Khazam, J. Kabbara, K. Perisetla, X. Wu, E. Shippole, K. Bollacker, T. Wu, L. Villa, S. Pentland, S. Hooker, The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI, arXiv [cs.CL] (2023); http://arxiv.org/abs/2310.16787.  2 3 4

  482. H. Suresh, J. Guttag, “A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle” in Equity and Access in Algorithms, Mechanisms, and Optimization (ACM, New York, NY, USA, 2021); https://doi.org/10.1145/3465416.3483305.  2

  483. [industry] L. Weidinger, J. Mellor, M. Rauh, C. Griffin, J. Uesato, P.-S. Huang, M. Cheng, M. Glaese, B. Balle, A. Kasirzadeh, Z. Kenton, S. Brown, W. Hawkins, T. Stepleton, C. Biles, A. Birhane, J. Haas, … I. Gabriel, “Ethical and Social Risks of Harm from Language Models” (Google DeepMind, 2021); http://arxiv.org/abs/2112.04359.  2 3 4 5 6

  484. J. Nwatu, O. Ignat, R. Mihalcea, “Bridging the Digital Divide: Performance Variation across Socio-Economic Factors in Vision-Language Models” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), H. Bouamor, J. Pino, K. Bali, Eds. (Association for Computational Linguistics, Singapore, 2023), pp. 10686–10702; https://doi.org/10.18653/v1/2023.emnlp-main.660. 

  485. A. Pouget, L. Beyer, E. Bugliarello, X. Wang, A. P. Steiner, X. Zhai, I. Alabdulmohsin, “No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models” in 38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024) (2024); https://openreview.net/pdf?id=UmW9BYj761. 

  486. S. Nayak, K. Jain, R. Awal, S. Reddy, S. Van Steenkiste, L. A. Hendricks, K. Stanczak, A. Agrawal, Benchmarking Vision Language Models for Cultural Understanding (Association for Computational Linguistics, 2024); https://aclanthology.org/2024.emnlp-main.329. 

  487. D. Agarwal, M. Naaman, A. Vashistha, AI Suggestions Homogenize Writing toward Western Styles and Diminish Cultural Nuances, arXiv [cs.HC] (2024); http://arxiv.org/abs/2409.11360. 

  488. N. Shahbazi, Y. Lin, A. Asudeh, H. V. Jagadish, Representation Bias in Data: A Survey on Identification and Resolution Techniques. ACM Computing Surveys 55, 293:1–293:39 (2023); https://doi.org/10.1145/3588433.  2

  489. S. E. Whang, Y. Roh, H. Song, J.-G. Lee, Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective. The VLDB Journal: Very Large Data Bases: A Publication of the VLDB Endowment 32, 791–813 (2023); https://doi.org/10.1007/s00778-022-00775-9. 

  490. A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y. Zhao, X. Du, M. R. G. Madani, C. Barale, R. McHardy, J. Harris, J. Kaddour, E. van Krieken, P. Minervini, Are We Done with MMLU?, arXiv [cs.CL] (2024); http://arxiv.org/abs/2406.04127.  2

  491. Y. Wan, G. Pu, J. Sun, A. Garimella, K.-W. Chang, N. Peng, “‘Kelly Is a Warm Person, Joseph Is a Role Model’: Gender Biases in LLM-Generated Reference Letters” in Findings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, K. Bali, Eds. (Association for Computational Linguistics, Singapore, 2023), pp. 3730–3748; https://doi.org/10.18653/v1/2023.findings-emnlp.243 2 3

  492. D. van Niekerk, M. Pérez-Ortiz, J. Shawe-Taylor, D. Orlič, I. Drobnjak, J. Kay, N. Siegel, K. Evans, N. Moorosi, T. Eliassi-Rad, L. M. Tanczer, W. Holmes, M. P. Deisenroth, I. Straw, M. Fasli, R. Adams, N. Oliver, … M. Janicky, “Challenging Systematic Prejudices: An Investigation into Bias Against Women and Girls in Large Language Models” (UNESCO, IRCAI, 2024); https://ircai.org/project/challenging-systematic-prejudices/ 2

  493. M. Vlasceanu, D. M. Amodio, Propagation of Societal Gender Inequality by Internet Search Algorithms. Proceedings of the National Academy of Sciences 119, e2204529119 (2022); https://doi.org/10.1073/pnas.2204529119

  494. S. Sterlie, N. Weng, A. Feragen, Generalizing Fairness to Generative Language Models via Reformulation of Non-Discrimination Criteria, arXiv [cs.CL] (2024); http://arxiv.org/abs/2403.08564

  495. T. Sandoval-Martin, E. Martínez-Sanzo, Perpetuation of Gender Bias in Visual Representation of Professions in the Generative AI Tools DALL·E and Bing Image Creator. Social Sciences (Basel, Switzerland) 13, 250 (2024); https://doi.org/10.3390/socsci13050250

  496. L. Sun, M. Wei, Y. Sun, Y. J. Suh, L. Shen, S. Yang, Smiling Women Pitching down: Auditing Representational and Presentational Gender Biases in Image-Generative AI. Journal of Computer-Mediated Communication: JCMC 29, zmad045 (2023); https://doi.org/10.1093/jcmc/zmad045

  497. Y. Wan, K.-W. Chang, The Male CEO and the Female Assistant: Evaluation and Mitigation of Gender Biases in Text-to-Image Generation of Dual Subjects, arXiv [cs.CV] (2024); http://arxiv.org/abs/2402.11089

  498. A. Nielsen, A. Woemmel, “Invisible Inequities: Confronting Age-Based Discrimination in Machine Learning Research and Applications” in 2nd Workshop on Generative AI and Law (2024); https://blog.genlaw.org/pdfs/genlaw_icml2024/50.pdf 2

  499. C. Harris, Mitigating Age Biases in Resume Screening AI Models. The International FLAIRS Conference Proceedings 36 (2023); https://doi.org/10.32473/flairs.36.133236

  500. J. Stypinska, AI Ageism: A Critical Roadmap for Studying Age Discrimination and Exclusion in Digitalized Societies. AI & Society 38, 665–677 (2023); https://doi.org/10.1007/s00146-022-01553-5

  501. R. Naik, B. Nushi, “Social Biases through the Text-to-Image Generation Lens” in Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (AIES ‘23) (Association for Computing Machinery, New York, NY, USA, 2023), pp. 786–808; https://doi.org/10.1145/3600211.3604711 2

  502. [industry] A. Tamkin, A. Askell, L. Lovitt, E. Durmus, N. Joseph, S. Kravec, K. Nguyen, J. Kaplan, D. Ganguli, Evaluating and Mitigating Discrimination in Language Model Decisions, arXiv [cs.CL] (2023); http://arxiv.org/abs/2312.03689

  503. M. Kamruzzaman, Shovon, G. Kim, Investigating Subtler Biases in LLMs: Ageism, Beauty, Institutional, and Nationality Bias in Generative Models (Association for Computational Linguistics, 2024); https://doi.org/10.18653/v1/2024.findings-acl.530

  504. C. H. Chu, S. Donato-Woodger, S. S. Khan, R. Nyrup, K. Leslie, A. Lyn, T. Shi, A. Bianchi, S. A. Rahimi, A. Grenier, Age-Related Bias and Artificial Intelligence: A Scoping Review. Humanities & Social Sciences Communications 10, 1–17 (2023); https://doi.org/10.1057/s41599-023-01999-y

  505. T. Kamelski, D. Klinge, Generative Artificial Intelligence and Digital Ageism: Exploring the Construction of Age and Aging by Image-Generating AI (2024); https://doi.org/10.31219/osf.io/p3sdj

  506. K. A. Mack, R. Qadri, R. Denton, S. K. Kane, C. L. Bennett, “‘They Only Care to Show Us the Wheelchair’: Disability Representation in Text-to-Image AI Models” in Proceedings of the CHI Conference on Human Factors in Computing Systems (ACM, New York, NY, USA, 2024) vol. 22, pp. 1–23; https://doi.org/10.1145/3613904.3642166 2

  507. P. N. Venkit, M. Srinath, S. Wilson, “Automated Ableism: An Exploration of Explicit Disability Biases in Sentiment and Toxicity Analysis Models” in Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), A. Ovalle, K.-W. Chang, N. Mehrabi, Y. Pruksachatkun, A. Galystan, J. Dhamala, A. Verma, T. Cao, A. Kumar, R. Gupta, Eds. (Association for Computational Linguistics, Toronto, Canada, 2023), pp. 26–34; https://doi.org/10.18653/v1/2023.trustnlp-1.3

  508. K. Glazko, Y. Mohammed, B. Kosa, V. Potluri, J. Mankoff, “Identifying and Improving Disability Bias in GPT-Based Resume Screening” in The 2024 ACM Conference on Fairness, Accountability, and Transparency (ACM, New York, NY, USA, 2024); https://doi.org/10.1145/3630106.3658933

  509. N. Shahin, L. Ismail, “ChatGPT, Let Us Chat Sign Language: Experiments, Architectural Elements, Challenges and Research Directions” in 2023 International Symposium on Networks, Computers and Communications (ISNCC) (IEEE, 2023), pp. 1–7; https://doi.org/10.1109/isncc58260.2023.10323974

  510. S. Gueuwou, K. Takyi, M. Müller, M. S. Nyarko, R. Adade, R.-M. O. M. Gyening, “AfriSign: Machine Translation for African Sign Languages” in 4th Workshop on African Natural Language Processing (AfricaNLP 2023) (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=EHldk3J2xk

  511. J. Hartmann, J. Schwenzow, M. Witte, The Political Ideology of Conversational AI: Converging Evidence on ChatGPT’s pro-Environmental, Left-Libertarian Orientation, arXiv [cs.CL] (2023); http://arxiv.org/abs/2301.01768 2 3

  512. F. Motoki, V. Pinho Neto, V. Rodrigues, More Human than Human: Measuring ChatGPT Political Bias. Public Choice 198, 3–23 (2024); https://doi.org/10.1007/s11127-023-01097-2

  513. D. Rozado, The Political Biases of ChatGPT. Social Sciences (Basel, Switzerland) 12, 148 (2023); https://doi.org/10.3390/socsci12030148

  514. J. Rutinowski, S. Franke, J. Endendyk, I. Dormuth, M. Roidl, M. Pauly, The Self-Perception and Political Biases of ChatGPT. Human Behavior and Emerging Technologies 2024, 1–9 (2024); https://doi.org/10.1155/2024/7115633

  515. M. Buyl, A. Rogiers, S. Noels, I. Dominguez-Catena, E. Heiter, R. Romero, I. Johary, A.-C. Mara, J. Lijffijt, T. De Bie, Large Language Models Reflect the Ideology of Their Creators, arXiv [cs.CL] (2024); http://arxiv.org/abs/2410.18417

  516. [industry] T. Choudhary, Political Bias in AI-Language Models: A Comparative Analysis of ChatGPT-4, Perplexity, Google Gemini, and Claude, Techrxiv (2024); https://doi.org/10.36227/techrxiv.172107441.12283354/v1

  517. S. Feng, C. Y. Park, Y. Liu, Y. Tsvetkov, From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models (Association for Computational Linguistics, 2023); https://doi.org/10.18653/v1/2023.acl-long.656

  518. L. Rettenberger, M. Reischl, M. Schutera, Assessing Political Bias in Large Language Models, arXiv [cs.CL] (2024); http://arxiv.org/abs/2405.13041

  519. S. Fujimoto, K. Takemoto, Revisiting the Political Biases of ChatGPT. Frontiers in Artificial Intelligence 6, 1232003 (2023); https://doi.org/10.3389/frai.2023.1232003

  520. C. Walker, J. C. Timoneda, Identifying the Sources of Ideological Bias in GPT Models through Linguistic Variation in Output, arXiv [cs.CL] (2024); http://arxiv.org/abs/2409.06043 2 3

  521. T. Ceron, N. Falk, A. Barić, D. Nikolaev, S. Padó, Beyond Prompt Brittleness: Evaluating the Reliability and Consistency of Political Worldviews in LLMs. Transactions of the Association for Computational Linguistics 12, 1378–1400 (2024); https://doi.org/10.1162/tacl_a_00710

  522. E. Perez, S. Ringer, K. Lukosiute, K. Nguyen, E. Chen, S. Heiner, C. Pettit, C. Olsson, S. Kundu, S. Kadavath, A. Jones, A. Chen, B. Mann, B. Israel, B. Seethor, C. McKinnon, C. Olah, … J. Kaplan, “Discovering Language Model Behaviors with Model-Written Evaluations” in Findings of the Association for Computational Linguistics: ACL 2023, A. Rogers, J. Boyd-Graber, N. Okazaki, Eds. (Association for Computational Linguistics, Toronto, Canada, 2023), pp. 13387–13434; https://doi.org/10.18653/v1/2023.findings-acl.847 2 3 4 5 6 7 8

  523. J. Fisher, S. Feng, R. Aron, T. Richardson, Y. Choi, D. W. Fisher, J. Pan, Y. Tsvetkov, K. Reinecke, Biased AI Can Influence Political Decision-Making, arXiv [cs.HC] (2024); http://arxiv.org/abs/2410.06415

  524. U. Messer, How Do People React to Political Bias in Generative Artificial Intelligence (AI)? Computers in Human Behavior: Artificial Humans, 100108 (2024); https://doi.org/10.1016/j.chbah.2024.100108

  525. Á. A. Cabrera, W. Epperson, F. Hohman, M. Kahng, J. Morgenstern, D. H. Chau, “FAIRVIS: Visual Analytics for Discovering Intersectional Bias in Machine Learning” in 2019 IEEE Conference on Visual Analytics Science and Technology (VAST) (2019), pp. 46–56; https://doi.org/10.1109/VAST47406.2019.8986948

  526. W. Guo, A. Caliskan, “Detecting Emergent Intersectional Biases: Contextualized Word Embeddings Contain a Distribution of Human-like Biases” in Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (AIES ‘21) (Association for Computing Machinery, New York, NY, USA, 2021), pp. 122–133; https://doi.org/10.1145/3461702.3462536

  527. I. M. S. Lassen, M. Almasi, K. Enevoldsen, R. D. Kristensen-McLachlan, “Detecting Intersectionality in NER Models: A Data-Driven Approach” in Proceedings of the 7th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, S. Degaetano-Ortlieb, A. Kazantseva, N. Reiter, S. Szpakowicz, Eds. (Association for Computational Linguistics, Dubrovnik, Croatia, 2023), pp. 116–127; https://doi.org/10.18653/v1/2023.latechclfl-1.13

  528. A. Ovalle, A. Subramonian, V. Gautam, G. Gee, K.-W. Chang, “Factoring the Matrix of Domination: A Critical Review and Reimagination of Intersectionality in AI Fairness” in Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (AIES ‘23) (Association for Computing Machinery, New York, NY, USA, 2023), pp. 496–511; https://doi.org/10.1145/3600211.3604705

  529. K. Wilson, A. Caliskan, Gender, Race, and Intersectional Bias in Resume Screening via Language Model Retrieval, arXiv [cs.CY] (2024); http://arxiv.org/abs/2407.20371 2

  530. X. Fang, S. Che, M. Mao, H. Zhang, M. Zhao, X. Zhao, Bias of AI-Generated Content: An Examination of News Produced by Large Language Models. Scientific Reports 14, 5224 (2024); https://doi.org/10.1038/s41598-024-55686-2

  531. H. An, C. Acquaye, C. Wang, Z. Li, R. Rudinger, “Do Large Language Models Discriminate in Hiring Decisions on the Basis of Race, Ethnicity, and Gender?” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (Association for Computational Linguistics, Stroudsburg, PA, USA, 2024), pp. 386–397; https://doi.org/10.18653/v1/2024.acl-short.37

  532. R. Navigli, S. Conia, B. Ross, Biases in Large Language Models: Origins, Inventory, and Discussion. J. Data and Information Quality 15, 1–21 (2023); https://doi.org/10.1145/3597307

  533. Y. Li, M. Du, R. Song, X. Wang, Y. Wang, A Survey on Fairness in Large Language Models, arXiv [cs.CL] (2023); http://arxiv.org/abs/2308.10149

  534. [industry] S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, A. Awadallah, Orca: Progressive Learning from Complex Explanation Traces of GPT-4, arXiv [cs.CL] (2023); http://arxiv.org/abs/2306.02707

  535. E. Ferrara, Fairness and Bias in Artificial Intelligence: A Brief Survey of Sources, Impacts, and Mitigation Strategies. Sci 6, 3 (2023); https://doi.org/10.3390/sci6010003 2

  536. S. U. Noble, Algorithms of Oppression: How Search Engines Reinforce Racism, NYU Press (2019); https://nyupress.org/9781479837243/algorithms-of-oppression/

  537. S. Lazar, A. Nelson, AI Safety on Whose Terms? Science 381, 138 (2023); https://doi.org/10.1126/science.adi8982 2 3

  538. R. I. J. Dobbe, T. K. Gilbert, Y. Mintz, “Hard Choices in Artificial Intelligence: Addressing Normative Uncertainty through Sociotechnical Commitments (AIES ‘20)” in Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (Association for Computing Machinery, New York, NY, USA, 2020), p. 242; https://doi.org/10.1145/3375627.3375861 2 3

  539. M. Shur-Ofry, Multiplicity as an AI Governance Principle (2023); https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4444354

  540. M. Sloane, E. Moss, O. Awomolo, L. Forlano, “Participation Is Not a Design Fix for Machine Learning” in Proceedings of the 2nd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO ‘22) (Association for Computing Machinery, New York, NY, USA, 2022), pp. 1–6; https://doi.org/10.1145/3551624.3555285 2 3

  541. H. Gonen, Y. Goldberg, “Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But Do Not Remove Them” in Proceedings of the 2019 Workshop on Widening NLP, A. Axelrod, D. Yang, R. Cunha, S. Shaikh, Z. Waseem, Eds. (Association for Computational Linguistics, Florence, Italy, 2019), pp. 60–63; https://aclanthology.org/W19-3621

  542. J. Xiao, Z. Li, X. Xie, E. Getzen, C. Fang, Q. Long, W. J. Su, On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization, arXiv [stat.ML] (2024); http://arxiv.org/abs/2405.16455

  543. D. Y. Kim, C. Wallraven, “Label Quality in AffectNet: Results of Crowd-Based Re-Annotation” in Lecture Notes in Computer Science (Springer International Publishing, Cham, 2022)Lecture notes in computer science, pp. 518–531; https://doi.org/10.1007/978-3-031-02444-3_39

  544. J. Ma, Y. Ushiku, M. Sagara, “The Effect of Improving Annotation Quality on Object Detection Datasets: A Preliminary Study” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2022), pp. 4849–4858; https://doi.org/10.1109/CVPRW56347.2022.00532

  545. Z. Xu, K. Peng, L. Ding, D. Tao, X. Lu, Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge Extraction, arXiv [cs.CL] (2024); http://arxiv.org/abs/2403.09963

  546. H. Weerts, F. Pfisterer, M. Feurer, K. Eggensperger, E. Bergman, N. Awad, J. Vanschoren, M. Pechenizkiy, B. Bischl, F. Hutter, Can Fairness Be Automated? Guidelines and Opportunities for Fairness-Aware AutoML. The Journal of Artificial Intelligence Research 79, 639–677 (2024); https://doi.org/10.1613/jair.1.14747

  547. I. D. Raji, J. Buolamwini, “Actionable Auditing: Investigating the Impact of Publicly Naming Biased Performance Results of Commercial AI Products” in Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (ACM, New York, NY, USA, 2019); https://doi.org/10.1145/3306618.3314244

  548. D. Zhang, P. Finckenberg-Broman, T. Hoang, S. Pan, Z. Xing, M. Staples, X. Xu, Right to Be Forgotten in the Era of Large Language Models: Implications, Challenges, and Solutions. AI and Ethics (2024); https://doi.org/10.1007/s43681-024-00573-9 2

  549. [industry] A. Xiang, Being “Seen” vs. “Mis-Seen”: Tensions between Privacy and Fairness in Computer Vision. Harvard Journal of Law & Technology 36 (2022); https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4068921

  550. J. Kleinberg, “Inherent Trade-Offs in Algorithmic Fairness” in Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ‘18) (Association for Computing Machinery, New York, NY, USA, 2018), p. 40; https://doi.org/10.1145/3219617.3219634 2

  551. H. Nilforoshan, J. D. Gaebler, R. Shroff, S. Goel, “Causal Conceptions of Fairness and Their Consequences” in Proceedings of the 39th International Conference on Machine Learning (ICML 2022) (PMLR, 2022); https://proceedings.mlr.press/v162/nilforoshan22a.html

  552. N. Konstantinov, C. H. Lampert, “On the Impossibility of Fairness-Aware Learning from Corrupted Data” in Algorithmic Fairness through the Lens of Causality and Robustness Workshop (AFCR 2021) (PMLR, Virtual, 2021); https://proceedings.mlr.press/v171/konstantinov22a.html

  553. A. Chouldechova, Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments. Big Data 5, 153–163 (2017); https://doi.org/10.1089/big.2016.0047

  554. Q. Zhang, J. Liu, Z. Zhang, J. Wen, B. Mao, X. Yao, Mitigating Unfairness via Evolutionary Multiobjective Ensemble Learning. IEEE Transactions on Evolutionary Computation 27, 848–862 (2023); https://doi.org/10.1109/TEVC.2022.3209544

  555. M. Hardt, E. Price, E. Price, N. Srebro, “Equality of Opportunity in Supervised Learning” in 30th Conference on Neural Information Processing Systems (NIPS 2016) (Curran Associates, Inc., Barcelona, Spain, 2016) vol. 29; https://proceedings.neurips.cc/paper_files/paper/2016/hash/9d2682367c3935defcb1f9e247a97c0d-Abstract.html

  556. M. Brcic, R. V. Yampolskiy, Impossibility Results in AI: A Survey. ACM Comput. Surv. 56, 1–24 (2023); https://doi.org/10.1145/3603371

  557. B. Green, Escaping the Impossibility of Fairness: From Formal to Substantive Algorithmic Fairness. Philosophy & Technology 35, 90 (2022); https://doi.org/10.1007/s13347-022-00584-6 2 3

  558. A. Bell, L. Bynum, N. Drushchak, T. Zakharchenko, L. Rosenblatt, J. Stoyanovich, “The Possibility of Fairness: Revisiting the Impossibility Theorem in Practice” in Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (FAccT ‘23) (Association for Computing Machinery, New York, NY, USA, 2023), pp. 400–422; https://doi.org/10.1145/3593013.3594007 2

  559. K. T. Rodolfa, H. Lamba, R. Ghani, Empirical Observation of Negligible Fairness–accuracy Trade-Offs in Machine Learning for Public Policy. Nature Machine Intelligence 3, 896–904 (2021); https://doi.org/10.1038/s42256-021-00396-x

  560. V. Hofmann, P. R. Kalluri, D. Jurafsky, S. King, Dialect Prejudice Predicts AI Decisions about People’s Character, Employability, and Criminality, arXiv [cs.CL] (2024); http://arxiv.org/abs/2403.00742 2

  561. R. L. Johnson, G. Pistilli, N. Menédez-González, L. D. D. Duran, E. Panai, J. Kalpokiene, D. J. Bertulfo, The Ghost in the Machine Has an American Accent: Value Conflict in GPT-3, arXiv [cs.CL] (2022); http://arxiv.org/abs/2203.07785 2

  562. E. Durmus, K. Nguyen, T. Liao, N. Schiefer, A. Askell, A. Bakhtin, C. Chen, Z. Hatfield-Dodds, D. Hernandez, N. Joseph, L. Lovitt, S. McCandlish, O. Sikder, A. Tamkin, J. Thamkul, J. Kaplan, J. Clark, D. Ganguli, “Towards Measuring the Representation of Subjective Global Opinions in Language Models” in First Conference on Language Modeling (2024); https://openreview.net/pdf?id=zl16jLb91v 2

  563. Y. Wan, K.-W. Chang, White Men Lead, Black Women Help? Benchmarking Language Agency Social Biases in LLMs, arXiv [cs.CL] (2024); http://arxiv.org/abs/2404.10508 2

  564. B. AlKhamissi, M. ElNokrashy, M. Alkhamissi, M. Diab, “Investigating Cultural Alignment of Large Language Models” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Association for Computational Linguistics, Stroudsburg, PA, USA, 2024), pp. 12404–12422; https://doi.org/10.18653/v1/2024.acl-long.671 2

  565. H. Yuan, Z. Che, S. Li, Y. Zhang, X. Hu, S. Luo, The High Dimensional Psychological Profile and Cultural Bias of ChatGPT, arXiv [cs.CL] (2024); http://arxiv.org/abs/2405.03387 2

  566. R. Hada, S. Husain, V. Gumma, H. Diddee, A. Yadavalli, A. Seth, N. Kulkarni, U. Gadiraju, A. Vashistha, V. Seshadri, K. Bali, “Akal Badi Ya Bias: An Exploratory Study of Gender Bias in Hindi Language Technology” in The 2024 ACM Conference on Fairness, Accountability, and Transparency (ACM, New York, NY, USA, 2024); https://doi.org/10.1145/3630106.3659017

  567. M. H. J. Lee, J. M. Montgomery, C. K. Lai, “Large Language Models Portray Socially Subordinate Groups as More Homogeneous, Consistent with a Bias Observed in Humans” in The 2024 ACM Conference on Fairness, Accountability, and Transparency (ACM, New York, NY, USA, 2024); https://doi.org/10.1145/3630106.3658975

  568. C. Raj, A. Mukherjee, A. Caliskan, A. Anastasopoulos, Z. Zhu, Breaking Bias, Building Bridges: Evaluation and Mitigation of Social Biases in LLMs via Contact Hypothesis. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society 7, 1180–1189 (2024); https://ojs.aaai.org/index.php/AIES/article/view/31715

  569. D. Oba, M. Kaneko, D. Bollegala, “In-Contextual Gender Bias Suppression for Large Language Models” in Findings of the Association for Computational Linguistics: EACL 2024 (2024), pp. 1722–1742; https://aclanthology.org/2024.findings-eacl.121.pdf

  570. Y. Reif, R. Schwartz, “Beyond Performance: Quantifying and Mitigating Label Bias in LLMs” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (Association for Computational Linguistics, Stroudsburg, PA, USA, 2024), pp. 6784–6798; https://doi.org/10.18653/v1/2024.naacl-long.378

  571. M. Ribeiro, B. Malcorra, N. B. Mota, R. Wilkens, A. Villavicencio, L. C. Hubner, C. Rennó-Costa, A Methodology for Explainable Large Language Models with Integrated Gradients and Linguistic Analysis in Text Classification, arXiv [cs.CL] (2024); http://arxiv.org/abs/2410.00250

  572. L. Luo, Y.-F. Li, R. Haf, S. Pan, “Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning” in 12th International Conference on Learning Representations (2023); https://openreview.net/pdf?id=ZGNWW7xZ6Q

  573. S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, X. Wu, Unifying Large Language Models and Knowledge Graphs: A Roadmap. IEEE Transactions on Knowledge and Data Engineering 36, 3580–3599 (2024); https://doi.org/10.1109/tkde.2024.3352100

  574. S. A. Friedler, C. Scheidegger, S. Venkatasubramanian, The (Im)possibility of Fairness: Different Value Systems Require Different Mechanisms for Fair Decision Making. Communications of the ACM 64, 136–143 (2021); https://doi.org/10.1145/3433949 2

  575. J. Banja, J. W. Gichoya, N. Martinez-Martin, L. A. Waller, G. D. Clifford, Fairness as an Afterthought: An American Perspective on Fairness in Model Developer-Clinician User Collaborations. PLOS Digital Health 2, e0000386 (2023); https://doi.org/10.1371/journal.pdig.0000386

  576. N. A. Saxena, K. Huang, E. DeFilippis, G. Radanovic, D. C. Parkes, Y. Liu, “How Do Fairness Definitions Fare? Examining Public Attitudes Towards Algorithmic Definitions of Fairness” in Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (AIES ‘19) (Association for Computing Machinery, New York, NY, USA, 2019), pp. 99–106; https://doi.org/10.1145/3306618.3314248

  577. W. Fleisher, “What’s Fair about Individual Fairness?” in Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (AIES ‘21) (Association for Computing Machinery, New York, NY, USA, 2021), pp. 480–490; https://doi.org/10.1145/3461702.3462621

  578. A. M. Turing, Intelligent Machinery, A Heretical Theory*. Philosophia Mathematica. Series III 4, 256–260 (1996); https://doi.org/10.1093/philmat/4.3.256

  579. I. J. Good, “Speculations Concerning the First Ultraintelligent Machine” in Advances in Computers, F. L. Alt, M. Rubinoff, Eds. (Elsevier, 1966) vol. 6, pp. 31–88; https://doi.org/10.1016/S0065-2458(08)60418-0

  580. N. Wiener, Some Moral and Technical Consequences of Automation. Science 131, 1355–1358 (1960); https://doi.org/10.1126/science.131.3410.1355 2

  581. S. M. Omohundro, “The Basic AI Drives” in Proceedings of the 2008 Conference on Artificial General Intelligence 2008: Proceedings of the First AGI Conference (IOS Press, NLD, 2008), pp. 483–492; https://dl.acm.org/doi/10.5555/1566174.1566226

  582. N. Bostrom, M. M. Cirkovic, Global Catastrophic Risks (Oxford University Press, London, England, 2011); https://academic.oup.com/book/40615

  583. S. Russell, P. Norvig, Artificial Intelligence (Pearson, Upper Saddle River, NJ, ed. 3, 2009); https://people.engr.tamu.edu/guni/csce421/files/AI_Russell_Norvig.pdf

  584. N. Bostrom, Superintelligence: Paths, Dangers, Strategies (Oxford University Press, London, England, 2014); https://global.oup.com/academic/product/superintelligence-9780198739838?cc=mx&lang=en&

  585. S. J. Russell, Human Compatible: Artificial Intelligence and the Problem of Control (Penguin Books, Harlow, England, 2020); https://www.penguin.co.uk/books/307948/human-compatible-by-russell-stuart/9780141987507 2 3

  586. Center for AI Safety, Statement on AI Risk: AI Experts and Public Figures Express Their Concern about AI Risk (2024); https://www.safe.ai/work/statement-on-ai-risk 2

  587. Y. Bengio, Written Statement of Professor Yoshua Bengio Before the US Senate Forum on AI Insight Regarding Risk, Alignment, and Guarding Against Doomsday Scenarios. (2023); https://www.schumer.senate.gov/imo/media/doc/Yoshua%20Benigo%20-%20Statement.pdf

  588. K. Grace, H. Stewart, J. F. Sandkühler, S. Thomas, B. Weinstein-Raun, J. Brauner, Thousands of AI Authors on the Future of AI, arXiv [cs.CY] (2024); http://arxiv.org/abs/2401.02843

  589. A. Critch, S. Russell, TASRA: A Taxonomy and Analysis of Societal-Scale Risks from AI, arXiv [cs.AI] (2023); http://arxiv.org/abs/2306.06924 2

  590. K. Goddard, A. Roudsari, J. C. Wyatt, Automation Bias: A Systematic Review of Frequency, Effect Mediators, and Mitigators. Journal of the American Medical Informatics Association: JAMIA 19, 121–127 (2012); https://doi.org/10.1136/amiajnl-2011-000089

  591. M. Chugunova, D. Sele, We and It: An Interdisciplinary Review of the Experimental Evidence on How Humans Interact with Machines. Journal of Behavioral and Experimental Economics 99, 101897 (2022); https://doi.org/10.1016/j.socec.2022.101897

  592. A. Kasirzadeh, Two Types of AI Existential Risk: Decisive and Accumulative, arXiv [cs.CY] (2024); http://arxiv.org/abs/2401.07836 2

  593. M. Kinniment, L. J. K. Sato, H. Du, B. Goodrich, M. Hasin, L. Chan, L. H. Miles, T. R. Lin, H. Wijk, J. Burget, A. Ho, E. Barnes, P. Christiano, Evaluating Language-Model Agents on Realistic Autonomous Tasks, arXiv [cs.CL] (2023); https://evals.alignment.org/Evaluating_LMAs_Realistic_Tasks.pdf 2 3

  594. [industry] OpenAI, “Preparedness Framework (Beta)” (OpenAI, 2023); https://cdn.openai.com/openai-preparedness-framework-beta.pdf 2 3 4

  595. [industry] Anthropic, Anthropic’s Responsible Scaling Policy, Version 1.0. (2023); https://www-cdn.anthropic.com/1adf000c8f675958c2ee23805d91aaade1cd4613/responsible-scaling-policy.pdf 2 3

  596. [industry] Google DeepMind, Frontier Safety Framework Version 1.0. (2024); https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/introducing-the-frontier-safety-framework/fsf-technical-report.pdf 2 3 4

  597. T. Hagendorff, Deception Abilities Emerged in Large Language Models. Proceedings of the National Academy of Sciences of the United States of America 121, e2317967121 (2024); https://doi.org/10.1073/pnas.2317967121

  598. [industry] E. Hubinger, C. Denison, J. Mu, M. Lambert, M. Tong, M. MacDiarmid, T. Lanham, D. M. Ziegler, T. Maxwell, N. Cheng, A. Jermyn, A. Askell, A. Radhakrishnan, C. Anil, D. Duvenaud, D. Ganguli, F. Barez, … E. Perez, Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training, arXiv [cs.CR] (2024); http://arxiv.org/abs/2401.05566 2 3 4

  599. [industry] C. Denison, M. MacDiarmid, F. Barez, D. Duvenaud, S. Kravec, S. Marks, N. Schiefer, R. Soklaski, A. Tamkin, J. Kaplan, B. Shlegeris, S. R. Bowman, E. Perez, E. Hubinger, Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models, arXiv [cs.AI] (2024); http://arxiv.org/abs/2406.10162 2 3 4

  600. S. Kapoor, B. Stroebl, Z. S. Siegel, N. Nadgir, A. Narayanan, AI Agents That Matter, arXiv [cs.LG] (2024); http://arxiv.org/abs/2407.01502 2

  601. R. Shiffrin, M. Mitchell, Probing the Psychology of AI Models. Proceedings of the National Academy of Sciences of the United States of America 120, e2300963120 (2023); https://doi.org/10.1073/pnas.2300963120

  602. D. Hendrycks, M. Mazeika, T. Woodside, An Overview of Catastrophic AI Risks, arXiv [cs.CY] (2023); http://arxiv.org/abs/2306.12001 2 3

  603. J. Lehman, J. Clune, D. Misevic, C. Adami, L. Altenberg, J. Beaulieu, P. J. Bentley, S. Bernard, G. Beslon, D. M. Bryson, N. Cheney, P. Chrabaszcz, A. Cully, S. Doncieux, F. C. Dyer, K. O. Ellefsen, R. Feldt, … J. Yosinski, The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities. Artificial Life 26, 274–306 (2020); https://doi.org/10.1162/artl_a_00319

  604. J. Skalse, N. H. R. Howe, D. Krasheninnikov, D. Krueger, Defining and Characterizing Reward Hacking, arXiv [cs.LG] (2022); http://arxiv.org/abs/2209.13085 2

  605. R. Ngo, L. Chan, S. Mindermann, “The Alignment Problem from a Deep Learning Perspective” in The 12th International Conference on Learning Representations (ICLR 2024) (Vienna, Austria, 2023); https://openreview.net/forum?id=fh8EYKFKns 2 3 4

  606. J. Ji, T. Qiu, B. Chen, B. Zhang, H. Lou, K. Wang, Y. Duan, Z. He, J. Zhou, Z. Zhang, F. Zeng, K. Y. Ng, J. Dai, X. Pan, A. O’Gara, Y. Lei, H. Xu, … W. Gao, AI Alignment: A Comprehensive Survey, arXiv [cs.AI] (2023); http://arxiv.org/abs/2310.19852 2 3 4 5

  607. A. Pan, K. Bhatia, J. Steinhardt, “The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models” in The 10th International Conference on Learning Representations (ICLR 2022) (Virtual, 2021); https://openreview.net/forum?id=JYtwGwIL7ye 2 3 4

  608. J. Wen, R. Zhong, A. Khan, E. Perez, J. Steinhardt, M. Huang, S. R. Bowman, H. He, S. Feng, Language Models Learn to Mislead Humans via RLHF, arXiv [cs.CL] (2024); http://arxiv.org/abs/2409.12822 2 3 4

  609. [industry] S. R. Bowman, J. Hyun, E. Perez, E. Chen, C. Pettit, S. Heiner, K. Lukošiūtė, A. Askell, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Olah, D. Amodei, D. Amodei, D. Drain, … J. Kaplan, Measuring Progress on Scalable Oversight for Large Language Models, arXiv [cs.HC] (2022); http://arxiv.org/abs/2211.03540 2 3 4

  610. [industry] P. Christiano, B. Shlegeris, Dario, Amodei, Supervising Strong Learners by Amplifying Weak Experts, arXiv [cs.LG] (2018); http://arxiv.org/abs/1810.08575 2

  611. [industry] G. Irving, P. Christiano, D. Amodei, “AI Safety via Debate” (OpenAI, 2018); http://arxiv.org/abs/1805.00899 2

  612. [industry] J. Leike, D. Krueger, T. Everitt, M. Martic, V. Maini, S. Legg, Scalable Agent Alignment via Reward Modeling: A Research Direction, arXiv [cs.LG] (2018); http://arxiv.org/abs/1811.07871

  613. [industry] J. Wu, L. Ouyang, D. M. Ziegler, N. Stiennon, R. Lowe, J. Leike, P. Christiano, Recursively Summarizing Books with Human Feedback, arXiv [cs.CL] (2021); http://arxiv.org/abs/2109.10862

  614. [industry] W. Saunders, C. Yeh, J. Wu, S. Bills, L. Ouyang, J. Ward, J. Leike, Self-Critiquing Models for Assisting Human Evaluators, arXiv [cs.CL] (2022); http://arxiv.org/abs/2206.05802

  615. [industry] A. Khan, J. Hughes, D. Valentine, L. Ruis, K. Sachan, A. Radhakrishnan, E. Grefenstette, S. R. Bowman, T. Rocktäschel, E. Perez, Debating with More Persuasive LLMs Leads to More Truthful Answers, arXiv [cs.AI] (2024); http://arxiv.org/abs/2402.06782 2

  616. L. L. D. Langosco, J. Koch, L. D. Sharkey, J. Pfau, D. Krueger, “Goal Misgeneralization in Deep Reinforcement Learning” in Proceedings of the 39th International Conference on Machine Learning (PMLR, 2022) vol. 162, pp. 12004–12019; https://proceedings.mlr.press/v162/langosco22a.html 2 3 4 5

  617. [industry] R. Shah, V. Varma, R. Kumar, M. Phuong, V. Krakovna, J. Uesato, Z. Kenton, Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals, arXiv [cs.LG] (2022); http://arxiv.org/abs/2210.01790 2

  618. H. N. E. Barj, T. Sautory, Reinforcement Learning from LLM Feedback to Counteract Goal Misgeneralization, arXiv [cs.LG] (2024); http://arxiv.org/abs/2401.07181

  619. D. Hendrycks, X. Liu, E. Wallace, A. Dziedzic, R. Krishnan, D. Song, “Pretrained Transformers Improve Out-of-Distribution Robustness” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), D. Jurafsky, J. Chai, N. Schluter, J. Tetreault, Eds. (Association for Computational Linguistics, Online, 2020), pp. 2744–2751; https://doi.org/10.18653/v1/2020.acl-main.244

  620. L. Berglund, A. C. Stickland, M. Balesni, M. Kaufmann, M. Tong, T. Korbak, D. Kokotajlo, O. Evans, Taken out of Context: On Measuring Situational Awareness in LLMs, arXiv [cs.CL] (2023); http://arxiv.org/abs/2309.00667

  621. R. Laine, B. Chughtai, J. Betley, K. Hariharan, M. Balesni, J. Scheurer, M. Hobbhahn, A. Meinke, O. Evans, “Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs” in 38th Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2024); https://openreview.net/forum?id=UnWhcpIyUC

  622. C. Schwab, L. Huber, Obey or Not Obey? Dogs (Canis Familiaris) Behave Differently in Response to Attentional States of Their Owners. Journal of Comparative Psychology (Washington, D.C.: 1983) 120, 169–175 (2006); https://doi.org/10.1037/0735-7036.120.3.169

  623. [industry] V. Krakovna, J. Kramar, Power-Seeking Can Be Probable and Predictive for Trained Agents, arXiv [cs.AI] (2023); http://arxiv.org/abs/2304.06528 2

  624. A. Turner, L. Smith, R. Shah, A. Critch, P. Tadepalli, “Optimal Policies Tend To Seek Power” in 35th Conference on Neural Information Processing Systems (NeurIPS 2021) (Curran Associates, Inc., Virtual, 2021) vol. 34; https://proceedings.neurips.cc/paper/2021/hash/c26820b8a4c1b3c2aa868d6d57e14a79-Abstract.html

  625. A. Turner, P. Tadepalli, “Parametrically Retargetable Decision-Makers Tend to Seek Power” in Advances in Neural Information Processing Systems (NeurIPS 2022) Main Conference Track (New Orleans, LA, US, 2022) vol. abs/2206.13477; https://doi.org/10.48550/arXiv.2206.13477

  626. M. K. Cohen, M. Hutter, M. A. Osborne, Advanced Artificial Agents Intervene in the Provision of Reward. AI Magazine 43, 282–293 (2022); https://doi.org/10.1002/aaai.12064

  627. S. Zhuang, D. Hadfield-Menell, “Consequences of Misaligned AI” in Advances in Neural Information Processing Systems (NeurIPS 2020) (Curran Associates, Inc., 2020) vol. 33, pp. 15763–15773; https://proceedings.neurips.cc/paper/2020/hash/b607ba543ad05417b8507ee86c54fcb7-Abstract.html

  628. E. Hubinger, C. van Merwijk, V. Mikulik, J. Skalse, S. Garrabrant, Risks from Learned Optimization in Advanced Machine Learning Systems, arXiv [cs.AI] (2019); http://arxiv.org/abs/1906.01820

  629. J. Carlsmith, Scheming AIs: Will AIs Fake Alignment during Training in Order to Get Power?, arXiv [cs.CY] (2023); http://arxiv.org/abs/2311.08379

  630. [industry] R. Grosse, J. Bae, C. Anil, N. Elhage, A. Tamkin, A. Tajdini, B. Steiner, D. Li, E. Durmus, E. Perez, E. Hubinger, K. Lukošiūtė, K. Nguyen, N. Joseph, S. McCandlish, J. Kaplan, S. R. Bowman, Studying Large Language Model Generalization with Influence Functions, arXiv [cs.LG] (2023); http://arxiv.org/abs/2308.03296

  631. S. Im, Y. Li, On the Generalization of Preference Learning with DPO, arXiv [cs.LG] (2024); http://arxiv.org/abs/2408.03459

  632. A. Pan, J. S. Chan, A. Zou, N. Li, S. Basart, T. Woodside, H. Zhang, S. Emmons, D. Hendrycks, “Do the Rewards Justify the Means? Measuring Trade-Offs between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark” in Proceedings of the 40th International Conference on Machine Learning (ICML’23) (JMLR, Honolulu, Hawaii, USA, 2023) vol. 202, pp. 26837–26867. 

  633. L. Dung, The Argument for near-Term Human Disempowerment through AI. AI & Society, 1–14 (2024); https://doi.org/10.1007/s00146-024-01930-2

  634. P. J. Denning, The Science of Computing: The Internet Worm. American Scientist 77, 126–128 (1989); http://www.jstor.org/stable/27855650

  635. D. Hendrycks, Natural Selection Favors AIs over Humans, arXiv [cs.CY] (2023); http://arxiv.org/abs/2303.16200

  636. UK AI Safety Institute, Advancing the Field of Systemic AI Safety: Grants Open (2024); https://www.aisi.gov.uk/work/advancing-the-field-of-systemic-ai-safety-grants-open

  637. [industry] T. Eloundou, S. Manning, P. Mishkin, D. Rock, GPTs Are GPTs: Labor Market Impact Potential of LLMs. Science 384, 1306–1308 (2024); https://doi.org/10.1126/science.adj0998 2 3 4 5

  638. B. Lou, H. Sun, T. Sun, GPTs and Labor Markets in the Developing Economy: Evidence from China, SSRN [preprint] (2023); https://doi.org/10.2139/ssrn.4426461

  639. P. Gmyrek, J. Berg, D. Bescond, Generative AI and Jobs: A Global Analysis of Potential Effects on Job Quantity and Quality (International Labour Organization, Geneva, 2023); https://doi.org/10.54394/fhem8239 2

  640. M. Cazzaniga, F. Jaumotte, L. Li, G. Melina, A. J. Panton, C. Pizzinelli, E. J. Rockall, M. M. Tavares, “Gen-AI: Artificial Intelligence and the Future of Work” (SDN/2024/001, International Monetary Fund, 2024); https://www.imf.org/en/Publications/Staff-Discussion-Notes/Issues/2024/01/14/Gen-AI-Artificial-Intelligence-and-the-Future-of-Work-542379 2 3 4 5

  641. D. Acemoglu, P. Restrepo, Automation and New Tasks: How Technology Displaces and Reinstates Labor. The Journal of Economic Perspectives: A Journal of the American Economic Association 33, 3–30 (2019); https://doi.org/10.1257/jep.33.2.3 2

  642. D. Acemoglu, D. Autor, “Skills, Tasks and Technologies: Implications for Employment and Earnings*” in Handbook of Labor Economics, D. Card, O. Ashenfelter, Eds. (Elsevier, 2011) vol. 4, pp. 1043–1171; https://doi.org/10.1016/S0169-7218(11)02410-5

  643. P. Restrepo, “Automation: Theory, Evidence, and Outlook” (w31910, National Bureau of Economic Research, 2023); https://doi.org/10.3386/w31910 2

  644. D. Autor, C. Chin, A. Salomons, B. Seegmiller, “New Frontiers: The Origins and Content of New Work, 1940–2018” (30389, National Bureau of Economic Research, 2022); https://doi.org/10.3386/w30389 2

  645. X. Hui, O. Reshef, L. Zhou, “The Short-Term Effects of Generative Artificial Intelligence on Employment: Evidence from an Online Labor Market” (10601, CESifo Working Paper, 2023); https://www.econstor.eu/handle/10419/279352 2

  646. A. Korinek, D. Suh, “Scenarios for the Transition to AGI” (32255, National Bureau of Economic Research, 2024); https://doi.org/10.3386/w32255 2 3

  647. A. Korinek, Scenario Planning for an A(G)I Future. Finance and Development Magazine (2023); https://www.imf.org/en/Publications/fandd/issues/2023/12/Scenario-Planning-for-an-AGI-future-Anton-korinek

  648. D. Acemoglu, “The Simple Macroeconomics of AI” (w32487, National Bureau of Economic Research, 2024); https://doi.org/10.3386/w32487

  649. B. Romera-Paredes, M. Barekatain, A. Novikov, M. Balog, M. P. Kumar, E. Dupont, F. J. R. Ruiz, J. S. Ellenberg, P. Wang, O. Fawzi, P. Kohli, A. Fawzi, Mathematical Discoveries from Program Search with Large Language Models. Nature 625, 468–475 (2024); https://doi.org/10.1038/s41586-023-06924-6

  650. Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago, T. Hubert, P. Choy, C. de Masson d’Autume, I. Babuschkin, X. Chen, P.-S. Huang, J. Welbl, … O. Vinyals, Competition-Level Code Generation with AlphaCode. Science (New York, N.Y.) 378, 1092–1097 (2022); https://doi.org/10.1126/science.abq1158

  651. S. Noy, W. Zhang, Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence. Science (New York, N.Y.) 381, 187–192 (2023); https://doi.org/10.1126/science.adh2586 2 3

  652. D. Susskind, A World without Work: Technology, Automation, and How We Should Respond (Metropolitan Books, 2020); https://www.danielsusskind.com/a-world-without-work

  653. A. Korinek, M. Juelfs, “Preparing for the (non-Existent?) Future of Work” (w30172, National Bureau of Economic Research, 2022); https://doi.org/10.3386/w30172

  654. A. Korinek, “Economic Policy Challenges for the Age of AI” (w32980, National Bureau of Economic Research, 2024); https://doi.org/10.3386/w32980 2

  655. [industry] A. McAfee, “Generally Faster: The Economic Impact of Generative AI” (Google, 2024); [https://policycommons.net/artifacts/12281693/generally_faster_-the_economic_impact_of_generative_ai/](https://policycommons.net/artifacts/12281693/generally_faster-_the_economic_impact_of_generative_ai/). 

  656. A. Agrawal, J. Gans, A. Goldfarb, “AI Adoption and System-Wide Change” (w28811, National Bureau of Economic Research, 2021); https://doi.org/10.3386/w28811

  657. J. Feigenbaum, D. P. Gross, Organizational and Economic Obstacles to Automation: A Cautionary Tale from AT&T in the Twentieth Century. Management Science (2024); https://doi.org/10.1287/mnsc.2022.01760

  658. M. Svanberg, W. Li, M. Fleming, B. Goehring, N. Thompson, Beyond AI Exposure: Which Tasks Are Cost-Effective to Automate with Computer Vision?, SSRN [preprint] (2024); https://doi.org/10.2139/ssrn.4700751

  659. V. Magesh, F. Surani, M. Dahl, M. Suzgun, C. D. Manning, D. E. Ho, Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools, arXiv [cs.CL] (2024); http://arxiv.org/abs/2405.20362

  660. E. Erdil, T. Besiroglu, Explosive Growth from AI Automation: A Review of the Arguments, arXiv [econ.GN] (2023); https://epoch.ai/blog/explosive-growth-from-ai-a-review-of-the-arguments

  661. A. Bick, A. Blandin, D. Deming, “The Rapid Adoption of Generative AI” (w32966, National Bureau of Economic Research, 2024); https://doi.org/10.3386/w32966 2 3 4

  662. E. Brynjolfsson, D. Li, L. Raymond, “Generative AI at Work” (w31161, National Bureau of Economic Research, 2023); https://doi.org/10.3386/w31161 2 3 4 5 6

  663. D. Acemoglu, P. Restrepo, The Race between Man and Machine: Implications of Technology for Growth, Factor Shares, and Employment. American Economic Review 108, 1488–1542 (2018); https://doi.org/10.1257/aer.20160696

  664. A. K. Agrawal, J. S. Gans, A. Goldfarb, “The Turing Transformation: Artificial Intelligence, Intelligence Augmentation, and Skill Premiums” (31767, National Bureau of Economic Research, 2023); https://doi.org/10.3386/w31767 2

  665. E. Felten, M. Raj, R. Seamans, How Will Language Modelers like ChatGPT Affect Occupations and Industries?, arXiv [econ.GN] (2023); http://arxiv.org/abs/2303.01157

  666. E. W. Felten, M. Raj, R. Seamans, Occupational Heterogeneity in Exposure to Generative AI, SSRN [preprint] (2023); https://doi.org/10.2139/ssrn.4414065

  667. F. Dell’Acqua, E. McFowland III, E. R. Mollick, H. Lifshitz-Assaf, K. Kellogg, S. Rajendran, L. Krayer, F. Candelon, K. R. Lakhani, “Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality” (24–013, Harvard Business School, 2023); https://www.hbs.edu/ris/Publication%20Files/24-013_d9b45b68-9e74-42d6-a1c6-c72fb70c7282.pdf 2

  668. J. H. Choi, A. Monahan, D. B. Schwarcz, Lawyering in the Age of Artificial Intelligence, SSRN [preprint] (2023); https://doi.org/10.2139/ssrn.4626276 2

  669. K. Bonney, C. Breaux, C. Buffington, E. Dinlersoz, L. Foster, N. Goldschlag, J. Haltiwanger, Z. Kroff, K. Savage, “Tracking Firm Use of AI in Real Time: A Snapshot from the Business Trends and Outlook Survey” (w32319, National Bureau of Economic Research, 2024); https://doi.org/10.3386/w32319 2

  670. A. Korinek, The Rise of Articially Intelligent Agents (2019); https://drive.google.com/file/d/16y5UmeTOv5YB9E5ms_ce7WiYNfMAn17J/view 2

  671. A. Chan, R. Salganik, A. Markelius, C. Pang, N. Rajkumar, D. Krasheninnikov, L. Langosco, Z. He, Y. Duan, M. Carroll, M. Lin, A. Mayhew, K. Collins, M. Molamohammadi, J. Burden, W. Zhao, S. Rismani, … T. Maharaj, “Harms from Increasingly Agentic Algorithmic Systems” in Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (FAccT ‘23) (Association for Computing Machinery, New York, NY, USA, 2023), pp. 651–666; https://doi.org/10.1145/3593013.3594033 2

  672. METR, Details about METR’s Preliminary Evaluation of GPT-4o, METR’s Autonomy Evaluation Resources (2024); https://metr.github.io/autonomy-evals-guide/gpt-4o-report/

  673. Y. Shavit, S. Agarwal, M. Brundage, S. A. C. O’Keefe, R. Campbell, T. Lee, P. Mishkin, T. Eloundou, A. Hickey, K. Slama, L. Ahmad, P. McMillan, A. Beutel, A. Passos, D. G. Robinson, Practices for Governing Aagentic AI Systems. Research Paper, OpenAI (2023); https://cdn.openai.com/papers/practices-for-governing-agentic-ai-systems.pdf 2

  674. D. Hyslop, W. Townsend, “The Longer Term Impacts of Job Displacement on Labour Market Outcomes” (Motu Economic and Public Policy Research, 2017); https://www.motu.nz/our-research/population-and-labour/individual-and-group-outcomes/the-longer-term-impacts-of-job-displacement-on-labour-market-outcomes/

  675. S. C. Dixon, D. C. Maré, “The Costs of Involuntary Job Loss: Impacts on Workers’ Employment and Earnings” (Motu Economic and Public Policy Research, 2013); https://doi.org/10.2139/ssrn.2247198

  676. D. Hamermesh, “What Do We Know about Worker Displacement in the U.s.?” (National Bureau of Economic Research, 1987); https://doi.org/10.3386/w2402

  677. L. S. Jacobson, R. J. LaLonde, D. G. Sullivan, Earnings Losses of Displaced Workers. The American Economic Review 83, 685–709 (1993); http://www.jstor.org/stable/2117574

  678. T. Von Wachter, J. Song, J. Manchester, Long-Term Earnings Losses due to Mass Layoffs during the 1982 Recession: An Analysis Using US Administrative Data from 1974 to 2004 (2009); http://www.econ.ucla.edu/tvwachter/papers/mass_layoffs_1982.pdf

  679. J. Barnette, A. Michaud, Wage Scars and Human Capital Theory (2017); https://ammichau.github.io/papers/JBAMWageScar.pdf

  680. D. Sullivan, T. von Wachter, Job Displacement and Mortality: An Analysis Using Administrative Data*. The Quarterly Journal of Economics 124, 1265–1306 (2009); https://doi.org/10.1162/qjec.2009.124.3.1265

  681. S. A. Burgard, J. E. Brand, J. S. House, Toward a Better Estimation of the Effect of Job Loss on Health. Journal of Health and Social Behavior 48, 369–384 (2007); http://www.jstor.org/stable/27638722

  682. M. Browning, E. Heinesen, Effect of Job Loss due to Plant Closure on Mortality and Hospitalization. Journal of Health Economics 31, 599–616 (2012); https://doi.org/10.1016/j.jhealeco.2012.03.001 2

  683. K. Telle, M. Votruba, Parental Job Loss and Children’s School Performance. Review of Economic Studies 78, 1462–1489 (10 2011); https://doi.org/10.2307/41407068

  684. J. Duggan, U. Sherman, R. Carbery, A. McDonnell, Algorithmic Management and App‐work in the Gig Economy: A Research Agenda for Employment Relations and HRM. Human Resource Management Journal 30, 114–132 (2020); https://doi.org/10.1111/1748-8583.12258

  685. B. Bai, H. Dai, D. J. Zhang, F. Zhang, H. Hu, The Impacts of Algorithmic Work Assignment on Fairness Perceptions and Productivity: Evidence from Field Experiments. Manufacturing & Service Operations Management: M & SOM 24, 3060–3078 (2022); https://doi.org/10.1287/msom.2022.1120

  686. J. Howard, P. Schulte, Managing Workplace AI Risks and the Future of Work. American Journal of Industrial Medicine 67, 980–993 (2024); https://doi.org/10.1002/ajim.23653

  687. A. Bernhardt, L. Kresge, R. Suleiman, The Data-Driven Workplace and the Case for Worker Technology Rights. Industrial & Labor Relations Review 76, 3–29 (2023); https://doi.org/10.1177/00197939221131558 2

  688. D. Acemoglu, P. Restrepo, Tasks, Automation, and the Rise in U.S. Wage Inequality. Econometrica: Journal of the Econometric Society 90, 1973–2016 (2022); https://doi.org/10.3982/ECTA19815 2

  689. D. Acemoglu, Technical Change, Inequality, and the Labor Market. Journal of Economic Literature 40, 7–72 (2002); https://doi.org/10.1257/0022051026976

  690. D. H. Autor, Why Are There Still So Many Jobs? The History and Future of Workplace Automation. The Journal of Economic Perspectives: A Journal of the American Economic Association 29, 3–30 (2015); https://doi.org/10.1257/jep.29.3.3

  691. Ó. Afonso, R. Forte, Routine and Non-Routine Sectors, Tasks Automation and Wage Polarization. Applied Economics (2023); https://www.tandfonline.com/doi/abs/10.1080/00036846.2023.2280461

  692. D. Acemoglu, J. Loebbing, “Automation and Polarization” (National Bureau of Economic Research, 2022); https://doi.org/10.3386/w30528

  693. D. Autor, “Applying AI to Rebuild Middle Class Jobs” (National Bureau of Economic Research, 2024); https://doi.org/10.3386/w32140

  694. L. Karabarbounis, Perspectives on the Labor Share. The Journal of Economic Perspectives: A Journal of the American Economic Association 38, 107–136 (2024); https://doi.org/10.1257/jep.38.2.107

  695. M. Ranaldi, Income Composition Inequality. The Review of Income and Wealth 68, 139–160 (2022); https://doi.org/10.1111/roiw.12503

  696. T. Piketty, Capital in the Twenty-First Century (The Belknap Press of Harvard University Press, Cambridge Massachusetts, 2014); https://www.hup.harvard.edu/books/9780674430006

  697. B. Moll, L. Rachel, P. Restrepo, Uneven Growth: Automation’s Impact on Income and Wealth Inequality. Econometrica: Journal of the Econometric Society 90, 2645–2683 (2022); https://doi.org/10.3982/ECTA19417

  698. C. Wang, M. Zheng, X. Bai, Y. Li, W. Shen, Future of Jobs in China under the Impact of Artificial Intelligence. Finance Research Letters 55, 103798 (2023); https://doi.org/10.1016/j.frl.2023.103798

  699. H. Firooz, Z. Liu, Y. Wang, “Automation and the Rise of Superstar Firms” (Federal Reserve Bank of San Francisco, 2022); https://doi.org/10.24148/wp2022-05

  700. C. T. Okolo, AI in the Global South: Opportunities and Challenges towards More Inclusive Governance, Brookings (2023); https://www.brookings.edu/articles/ai-in-the-global-south-opportunities-and-challenges-towards-more-inclusive-governance/

  701. A. Korinek, J. E. Stiglitz, “Artificial Intelligence, Globalization, and Strategies for Economic Development” (National Bureau of Economic Research, 2021); https://doi.org/10.3386/w28453

  702. C. Alonso, A. Berg, S. Kothari, C. Papageorgiou, S. Rehman, “Will the AI Revolution Cause a Great Divergence?” (International Monetary Fund, 2020); https://www.imf.org/en/Publications/WP/Issues/2020/09/11/Will-the-AI-Revolution-Cause-a-Great-Divergence-49734

  703. H. Nii-Aponsah, B. Verspagen, P. Mohnen, “Automation-Induced Reshoring and Potential Implications for Developing Economies” (UNU-MERIT, 2023); https://ideas.repec.org/p/unm/unumer/2023018.html

  704. J. Jacobs, “How Generative AI Is Changing the Global South’s IT Services Sector” (Information Technology and Innovation Foundation, 2024); https://itif.org/publications/2024/06/10/how-generative-ai-is-changing-the-global-souths-it-services-sector/ 2

  705. N. Otis, R. Clarke, S. Delecourt, D. Holtz, R. Koning, “The Uneven Impact of Generative AI on Entrepreneurial Performance” (Harvard Business School, 2024); https://www.hbs.edu/ris/Publication%20Files/24-042_9ebd2f26-e292-404c-b858-3e883f0e11c0.pdf

  706. A. Merali, Scaling Laws for Economic Productivity: Experimental Evidence in LLM-Assisted Translation, arXiv [econ.GN] (2024); http://arxiv.org/abs/2409.02391 2

  707. K. McElheran, J. F. Li, E. Brynjolfsson, Z. Kroff, E. Dinlersoz, L. Foster, N. Zolas, AI Adoption in America: Who, What, and Where. Journal of Economics & Management Strategy 33, 375–415 (2024); https://doi.org/10.1111/jems.12576 2

  708. K. Bonney, C. Breaux, C. Buffington, E. Dinlersoz, L. Foster, N. Goldschlag, J. Haltiwanger, Z. Kroff, K. Savage, The Impact of AI on the Workforce: Tasks versus Jobs? Economics Letters 244, 111971 (2024); https://doi.org/10.1016/j.econlet.2024.111971

  709. A. Kreacic, L. Uribe, J. Romeo, A. Lasater-Wille, R. Jesuthasan, S. Luong, “How Generative AI Is Transforming Business And Society: The Good, The Bad, And Everything in Between” (Oliver Wyman Forum, 2024); https://www.oliverwymanforum.com/global-consumer-sentiment/how-will-ai-affect-global-economics.html

  710. N. G. Otis, S. Delecourt, K. Cranney, R. Koning, “Global Evidence on Gender Gaps and Generative AI” (Harvard Business School, 2024); https://www.hbs.edu/faculty/Pages/item.aspx?num=66548

  711. [industry] S. Jaffe, N. P. Shah, J. Butler, A. Farach, A. Cambon, B. Hecht, M. Schwarz, J. Teevan, “Generative AI in Real-World Workplaces” (Microsoft, 2024); https://www.microsoft.com/en-us/research/publication/generative-ai-in-real-world-workplaces/

  712. [industry] E. Wiles, L. Krayer, M. Abbadi, U. Awasthi, R. Kennedy, P. Mishkin, D. Sack, F. Candelon, GenAI as an Exoskeleton: Experimental Evidence on Knowledge Workers Using GenAI on New Skills, Social Science Research Network (2024); https://doi.org/10.2139/ssrn.4944588

  713. A. Toner-Rodgers, Artificial Intelligence, Scientific Discovery, and Product Innovation (2024); https://aidantr.github.io/files/AI_innovation.pdf

  714. T. Besiroglu, N. Emery-Xu, N. Thompson, Economic Impacts of AI-Augmented R&D. Research Policy 53, 105037 (2024); https://doi.org/10.1016/j.respol.2024.105037

  715. S. McConnell, K. Fortson, D. Rotz, P. Schochet, P. Burkander, L. Rosenber, A. Mastri, R. D’Amico, “Providing Public Workforce Services to Job Seekers: 15-Month Impact Findings on the WIA Adult and Dislocated Worker Programs” (Mathematica Policy Reearch, 2016); https://mathematica.org/publications/providing-public-workforce-services-to-job-seekers-15-month-impact-findings-on-the-wia-adult

  716. J. Furman, “Policies for the Future of Work Should Be Based on Its Past and Present” (Economic Innovation Group, 2024); https://eig.org/wp-content/uploads/2024/07/TAWP-Furman.pdf

  717. A. Anthony, L. Sharma, E. Noor, “Advancing a More Global Agenda for Trustworthy Artificial Intelligence” (Carnegie Endowment for International Peace, 2024); https://carnegieendowment.org/research/2024/04/advancing-a-more-global-agenda-for-trustworthy-artificial-intelligence?lang=en

  718. S. Ghosh, A. Caliskan, “ChatGPT Perpetuates Gender Bias in Machine Translation and Ignores Non-Gendered Pronouns: Findings across Bengali and Five Other Low-Resource Languages” in Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (AIES ‘23) (Association for Computing Machinery, New York, NY, USA, 2023), pp. 901–912; https://doi.org/10.1145/3600211.3604672

  719. C. Okorie, V. Marivate, How African NLP Experts Are Navigating the Challenges of Copyright, Innovation, and Access, Carnegie Endowment for International Peace (2024); https://carnegieendowment.org/research/2024/04/how-african-nlp-experts-are-navigating-the-challenges-of-copyright-innovation-and-access?lang=en

  720. N. Maslej, L. Fattorini, E. Brynjolfsson, J. Etchemendy, K. Ligett, T. Lyons, J. Manyika, H. Ngo, J. C. Niebles, V. Parli, Y. Shoham, R. Wald, J. Clark, R. Perrault, “Artificial Intelligence Index Report 2023” (AI Index Steering Committee, Institute for Human-Centered AI, Stanford University, 2023); https://arxiv.org/pdf/2310.03715

  721. N. Ahmed, M. Wahed, N. C. Thompson, The Growing Influence of Industry in AI Research. Science (New York, N.Y.) 379, 884–886 (2023); https://doi.org/10.1126/science.ade2420

  722. S. Teleanu, J. Kurbalija, “Stronger Digital Voices from Africa: Building African Digital Foreign Policy and Diplomacy” (Diplo, 2022); https://www.diplomacy.edu/resource/report-stronger-digital-voices-from-africa/

  723. T. Alsop, Estimated Shipments of Nvidia H100 Graphics Processing Units (GPUs) Worldwide in 2023, by Customer, Statista (2024); https://www.statista.com/statistics/1446564/nvidia-h100-gpu-shipments-by-customer/

  724. [industry] Google Data Centers, Investing in Nebraska (2020); https://www.google.com/intl/es/about/datacenters/locations/papillion/

  725. Office of Governor Michael L. Parson, Governor Parson Announces Google’s Selection of Kansas City for New Data Center (2024); https://governor.mo.gov/press-releases/archive/governor-parson-announces-googles-selection-kansas-city-new-data-center

  726. [industry] Meta, “Meta’s Prineville Data Center” (Meta, 2024); https://datacenters.atmeta.com/wp-content/uploads/2024/10/Oregon-Prineville.pdf

  727. [industry] Microsoft, Microsoft and G42 Announce $1 Billion Comprehensive Digital Ecosystem Initiative for Kenya, Stories (2024); https://news.microsoft.com/2024/05/22/microsoft-and-g42-announce-1-billion-comprehensive-digital-ecosystem-initiative-for-kenya/

  728. R. Zwetsloot, B. Zhang, N. Dreksler, L. Kahn, M. Anderljung, A. Dafoe, M. C. Horowitz, “Skilled and Mobile: Survey Evidence of AI Researchers’ Immigration Preferences” in Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (AIES ‘21) (Association for Computing Machinery, New York, NY, USA, 2021), pp. 1050–1059; https://doi.org/10.1145/3461702.3462617

  729. Top Universities, QS World University Rankings for Data Science and Artificial Intelligence 2024 (2024); https://www.topuniversities.com/university-subject-rankings/data-science-artificial-intelligence

  730. N. Maslej, L. Fattorini, R. Perrault, V. Parli, A. Reuel, E. Brynjolfsson, J. Etchemendy, K. Ligett, T. Lyons, J. Manyika, J. C. Niebles, Y. Shoham, R. Wald, J. Clark, “The AI Index 2024 Annual Report” (Institute for Human-Centered AI, Stanford University, 2024); https://aiindex.stanford.edu/report/ 2

  731. N. Maslej, L. Fattorini, R. Perrault, V. Parli, A. Reuel, E. Brynjolfsson, J. Etchemendy, K. Ligett, T. Lyons, J. Manyika, J. C. Niebles, Y. Shoham, R. Wald, J. Clark, “The AI Index 2024 Annual Report” (Institute for Human-Centered AI, Stanford University, 2024); https://aiindex.stanford.edu/report/ 2 3

  732. M. L. Gray, S. Suri, Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass (Houghton Mifflin Harcourt, 2019); https://ghostwork.info/

  733. A. Arora, M. Barrett, E. Lee, E. Oborn, K. Prince, Risk and the Future of AI: Algorithmic Bias, Data Colonialism, and Marginalization. Information and Organization 33 (2023); https://doi.org/10.1016/j.infoandorg.2023.100478

  734. C. T. Okolo, “Addressing Global Inequity in AI Development” in Handbook of Critical Studies of Artificial Intelligence (Edward Elgar Publishing, 2023), pp. 378–389; https://www.elgaronline.com/edcollchap/book/9781803928562/book-part-9781803928562-40.xml

  735. M. Miceli, T. Yang, A. Alvarado Garcia, J. Posada, S. M. Wang, M. Pohl, A. Hanna, Documenting Data Production Processes: A Participatory Approach for Data Work. Proceedings of the ACM on Human-Computer Interaction 6, 1–34 (2022); https://doi.org/10.1145/3555623

  736. D. Wang, S. Prabhat, N. Sambasivan, “Whose AI Dream? In Search of the Aspiration in Data Annotation” in CHI Conference on Human Factors in Computing Systems (CHI ‘22) (ACM, New Orleans LA USA, 2022), pp. 1–16; https://doi.org/10.1145/3491102.3502121

  737. M. Steiger, T. J. Bharucha, S. Venkatagiri, M. J. Riedl, M. Lease, “The Psychological Well-Being of Content Moderators: The Emotional Labor of Commercial Moderation and Avenues for Improving Support” in Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (ACM, New York, NY, USA, 2021); https://doi.org/10.1145/3411764.3445092

  738. M. M. AlEmadi, W. Zaghouani, “Emotional Toll and Coping Strategies: Navigating the Effects of Annotating Hate Speech Data” in Proceedings of the Workshop on Legal and Ethical Issues in Human Language Technologies @ LREC-COLING 2024 (2024), pp. 66–72; https://aclanthology.org/2024.legal-1.10.pdf

  739. S. Luccioni, Y. Jernite, E. Strubell, “Power Hungry Processing: Watts Driving the Cost of AI Deployment?” in The 2024 ACM Conference on Fairness, Accountability, and Transparency (ACM, New York, NY, USA, 2024); https://doi.org/10.1145/3630106.3658542 2 3

  740. B. Thormundsson, “Change in Concentration of Talent Related to Artificial Intelligence (AI) Worldwide from 2016 to 2023, by Region” (Statista, 2024); https://www.statista.com/statistics/1472183/ai-talent-concentration-change-percentage-by-region/

  741. S. V. Bentley, C. K. Naughtin, M. J. McGrath, J. L. Irons, P. S. Cooper, The Digital Divide in Action: How Experiences of Digital Technology Shape Future Relationships with Artificial Intelligence. AI and Ethics 4, 901–915 (2024); https://doi.org/10.1007/s43681-024-00452-3

  742. Nigeria Federal Ministry of Communications, Innovation & Digital Economy, “Accelerating Our Collective Prosperity through Technical Efficiency: A Strategic Plan for the Federal Ministry of Communications, Innovation & Digital Economy” (2023); https://fmcide.gov.ng/wp-content/uploads/2023/11/blueprint.pdf

  743. US Government, Bring Your AI Skills to the U.S (2023); https://ai.gov/immigrate/

  744. UK Government, Supporting the next Generation of AI Leaders from around the World (2023); https://www.great.gov.uk/campaign-site/ai-futures/

  745. S. Pal, “Where Is Europe’s AI Workforce Coming from?: Immigration, Emigration & Transborder Movement of AI Talent” (interface, 2024); https://www.stiftung-nv.de/publications/where-is-europes-ai-workforce-coming-from

  746. M. Mazumder, C. Banbury, X. Yao, B. Karlaş, W. G. Rojas, S. Diamos, G. Diamos, L. He, A. Parrish, H. R. Kirk, J. Quaye, C. Rastogi, D. Kiela, D. Jurado, D. Kanter, R. Mosquera, J. Ciro, … V. J. Reddi, “DataPerf: Benchmarks for Data-Centric AI Development” in 37th International Conference on Neural Information Processing Systems (NeurIPS 2023) (Curran Associates Inc., Red Hook, NY, USA, 2024), pp. 5320–5347; https://doi.org/10.5555/3666122.3666357

  747. N. Guha, J. Nyarko, D. E. Ho, C. Ré, “Building GenAI Benchmarks: A Case Study in Legal Applications” in The Oxford Handbook on the Foundations and Regulation of Generative AI, P. Hacker, A. Engel, S. Hammer, B. Mittelstadt, Eds. (Oxford University Press, Oxford, England); https://neelguha.github.io/assets/pdf/building_genai_benchmarks_for_law_oxford_chapter.pdf

  748. E. Brynjolfsson, A. Ng, “Big AI Can Centralize Decision-Making and Power, and That’s a Problem” in Missing Links in AI Governance, B. Prud’homme, C. Régis, G. Farnadi, Eds. (UNESCO/MILA, 2023), pp. 65–87; https://www.unesco.org/en/articles/missing-links-ai-governance 2

  749. A. Korinek, J. Vipra, “Concentrating Intelligence: Scaling and Market Structure in Artificial Intelligence” (w33139, National Bureau of Economic Research, 2024); https://doi.org/10.3386/w33139

  750. Competition and Markets Authority, “AI Foundation Models: Initial Report” (CMA, 2023); https://www.gov.uk/government/publications/ai-foundation-models-initial-report 2

  751. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, … N. Fiedel, PaLM: Scaling Language Modeling with Pathways. Journal of Machine Learning Research: JMLR 24, 240:11324–240:11436 (2024). 

  752. X. Jin, D. Zhang, H. Zhu, W. Xiao, S.-W. Li, X. Wei, A. Arnold, X. Ren, “Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora” in Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models (Association for Computational Linguistics, Stroudsburg, PA, USA, 2022), pp. 1–16; https://doi.org/10.18653/v1/2022.bigscience-1.1

  753. K. Gupta, B. Thérien, A. Ibrahim, M. L. Richter, Q. G. Anthony, E. Belilovsky, I. Rish, T. Lesort, “Continual Pre-Training of Large Language Models: How to Re-Warm Your Model?” in Workshop on Efficient Systems for Foundation Models @ ICML2023 (2023); https://openreview.net/pdf?id=pg7PUJe0Tl

  754. F. Richter, Amazon Maintains Cloud Lead as Microsoft Edges Closer, Statista (2024); https://www.statista.com/chart/18819/worldwide-market-share-of-leading-cloud-infrastructure-service-providers

  755. D. Luitse, Platform Power in AI: The Evolution of Cloud Infrastructures in the Political Economy of Artificial Intelligence. Internet Policy Review 13, 1–44 (2024); https://doi.org/10.14763/2024.2.1768

  756. C. Rikap, Varieties of Corporate Innovation Systems and Their Interplay with Global and National Systems: Amazon, Facebook, Google and Microsoft’s Strategies to Produce and Appropriate Artificial Intelligence. Review of International Political Economy, 1–29 (2024); https://doi.org/10.1080/09692290.2024.2365757

  757. P. Maham, S. Küspert, “Governing General Purpose AI: A Comprehensive Map of Unreliability, Misuse and Systemic Risks” (Stiftung Neue Verantwortung, 2023); https://www.interface-eu.org/publications/governing-general-purpose-ai-comprehensive-map-unreliability-misuse-and-systemic-risks

  758. G. Yu, G. Tan, H. Huang, Z. Zhang, P. Chen, R. Natella, Z. Zheng, A Survey on Failure Analysis and Fault Injection in AI Systems, arXiv [cs.SE] (2024); http://arxiv.org/abs/2407.00125

  759. F. Jimmy, Emerging Threats: The Latest Cybersecurity Risks and the Role of Artificial Intelligence in Enhancing Cybersecurity Defenses. International Journal of Scientific Research and Management 9, 564–574 (2021); https://doi.org/10.18535/ijsrm/v9i2.ec01

  760. US Department of the Treasury, Managing Artificial Intelligence-Specific Cybersecurity Risks in the Financial Services Sector. (2024); https://home.treasury.gov/system/files/136/Managing-Artificial-Intelligence-Specific-Cybersecurity-Risks-In-The-Financial-Services-Sector.pdf

  761. S. Trivedi, V. Aggarwal, R. Rastogi, “Enhancing the Power of Cyber-Physical Systems Enabled with AI” in Artificial Intelligence Solutions for Cyber-Physical Systems (Auerbach Publications, Boca Raton, ed. 1, 2024), pp. 1–39; https://doi.org/10.1201/9781032694375-1

  762. I. D. Raji, S. Costanza-Chock, J. Buolamwini, “Change from the Outside: Towards Credible Third-Party Audits of AI Systems” in Missing Links in AI Governance, B. Prud’homme, C. Régis, G. Farnadi, Eds. (UNESCO/MILA, 2023), pp. 4–26; https://www.unesco.org/en/articles/missing-links-ai-governance

  763. M. Stein, M. Gandhi, T. Kriecherbauer, A. Oueslati, R. Trager, “Public vs Private Bodies: Who Should Run Advanced AI Evaluations and Audits? A Three-Step Logic Based on Case Studies of High-Risk Industries” (Oxford Martin AI Governance Initiative, 2024); https://www.oxfordmartin.ox.ac.uk/publications/public-vs-private-bodies-who-should-run-advanced-ai-evaluations-and-audits-a-three-step-logic-based-on-case-studies-of-high-risk-industries 2

  764. A. J. Grotto, J. Dempsey, “Vulnerability Disclosure and Management for AI/ML Systems: A Working Paper with Policy Recommendations” (Stanford Geopolitics, Technology, and Governance Cyber Policy Center, 2021); https://doi.org/10.2139/ssrn.3964084

  765. Y. Hong, J. Lian, L. Xu, J. Min, Y. Wang, L. J. Freeman, X. Deng, Statistical Perspectives on Reliability of Artificial Intelligence Systems. Quality Engineering 35, 56–78 (2023); https://doi.org/10.1080/08982112.2022.2089854

  766. T. Aguirre, On Labs and Fabs: Mapping How Alliances, Acquisitions, and Antitrust Are Shaping the Frontier AI Industry, arXiv [econ.GN] (2024); http://arxiv.org/abs/2406.01722

  767. B. Martens, “Why Artificial Intelligence Is Creating Fundamental Challenges for Competition Policy” (16/2024, Bruegel Policy Brief, 2024); https://hdl.handle.net/10419/302296 2

  768. US Environmental Protection Agency, “Greenhouse Gas Equivalencies Calculator - Calculations and References” (EPA, 2024); https://www.epa.gov/energy/greenhouse-gas-equivalencies-calculator-calculations-and-references

  769. [industry] Gemma Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, A. Ramé, J. Ferret, P. Liu, P. Tafti, A. Friesen, M. Casbon, S. Ramos, R. Kumar, … A. Andreev, Gemma 2: Improving Open Language Models at a Practical Size, arXiv [cs.CL] (2024); http://arxiv.org/abs/2408.00118

  770. D. Donnellan, A. Lawrence, D. Bizo, P. Judge, J. O’Brien, J. Davis, M. Smolaks, J. Williams-George, R. Weinschenk, “Uptime Institute Global Data Center Survey 2024” (Uptime Institute, 2024); https://uptimeinstitute.com/resources/research-and-reports/uptime-institute-global-data-center-survey-results-2024 2 3 4

  771. V. Rozite, E. Bertoli, B. Reidenbach, “Data Centres and Data Transmission Networks” (International Energy Agency, 2023); https://www.iea.org/energy-system/buildings/data-centres-and-data-transmission-networks 2 3 4 5

  772. L. Burdette, P. Brodsky, P. Christian, J. Hjembo, A. Mauldin, T. Stronge, M. Tan, J. Velandia, “The State of the Network 2023 Edition” (TeleGeography, 2023); https://www2.telegeography.com/hubfs/LP-Assets/Ebooks/state-of-the-network-2023.pdf

  773. R. Schwartz, J. Dodge, N. A. Smith, O. Etzioni, Green AI. Communications of the ACM 63, 54–63 (2020); https://doi.org/10.1145/3381831 2

  774. L. H. Kaack, P. L. Donti, E. Strubell, G. Kamiya, F. Creutzig, D. Rolnick, Aligning Artificial Intelligence with Climate Change Mitigation. Nature Climate Change 12, 518–527 (2022); https://doi.org/10.1038/s41558-022-01377-7 2 3

  775. E. Zelikman, Y. Wu, J. Mu, N. Goodman, “STaR: Bootstrapping Reasoning With Reasoning” in Advances in Neural Information Processing Systems (NeurIPS 2022) (New Orleans, LA, US, 2022) vol. 35, pp. 15476–15488; https://proceedings.neurips.cc/paper_files/paper/2022/file/639a9a172c044fbb64175b5fad42e9a5-Paper-Conference.pdf

  776. [industry] T. Wu, J. Lan, W. Yuan, J. Jiao, J. Weston, S. Sukhbaatar, Thinking LLMs: General Instruction Following with Thought Generation, arXiv [cs.CL] (2024); http://arxiv.org/abs/2410.10630

  777. L. Long, R. Wang, R. Xiao, J. Zhao, X. Ding, G. Chen, H. Wang, “On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey” in Findings of the Association for Computational Linguistics ACL 2024 (Association for Computational Linguistics, Stroudsburg, PA, USA, 2024), pp. 11065–11082; https://doi.org/10.18653/v1/2024.findings-acl.658

  778. N. Alder, K. Ebert, R. Herbrich, P. Hacker, AI, Climate, and Transparency: Operationalizing and Improving the AI Act, arXiv [cs.CY] (2024); http://arxiv.org/abs/2409.07471 2 3

  779. [industry] A. S. Luccioni, A. Hernandez-Garcia, Counting Carbon: A Survey of Factors Influencing the Emissions of Machine Learning, arXiv [cs.LG] (2023); http://arxiv.org/abs/2302.08476 2

  780. [industry] Google, “Environmental Report 2024” (2024); https://www.gstatic.com/gumdrop/sustainability/google-2024-environmental-report.pdf 2 3 4

  781. Baidu, “Baidu 2023 Environmental, Social and Governance Report” (2023); https://esg.baidu.com/Uploads/File/2024/05/17/Baidu%202023%20Environmental,%20Social%20and%20Governance%20Report.20240517150706.pdf 2

  782. EPRI, “Powering Intelligence: Analyzing Artificial Intelligence and Data Center Energy Consumption” (2024); https://www.epri.com/research/products/000000003002028905 2 3

  783. G. Guidi, F. Dominici, J. Gilmour, K. Butler, E. Bell, S. Delaney, F. J. Bargagli-Stoffi, Environmental Burden of United States Data Centers in the Artificial Intelligence Era, arXiv [cs.CY] (2024); http://arxiv.org/abs/2411.09786

  784. International Energy Agency, “World Energy Outlook 2024” (IEA, 2024); https://www.iea.org/reports/world-energy-outlook-2024 2

  785. Ireland Central Statistics Office, “Data Centres Metered Electricity Consumption 2023” (CSO, 2024); https://www.cso.ie/en/releasesandpublications/ep/p-dcmec/datacentresmeteredelectricityconsumption2023/

  786. PGIM Real Estate, “Global Data Centers Americas Excerpt” (2021); https://cdn.pficdn.com/cms/pgim-real-estate/sites/default/files/2021-01/Global%20Data%20Centers-U.S._February%202021_PGIM.pdf

  787. US Department of Energy Office of Policy, “Clean Energy Resources to Meet Data Center Electricity Demand” (DOE, 2024); https://www.energy.gov/policy/articles/clean-energy-resources-meet-data-center-electricity-demand 2

  788. [industry] Constellation, Constellation to Launch Crane Clean Energy Center, Restoring Jobs and Carbon-Free Power to The Grid (2024); https://www.constellationenergy.com/newsroom/2024/Constellation-to-Launch-Crane-Clean-Energy-Center-Restoring-Jobs-and-Carbon-Free-Power-to-The-Grid.html

  789. Talen Energy Corporation, “Unlocking Value” (2024); https://ir.talenenergy.com/static-files/f02c44a9-d2dc-45c1-9331-eee1495f7d2d

  790. US Federal Energy Regulatory Commission, Order Rejecting Amendments to Interconnection Service Agreement. FERC (2024); https://elibrary.ferc.gov/eLibrary/filelist?accession_number=20241101-3061&optimized=false

  791. [industry] M. Terrell, New Nuclear Clean Energy Agreement with Kairos Power, Google (2024); https://blog.google/outreach-initiatives/sustainability/google-kairos-power-nuclear-energy-agreement/

  792. L. M. Krall, A. M. Macfarlane, R. C. Ewing, Nuclear Waste from Small Modular Reactors. Proceedings of the National Academy of Sciences of the United States of America 119, e2111833119 (2022); https://doi.org/10.1073/pnas.2111833119

  793. J. Dodge, T. Prewitt, R. Tachet des Combes, E. Odmark, R. Schwartz, E. Strubell, A. S. Luccioni, N. A. Smith, N. DeCario, W. Buchanan, “Measuring the Carbon Intensity of AI in Cloud Instances” in 2022 ACM Conference on Fairness, Accountability, and Transparency (ACM, New York, NY, USA, 2022); https://doi.org/10.1145/3531146.3533234

  794. P. Hacker, Sustainable AI Regulation, arXiv [cs.CY] (2023); http://arxiv.org/abs/2306.00292

  795. [industry] Meta, “2024 Sustainability Report” (2024); https://sustainability.atmeta.com/wp-content/uploads/2024/08/Meta-2024-Sustainability-Report.pdf

  796. [industry] Amazon, “Amazon Sustainability Report” (2024); https://sustainability.aboutamazon.com/2023-amazon-sustainability-report.pdf

  797. A. N. Achanta, P. Erickson, E. Haites, M. Lazarus, N. Pandey, N. Pahuja, S. Seres, R. Spalding-Fecher, R. Tewari, “Assessing the Impact of the Clean Development Mechanism” (The High-Level Panel on the CDM Policy Dialogue, 2012); https://www.cdmpolicydialogue.org/research/1030_impact.pdf

  798. J. Rasley, S. Rajbhandari, O. Ruwase, Y. He, “DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (ACM, New York, NY, USA, 2020); https://doi.org/10.1145/3394486.3406703

  799. W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, I. Stoica, “Efficient Memory Management for Large Language Model Serving with PagedAttention” in Proceedings of the 29th Symposium on Operating Systems Principles (ACM, New York, NY, USA, 2023), pp. 611–626; https://doi.org/10.1145/3600006.3613165

  800. H. D. Saunders, The Khazzoom-Brookes Postulate and Neoclassical Growth. The Energy Journal 13, 131–148 (1992); http://www.jstor.org/stable/41322471

  801. G. Kamiya, V. C. Coroamă, “Data Centre Energy Use – A Critical Review” (IEA 4E TCP Electronic Devices and Networks Annex (EDNA)). 

  802. International Energy Agency, “Tracking Clean Energy Progress 2023” (IEA, 2023); https://www.iea.org/reports/tracking-clean-energy-progress-2023

  803. E. Halper, Amid Explosive Demand, America Is Running out of Power, Washington Post (2024); https://www.washingtonpost.com/business/2024/03/07/ai-data-centers-power/

  804. European Commission, Joint Research Centre, G. Kamiya, P. Bertoldi, Energy Consumption in Data Centres and Broadband Communication Networks in the EU (Publications Office of the European Union, 2024); https://doi.org/10.2760/706491

  805. J. Koomey, E. Masanet, Does Not Compute: Avoiding Pitfalls Assessing the Internet’s Energy and Carbon Impacts. Joule 5, 1625–1628 (2021); https://doi.org/10.1016/j.joule.2021.05.007

  806. E. Masanet, A. Shehabi, N. Lei, S. Smith, J. Koomey, Recalibrating Global Data Center Energy-Use Estimates. Science (New York, N.Y.) 367, 984–986 (2020); https://doi.org/10.1126/science.aba3758

  807. D. Rolnick, P. L. Donti, L. H. Kaack, K. Kochanski, A. Lacoste, K. Sankaran, A. S. Ross, N. Milojevic-Dupont, N. Jaques, A. Waldman-Brown, A. S. Luccioni, T. Maharaj, E. D. Sherwin, S. K. Mukkavilli, K. P. Kording, C. P. Gomes, A. Y. Ng, … Y. Bengio, Tackling Climate Change with Machine Learning. ACM Computing Surveys 55, 1–96 (2023); https://doi.org/10.1145/3485128 2

  808. U. Gupta, Y. G. Kim, S. Lee, J. Tse, H.-H. S. Lee, G.-Y. Wei, D. Brooks, C.-J. Wu, Chasing Carbon: The Elusive Environmental Footprint of Computing. IEEE Micro 42, 37–47 (2022); https://doi.org/10.1109/mm.2022.3163226

  809. [industry] Intel, “2023-24 Corporate Responsibility Report” (2024); https://csrreportbuilder.intel.com/pdfbuilder/pdfs/CSR-2023-24-Full-Report.pdf 2 3 4 5

  810. European Environment Agency, “Water Use in Europe — Quantity and Quality Face Big Challenges” (EEA, 2018); https://www.eea.europa.eu/signals-archived/signals-2018-content-list/articles/water-use-in-europe-2014

  811. Taiwan Semiconductor Manufacturing Company, “TSMC 2023 Sustainability Report” (TSMC, 2024); https://esg.tsmc.com/en-US/file/public/e-all_2023.pdf 2

  812. P. Li, J. Yang, M. A. Islam, S. Ren, Making AI Less “Thirsty”: Uncovering and Addressing the Secret Water Footprint of AI Models, arXiv [cs.LG] (2023); http://arxiv.org/abs/2304.03271

  813. United Nations, The Human Right to Water and Sanitation: Resolution A/RES/64/292 Adopted by the General Assembly on 28 July 2010 (2010); https://documents.un.org/doc/undoc/gen/n09/479/35/pdf/n0947935.pdf

  814. The European Parliament and the Council of the European Union, Directive (EU) 2023/1791 of the European Parliament and of the Council on Energy Efficiency and Amending Regulation (EU) 2023/955 (recast) (Text with EEA Relevance). (2023); https://eur-lex.europa.eu/eli/dir/2023/1791/oj

  815. Y. Jin, P. Behrens, A. Tukker, L. Scherer, Water Use of Electricity Technologies: A Global Meta-Analysis. Renewable and Sustainable Energy Reviews 115, 109391 (2019); https://doi.org/10.1016/j.rser.2019.109391

  816. H. Zhai, E. S. Rubin, E. J. Grol, A. C. O’Connell, Z. Wu, E. G. Lewis, Dry Cooling Retrofits at Existing Fossil Fuel-Fired Power Plants in a Water-Stressed Region: Tradeoffs in Water Savings, Cost, and Capacity Shortfalls. Applied Energy 306, 117997 (2022); https://doi.org/10.1016/j.apenergy.2021.117997

  817. V. G. Gude, Energy Consumption and Recovery in Reverse Osmosis. Desalination and Water Treatment 36, 239–260 (2011); https://doi.org/10.5004/dwt.2011.2534

  818. Australian Department of the Environment and Energy, “HVAC Factsheet: Co and Tri-Generation” (2013); https://www.energy.gov.au/sites/default/files/hvac-factsheet-co-tri-generation.pdf

  819. Office of Fossil Energy, “Hydrogen Strategy: Enabling A Low-Carbon Economy” (US Department of Energy, 2020); https://www.energy.gov/sites/prod/files/2020/07/f76/USDOE_FE_Hydrogen_Strategy_July2020.pdf

  820. H. Nissenbaum, Privacy in Context: Technology, Policy, and the Integrity of Social Life (Stanford University Press, Palo Alto, CA, 2009); http://www.sup.org/books/title/?id=8862

  821. L. Bourtoule, V. Chandrasekaran, C. A. Choquette-Choo, H. Jia, A. Travers, B. Zhang, D. Lie, N. Papernot, “Machine Unlearning” in 2021 IEEE Symposium on Security and Privacy (SP) (IEEE, Virtual, 2021), pp. 141–159; https://doi.org/10.1109/SP40001.2021.00019 2 3

  822. Organisation for Economic Co-Operation and Development, “AI, Data Governance and Privacy” (OECD, 2024); https://doi.org/10.1787/2476b1a4-en 2 3

  823. European Data Protection Board, “Report of the Work Undertaken by the ChatGPT Taskforce” (EDPB, 2024); https://www.edpb.europa.eu/our-work-tools/our-documents/other/report-work-undertaken-chatgpt-taskforce_en

  824. D. J. Solove, Artificial Intelligence and Privacy. Florida Law Review (forthcoming Jan 2025); https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4713111

  825. UK Parliament, Data Protection Act 2018, Section 46: Right to Rectification. (2018); https://www.legislation.gov.uk/ukpga/2018/12/section/46

  826. GPA’s International Enforcement Cooperation Working Group, “Joint Statement on Data Scraping and the Protection of Privacy” (Information Commissioner’s Office, 2023); https://ico.org.uk/media/about-the-ico/documents/4026232/joint-statement-data-scraping-202308.pdf 2

  827. N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, C. Zhang, “Quantifying Memorization Across Neural Language Models” in 11th International Conference on Learning Representations (ICLR 2023) (Kigali, Rwanda, 2022); https://openreview.net/forum?id=TatRHT_1cK 2 3

  828. Y. Chen, E. Mendes, S. Das, W. Xu, A. Ritter, Can Language Models Be Instructed to Protect Personal Information?, arXiv [cs.CL] (2023); http://arxiv.org/abs/2310.02224 2

  829. R. Shokri, M. Stronati, C. Song, V. Shmatikov, “Membership Inference Attacks Against Machine Learning Models” in 2017 IEEE Symposium on Security and Privacy (SP) (IEEE, San Jose, CA, USA, 5/2017), pp. 3–18; https://doi.org/10.1109/SP.2017.41

  830. M. Fredrikson, S. Jha, T. Ristenpart, “Model Inversion Attacks That Exploit Confidence Information and Basic Countermeasures” in Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS ‘15) (Association for Computing Machinery, New York, NY, USA, 2015), pp. 1322–1333; https://doi.org/10.1145/2810103.2813677

  831. M. Duan, A. Suri, N. Mireshghallah, S. Min, W. Shi, L. Zettlemoyer, Y. Tsvetkov, Y. Choi, D. Evans, H. Hajishirzi, Do Membership Inference Attacks Work on Large Language Models?, arXiv [cs.CL] (2024); http://arxiv.org/abs/2402.07841 2 3 4 5

  832. N. Carlini, F. Tramèr, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. Brown, D. Song, Ú. Erlingsson, A. Oprea, C. Raffel, “Extracting Training Data from Large Language Models” in 30th USENIX Security Symposium (USENIX Security 21) (USENIX Association, 2021), pp. 2633–2650; https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting 2 3

  833. N. Carlini, J. Hayes, M. Nasr, M. Jagielski, V. Sehwag, F. Tramèr, B. Balle, D. Ippolito, E. Wallace, “Extracting Training Data from Diffusion Models” in 32nd USENIX Security Symposium (USENIX Security 23) (USENIX Association, Anaheim, CA, 2023), pp. 5253–5270; https://www.usenix.org/conference/usenixsecurity23/presentation/carlini

  834. W. Shi, A. Ajith, M. Xia, Y. Huang, D. Liu, T. Blevins, D. Chen, L. Zettlemoyer, “Detecting Pretraining Data from Large Language Models” in The 12th International Conference on Learning Representations (ICLR 2024) (Vienna, Austria, 2023); https://openreview.net/forum?id=zWqr3MQuNs

  835. N. Lukas, A. Salem, R. Sim, S. Tople, L. Wutschitz, S. Zanella-Béguelin, “Analyzing Leakage of Personally Identifiable Information in Language Models” in 2023 IEEE Symposium on Security and Privacy (SP) (IEEE, 2023), pp. 346–363; https://doi.org/10.1109/SP46215.2023.10179300

  836. S. Longpre, R. Mahari, A. N. Lee, C. S. Lund, H. Oderinwale, W. Brannon, N. Saxena, N. Obeng-Marnu, T. South, C. J. Hunter, K. Klyman, C. Klamm, H. Schoelkopf, N. Singh, M. Cherep, A. M. Anis, A. Dinh, … A. Pentland, “Consent in Crisis: The Rapid Decline of the AI Data Commons” in 38th Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2024); https://openreview.net/pdf?id=66PcEzkf95 2 3 4 5

  837. [industry] K. Saab, T. Tu, W.-H. Weng, R. Tanno, D. Stutz, E. Wulczyn, F. Zhang, T. Strother, C. Park, E. Vedadi, J. Z. Chaves, S.-Y. Hu, M. Schaekermann, A. Kamath, Y. Cheng, D. G. T. Barrett, C. Cheung, … V. Natarajan, “Capabilities of Gemini Models in Medicine” (Google Deepmind, 2024); http://arxiv.org/abs/2404.18416

  838. P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-T. Yih, T. Rocktäschel, S. Riedel, D. Kiela, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” in 34th Conference on Neural Information Processing Systems (NeurIPS 2020) (Curran Associates, Inc., Vancouver, Canada, 2020) vol. 33, pp. 9459–9474; https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html 2

  839. V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, W.-T. Yih, “Dense Passage Retrieval for Open-Domain Question Answering” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Association for Computational Linguistics, Stroudsburg, PA, USA, 2020), pp. 6769–6781; https://doi.org/10.18653/v1/2020.emnlp-main.550

  840. O. Ram, Y. Levine, I. Dalmedigos, D. Muhlgay, A. Shashua, K. Leyton-Brown, Y. Shoham, In-Context Retrieval-Augmented Language Models. Transactions of the Association for Computational Linguistics 11, 1316–1331 (2023); https://doi.org/10.1162/tacl_a_00605

  841. [industry] T. Gunter, Z. Wang, C. Wang, R. Pang, A. Narayanan, A. Zhang, B. Zhang, C. Chen, C.-C. Chiu, D. Qiu, D. Gopinath, D. A. Yap, D. Yin, F. Nan, F. Weers, G. Yin, H. Huang, … Z. Ren, Apple Intelligence Foundation Language Models, arXiv [cs.AI] (2024); http://arxiv.org/abs/2407.21075 2 3 4

  842. S. Arora, P. Lewis, A. Fan, J. Kahn, C. Ré, Reasoning over Public and Private Data in Retrieval-Based Systems. Transactions of the Association for Computational Linguistics 11, 902–921 (2023); https://doi.org/10.1162/tacl_a_00580

  843. G. Zyskind, T. South, A. Pentland, “Don’t Forget Private Retrieval: Distributed Private Similarity Search for Large Language Models” in Proceedings of the Fifth Workshop on Privacy in Natural Language Processing (2024), pp. 7–19; https://aclanthology.org/2024.privatenlp-1.2.pdf

  844. UK National Cyber Security Centre, US Cybersecurity and Infrastructure Security Agency, National Security Agency, Federal Bureau of Investigation, Australian Signals Directorate’s Australian Cyber Security Centre, Canadian Centre for Cyber Security, New Zealand National Cyber Security Centre, Chile’s Government CSIRT, National Cyber and Information Security Agency of the Czech Republic, Information System Authority of Estonia, National Cyber Security Centre of Estonia, French Cybersecurity Agency, Germany’s Federal Office for Information Security, Israeli National Cyber Directorate, Italian National Cybersecurity Agency, Japan’s National center of Incident readiness and Strategy For Cybersecurity, Japan’s Secretariat of Science, Technology and Innovation Policy, Cabinet Office, … Cyber Security Agency of Singapore, “Guidelines for Secure AI System Development” (UK Government, 2023); https://www.ncsc.gov.uk/files/Guidelines-for-secure-AI-system-development.pdf 2 3

  845. M. Kosinski, D. Stillwell, T. Graepel, Private Traits and Attributes Are Predictable from Digital Records of Human Behavior. Proceedings of the National Academy of Sciences of the United States of America 110, 5802–5805 (2013); https://doi.org/10.1073/pnas.1218772110

  846. R. Staab, M. Vero, M. Balunovic, M. Vechev, “Beyond Memorization: Violating Privacy via Inference with Large Language Models” in The 12th International Conference on Learning Representations (ICLR 2024) (Vienna, Austria, 2023); https://openreview.net/forum?id=kmn0BhQk7p 2

  847. N. Mireshghallah, M. Antoniak, Y. More, Y. Choi, G. Farnadi, “Trust No Bot: Discovering Personal Disclosures in Human-LLM Conversations in the Wild” in First Conference on Language Modeling (2024); https://openreview.net/pdf?id=tIpWtMYkzU 2 3 4

  848. [industry] J. Lamb, G. Israelstam, R. Agarwal, S. Bhasker, “Generative AI in Healthcare: Adoption Trends and What’s next” (McKinsey & Company, 2024); https://www.mckinsey.com/industries/healthcare/our-insights/generative-ai-in-healthcare-adoption-trends-and-whats-next

  849. Federal Trade Commission, FTC Staff Report Finds Large Social Media and Video Streaming Companies Have Engaged in Vast Surveillance of Users with Lax Privacy Controls and Inadequate Safeguards for Kids and Teens (2024); https://www.ftc.gov/news-events/news/press-releases/2024/09/ftc-staff-report-finds-large-social-media-video-streaming-companies-have-engaged-vast-surveillance

  850. Federal Trade Commission, FTC Says Ring Employees Illegally Surveilled Customers, Failed to Stop Hackers from Taking Control of Users’ Cameras (2023); https://www.ftc.gov/news-events/news/press-releases/2023/05/ftc-says-ring-employees-illegally-surveilled-customers-failed-stop-hackers-taking-control-users

  851. [industry] J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, T. Salimans, Imagen Video: High Definition Video Generation with Diffusion Models, arXiv [cs.CV] (2022); http://arxiv.org/abs/2210.02303

  852. [industry] Reka Team, A. Ormazabal, C. Zheng, C. de M. d’Autume, D. Yogatama, D. Fu, D. Ong, E. Chen, E. Lamprecht, H. Pham, I. Ong, K. Aleksiev, L. Li, M. Henderson, M. Bain, M. Artetxe, N. Relan, … Z. Xie, Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models, arXiv [cs.CL] (2024); http://arxiv.org/abs/2404.12387

  853. S. Adler, Z. Hitzig, S. Jain, C. Brewer, W. Chang, R. DiResta, E. Lazzarin, S. McGregor, W. Seltzer, D. Siddarth, N. Soliman, T. South, C. Spelliscy, M. Sporny, V. Srivastava, J. Bailey, B. Christian, … T. Zick, Personhood Credentials: Artificial Intelligence and the Value of Privacy-Preserving Tools to Distinguish Who Is Real Online, arXiv [cs.CY] (2024); http://arxiv.org/abs/2408.07892 2 3 4

  854. B. Auxier, L. Rainie, M. Anderson, A. Perrin, M. Kumar, E. Turner, “Americans and Privacy: Concerned, Confused and Feeling Lack of Control Over Their Personal Information” (Pew Research Center, 2019); https://www.pewresearch.org/internet/2019/11/15/americans-and-privacy-concerned-confused-and-feeling-lack-of-control-over-their-personal-information/

  855. [industry] IBM, “Cost of a Data Breach 2024” (2024); https://www.ibm.com/reports/data-breach

  856. S. Min, S. Gururangan, E. Wallace, W. Shi, H. Hajishirzi, N. A. Smith, L. Zettlemoyer, “SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore” in NeurIPS 2023 Workshop on Distribution Shifts (DistShift) (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=z03bW0doni 2 3 4 5

  857. US Copyright Office, “Copyright and Artificial Intelligence” (2024); https://www.copyright.gov/ai/

  858. P. Burger, The Berne Convention: Its History and Its Key Role in the Future. Journal of Law and Technology 3, 1–70 (1988); https://heinonline.org/HOL/P?h=hein.journals/jlawtecy3&i=9

  859. L. R. Patterson, C. Joyce, Copyright in 1791: An Essay Concerning the Founers’ View of the Copyright Power Granted to Congress in Article I, Section 8, Clause 8 of the US Constitution. Emory Law Journal (2003); https://heinonline.org/hol-cgi-bin/get_pdf.cgi?handle=hein.journals/emlj52&section=25

  860. The Office of the Law Revision Counsel of the United States House of Representatives, “Limitations on Exclusive Rights: Fair Use. Sec. 107” in United States Code, 2006 Edition, Supplement 4, Title 17 - Copyrights (US Government Publishing Office, ed. 2010, 2010); https://www.govinfo.gov/app/details/USCODE-2010-title17/USCODE-2010-title17-chap1-sec107

  861. European Parliament, Directorate-General for Internal Policies of the Union, E. Rosati, The Exception for Text and Data Mining (TDM) in the Proposed Directive on Copyright in the Digital Single Market – Technical Aspects (European Parliament, 2018). 

  862. Japanese Law Translation Database System, “著作権法(一部未施行)Copyright Act (Partially Unenforced)” (Ministry of Justice, Japan, 2024); https://www.japaneselawtranslation.go.jp/en/laws/view/4207

  863. Israeli Ministry of Justice, “Opinion: Uses of Copyrighted Materials for Machine Learning” (Israeli Government, 2022); https://www.gov.il/BlobFolder/legalinfo/machine-learning/he/18-12-2022.pdf

  864. Intellectual Property Office of Singapore, “Copyright: Factsheet on Copyright Act 2021” (IPOS, 2022); https://www.ipos.gov.sg/docs/default-source/resources-library/copyright/copyright-act-factsheet.pdf

  865. P. Henderson, X. Li, D. Jurafsky, T. Hashimoto, M. A. Lemley, P. Liang, Foundation Models and Fair Use, arXiv [cs.CY] (2023); http://arxiv.org/abs/2303.15715

  866. B. L. W. Sobel, Artificial Intelligence’s Fair Use Crisis. The Columbia Journal of Law & the Arts 41, 45–97 (2018); https://doi.org/10.7916/jla.v41i1.2036

  867. M. A. Lemley, B. Casey, Fair Learning. Texas Law Review 99, 743–786 (2020-2021); https://heinonline.org/HOL/P?h=hein.journals/tlr99&i=777

  868. P. Samuelson, Generative AI Meets Copyright. Science 381, 158–161 (2023); https://doi.org/10.1126/science.adi0656

  869. Tremblay v. OpenAI, Inc. (3:23-cv-03223) Document 1 (2023); https://storage.courtlistener.com/recap/gov.uscourts.cand.414822/gov.uscourts.cand.414822.1.0_1.pdf

  870. D. Zhang, B. Xia, Y. Liu, X. Xu, T. Hoang, Z. Xing, M. Staples, Q. Lu, L. Zhu, “Privacy and Copyright Protection in Generative AI: A Lifecycle Perspective” in 3rd International Conference on AI Engineering - Software Engineering for AI (CAIN) (Lisbon, Portugal, 2024); http://arxiv.org/abs/2311.18252

  871. R. Mahari, S. Longpre, “Discit Ergo Est: Training Data Provenance And Fair Use” in Dynamics of Generative AI, T. Schrepel, V. Stocker, Eds. (Network Law Review, 2023); https://www.networklawreview.org/mahari-longpre-generative-ai/

  872. K. Lee, A. F. Cooper, J. Grimmelmann, “Talkin’ ‘Bout AI Generation: Copyright and the Generative-AI Supply Chain (The Short Version)” in Proceedings of the Symposium on Computer Science and Law (CSLAW ‘24) (Association for Computing Machinery, New York, NY, USA, 2024), pp. 48–63; https://doi.org/10.1145/3614407.3643696

  873. J. Grimmelmann, Copyright for Literate Robots. Iowa Law Review 101, 657–682 (2015-2016); https://heinonline.org/HOL/P?h=hein.journals/ilr101&i=681

  874. K. Lee, A. F. Cooper, J. Grimmelmann, D. Ippolito, AI and Law: The Next Generation (2023); https://doi.org/10.2139/ssrn.4580739

  875. L. Tiedrich, When AI Generates Work, Standard Contractual Terms Can Help Generate Value and Clarity, OECD.AI Policy Observatory (2024); https://oecd.ai/en/wonk/contractual-terms

  876. M. Sag, Copyright Safety for Generative AI. Houston Law Review / University of Houston 61, 295–347 (2023); https://houstonlawreview.org/article/92126-copyright-safety-for-generative-ai

  877. N. Vyas, S. M. Kakade, B. Barak, “On Provable Copyright Protection for Generative Models” in Proceedings of the 40th International Conference on Machine Learning (ICML 2023) (PMLR, Kigali, Rwanda, 2023); https://proceedings.mlr.press/v202/vyas23b.html 2

  878. L. Soldaini, R. Kinney, A. Bhagia, D. Schwenk, D. Atkinson, R. Authur, B. Bogin, K. Chandu, J. Dumas, Y. Elazar, V. Hofmann, A. H. Jha, S. Kumar, L. Lucy, X. Lyu, N. Lambert, I. Magnusson, … K. Lo, Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research, arXiv [cs.CL] (2024); http://arxiv.org/abs/2402.00159 2 3 4 5 6

  879. E. M. Bender, B. Friedman, Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics 6, 587–604 (2018); https://doi.org/10.1162/tacl_a_00041

  880. R. Bommasani, K. Klyman, S. Longpre, S. Kapoor, N. Maslej, B. Xiong, D. Zhang, P. Liang, “The Foundation Model Transparency Index” (Center for Research on Foundation Models (CRFM) and Institute on Human-Centered Artificial Intelligence (HAI), 2023); http://arxiv.org/abs/2310.12941 2

  881. R. Mahari, L. Shayne, L. Donewald, A. Polozov, A. Pentland, A. Lipsitz, Comment to US Copyright Office on Data Provenance and Copyright. US Copyright Office (2023); https://dspace.mit.edu/handle/1721.1/154171?show=full?show=full

  882. B. Magagna, D. Goldfarb, P. Martin, M. Atkinson, S. Koulouzis, Z. Zhao, “Data Provenance” in Towards Interoperable Research Infrastructures for Environmental and Earth Sciences: A Reference Model Guided Approach for Common Challenges, Z. Zhao, M. Hellström, Eds. (Springer International Publishing, Cham, 2020), pp. 208–225; https://doi.org/10.1007/978-3-030-52829-4_12

  883. S. Longpre, R. Mahari, N. Obeng-Marnu, W. Brannon, T. South, J. Kabbara, S. Pentland, Data Authenticity, Consent, and Provenance for AI Are All Broken: What Will It Take to Fix Them? An MIT Exploration of Generative AI (2024); https://doi.org/10.21428/e4baedd9.a650f77d 2

  884. K. I. Gero, M. Desai, C. Schnitzler, N. Eom, J. Cushman, E. L. Glassman, Creative Writers’ Attitudes on Writing as Training Data for Large Language Models, arXiv [cs.HC] (2024); http://arxiv.org/abs/2409.14281

  885. R. Fletcher, “How Many News Websites Block AI Crawlers?” (Reuters Institute for the Study of Journalism, 2024); https://doi.org/10.60625/RISJ-XM9G-WS87

  886. European Commission, AI Act: Participate in the Drawing-up of the First General-Purpose AI Code of Practice, Shaping Europe’s digital future (2024); https://digital-strategy.ec.europa.eu/en/news/ai-act-participate-drawing-first-general-purpose-ai-code-practice

  887. National Institute of Standards and Technology (NIST), AI Risk Management Framework (2021); https://www.nist.gov/itl/ai-risk-management-framework 2

  888. J. Lee, T. Le, J. Chen, D. Lee, “Do Language Models Plagiarize?” in Proceedings of the ACM Web Conference 2023 (ACM, New York, NY, USA, 2023); https://doi.org/10.1145/3543507.3583199 2

  889. A. F. Cooper, J. Grimmelmann, The Files Are in the Computer: On Copyright, Memorization, and Generative AI. Chicago-Kent Law Review (2024); https://blog.genlaw.org/pdfs/genlaw_icml2024/5.pdf 2

  890. C. Zhang, D. Ippolito, K. Lee, M. Jagielski, F. Tramèr, N. Carlini, “Counterfactual Memorization in Neural Language Models” in 37th International Conference on Neural Information Processing Systems (NeurIPS 2023) (Curran Associates Inc., Red Hook, NY, USA, 2023); https://dl.acm.org/doi/10.5555/3666122.3667830 2

  891. L. He, Y. Huang, W. Shi, T. Xie, H. Liu, Y. Wang, L. Zettlemoyer, C. Zhang, D. Chen, P. Henderson, Fantastic Copyrighted Beasts and How (not) to Generate Them, arXiv [cs.CV] (2024); http://arxiv.org/abs/2406.14526 2

  892. S. Liu, Y. Yao, J. Jia, S. Casper, N. Baracaldo, P. Hase, X. Xu, Y. Yao, H. Li, K. R. Varshney, M. Bansal, S. Koyejo, Y. Liu, Rethinking Machine Unlearning for Large Language Models, arXiv [cs.LG] (2024); http://arxiv.org/abs/2402.08787 2 3 4

  893. [industry] R. Eldan, M. Russinovich, Who’s Harry Potter? Approximate Unlearning in LLMs, arXiv [cs.CL] (2023); http://arxiv.org/abs/2310.02238 2

  894. T. Chen, A. Asai, N. Mireshghallah, S. Min, J. Grimmelmann, Y. Choi, H. Hajishirzi, L. Zettlemoyer, P. W. Koh, CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation, arXiv [cs.CL] (2024); http://arxiv.org/abs/2407.07087

  895. T. T. Nguyen, T. T. Huynh, P. Le Nguyen, A. W.-C. Liew, H. Yin, Q. V. H. Nguyen, A Survey of Machine Unlearning, arXiv [cs.LG] (2022); http://arxiv.org/abs/2209.02299 2

  896. T. Baumhauer, P. Schöttle, M. Zeppelzauer, Machine Unlearning: Linear Filtration for Logit-Based Classifiers. Machine Learning 111, 3203–3226 (2022); https://doi.org/10.1007/s10994-022-06178-9

  897. Z. Liu, H. Ye, C. Chen, Y. Zheng, K.-Y. Lam, Threats, Attacks, and Defenses in Machine Unlearning: A Survey, arXiv [cs.CR] (2024); http://arxiv.org/abs/2403.13682 2 3

  898. J. Xu, Z. Wu, C. Wang, X. Jia, Machine Unlearning: Solutions and Challenges. IEEE Transactions on Emerging Topics in Computational Intelligence 8, 2150–2168 (2024); https://doi.org/10.1109/tetci.2024.3379240 2 3

  899. S. Nevo, D. Lahav, A. Karpur, Y. Bar-On, H. A. Bradley, J. Alstott, Securing AI Model Weights: Preventing Theft and Misuse of Frontier Models (RAND Corporation, Santa Monica, CA, 2024); https://doi.org/10.7249/RRA2849-1 2

  900. R. Bommasani, S. Kapoor, K. Klyman, S. Longpre, A. Ramaswami, D. Zhang, M. Schaake, D. E. Ho, A. Narayanan, P. Liang, Considerations for Governing Open Foundation Models. Science (New York, N.Y.) 386, 151–153 (2024); https://doi.org/10.1126/science.adp1848

  901. US National Telecommunications and Information Administration, “Dual-Use Foundation Models with Widely Available Model Weights NTIA Report” (US Department of Commerce, 2024); https://www.ntia.gov/issues/artificial-intelligence/open-model-weights-report 2 3

  902. E. Seger, N. Dreksler, R. Moulange, E. Dardaman, J. Schuett, K. Wei, C. Winter, M. Arnold, S. Ó. hÉigeartaigh, A. Korinek, M. Anderljung, B. Bucknall, A. Chan, E. Stafford, L. Koessler, A. Ovadya, B. Garfinkel, … A. Gupta, “Open-Sourcing Highly Capable Foundation Models: An Evaluation of Risks, Benefits, and Alternative Methods for Pursuing Open-Source Objectives” ( Centre for the Governance of AI, 2023); http://arxiv.org/abs/2311.09227 2 3 4

  903. P. Gade, S. Lermen, C. Rogers-Smith, J. Ladish, BadLlama: Cheaply Removing Safety Fine-Tuning from Llama 2-Chat 13B, arXiv [cs.CL] (2023); http://arxiv.org/abs/2311.00117

  904. [industry] A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Zico Kolter, M. Fredrikson, Universal and Transferable Adversarial Attacks on Aligned Language Models, arXiv [cs.CL] (2023); http://arxiv.org/abs/2307.15043 2 3 4 5 6 7 8

  905. I. Yum, Language Agents and Malevolent Design. Philosophy & Technology 37, 1–19 (2024); https://doi.org/10.1007/s13347-024-00794-0

  906. S. Lermen, C. Rogers-Smith, J. Ladish, LoRA Fine-Tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B, arXiv [cs.LG] (2023); http://arxiv.org/abs/2310.20624 2

  907. A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, N. Nanda, Refusal in Language Models Is Mediated by a Single Direction, arXiv [cs.LG] (2024); http://arxiv.org/abs/2406.11717 2 3

  908. J. Cable, A. Black, “With Open Source Artificial Intelligence, Don’t Forget the Lessons of Open Source Software” (Cybersecurity and Infrastructure Security Agency CISA, 2024); https://www.cisa.gov/news-events/news/open-source-artificial-intelligence-dont-forget-lessons-open-source-software

  909. by J. Bateman, D. Baer, S. A. Bell, G. O. Brown, M.-F. (tino) Cuéllar, D. Ganguli, P. Henderson, B. Kotila, L. Lessig, N. B. Lundblad, J. Napolitano, D. Raji, E. Seger, M. Sheehan, A. Skowron, I. Solaiman, H. Toner, A. P. Zvyagina, “Beyond Open vs. Closed: Emerging Consensus and Key Questions for Foundation AI Model Governance” (Carnegie Endowment for International Peace, 2024); https://carnegieendowment.org/research/2024/07/beyond-open-vs-closed-emerging-consensus-and-key-questions-for-foundation-ai-model-governance?lang=en

  910. E. Seger, B. O’Dell, “Open Horizons: Exploring Nuanced Technical and Policy Approaches to Openness in AI” (Demos, 2024); https://demos.co.uk/research/open-horizons-exploring-nuanced-technical-and-policy-approaches-to-openness-in-ai/ 2 3 4

  911. S. Kapoor, R. Bommasani, K. Klyman, S. Longpre, A. Ramaswami, P. Cihon, A. K. Hopkins, K. Bankston, S. Biderman, M. Bogen, R. Chowdhury, A. Engler, P. Henderson, Y. Jernite, S. Lazar, S. Maffulli, A. Nelson, … A. Narayanan, “Position: On the Societal Impact of Open Foundation Models” in International Conference on Machine Learning (PMLR, 2024), pp. 23082–23104; https://proceedings.mlr.press/v235/kapoor24a.html 2 3 4

  912. [industry] S. Lakatos, “A Revealing Picture: AI-Generated ‘Undressing’ Images Move from Niche Pornography Discussion Forums to a Scaled and Monetized Online Business” (Graphika, 2023); https://graphika.com/reports/a-revealing-picture

  913. D. Thiel, M. Stroebel, R. Portnoff, “Generative ML and CSAM: Implications and Mitigations” (Thorn & Stanford Internet Observatory, 2023); https://fsi.stanford.edu/publication/generative-ml-and-csam-implications-and-mitigations

  914. A. Engler, “How Open-Source Software Shapes AI Policy” (Brookings, 2021); https://www.brookings.edu/articles/how-open-source-software-shapes-ai-policy/ 2 3

  915. D. Gray Widder, S. West, M. Whittaker, Open (for Business): Big Tech, Concentrated Power, and the Political Economy of Open AI, SSRN [preprint] (2023); https://doi.org/10.2139/ssrn.4543807 2

  916. K. Blind, M. Böhm, P. Grzegorzewska, A. Katz, S. Muto, S. Pätsch, T. Schubert, “Study about the Impact of Open Source Software and Hardware on Technological Independence, Competitiveness and Innovation in the EU Economy, Final Study Report” (European Commission, 2021); https://digital-strategy.ec.europa.eu/en/library/study-about-impact-open-source-software-and-hardware-technological-independence-competitiveness-and

  917. Y. Kilcher, Ykilcher/gpt-4chan (2023); https://huggingface.co/ykilcher/gpt-4chan

  918. S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, G. van den Driessche, J.-B. Lespiau, B. Damoc, A. Clark, D. de Las Casas, A. Guy, J. Menick, R. Ring, T. Hennigan, S. Huang, L. Maggiore, … L. Sifre, Improving Language Models by Retrieving from Trillions of Tokens. International Conference on Machine Learning 162, 2206–2240 (2021); https://proceedings.mlr.press/v162/borgeaud22a/borgeaud22a.pdf

  919. P. Henderson, E. Mitchell, C. Manning, D. Jurafsky, C. Finn, “Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models” in Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (Association for Computing Machinery, New York, NY, USA, 2023)AIES ‘23, pp. 287–296; https://doi.org/10.1145/3600211.3604690

  920. J. Deng, S. Pang, Y. Chen, L. Xia, Y. Bai, H. Weng, W. Xu, SOPHON: Non-Fine-Tunable Learning to Restrain Task Transferability For Pre-Trained Models, arXiv [cs.LG] (2024); http://arxiv.org/abs/2404.12699

  921. T. Huang, S. Hu, L. Liu, “Vaccine: Perturbation-Aware Alignment for Large Language Models against Harmful Fine-Tuning Attack” in 38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024) (2024); https://openreview.net/pdf?id=lpXDZKiAnt

  922. D. Rosati, J. Wehner, K. Williams, Ł. Bartoszcze, D. Atanasov, R. Gonzales, S. Majumdar, C. Maple, H. Sajjad, F. Rudzicz, Representation Noising Effectively Prevents Harmful Fine-Tuning on LLMs, arXiv [cs.CL] (2024); http://arxiv.org/abs/2405.14577

  923. R. Tamirisa, B. Bharathi, L. Phan, A. Zhou, A. Gatti, T. Suresh, M. Lin, J. Wang, R. Wang, R. Arel, A. Zou, D. Song, B. Li, D. Hendrycks, M. Mazeika, Tamper-Resistant Safeguards for Open-Weight LLMs, arXiv [cs.LG] (2024); http://arxiv.org/abs/2408.00761

  924. G. Wang, Y.-N. Chuang, R. Tang, S. Zhong, J. Yuan, H. Jin, Z. Liu, V. Chaudhary, S. Xu, J. Caverlee, X. Hu, Taylor Unswift: Secured Weight Release for Large Language Models via Taylor Expansion, arXiv [cs.CR] (2024); http://arxiv.org/abs/2410.05331

  925. M. Srikumar, J. Chang, K. Chmielinski, “Risk Mitigation Strategies for the Open Foundation Model Value Chain: Insights from PAI Workshop Co-Hosted with GitHub” (Partnership on AI, 2024); https://partnershiponai.org/wp-content/uploads/dlm_uploads/2024/07/open-foundation-model-risk-mitigation_rev3-1.pdf 2

  926. E. David, Meta Unleashes Its Most Powerful AI Model, Llama 3.1, with 405B Parameters, VentureBeat (2024); https://venturebeat.com/ai/meta-unleashes-its-most-powerful-ai-model-llama-3-1-with-405b-parameters/

  927. B. Muralidharan, H. Beadles, R. Marzban, K. S. Mupparaju, Knowledge AI: Fine-Tuning NLP Models for Facilitating Scientific Knowledge Extraction and Understanding, arXiv [cs.CL] (2024); http://arxiv.org/abs/2408.04651

  928. [industry] L. Weidinger, M. Rauh, N. Marchal, A. Manzini, L. A. Hendricks, J. Mateos-Garcia, S. Bergman, J. Kay, C. Griffin, B. Bariach, I. Gabriel, V. Rieser, W. Isaac, “Sociotechnical Safety Evaluation of Generative AI Systems” (Google Deepmind, 2023); http://arxiv.org/abs/2310.11986 2 3

  929. [industry] L. Weidinger, J. Barnhart, J. Brennan, C. Butterfield, S. Young, W. Hawkins, L. A. Hendricks, R. Comanescu, O. Chang, M. Rodriguez, J. Beroshi, D. Bloxwich, L. Proleev, J. Chen, S. Farquhar, L. Ho, I. Gabriel, … W. Isaac, “Holistic Safety and Responsibility Evaluations of Advanced AI Models” (Google Deepmind, 2024); http://arxiv.org/abs/2404.14068 2 3

  930. [industry] I. Solaiman, Z. Talat, W. Agnew, L. Ahmad, D. Baker, S. L. Blodgett, H. Daumé III, J. Dodge, E. Evans, S. Hooker, Y. Jernite, A. S. Luccioni, A. Lusoli, M. Mitchell, J. Newman, M.-T. Png, A. Strait, A. Vassilev, Evaluating the Social Impact of Generative AI Systems in Systems and Society, arXiv [cs.CY] (2023); http://arxiv.org/abs/2306.05949 2

  931. A. R. R. Salammagari, G. Srivastava, Advancing Natural Language Understanding for Low-Resource Languages: Current Progress, Applications, and Challenges. International Journal of Advanced Research in Engineering and Technology 15, 244–255 (2024); https://iaeme.com/Home/article_id/IJARET_15_03_021

  932. A. Birhane, W. Isaac, V. Prabhakaran, M. Diaz, M. C. Elish, I. Gabriel, S. Mohamed, “Power to the People? Opportunities and Challenges for Participatory AI” in Proceedings of the 2nd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO ‘22) (Association for Computing Machinery, New York, NY, USA, 2022), pp. 1–8; https://doi.org/10.1145/3551624.3555290 2

  933. P. Slattery, A. K. Saeri, E. A. C. Grundy, J. Graham, M. Noetel, R. Uuk, J. Dao, S. Pour, S. Casper, N. Thompson, The AI Risk Repository: A Comprehensive Meta-Review, Database, and Taxonomy of Risks from Artificial Intelligence, arXiv [cs.AI] (2024); http://arxiv.org/abs/2408.12622 2 3 4 5

  934. Partnership on AI, “[Draft] Guidelines for Participatory and Inclusive AI” (2024); https://partnershiponai.notion.site/1e8a6131dda045f1ad00054933b0bda0?v=dcb890146f7d464a86f11fcd5de372c0

  935. M. Maghsoudi, A. Mohammadi, S. Habibipour, Navigating and Addressing Public Concerns in AI: Insights from Social Media Analytics and Delphi. IEEE Access: Practical Innovations, Open Solutions 12, 1–1 (2024); https://doi.org/10.1109/access.2024.3440660

  936. K. Grosse, L. Bieringer, T. R. Besold, A. M. Alahi, “Towards More Practical Threat Models in Artificial Intelligence Security” in 33rd USENIX Security Symposium (USENIX Security 24) (2024), pp. 4891–4908; https://www.usenix.org/system/files/usenixsecurity24-grosse.pdf

  937. H. Li, Z. Ren, M. Fan, W. Li, Y. Xu, Y. Jiang, W. Xia, A Review of Scenario Analysis Methods in Planning and Operation of Modern Power Systems: Methodologies, Applications, and Challenges. Electric Power Systems Research 205, 107722 (2022); https://doi.org/10.1016/j.epsr.2021.107722

  938. A. Mantelero, The Fundamental Rights Impact Assessment (FRIA) in the AI Act: Roots, Legal Obligations and Key Elements for a Model Template. Computer Law and Security Report 54, 106020 (2024); https://doi.org/10.1016/j.clsr.2024.106020

  939. I. D. Raji, P. Xu, C. Honigsberg, D. Ho, “Outsider Oversight: Designing a Third Party Audit Ecosystem for AI Governance” in Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society (AIES ‘22) (Association for Computing Machinery, New York, NY, USA, 2022), pp. 557–571; https://doi.org/10.1145/3514094.3534181 2

  940. V. Storchan, R. Kumar, R. Chowdhury, S. Goldfarb-Tarrant, S. Cattell, “2024 Generative AI Red Teaming Transparency Report” (Humane intelligence, 2024). 

  941. [industry] S. Wan, C. Nikolaidis, D. Song, D. Molnar, J. Crnkovich, J. Grace, M. Bhatt, S. Chennabasappa, S. Whitman, S. Ding, V. Ionescu, Y. Li, J. Saxe, CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models, arXiv [cs.CR] (2024); http://arxiv.org/abs/2408.01605

  942. R. J. Neuwirth, Prohibited Artificial Intelligence Practices in the Proposed EU Artificial Intelligence Act (AIA). Computer Law & Security Review 48, 105798 (2023); https://doi.org/10.1016/j.clsr.2023.105798

  943. L. Heim, L. Koessler, Training Compute Thresholds: Features and Functions in AI Regulation, arXiv [cs.CY] (2024); http://arxiv.org/abs/2405.10799 2

  944. L. Koessler, J. Schuett, M. Anderljung, Risk Thresholds for Frontier AI, arXiv [cs.CY] (2024); http://arxiv.org/abs/2406.14713 2 3

  945. Center for Chemical Process Safety (CCPS), Bow Ties in Risk Management (John Wiley & Sons, Nashville, TN, 2018); https://doi.org/10.1002/9781119490357

  946. International Organization for Standardization, “ISO 21448:2022: Road Vehicles — Safety of the Intended Functionality” (ISO, 2022); https://www.iso.org/standard/77490.html 2

  947. [industry] Anthropic, Responsible Scaling Policy. (2024); https://assets.anthropic.com/m/24a47b00f10301cd/original/Anthropic-Responsible-Scaling-Policy-2024-10-15.pdf 2 3 4 5

  948. Partnership on AI, PAI’s Guidance for Safe Foundation Model Deployment (2023); https://partnershiponai.org/modeldeployment/ 2

  949. T. Kelly, A Systematic Approach to Safety Case Management. SAE Transactions: Journal of Materials & Manufacturing 113, 257–266 (2004); http://www.jstor.org/stable/44699541 2

  950. B. Lakshmi Prasanna, M. SaidiReddy, (CSM2-RA-R2-TI): Cyber Security Maturity Model for Risk Assessment Using Risk Register for Threat Intelligence. Journal of Physics. Conference Series 2040, 012005 (2021); https://doi.org/10.1088/1742-6596/2040/1/012005

  951. [industry] Y. Zeng, K. Klyman, A. Zhou, Y. Yang, M. Pan, R. Jia, D. Song, P. Liang, B. Li, AI Risk Categorization Decoded (AIR 2024): From Government Regulations to Corporate Policies, arXiv [cs.CY] (2024); http://arxiv.org/abs/2406.17864 2

  952. H. Wu, AI Whistleblowers, SSRN [preprint] (2024); https://doi.org/10.2139/ssrn.4790511

  953. MITRE ATLAS, MITRE ATLAS AI Incidents (2024); https://ai-incidents.mitre.org/ 2

  954. B. Robinson, J. Ginns, “Transforming Risk Governance at Frontier AI Companies” (Centre for Long-Term Resilience, 2024); https://www.longtermresilience.org/wp-content/uploads/2024/07/Transforming-risk-governance-at-frontier-AI-companies-CLTR-1.pdf

  955. J. Schuett, Three Lines of Defense against Risks from AI. AI & Society (2023); https://doi.org/10.1007/s00146-023-01811-0

  956. R. Bommasani, K. Klyman, S. Longpre, B. Xiong, S. Kapoor, N. Maslej, A. Narayanan, P. Liang, Foundation Model Transparency Reports, arXiv [cs.LG] (2024); http://arxiv.org/abs/2402.16268

  957. [industry] D. Hendrycks, N. Carlini, J. Schulman, J. Steinhardt, Unsolved Problems in ML Safety, arXiv [cs.LG] (2021); http://arxiv.org/abs/2109.13916

  958. M. Anderljung, E. T. Smith, J. O’Brien, L. Soder, B. Bucknall, E. Bluemke, J. Schuett, R. Trager, L. Strahm, R. Chowdhury, Towards Publicly Accountable Frontier LLMs: Building an External Scrutiny Ecosystem under the ASPIRE Framework, arXiv [cs.CY] (2023); http://arxiv.org/abs/2311.14711

  959. R. Gupta, L. Walker, R. Corona, S. Fu, S. Petryk, J. Napolitano, T. Darrell, A. W. Reddie, Data-Centric AI Governance: Addressing the Limitations of Model-Focused Policies, arXiv [cs.CY] (2024); http://arxiv.org/abs/2409.17216

  960. D. McDuff, T. Korjakow, S. Cambo, J. J. Benjamin, J. Lee, Y. Jernite, C. M. Ferrandis, A. Gokaslan, A. Tarkowski, J. Lindley, A. F. Cooper, D. Contractor, On the Standardization of Behavioral Use Clauses and Their Adoption for Responsible Licensing of AI, arXiv [cs.SE] (2024); http://arxiv.org/abs/2402.05979

  961. B. Rakova, J. Yang, H. Cramer, R. Chowdhury, Where Responsible AI Meets Reality: Practitioner Perspectives on Enablers for Shifting Organizational Practices. Proceedings of the ACM on Human-Computer Interaction 5, 1–23 (2021); https://doi.org/10.1145/3449081

  962. [industry] Microsoft AI, “Putting Principles into Practice: How We Approach Responsible AI at Microsoft” (Microsoft, 2020); https://www.microsoft.com/cms/api/am/binary/RE4pKH5

  963. J. Schuett, A.-K. Reuel, A. Carlier, How to Design an AI Ethics Board. AI and Ethics, 1–19 (2024); https://doi.org/10.1007/s43681-023-00409-y

  964. G. de Beco, Human Rights Impact Assessments. Netherlands Quarterly of Human Rights 27, 139–166 (2009); https://doi.org/10.1177/016934410902700202

  965. E. Donahoe, M. M. Metzger, Artificial Intelligence and Human Rights. Journal of Democracy 30, 115–126 (2019); https://doi.org/10.1353/jod.2019.0029

  966. S. Makridakis, The Art and Science of Forecasting An Assessment and Future Directions. International Journal of Forecasting 2, 15–39 (1986); https://doi.org/10.1016/0169-2070(86)90028-2

  967. E. Karger, P. Atanasov, P. E. Tetlock, “Improving Judgments of Existential Risk: Better Forecasts, Questions, Explanations, Policies” (Future of Humanity Institute, 2022); https://www.fhi.ox.ac.uk/wp-content/uploads/2022/05/Improving-Judgments-of-Existential-Risk.pdf

  968. L. Koessler, J. Schuett, Risk Assessment at AGI Companies: A Review of Popular Risk Assessment Techniques from Other Safety-Critical Industries, arXiv [cs.CY] (2023); http://arxiv.org/abs/2307.08823 2 3

  969. B. Anderson-Samways, “AI-Relevant Regulatory Precedents: A Systematic Search Across All Federal Agencies” (Institute for AI Policy and Strategy, 2024); https://www.iaps.ai/research/ai-relevant-regulatory-precedent

  970. H. E. Roland, B. Moriarty, System Safety Engineering and Management (Wiley, New York, 2nd ed., 1990); https://www.wiley.com/en-us/System+Safety+Engineering+and+Management%2C+2nd+Edition-p-9780471618164

  971. N. G. Leveson, Engineering a Safer World: Systems Thinking Applied to Safety (The MIT Press, 2012); https://doi.org/10.7551/mitpress/8179.001.0001

  972. S. Dekker, Foundations of Safety Science: A Century of Understanding Accidents and Disasters (Routledge, London, England, 2019); https://doi.org/10.4324/9781351059794

  973. ISO, ISO 31000: Risk Management, ISO (2018); https://www.iso.org/iso-31000-risk-management.html

  974. E. Black, R. Naidu, R. Ghani, K. Rodolfa, D. Ho, H. Heidari, “Toward Operationalizing Pipeline-Aware ML Fairness: A Research Agenda for Developing Practical Guidelines and Tools” in Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO ‘23) (Association for Computing Machinery, New York, NY, USA, 2023), pp. 1–11; https://doi.org/10.1145/3617694.3623259

  975. S. Rismani, R. Shelby, A. Smart, E. Jatho, J. Kroll, A. Moon, N. Rostamzadeh, “From Plane Crashes to Algorithmic Harm: Applicability of Safety Engineering Frameworks for Responsible ML” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ‘23) (Association for Computing Machinery, New York, NY, USA, 2023), pp. 1–18; https://doi.org/10.1145/3544548.3581407

  976. R. Hawkins, C. Paterson, C. Picardi, Y. Jia, R. Calinescu, I. Habli, Guidance on the Assurance of Machine Learning in Autonomous Systems (AMLAS), arXiv [cs.LG] (2021); http://arxiv.org/abs/2102.01564

  977. T. Raz, D. Hillson, A Comparative Review of Risk Management Standards. Risk Management: An International Journal 7, 53–66 (2005); https://doi.org/10.1057/palgrave.rm.8240227

  978. J. Clymer, N. Gabrieli, D. Krueger, T. Larsen, Safety Cases: How to Justify the Safety of Advanced AI Systems, arXiv [cs.CY] (2024); http://arxiv.org/abs/2403.10462 2 3 4

  979. C. Haddon-Cave, The Nimrod Review: An Independent Review into the Broader Issues Surrounding the Loss of the RAF Nimrod MR2 Aircraft XV230 in Afghanistan in 2006, Report (Stationery Office, 2009); https://www.gov.uk/government/publications/the-nimrod-review

  980. N. G. Leveson, Applying Systems Thinking to Analyze and Learn from Events. Safety Science 49, 55–64 (2011); https://doi.org/10.1016/j.ssci.2009.12.021

  981. D. Hendrycks, Introduction to AI Safety, Ethics, and Society (CRC Press, 2024); https://www.aisafetybook.com/ 2

  982. O. Delaney, O. Guest, Z. Williams, Mapping Technical Safety Research at AI Companies: A Literature Review and Incentives Analysis, arXiv [cs.CY] (2024); http://arxiv.org/abs/2409.07878

  983. R. Uuk, A. Brouwer, N. Dreksler, V. Pulignano, R. Bommasani, Effective Mitigations for Systemic Risks from General-Purpose AI. (2024); https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5021463

  984. D. A. Boiko, R. MacKnight, G. Gomes, Emergent Autonomous Scientific Research Capabilities of Large Language Models, arXiv [physics.chem-ph] (2023); http://arxiv.org/abs/2304.05332

  985. Q. Lu, L. Zhu, X. Xu, Z. Xing, S. Harrer, J. Whittle, Towards Responsible Generative AI: A Reference Architecture for Designing Foundation Model Based Agents, arXiv [cs.AI] (2023); http://arxiv.org/abs/2311.13148

  986. [industry] SIMA Team, M. A. Raad, A. Ahuja, C. Barros, F. Besse, A. Bolt, A. Bolton, B. Brownfield, G. Buttimore, M. Cant, S. Chakera, S. C. Y. Chan, J. Clune, A. Collister, V. Copeman, A. Cullum, I. Dasgupta, … N. Young, “Scaling Instructable Agents Across Many Simulated Worlds” (Google Deepmind, 2024); http://arxiv.org/abs/2404.10179

  987. T. Schick, J. Dwivedi-Yu, R. Dessi, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, T. Scialom, “Toolformer: Language Models Can Teach Themselves to Use Tools” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=Yacmpz84TH

  988. Y. Tian, X. Yang, J. Zhang, Y. Dong, H. Su, Evil Geniuses: Delving into the Safety of LLM-Based Agents, arXiv [cs.CL] (2023); http://arxiv.org/abs/2311.11855

  989. Z. Wu, C. Han, Z. Ding, Z. Weng, Z. Liu, S. Yao, T. Yu, L. Kong, OS-Copilot: Towards Generalist Computer Agents with Self-Improvement, arXiv [cs.AI] (2024); http://arxiv.org/abs/2402.07456

  990. Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou, R. Zheng, X. Fan, X. Wang, L. Xiong, Y. Zhou, W. Wang, C. Jiang, … T. Gui, The Rise and Potential of Large Language Model Based Agents: A Survey, arXiv [cs.AI] (2023); http://arxiv.org/abs/2309.07864

  991. [industry] T. Masterman, S. Besen, M. Sawtell, A. Chao, The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey, arXiv [cs.AI] (2024); http://arxiv.org/abs/2404.11584

  992. M. Hartmann, A. Koller, A Survey on Complex Tasks for Goal-Directed Interactive Agents, arXiv [cs.CL] (2024); http://arxiv.org/abs/2409.18538

  993. T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y. Liu, Y. Xu, S. Zhou, S. Savarese, C. Xiong, V. Zhong, T. Yu, OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments, arXiv [cs.AI] (2024); http://arxiv.org/abs/2404.07972

  994. [industry] A. Fourney, G. Bansal, H. Mozannar, C. Tan, E. Salinas, E. (eric) Zhu, F. Niedtner, G. Proebsting, G. Bassman, J. Gerrits, J. Alber, P. Chang, R. Loynd, R. West, V. Dibia, A. Awadallah, E. Kamar, … S. Amershi, “Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks” (Microsoft, 2024); https://www.microsoft.com/en-us/research/publication/magentic-one-a-generalist-multi-agent-system-for-solving-complex-tasks/

  995. S. Hu, M. Ouyang, D. Gao, M. Z. Shou, The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use, arXiv [cs.AI] (2024); http://arxiv.org/abs/2411.10323

  996. J.-P. Rivera, G. Mukobi, A. Reuel, M. Lamparth, C. Smith, J. Schneider, “Escalation Risks from Language Models in Military and Diplomatic Decision-Making” in The 2024 ACM Conference on Fairness, Accountability, and Transparency (ACM, New York, NY, USA, 2024); https://doi.org/10.1145/3630106.3658942

  997. B. Zhang, Y. Tan, Y. Shen, A. Salem, M. Backes, S. Zannettou, Y. Zhang, Breaking Agents: Compromising Autonomous LLM Agents through Malfunction Amplification, arXiv [cs.CR] (2024); http://arxiv.org/abs/2407.20859 2

  998. K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, M. Fritz, “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection” in Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec ‘23) (Association for Computing Machinery, New York, NY, USA, 2023), pp. 79–90; https://doi.org/10.1145/3605764.3623985

  999. R. Fang, D. Bowman, D. Kang, Voice-Enabled AI Agents Can Perform Common Scams, arXiv [cs.AI] (2024); http://arxiv.org/abs/2410.15650

  1000. M. Andriushchenko, A. Souly, M. Dziemian, D. Duenas, M. Lin, J. Wang, D. Hendrycks, A. Zou, Z. Kolter, M. Fredrikson, E. Winsor, J. Wynne, Y. Gal, X. Davies, AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents, arXiv [cs.LG] (2024); http://arxiv.org/abs/2410.09024 2

  1001. [industry] P. Kumar, E. Lau, S. Vijayakumar, T. Trinh, Scale Red Team, E. Chang, V. Robinson, S. Hendryx, S. Zhou, M. Fredrikson, S. Yue, Z. Wang, Refusal-Trained LLMs Are Easily Jailbroken as Browser Agents, arXiv [cs.CR] (2024); http://arxiv.org/abs/2410.13886 2

  1002. A. Chan, C. Ezell, M. Kaufmann, K. Wei, L. Hammond, H. Bradley, E. Bluemke, N. Rajkumar, D. Krueger, N. Kolt, L. Heim, M. Anderljung, Visibility into AI Agents, arXiv [cs.CY] (2024); http://arxiv.org/abs/2401.13138 2

  1003. M. K. Cohen, N. Kolt, Y. Bengio, G. K. Hadfield, S. Russell, Regulating Advanced Artificial Agents. Science 384, 36–38 (2024); https://doi.org/10.1126/science.adl0625 2

  1004. G. Mialon, C. Fourrier, T. Wolf, Y. LeCun, T. Scialom, “GAIA: A Benchmark for General AI Assistants” in The 12th International Conference on Learning Representations (ICLR 2024) (Vienna, Austria, 2024); https://openreview.net/forum?id=fibxvahvs3

  1005. K. Valmeekam, K. Stechly, S. Kambhampati, “LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on PlanBench” in NeurIPS 2024 Workshop on Open-World Agents (2024); https://openreview.net/forum?id=Gcr1Lx4Koz

  1006. P. P. Liang, A. Zadeh, L.-P. Morency, Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions. ACM Computing Surveys 56, 1–42 (2024); https://doi.org/10.1145/3656580

  1007. R. Wang, X. Ma, H. Zhou, C. Ji, G. Ye, Y.-G. Jiang, “White-Box Multimodal Jailbreaks Against Large Vision-Language Models” in ACM Multimedia 2024 (2024); https://openreview.net/forum?id=SMOUQtEaAf

  1008. M. Thiemann, J. Lepoutre, Stitched on the Edge: Rule Evasion, Embedded Regulators, and the Evolution of Markets. American Journal of Sociology 122, 1775–1821 (2017); https://doi.org/10.1086/691348

  1009. R. Huben, H. Cunningham, L. R. Smith, A. Ewart, L. Sharkey, “Sparse Autoencoders Find Highly Interpretable Features in Language Models” in The 12th International Conference on Learning Representations (ICLR 2024) (Vienna, Austria, 2023); https://openreview.net/forum?id=F76bwRSLeK 2

  1010. [industry] L. Gao, T. D. la Tour, H. Tillman, G. Goh, R. Troll, A. Radford, I. Sutskever, J. Leike, J. Wu, Scaling and Evaluating Sparse Autoencoders, arXiv [cs.LG] (2024); http://arxiv.org/abs/2406.04093 2 3

  1011. [industry] T. Lieberum, S. Rajamanoharan, A. Conmy, L. Smith, N. Sonnerat, V. Varma, J. Kramar, A. Dragan, R. Shah, N. Nanda, “Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2” in The 7th BlackboxNLP Workshop (2024); https://openreview.net/forum?id=XkMrWOJhNd 2

  1012. A. Templeton, T. Conerly, J. Marcus, J. Lindsey, T. Bricken, B. Chen, A. Pearce, C. Citro, E. Ameisen, A. Jones, H. Cunningham, N. L. Turner, C. McDougall, M. MacDiarmid, C. D. Freeman, T. R. Sumers, E. Rees, … T. Henighan, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread (2024); https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html 2 3 4

  1013. [industry] T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y. Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, Z. Hatfield-Dodds, … C. Olah, Towards Monosemanticity: Decomposing Language Models with Dictionary Learning, Transformer Circuits Thread (2023); https://transformer-circuits.pub/2023/monosemantic-features 2

  1014. M. Ananny, K. Crawford, Seeing without Knowing: Limitations of the Transparency Ideal and Its Application to Algorithmic Accountability. New Media & Society 20, 973–989 (2018); https://doi.org/10.1177/1461444816676645 2

  1015. [industry] T. Bolukbasi, A. Pearce, A. Yuan, A. Coenen, E. Reif, F. Viégas, M. Wattenberg, An Interpretability Illusion for BERT, arXiv [cs.CL] (2021); http://arxiv.org/abs/2104.07143

  1016. K. Kaye, P. Dixon, “Risky Analysis: Assessing and Improving AI Governance Tools An International Review of AI Governance Tools and Suggestions for Pathways Forward” (World Privacy Forum, 2023); https://www.worldprivacyforum.org/wp-content/uploads/2023/12/WPF_Risky_Analysis_December_2023_fs.pdf

  1017. A. Makelov, G. Lange, A. Geiger, N. Nanda, “Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching” in The 12th International Conference on Learning Representations (ICLR 2024) (Vienna, Austria, 2023); https://openreview.net/forum?id=Ebt7JgMHv1

  1018. D. Stander, Q. Yu, H. Fan, S. Biderman, “Grokking Group Multiplication with Cosets” in Forty-First International Conference on Machine Learning (2024); https://openreview.net/forum?id=hcQfTsVnBo

  1019. D. Chanin, J. Wilken-Smith, T. Dulka, H. Bhatnagar, J. Bloom, A Is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders, arXiv [cs.CL] (2024); http://arxiv.org/abs/2409.14507

  1020. J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, B. Kim, “Sanity Checks for Saliency Maps” in Advances in Neural Information Processing Systems (NeurIPS 2018) (Curran Associates, Inc., 2018) vol. 31; https://proceedings.neurips.cc/paper_files/paper/2018/hash/294a8ed24b1ad22ec2e7efea049b8737-Abstract.html

  1021. J. Adebayo, M. Muelly, I. Liccardi, B. Kim, “Debugging Tests for Model Explanations” in Advances in Neural Information Processing Systems (NeurIPS 2020) (Curran Associates, Inc., 2020) vol. 33, pp. 700–712; https://proceedings.neurips.cc/paper/2020/hash/075b051ec3d22dac7b33f788da631fd4-Abstract.html

  1022. S. Casper, T. Bu, Y. Li, J. Li, K. Zhang, K. Hariharan, D. Hadfield-Menell, “Red Teaming Deep Neural Networks with Feature Synthesis Tools” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=Od6CHhPM7I 2

  1023. P. Hase, M. Bansal, B. Kim, A. Ghandeharioun, “Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) (2023); https://openreview.net/forum?id=EldbUlZtbd 2

  1024. J. Miller, B. Chughtai, W. Saunders, Transformer Circuit Faithfulness Metrics Are Not Robust, arXiv [cs.LG] (2024); http://arxiv.org/abs/2407.08734

  1025. [industry] M. L. Leavitt, A. Morcos, Towards Falsifiable Interpretability Research, arXiv [cs.CY] (2020); http://arxiv.org/abs/2010.12016 2

  1026. [industry] E. Durmus, A. Tamkin, J. Clark, J. Wei, J. Marcus, J. Batson, K. Handa, L. Lovitt, M. Tong, M. McCain, O. Rausch, S. Huang, S. Bowman, S. Ritchie, T. Henighan, D. Ganguli, “Evaluating Feature Steering: A Case Study in Mitigating Social Biases” (Anthropic, 2024); https://www.anthropic.com/research/evaluating-feature-steering 2

  1027. G. E. Hinton, “Distributed Representations” (CMU-CS-84–157, Carnegie-Mellon University, 1984); http://shelf2.library.cmu.edu/Tech/19334156.pdf

  1028. Y. Bengio, A. Courville, P. Vincent, Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1798–1828 (2013); https://doi.org/10.1109/TPAMI.2013.50

  1029. L. Gao, J. Schulman, J. Hilton, “Scaling Laws for Reward Model Overoptimization” in Proceedings of the 40th International Conference on Machine Learning (PMLR, Honolulu, Hawaii, USA, 2023), pp. 10835–10866; https://proceedings.mlr.press/v202/gao23h.html 2

  1030. P. Singhal, T. Goyal, J. Xu, G. Durrett, A Long Way to Go: Investigating Length Correlations in RLHF, arXiv [cs.CL] (2023); http://arxiv.org/abs/2310.03716

  1031. J. M. V. Skalse, N. H. R. Howe, D. Krasheninnikov, D. Krueger, “Defining and Characterizing Reward Gaming” in 36th Conference on Neural Information Processing Systems (NeurIPS 2022) (Virtual, 2022); https://openreview.net/forum?id=yb3HOXO3lX2 2 3

  1032. L. E. McKinney, Y. Duan, D. Krueger, A. Gleave, “On The Fragility of Learned Reward Functions” in 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Deep Reinforcement Learning Workshop (Virtual, 2022); https://openreview.net/forum?id=9gj9vXfeS-y 2

  1033. J. Tien, J. Z.-Y. He, Z. Erickson, A. Dragan, D. S. Brown, “Causal Confusion and Reward Misidentification in Preference-Based Reward Learning” in 11th International Conference on Learning Representations (ICLR 2023) (Kigali, Rwanda, 2022); https://openreview.net/forum?id=R0Xxvr_X3ZA 2

  1034. Z. X. Yong, C. Menghini, S. Bach, “Low-Resource Languages Jailbreak GPT-4” in NeurIPS Workshop on Socially Responsible Language Modelling Research (SoLaR) (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=pn83r8V2sv 2

  1035. Y. Huang, L. Sun, H. Wang, S. Wu, Q. Zhang, Y. Li, C. Gao, Y. Huang, W. Lyu, Y. Zhang, X. Li, H. Sun, Z. Liu, Y. Liu, Y. Wang, Z. Zhang, B. Vidgen, … Y. Zhao, “Position: TrustLLM: Trustworthiness in Large Language Models” in International Conference on Machine Learning (PMLR, 2024), pp. 20166–20270; https://proceedings.mlr.press/v235/huang24x.html

  1036. S. Longpre, S. Kapoor, K. Klyman, A. Ramaswami, R. Bommasani, B. Blili-Hamelin, Y. Huang, A. Skowron, Z.-X. Yong, S. Kotha, Y. Zeng, W. Shi, X. Yang, R. Southen, A. Robey, P. Chao, D. Yang, … P. Henderson, A Safe Harbor for AI Evaluation and Red Teaming, arXiv [cs.AI] (2024); http://arxiv.org/abs/2403.04893 2 3

  1037. Y. M. Pa Pa, S. Tanizaki, T. Kou, M. van Eeten, K. Yoshioka, T. Matsumoto, “An Attacker’s Dream? Exploring the Capabilities of ChatGPT for Developing Malware” in Proceedings of the 16th Cyber Security Experimentation and Test Workshop (CSET ‘23) (Association for Computing Machinery, New York, NY, USA, 2023), pp. 10–18; https://doi.org/10.1145/3607505.3607513

  1038. A. Liu, Q. Sheng, X. Hu, “Preventing and Detecting Misinformation Generated by Large Language Models” in Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (ACM, New York, NY, USA, 2024), pp. 3001–3004; https://doi.org/10.1145/3626772.3661377

  1039. J. B. Sandbrink, Artificial Intelligence and Biological Misuse: Differentiating Risks of Language Models and Biological Design Tools, arXiv [cs.CY] (2023); http://arxiv.org/abs/2306.13952

  1040. L. Pöhler, V. Schrader, A. Ladwein, F. von Keller, A Technological Perspective on Misuse of Available AI, arXiv [cs.CY] (2024); http://arxiv.org/abs/2403.15325

  1041. M. Anderljung, J. Hazell, Protecting Society from AI Misuse: When Are Restrictions on Capabilities Warranted?, arXiv [cs.AI] (2023); http://arxiv.org/abs/2303.09377

  1042. A. Karamolegkou, J. Li, L. Zhou, A. Søgaard, “Copyright Violations and Large Language Models” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), H. Bouamor, J. Pino, K. Bali, Eds. (Association for Computational Linguistics, Singapore, 2023), pp. 7403–7412; https://doi.org/10.18653/v1/2023.emnlp-main.458

  1043. H. Li, D. Guo, W. Fan, M. Xu, J. Huang, F. Meng, Y. Song, “Multi-Step Jailbreaking Privacy Attacks on ChatGPT” in The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023) (Singapore, 2023); https://openreview.net/forum?id=ls4Pfsl2jZ

  1044. [industry] M. Nasr, N. Carlini, J. Hayase, M. Jagielski, A. Feder Cooper, D. Ippolito, C. A. Choquette-Choo, E. Wallace, F. Tramèr, K. Lee, Scalable Extraction of Training Data from (Production) Language Models, arXiv [cs.LG] (2023); http://arxiv.org/abs/2311.17035

  1045. B. C. Das, M. H. Amini, Y. Wu, Security and Privacy Challenges of Large Language Models: A Survey, arXiv [cs.CL] (2024); http://arxiv.org/abs/2402.00888

  1046. B. Yan, K. Li, M. Xu, Y. Dong, Y. Zhang, Z. Ren, X. Cheng, On Protecting the Data Privacy of Large Language Models (LLMs): A Survey, arXiv [cs.CR] (2024); http://arxiv.org/abs/2403.05156

  1047. Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, Y. Zhang, A Survey on Large Language Model (LLM) Security and Privacy: The Good, The Bad, and The Ugly. High-Confidence Computing 4, 100211 (2024); https://doi.org/10.1016/j.hcc.2024.100211 2 3

  1048. A. Deshpande, V. Murahari, T. Rajpurohit, A. Kalyan, K. Narasimhan, “Toxicity in Chatgpt: Analyzing Persona-Assigned Language Models” in Findings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, K. Bali, Eds. (Association for Computational Linguistics, Singapore, 2023), pp. 1236–1270; https://doi.org/10.18653/v1/2023.findings-emnlp.88

  1049. Y. Qu, X. Shen, X. He, M. Backes, S. Zannettou, Y. Zhang, “Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS ‘23) (Association for Computing Machinery, New York, NY, USA, 2023), pp. 3403–3417; https://doi.org/10.1145/3576915.3616679

  1050. Z. Xu, S. Jain, M. Kankanhalli, Hallucination Is Inevitable: An Innate Limitation of Large Language Models, arXiv [cs.CL] (2024); http://arxiv.org/abs/2401.11817

  1051. [industry] Z. Bai, P. Wang, T. Xiao, T. He, Z. Han, Z. Zhang, M. Z. Shou, Hallucination of Multimodal Large Language Models: A Survey, arXiv [cs.CV] (2024); http://arxiv.org/abs/2404.18930

  1052. Y. Liu, G. Deng, Z. Xu, Y. Li, Y. Zheng, Y. Zhang, L. Zhao, T. Zhang, K. Wang, Y. Liu, Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study, arXiv [cs.SE] (2023); http://arxiv.org/abs/2305.13860 2 3

  1053. R. Shah, Q. F. Montixi, S. Pour, A. Tagade, J. Rando, “Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Socially Responsible Language Modelling Research Workshop (SoLaR) (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=x3Ltqz1UFg 2 3 4 5

  1054. N. Carlini, M. Nasr, C. A. Choquette-Choo, M. Jagielski, I. Gao, P. W. Koh, D. Ippolito, F. Tramèr, L. Schmidt, “Are Aligned Neural Networks Adversarially Aligned?” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=OQQoD8Vc3B 2

  1055. X. Shen, Z. Chen, M. Backes, Y. Shen, Y. Zhang, “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models, arXiv [cs.CR] (2023); http://arxiv.org/abs/2308.03825 2

  1056. [industry] N. Li, Z. Han, I. Steneker, W. Primack, R. Goodside, H. Zhang, Z. Wang, C. Menghini, S. Yue, LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks yet, arXiv [cs.LG] (2024); http://arxiv.org/abs/2408.15221 2 3 4

  1057. L. Jiang, K. Rao, S. Han, A. Ettinger, F. Brahman, S. Kumar, N. Mireshghallah, X. Lu, M. Sap, Y. Choi, N. Dziri, “WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models” in 38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024) (2024); https://openreview.net/pdf?id=n5R6TvBVcX

  1058. Z. Dong, Z. Zhou, C. Yang, J. Shao, Y. Qiao, Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey (Association for Computational Linguistics, 2024); https://doi.org/10.18653/v1/2024.naacl-long.375

  1059. M. Andriushchenko, F. Croce, N. Flammarion, Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks, arXiv [cs.CR] (2024); http://arxiv.org/abs/2404.02151 2

  1060. Y. Zeng, H. Lin, J. Zhang, D. Yang, R. Jia, W. Shi, “How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Association for Computational Linguistics, Stroudsburg, PA, USA, 2024), pp. 14322–14350; https://doi.org/10.18653/v1/2024.acl-long.773

  1061. A. G. Chowdhury, M. M. Islam, V. Kumar, F. H. Shezan, V. Kumar, V. Jain, A. Chadha, Breaking down the Defenses: A Comparative Survey of Attacks on Large Language Models, arXiv [cs.CR] (2024); http://arxiv.org/abs/2403.04786

  1062. M. K. B. Doumbouya, A. Nandi, G. Poesia, D. Ghilardi, A. Goldie, F. Bianchi, D. Jurafsky, C. D. Manning, H4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment, arXiv [cs.CR] (2024); http://arxiv.org/abs/2408.04811

  1063. [industry] B. R. Y. Huang, M. Li, L. Tang, Endless Jailbreaks with Bijection Learning, arXiv [cs.CL] (2024); http://arxiv.org/abs/2410.01294 2 3

  1064. X. Qi, Y. Zeng, T. Xie, P.-Y. Chen, R. Jia, P. Mittal, P. Henderson, “Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” in The 12th International Conference on Learning Representations (ICLR 2024) (Vienna, Austria, 2023); https://openreview.net/forum?id=hTEGyKf0dZ 2

  1065. Q. Zhan, R. Fang, R. Bindu, A. Gupta, T. Hashimoto, D. Kang, “Removing RLHF Protections in GPT-4 via Fine-Tuning” in 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Mexico City, Mexico, 2024); https://doi.org/10.48550/arXiv.2311.05553

  1066. S. Jain, R. Kirk, E. S. Lubana, R. P. Dick, H. Tanaka, E. Grefenstette, T. Rocktäschel, D. S. Krueger, Mechanistically Analyzing the Effects of Fine-Tuning on Procedurally Defined Tasks, arXiv [cs.LG] (2023); http://arxiv.org/abs/2311.12786

  1067. X. Yang, X. Wang, Q. Zhang, L. Petzold, W. Y. Wang, X. Zhao, D. Lin, Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models, arXiv [cs.CL] (2023); http://arxiv.org/abs/2310.02949

  1068. R. Bhardwaj, S. Poria, Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases, arXiv [cs.CL] (2023); http://arxiv.org/abs/2310.14303

  1069. J. Ji, K. Wang, T. Qiu, B. Chen, J. Zhou, C. Li, H. Lou, Y. Yang, Language Models Resist Alignment, arXiv [cs.CL] (2024); http://arxiv.org/abs/2406.06144 2

  1070. X. Qi, A. Panda, K. Lyu, X. Ma, S. Roy, A. Beirami, P. Mittal, P. Henderson, Safety Alignment Should Be Made More Than Just a Few Tokens Deep, arXiv [cs.CR] (2024); http://arxiv.org/abs/2406.05946 2

  1071. S. Hu, Y. Fu, Z. S. Wu, V. Smith, Jogging the Memory of Unlearned LLMs through Targeted Relearning Attacks, arXiv [cs.LG] (2024); http://arxiv.org/abs/2406.13356

  1072. D. Halawi, A. Wei, E. Wallace, T. T. Wang, N. Haghtalab, J. Steinhardt, “Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation” in International Conference on Machine Learning (PMLR, 2024), pp. 17298–17312; https://proceedings.mlr.press/v235/halawi24a.html

  1073. R. Greenblatt, F. Roger, D. Krasheninnikov, D. Krueger, “Stress-Testing Capability Elicitation With Password-Locked Models” in 38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024) (2024); https://openreview.net/pdf?id=zzOOqD6R1b

  1074. M. Lo, F. Barez, S. Cohen, Large Language Models Relearn Removed Concepts (Association for Computational Linguistics, 2024); https://doi.org/10.18653/v1/2024.findings-acl.492

  1075. S. Peng, P.-Y. Chen, M. D. Hull, D. H. Chau, “Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models” in 38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024) (2024); https://openreview.net/pdf?id=GZnsqBwHAG 2

  1076. A. Sheshadri, A. Ewart, P. Guo, A. Lynch, C. Wu, V. Hebbar, H. Sleight, A. C. Stickland, E. Perez, D. Hadfield-Menell, S. Casper, Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs, arXiv [cs.LG] (2024); http://arxiv.org/abs/2407.15549 2 3

  1077. S. Xhonneux, A. Sordoni, S. Günnemann, G. Gidel, L. Schwinn, “Efficient Adversarial Training in LLMs with Continuous Attacks” in 38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024) (2024); https://openreview.net/pdf?id=8jB6sGqvgQ

  1078. L. Schwinn, S. Geisler, Revisiting the Robust Alignment of Circuit Breakers, arXiv [cs.CR] (2024); http://arxiv.org/abs/2407.15902

  1079. T. Huang, S. Hu, F. Ilhan, S. F. Tekin, L. Liu, Harmful Fine-Tuning Attacks and Defenses for Large Language Models: A Survey, arXiv [cs.CR] (2024); http://arxiv.org/abs/2409.18169

  1080. J. Łucki, B. Wei, Y. Huang, P. Henderson, F. Tramèr, J. Rando, An Adversarial Perspective on Machine Unlearning for AI Safety, arXiv [cs.LG] (2024); http://arxiv.org/abs/2409.18025

  1081. Y. Wolf, N. Wies, O. Avnery, Y. Levine, A. Shashua, Fundamental Limitations of Alignment in Large Language Models, arXiv [cs.CL] (2023); http://arxiv.org/abs/2304.11082

  1082. T. Tseng, E. McLean, K. Pelrine, T. T. Wang, A. Gleave, Can Go AIs Be Adversarially Robust?, arXiv [cs.LG] (2024); http://arxiv.org/abs/2406.12843

  1083. M. Andriushchenko, N. Flammarion, Does Refusal Training in LLMs Generalize to the Past Tense?, arXiv [cs.CL] (2024); http://arxiv.org/abs/2407.11969

  1084. I. D. Raji, E. Denton, E. M. Bender, A. Hanna, A. Paullada, “AI and the Everything in the Whole Wide World Benchmark” in 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Datasets and Benchmarks Track (Round 2) (Virtual, 2021); https://openreview.net/forum?id=j6NxpQbREA1

  1085. B. Hutchinson, N. Rostamzadeh, C. Greer, K. Heller, V. Prabhakaran, “Evaluation Gaps in Machine Learning Practice” in Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ‘22) (Association for Computing Machinery, New York, NY, USA, 2022), pp. 1859–1876; https://doi.org/10.1145/3531146.3533233

  1086. S. Casper, C. Ezell, C. Siegmann, N. Kolt, T. L. Curtis, B. Bucknall, A. Haupt, K. Wei, J. Scheurer, M. Hobbhahn, L. Sharkey, S. Krishna, M. Von Hagen, S. Alberti, A. Chan, Q. Sun, M. Gerovitch, … D. Hadfield-Menell, Black-Box Access Is Insufficient for Rigorous AI Audits, arXiv [cs.CY] (2024); http://arxiv.org/abs/2401.14446 2 3 4

  1087. B. Ram, P. Verma, Artificial Intelligence AI-Based Chatbot Study of ChatGPT, Google AI Bard and Baidu AI. World Journal of Advanced Engineering Technology and Sciences 8, 258–261 (2023); https://doi.org/10.30574/wjaets.2023.8.1.0045

  1088. M. M. Maas, “Artificial Intelligence Governance under Change: Foundations, Facets, Frameworks,” thesis, University of Copenhagen (2020); 

  1089. P. M. Napoli, Social Media and the Public Interest (Columbia University Press, 2019); https://cup.columbia.edu/book/social-media-and-the-public-interest/9780231184540

  1090. J. M. Balkin, How to Regulate (and Not Regulate) Social Media. Journal of Free Speech Law 1, 71–96 (2021); https://doi.org/10.2139/ssrn.3484114

  1091. R. H. Frank, P. J. Cook, Winner-Take-All Markets. Studies in Microeconomics 1, 131–154 (2013); https://doi.org/10.1177/2321022213501254

  1092. B. A. Prakash, A. Beutel, R. Rosenfeld, C. Faloutsos, “Winner Takes All: Competing Viruses or Ideas on Fair-Play Networks” in Proceedings of the 21st International Conference on World Wide Web - WWW ‘12 (ACM Press, New York, New York, USA, 2012); https://doi.org/10.1145/2187836.2187975

  1093. T. A. Han, L. M. Pereira, T. Lenaerts, “Modelling and Influencing the AI Bidding War: A Research Agenda” in Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (AIES ‘19) (New York, NY, USA, 2019), pp. 5–11; https://doi.org/10.1145/3306618.3314265

  1094. T. Cimpeanu, F. C. Santos, L. M. Pereira, T. Lenaerts, T. A. Han, Artificial Intelligence Development Races in Heterogeneous Settings. Scientific Reports 12, 1723 (2022); https://doi.org/10.1038/s41598-022-05729-3

  1095. A. Guasti, M. Koenig-Archibugi, Has Global Trade Competition Really Led to a Race to the Bottom in Labor Standards? International Studies Quarterly: A Publication of the International Studies Association 66, sqac061 (2022); https://doi.org/10.1093/isq/sqac061

  1096. G. Porter, Trade Competition and Pollution Standards: “race to the Bottom” or “stuck at the Bottom.” Journal of Environment & Development 8, 133–151 (1999); https://doi.org/10.1177/107049659900800203

  1097. D. Vera, C. Rusche, “The Economics of Platforms” (Institut der deutschen Wirtschaft, 2018); https://www.iwkoeln.de/en/studies/vera-demary-christian-rusche-the-economics-of-platforms.html

  1098. M. F. Niculescu, D. J. Wu, L. Xu, Strategic Intellectual Property Sharing: Competition on an Open Technology Platform under Network Effects. Information Systems Research : ISR 29, 498–519 (2018); https://doi.org/10.1287/isre.2017.0756

  1099. N. L. Rose, Fear of Flying? Economic Analyses of Airline Safety. The Journal of Economic Perspectives: A Journal of the American Economic Association 6, 75–94 (1992); https://doi.org/10.1257/jep.6.2.75

  1100. J. Tirole, The Theory of Industrial Organization (MIT Press, London, England, 1988). 

  1101. S. Armstrong, N. Bostrom, C. Shulman, Racing to the Precipice: A Model of Artificial Intelligence Development. AI & Society 31, 201–206 (2016); https://doi.org/10.1007/s00146-015-0590-y

  1102. G. H. Stern, R. J. Feldman, Too Big to Fail: The Hazards of Bank Bailouts (Brookings Institution Press, 2009); https://www.brookings.edu/books/too-big-to-fail/

  1103. B. E. Gup, Financial Management Association International, Too Big to Fail : Policies and Practices in Government Bailouts (Praeger, Westport, Conn, ed. 1, 2003); https://library-search.open.ac.uk/permalink/44OPN_INST/la9sg5/alma9952597297902316

  1104. V. Acharya, D. Anginer, J. A. Warburton, “The End of Market Discipline? Investor Expectations of Implicit Government Guarantees” (2022); https://cepr.org/publications/dp17426

  1105. K. Pernell, J. Jung, Rethinking Moral Hazard: Government Protection and Bank Risk-Taking. Socio-Economic Review 22, 625–653 (2024); https://doi.org/10.1093/ser/mwad050

  1106. W. J. Baumol, W. E. Oates, The Theory of Environmental Policy (Cambridge University Press, Cambridge, England, ed. 2, 1988); https://doi.org/10.1017/cbo9781139173513

  1107. P. DeCicca, D. Kenkel, M. F. Lovenheim, The Economics of Tobacco Regulation: A Comprehensive Review. Journal of Economic Literature 60, 883–970 (2022); https://doi.org/10.1257/jel.20201482

  1108. J. Guerreiro, S. Rebelo, P. Teles, “Regulating Artificial Intelligence” (w31921, National Bureau of Economic Research, 2023); https://doi.org/10.3386/w31921

  1109. L. Dallas, “Short-Termism, the Financial Crisis, and Corporate Governance” (University of San Diego School of Law, 2012); http://dx.doi.org/

  1110. N. Kolt, M. Anderljung, J. Barnhart, A. Brass, K. Esvelt, G. K. Hadfield, L. Heim, M. Rodriguez, J. B. Sandbrink, T. Woodside, Responsible Reporting for Frontier AI Development, arXiv [cs.CY] (2024); http://arxiv.org/abs/2404.02675

  1111. M. Anderljung, J. Barnhart, A. Korinek, J. Leung, C. O’Keefe, J. Whittlestone, S. Avin, M. Brundage, J. Bullock, D. Cass-Beggs, B. Chang, T. Collins, T. Fist, G. Hadfield, A. Hayes, L. Ho, S. Hooker, … K. Wolf, Frontier AI Regulation: Managing Emerging Risks to Public Safety, arXiv [cs.CY] (2023); http://arxiv.org/abs/2307.03718

  1112. L. Collina, M. Sayyadi, M. Provitera, Critical Issues About A.I. Accountability Answered. California Management Review Insights (2023); https://cmr.berkeley.edu/2023/11/critical-issues-about-a-i-accountability-answered/

  1113. A. T. da Fonseca, E. Vaz de Sequeira, L. Barreto Xavier, “Liability for AI Driven Systems” in Multidisciplinary Perspectives on Artificial Intelligence and the Law, H. Sousa Antunes, P. M. Freitas, A. L. Oliveira, C. Martins Pereira, E. Vaz de Sequeira, L. Barreto Xavier, Eds. (Springer International Publishing, Cham, 2024), pp. 299–317; https://doi.org/10.1007/978-3-031-41264-6_16

  1114. M. Buiten, A. de Streel, M. Peitz, The Law and Economics of AI Liability. Computer Law and Security Report 48, 105794 (2023); https://doi.org/10.1016/j.clsr.2023.105794

  1115. T. Miller, Explanation in Artificial Intelligence: Insights from the Social Sciences. Artificial Intelligence 267, 1–38 (2019); https://doi.org/10.1016/j.artint.2018.07.007

  1116. F. Doshi-Velez, B. Kim, Towards A Rigorous Science of Interpretable Machine Learning, arXiv [stat.ML] (2017); http://arxiv.org/abs/1702.08608

  1117. Z. C. Lipton, The Mythos of Model Interpretability: In Machine Learning, the Concept of Interpretability Is Both Important and Slippery. ACM Queue: Tomorrow’s Computing Today 16, 31–57 (2018); https://doi.org/10.1145/3236386.3241340

  1118. T. Räuker, A. Ho, S. Casper, D. Hadfield-Menell, Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks, arXiv [cs.LG] (2022); http://arxiv.org/abs/2207.13243

  1119. M. Busuioc, Accountable Artificial Intelligence: Holding Algorithms to Account. Public Administration Review 81, 825–836 (2021); https://doi.org/10.1111/puar.13293

  1120. F. Doshi-Velez, M. Kortz, R. Budish, C. Bavitz, S. J. Gershman, D. O’Brien, K. Scott, S. Shieber, J. Waldo, D. Weinberger, A. Weller, A. Wood, “Accountability of AI Under the Law: The Role of Explanation” (Berkman Klein Center Working Group on Explanation and the Law, 2017); http://nrs.harvard.edu/urn-3:HUL.InstRepos:34372584

  1121. R. Palin, I. Habli, “Assurance of Automotive Safety – A Safety Case Approach” in Computer Safety, Reliability, and Security (SAFECOMP 2010), E. Schoitsch, Ed. (Springer, Berlin, Heidelberg, 2010)Lecture Notes in Computer Science (LNPSE), pp. 82–96; https://doi.org/10.1007/978-3-642-15651-9_7

  1122. I. I. Livshitz, P. A. Lontsikh, N. P. Lontsikh, E. Y. Golovina, O. M. Safonova, “A Study of Modern Risk Management Methods for Industrial Safety Assurance in the Fuel and Energy Industry” in 2021 International Conference on Quality Management, Transport and Information Security, Information Technologies (IT&QM&IS) (2021), pp. 165–167; https://doi.org/10.1109/ITQMIS53292.2021.9642791

  1123. M. L. Cummings, Rethinking the Maturity of Artificial Intelligence in Safety-Critical Settings. AI Magazine 42, 6–15 (2021); https://ojs.aaai.org/aimagazine/index.php/aimagazine/article/view/7394

  1124. N. Kolt, Governing AI Agents (2024); https://doi.org/10.2139/ssrn.4772956

  1125. P. Verdegem, Dismantling AI Capitalism: The Commons as an Alternative to the Power Concentration of Big Tech. AI & Society 39, 1–11 (2022); https://doi.org/10.1007/s00146-022-01437-8

  1126. K. Crawford, Atlas of AI, Yale University Press London (2021); https://yalebooks.co.uk/9780300264630/atlas-of-ai

  1127. J. Angwin, A. Nelson, R. Palta, “Seeking Reliable Election Information? Don’t Trust AI” (The AI Democracy Projects, 2024); https://www.proofnews.org/seeking-election-information-dont-trust-ai/

  1128. H. Shen, A. DeVos, M. Eslami, K. Holstein, Everyday Algorithm Auditing: Understanding the Power of Everyday Users in Surfacing Harmful Algorithmic Behaviors. Proceedings of the ACM on Human-Computer Interaction 5, 1–29 (2021); https://doi.org/10.1145/3479577

  1129. G. Abercrombie, D. Benbouzid, P. Giudici, D. Golpayegani, J. Hernandez, P. Noro, H. Pandit, E. Paraschou, C. Pownall, J. Prajapati, M. A. Sayre, U. Sengupta, A. Suriyawongkul, R. Thelot, S. Vei, L. Waltersdorfer, A Collaborative, Human-Centred Taxonomy of AI, Algorithmic, and Automation Harms, arXiv [cs.LG] (2024); http://arxiv.org/abs/2407.01294

  1130. J. Molloy, S. Shahbeigi, J. A. McDermid, Hazard and Safety Analysis of Machine-Learning-Based Perception Capabilities in Autonomous Vehicles. Computer 57, 60–70 (2024); https://doi.org/10.1109/mc.2024.3443751

  1131. Y. Jia, T. Lawton, J. Burden, J. McDermid, I. Habli, Safety-Driven Design of Machine Learning for Sepsis Treatment. Journal of Biomedical Informatics 117, 103762 (2021); https://doi.org/10.1016/j.jbi.2021.103762

  1132. R. Hawkins, C. Picardi, L. Donnell, M. Ireland, Creating a Safety Assurance Case for a Machine Learned Satellite-Based Wildfire Detection and Alert System. Journal of Intelligent & Robotic Systems 108, 1–21 (2023); https://doi.org/10.1007/s10846-023-01905-3

  1133. P. Festor, Y. Jia, A. C. Gordon, A. A. Faisal, I. Habli, M. Komorowski, Assuring the Safety of AI-Based Clinical Decision Support Systems: A Case Study of the AI Clinician for Sepsis Treatment. BMJ Health & Care Informatics 29, e100549 (2022); https://doi.org/10.1136/bmjhci-2022-100549

  1134. [industry] Department for Science, Innovation & Technology, “Frontier AI Safety Commitments, AI Seoul Summit 2024” (GOV.UK, 2024); https://www.gov.uk/government/publications/frontier-ai-safety-commitments-ai-seoul-summit-2024/frontier-ai-safety-commitments-ai-seoul-summit-2024 2

  1135. [industry] R. Schwartz, J. Fiscus, K. Greene, G. Waters, R. Chowdhury, T. Jensen, C. Greenberg, A. Godil, R. Amironesei, P. Hall, S. Jain, “The NIST Assessing Risks and Impacts of AI (ARIA) Pilot Evaluation Plan” (US National Institute of Standards and Technology, 2024); https://ai-challenges.nist.gov/uassets/7

  1136. C. G. Northcutt, A. Athalye, J. Mueller, “Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks” in 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Datasets and Benchmarks Track (Round 1) (Virtual, 2021); https://openreview.net/forum?id=XccDXrDNLek

  1137. Z. Xiao, S. Zhang, V. Lai, Q. V. Liao, Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics Using Measurement Theory (Association for Computational Linguistics, 2023); https://doi.org/10.18653/v1/2023.emnlp-main.676

  1138. M. Sclar, Y. Choi, Y. Tsvetkov, A. Suhr, Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I Learned to Start Worrying about Prompt Formatting, arXiv [cs.CL] (2023); http://arxiv.org/abs/2310.11324

  1139. B. Shu, L. Zhang, M. Choi, L. Dunagan, L. Logeswaran, M. Lee, D. Card, D. Jurgens, “You Don’t Need a Personality Test to Know These Models Are Unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (Association for Computational Linguistics, Stroudsburg, PA, USA, 2024), pp. 5263–5281; https://doi.org/10.18653/v1/2024.naacl-long.295

  1140. A. Bavaresco, R. Bernardi, L. Bertolazzi, D. Elliott, R. Fernández, A. Gatt, E. Ghaleb, M. Giulianelli, M. Hanna, A. Koller, A. F. T. Martins, P. Mondorf, V. Neplenbroek, S. Pezzelle, B. Plank, D. Schlangen, A. Suglia, … A. Testoni, LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks, arXiv [cs.CL] (2024); http://arxiv.org/abs/2406.18403

  1141. ISACA, “The Risk IT Framework” (2009); https://www.hci-itil.com/ITIL_v3/docs/RiskIT_FW_30June2010_Research.pdf

  1142. US AI Safety Institute, UK AI Safety Institute, “US AISI and UK AISI Joint Pre-Deployment Test” (National Institute of Standards and Technology; Department of Science Innovation and Technology, 2024); https://www.nist.gov/system/files/documents/2024/11/19/Upgraded%20Sonnet-Publication-US.pdf

  1143. G. Leech, J. J. Vazquez, N. Kupper, M. Yagudin, L. Aitchison, Questionable Practices in Machine Learning, arXiv [cs.LG] (2024); http://arxiv.org/abs/2407.12220

  1144. [industry] L. Madaan, A. K. Singh, R. Schaeffer, A. Poulton, S. Koyejo, P. Stenetorp, S. Narang, D. Hupkes, Quantifying Variance in Evaluation Benchmarks, arXiv [cs.LG] (2024); http://arxiv.org/abs/2406.10229

  1145. C. Xu, S. Guan, D. Greene, M.-T. Kechadi, Benchmark Data Contamination of Large Language Models: A Survey, arXiv [cs.CL] (2024); http://arxiv.org/abs/2406.04244

  1146. Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, W. Ye, Y. Zhang, Y. Chang, P. S. Yu, Q. Yang, X. Xie, A Survey on Evaluation of Large Language Models. ACM Transactions on Intelligent Systems and Technology 15, 39:1–39:45 (2024); https://doi.org/10.1145/3641289

  1147. [industry] W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, N. Duan, AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models, arXiv [cs.CL] (2023); http://arxiv.org/abs/2304.06364

  1148. L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, H. Zhang, J. E. Gonzalez, I. Stoica, “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Datasets and Benchmarks Track (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=uccHPGDlao

  1149. [industry] S. Yao, N. Shinn, P. Razavi, K. Narasimhan, τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains, arXiv [cs.AI] (2024); http://arxiv.org/abs/2406.12045

  1150. P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Kumar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. A. Cosgrove, C. D. Manning, C. Re, … Y. Koreeda, Holistic Evaluation of Language Models. Transactions on Machine Learning Research (2023); https://openreview.net/forum?id=iO4LZibEqW 2

  1151. A. Reuel, A. Hardy, C. Smith, M. Lamparth, M. Hardy, M. J. Kochenderfer, BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices, arXiv [cs.AI] (2024); http://arxiv.org/abs/2411.12990 2

  1152. [industry] E. Miller, Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations, arXiv [stat.AP] (2024); http://arxiv.org/abs/2411.00640

  1153. [industry] N. Sambasivan, E. Arnesen, B. Hutchinson, V. Prabhakaran, Non-Portability of Algorithmic Fairness in India, arXiv [cs.CY] (2020); http://arxiv.org/abs/2012.03659

  1154. I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Dernoncourt, T. Yu, R. Zhang, N. K. Ahmed, Bias and Fairness in Large Language Models: A Survey. Computational Linguistics (Association for Computational Linguistics) 50, 1–83 (2024); https://doi.org/10.1162/coli_a_00524

  1155. K. Charmaz, Constructing Grounded Theory (SAGE Publications, Thousand Oaks, CA, 2014). 

  1156. T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, S. Singh, “AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), B. Webber, T. Cohn, Y. He, Y. Liu, Eds. (Association for Computational Linguistics, Online, 2020), pp. 4222–4235; https://doi.org/10.18653/v1/2020.emnlp-main.346

  1157. E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, G. Irving, “Red Teaming Language Models with Language Models” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022), Y. Goldberg, Z. Kozareva, Y. Zhang, Eds. (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022), pp. 3419–3448; https://doi.org/10.18653/v1/2022.emnlp-main.225 2 3

  1158. [industry] D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, A. Jones, S. Bowman, A. Chen, T. Conerly, N. DasSarma, D. Drain, N. Elhage, … J. Clark, “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned” (Anthropic, 2022); http://arxiv.org/abs/2209.07858 2 3 4

  1159. S. Casper, J. Lin, J. Kwon, G. Culp, D. Hadfield-Menell, Explore, Establish, Exploit: Red Teaming Language Models from Scratch, arXiv [cs.CL] (2023); http://arxiv.org/abs/2306.09442

  1160. S. Tong, E. Jones, J. Steinhardt, “Mass-Producing Failures of Multimodal Systems with Language Models” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=T6iiOqsGOh

  1161. M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, D. Forsyth, D. Hendrycks, HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal, arXiv [cs.LG] (2024); http://arxiv.org/abs/2402.04249 2

  1162. P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, E. Wong, Jailbreaking Black Box Large Language Models in Twenty Queries, arXiv [cs.LG] (2023); http://arxiv.org/abs/2310.08419

  1163. D. Ziegler, S. Nix, L. Chan, T. Bauman, P. Schmidt-Nielsen, T. Lin, A. Scherlis, N. Nabeshima, B. Weinstein-Raun, D. de Haas, B. Shlegeris, N. Thomas, “Adversarial Training for High-Stakes Reliability” in Advances in Neural Information Processing Systems (NeurIPS 2022) (New Orleans, LA, US, 2022) vol. 35, pp. 9274–9286; https://proceedings.neurips.cc//paper_files/paper/2022/hash/3c44405d619a6920384a45bce876b41e-Abstract-Conference.html 2 3

  1164. A. Rao, S. Vashistha, A. Naik, S. Aditya, M. Choudhury, “Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks” in 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (Torino, Italia, 2024); https://doi.org/10.48550/arXiv.2305.14965 2 3

  1165. [industry] A. Mehrotra, M. Zampetakis, P. Kassianik, B. Nelson, H. Anderson, Y. Singer, A. Karbasi, Tree of Attacks: Jailbreaking Black-Box LLMs Automatically, arXiv [cs.LG] (2023); http://arxiv.org/abs/2312.02119

  1166. T. D. Pala, V. Y. H. Toh, R. Bhardwaj, S. Poria, Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique, arXiv [cs.CL] (2024); http://arxiv.org/abs/2408.10701

  1167. M. Feffer, A. Sinha, Z. C. Lipton, H. Heidari, Red-Teaming for Generative AI: Silver Bullet or Security Theater?, arXiv [cs.CY] (2024); http://arxiv.org/abs/2401.15897 2

  1168. [industry] L. Weidinger, J. Mellor, B. G. Pegueroles, N. Marchal, R. Kumar, K. Lum, C. Akbulut, M. Diaz, S. Bergman, M. Rodriguez, V. Rieser, W. Isaac, STAR: SocioTechnical Approach to Red Teaming Language Models, arXiv [cs.AI] (2024); http://arxiv.org/abs/2406.11757

  1169. P. Chao, E. Debenedetti, A. Robey, M. Andriushchenko, F. Croce, V. Sehwag, E. Dobriban, N. Flammarion, G. J. Pappas, F. Tramer, H. Hassani, E. Wong, JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models, arXiv [cs.CR] (2024); http://arxiv.org/abs/2404.01318

  1170. US AI Safety Institute, “Managing Misuse Risk for Dual-Use Foundation Models” (NIST, 2024); https://doi.org/10.6028/nist.ai.800-1.ipd

  1171. W. Tann, Y. Liu, J. H. Sim, C. M. Seah, E.-C. Chang, Using Large Language Models for Cybersecurity Capture-the-Flag Challenges and Certification Questions, arXiv [cs.AI] (2023); http://arxiv.org/abs/2308.10443

  1172. D. Kang, X. Li, I. Stoica, C. Guestrin, M. Zaharia, T. Hashimoto, “Exploiting Programmatic Behavior of LLMs: Dual-Use through Standard Security Attacks” in 2024 IEEE Security and Privacy Workshops (SPW) (IEEE, 2024), pp. 132–143; https://doi.org/10.1109/spw63631.2024.00018

  1173. F. N. Motlagh, M. Hajizadeh, M. Majd, P. Najafi, F. Cheng, C. Meinel, Large Language Models in Cybersecurity: State-of-the-Art, arXiv [cs.CR] (2024); http://arxiv.org/abs/2402.00891

  1174. A. Hagerty, I. Rubinov, Global AI Ethics: A Review of the Social Impacts and Ethical Implications of Artificial Intelligence, arXiv [cs.CY] (2019); http://arxiv.org/abs/1907.07892

  1175. M. M. Maas, “Aligning AI Regulation to Sociotechnical Change” in The Oxford Handbook of AI Governance, J. B. Bullock, Y.-C. Chen, J. Himmelreich, V. M. Hudson, A. Korinek, M. M. Young, B. Zhang, Eds. (Oxford University Press, 2022); https://doi.org/10.1093/oxfordhb/9780197579329.013.22

  1176. D. Dalrymple, J. Skalse, Y. Bengio, S. Russell, M. Tegmark, S. Seshia, S. Omohundro, C. Szegedy, B. Goldhaber, N. Ammann, A. Abate, J. Halpern, C. Barrett, D. Zhao, T. Zhi-Xuan, J. Wing, J. Tenenbaum, Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems, arXiv [cs.AI] (2024); http://arxiv.org/abs/2405.06624 2 3

  1177. A. Reuel, B. Bucknall, S. Casper, T. Fist, L. Soder, O. Aarne, L. Hammond, L. Ibrahim, A. Chan, P. Wills, M. Anderljung, B. Garfinkel, L. Heim, A. Trask, G. Mukobi, R. Schaeffer, M. Baker, … R. Trager, Open Problems in Technical AI Governance, arXiv [cs.CY] (2024); http://arxiv.org/abs/2407.14981 2 3

  1178. R. Ren, S. Basart, A. Khoja, A. Gatti, L. Phan, X. Yin, M. Mazeika, A. Pan, G. Mukobi, R. H. Kim, S. Fitz, D. Hendrycks, “Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?” in 38th Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2024); https://openreview.net/pdf?id=YagfTP3RK6

  1179. B. S. Bucknall, R. F. Trager, “Structured Access for Third-Party Research on Frontier AI Models: Investigating Researchers’ Model Access Requirements” (Oxford Martin School, University of Oxford and Center for the Governance of AI, 2023); https://cdn.governance.ai/Structured_Access_for_Third-Party_Research.pdf

  1180. A. Birhane, V. U. Prabhu, E. Kahembwe, Multimodal Datasets: Misogyny, Pornography, and Malignant Stereotypes, arXiv [cs.CY] (2021); http://arxiv.org/abs/2110.01963

  1181. R. Ashmore, R. Calinescu, C. Paterson, Assuring the Machine Learning Lifecycle. ACM Computing Surveys 54, 1–39 (2022); https://doi.org/10.1145/3453444

  1182. S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lindner, P. Freire, T. T. Wang, S. Marks, C.-R. Segerie, M. Carroll, A. Peng, P. Christoffersen, M. Damani, … D. Hadfield-Menell, Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. Transactions on Machine Learning Research (2023); https://openreview.net/forum?id=bx24KpJ4Eb 2 3

  1183. T. Shevlane, Structured Access: An Emerging Paradigm for Safe AI Deployment, arXiv [cs.AI] (2022); http://arxiv.org/abs/2201.05159

  1184. J. Petrie, O. Aarne, N. Amman, D. Dalrymple, Interim Report: Mechanisms for Flexible Hardware-Enabled Guarantees. (2024); https://yoshuabengio.org/wp-content/uploads/2024/09/FlexHEG-Interim-Report_2024.pdf

  1185. S. Costanza-Chock, I. D. Raji, J. Buolamwini, “Who Audits the Auditors? Recommendations from a Field Scan of the Algorithmic Auditing Ecosystem” in Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ‘22) (Association for Computing Machinery, New York, NY, USA, 2022), pp. 1571–1583; https://doi.org/10.1145/3531146.3533213

  1186. M. Feffer, M. Skirpan, Z. Lipton, H. Heidari, “From Preference Elicitation to Participatory ML: A Critical Survey & Guidelines for Future Research” in Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (AIES ‘23) (ACM, Montréal QC Canada, 2023), pp. 38–48; https://doi.org/10.1145/3600211.3604661 2

  1187. F. Delgado, S. Yang, M. Madaio, Q. Yang, “The Participatory Turn in AI Design: Theoretical Foundations and the Current State of Practice” in Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO ‘23) (Association for Computing Machinery, New York, NY, USA, 2023), pp. 1–23; https://doi.org/10.1145/3617694.3623261

  1188. J. Metcalf, E. Moss, E. A. Watkins, R. Singh, M. C. Elish, “Algorithmic Impact Assessments and Accountability: The Co-Construction of Impacts” in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ‘21) (Association for Computing Machinery, New York, NY, USA, 2021), pp. 735–746; https://doi.org/10.1145/3442188.3445935

  1189. D. Martin Jr, V. Prabhakaran, J. Kuhlberg, A. Smart, W. S. Isaac, “Participatory Problem Formulation for Fairer Machine Learning Through Community Based System Dynamics” in ICLR Workshop on Machine Learning in Real Life (2020); https://doi.org/10.48550/arXiv.2005.07572

  1190. S. Fazelpour, M. De-Arteaga, Diversity in Sociotechnical Machine Learning Systems. Big Data & Society 9, 205395172210820 (2022); https://doi.org/10.1177/20539517221082027

  1191. C. Knight, Reflective Equilibrium. (2023); https://plato.stanford.edu/entries/reflective-equilibrium/

  1192. P. Kalluri, Don’t Ask If Artificial Intelligence Is Good or Fair, Ask How It Shifts Power. Nature 583, 169 (2020); https://doi.org/10.1038/d41586-020-02003-2

  1193. R. Dobbe, T. Krendl Gilbert, Y. Mintz, Hard Choices in Artificial Intelligence. Artificial Intelligence 300, 103555 (2021); https://doi.org/10.1016/j.artint.2021.103555

  1194. [industry] S. Fort, B. Lakshminarayanan, Ensemble Everything Everywhere: Multi-Scale Aggregation for Adversarial Robustness, arXiv [cs.CV] (2024); http://arxiv.org/abs/2408.05446

  1195. A. Zou, L. Phan, J. Wang, D. Duenas, M. Lin, M. Andriushchenko, R. Wang, Z. Kolter, M. Fredrikson, D. Hendrycks, Improving Alignment and Robustness with Circuit Breakers, arXiv [cs.LG] (2024); http://arxiv.org/abs/2406.04313 2

  1196. M. Williams, M. Carroll, A. Narang, C. Weisser, B. Murphy, A. Dragan, On Targeted Manipulation and Deception When Optimizing LLMs for User Feedback, arXiv [cs.LG] (2024); http://arxiv.org/abs/2411.02306

  1197. S. Arnesen, D. Rein, J. Michael, Training Language Models to Win Debates with Self-Play Improves Judge Accuracy, arXiv [cs.CL] (2024); http://arxiv.org/abs/2409.16636 2

  1198. Z. Kenton, N. Y. Siegel, J. Kramar, J. Brown-Cohen, S. Albanie, J. Bulian, R. Agarwal, D. Lindner, Y. Tang, N. Goodman, R. Shah, “On Scalable Oversight with Weak LLMs Judging Strong LLMs” in 38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024) (2024); https://openreview.net/forum?id=O1fp9nVraj 2 3

  1199. A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowski, S. Goel, N. Li, M. J. Byun, Z. Wang, A. Mallen, S. Basart, S. Koyejo, … D. Hendrycks, Representation Engineering: A Top-Down Approach to AI Transparency, arXiv [cs.LG] (2023); http://arxiv.org/abs/2310.01405 2

  1200. S. Casper, L. Schulze, O. Patel, D. Hadfield-Menell, Defending Against Unforeseen Failure Modes with Latent Adversarial Training, arXiv [cs.CR] (2024); http://arxiv.org/abs/2403.05030 2

  1201. T. R. Shaham, S. Schwettmann, F. Wang, A. Rajaram, E. Hernandez, J. Andreas, A. Torralba, A Multimodal Automated Interpretability Agent (2024); https://openreview.net/forum?id=mDw42ZanmE 2

  1202. [industry] Z. Kenton, T. Everitt, L. Weidinger, I. Gabriel, V. Mikulik, G. Irving, “Alignment of Language Agents” (Google DeepMind, 2021); http://arxiv.org/abs/2103.14659

  1203. [industry] C. Burns, P. Izmailov, J. H. Kirchner, B. Baker, L. Gao, L. Aschenbrenner, Y. Chen, A. Ecoffet, M. Joglekar, J. Leike, I. Sutskever, J. Wu, Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision, arXiv [cs.CL] (2023); http://arxiv.org/abs/2312.09390

  1204. [industry] J. Michael, S. Mahdi, D. Rein, J. Petty, J. Dirani, V. Padmakumar, S. R. Bowman, Debate Helps Supervise Unreliable Experts, arXiv [cs.AI] (2023); http://arxiv.org/abs/2311.08702

  1205. Y. Bengio, M. K. Cohen, N. Malkin, M. MacDermott, D. Fornasiere, P. Greiner, Y. Kaddar, Can a Bayesian Oracle Prevent Harm from an Agent?, arXiv [cs.AI] (2024); http://arxiv.org/abs/2408.05284

  1206. M. Wu, A. F. Aji, Style Over Substance: Evaluation Biases for Large Language Models, arXiv [cs.CL] (2023); http://arxiv.org/abs/2307.03025

  1207. [industry] N. Lambert, R. Calandra, The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback, arXiv [cs.LG] (2023); http://arxiv.org/abs/2311.00168

  1208. H. Bansal, J. Dang, A. Grover, “Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models” in The 12th International Conference on Learning Representations (ICLR 2024) (Vienna, Austria, 2023); https://openreview.net/forum?id=dKl6lMwbCy

  1209. [industry] J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, I. Higgins, “Solving Math Word Problems with Process- and Outcome-Based Feedback” (Google Deepmind, 2022); https://doi.org/10.48550/arXiv.2211.14275

  1210. H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, K. Cobbe, “Let’s Verify Step by Step” in The 12th International Conference on Learning Representations (ICLR 2024) (Vienna, Austria, 2023); https://openreview.net/forum?id=v8L0pN6EOi

  1211. Z. Wu, Y. Hu, W. Shi, N. Dziri, A. Suhr, P. Ammanabrolu, N. A. Smith, M. Ostendorf, H. Hajishirzi, “Fine-Grained Human Feedback Gives Better Rewards for Language Model Training” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=CSbGXyCswu

  1212. Z. Li, The Dark Side of ChatGPT: Legal and Ethical Challenges from Stochastic Parrots and Hallucination, arXiv [cs.CY] (2023); http://arxiv.org/abs/2304.14347

  1213. [industry] A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. DasSarma, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, J. Kernion, K. Ndousse, C. Olsson, D. Amodei, … J. Kaplan, A General Language Assistant as a Laboratory for Alignment, arXiv [cs.CL] (2021); http://arxiv.org/abs/2112.00861

  1214. K. Shuster, S. Poff, M. Chen, D. Kiela, J. Weston, “Retrieval Augmentation Reduces Hallucination in Conversation” in Findings of the Association for Computational Linguistics: EMNLP 2021, M.-F. Moens, X. Huang, L. Specia, S. W.-T. Yih, Eds. (Association for Computational Linguistics, Punta Cana, Dominican Republic, 2021), pp. 3784–3803; https://doi.org/10.18653/v1/2021.findings-emnlp.320

  1215. L. Kuhn, Y. Gal, S. Farquhar, “Semantic Uncertainty: Linguistic Invariances For Uncertainty Estimation in Natural Language Generation” in 11th International Conference on Learning Representations (ICLR 2023) (Kigali, Rwanda, 2023); https://openreview.net/forum?id=VD-AYtP0dve

  1216. S. Min, K. Krishna, X. Lyu, M. Lewis, W.-T. Yih, P. Koh, M. Iyyer, L. Zettlemoyer, H. Hajishirzi, “FActScore: Fine-Grained Atomic Evaluation of Factual Precision in Long Form Text Generation” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (Association for Computational Linguistics, Stroudsburg, PA, USA, 2023), pp. 12076–12100; https://doi.org/10.18653/v1/2023.emnlp-main.741

  1217. L. Chen, A. Perez-Lebel, F. M. Suchanek, G. Varoquaux, Reconfidencing LLMs from the Grouping Loss Perspective, arXiv [cs.CL] (2024); http://arxiv.org/abs/2402.04957

  1218. D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo, D. Song, J. Steinhardt, J. Gilmer, “The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), pp. 8320–8329; https://doi.org/10.1109/ICCV48922.2021.00823

  1219. [industry] S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnson, S. Johnston, S. El-Showk, A. Jones, N. Elhage, T. Hume, A. Chen, Y. Bai, … J. Kaplan, Language Models (mostly) Know What They Know, arXiv [cs.CL] (2022); http://arxiv.org/abs/2207.05221

  1220. [industry] Y. A. Yadkori, I. Kuzborskij, A. György, C. Szepesvári, To Believe or Not to Believe Your LLM, arXiv [cs.LG] (2024); http://arxiv.org/abs/2406.02543

  1221. S. Marks, C. Rager, E. J. Michaud, Y. Belinkov, D. Bau, A. Mueller, Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models, arXiv [cs.LG] (2024); http://arxiv.org/abs/2403.19647 2

  1222. [industry] T. Lieberum, M. Rahtz, J. Kramár, N. Nanda, G. Irving, R. Shah, V. Mikulik, “Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla” (Google Deepmind, 2023); https://doi.org/10.48550/arXiv.2307.09458

  1223. E. Mitchell, C. Lin, A. Bosselut, C. D. Manning, C. Finn, “Memory-Based Model Editing at Scale” in Proceedings of the 39th International Conference on Machine Learning (PMLR, 2022), pp. 15817–15831; https://proceedings.mlr.press/v162/mitchell22a.html

  1224. K. Meng, A. S. Sharma, A. J. Andonian, Y. Belinkov, D. Bau, “Mass-Editing Memory in a Transformer” in 11th International Conference on Learning Representations (ICLR 2023) (Kigali, Rwanda, 2022); https://openreview.net/forum?id=MkbcAHIYgyS

  1225. Y. Gandelsman, A. A. Efros, J. Steinhardt, “Interpreting CLIP’s Image Representation via Text-Based Decomposition” in The 12th International Conference on Learning Representations (ICLR 2024) (Vienna, Austria, 2023); https://openreview.net/forum?id=5Ca9sSzuDp

  1226. C. Tan, G. Zhang, J. Fu, “Massive Editing for Large Language Models via Meta Learning” in The 12th International Conference on Learning Representations (ICLR 2024) (Vienna, Austria, 2023); https://openreview.net/forum?id=L6L1CJQ2PE

  1227. S. Wang, Y. Zhu, H. Liu, Z. Zheng, C. Chen, J. Li, Knowledge Editing for Large Language Models: A Survey, arXiv [cs.CL] (2023); http://arxiv.org/abs/2310.16218 2

  1228. A. Ghorbani, J. Y. Zou, “Neuron Shapley: Discovering the Responsible Neurons” in Advances in Neural Information Processing Systems (NeurIPS 2020) (Curran Associates, Inc., 2020) vol. 33, pp. 5922–5932; https://proceedings.neurips.cc/paper/2020/hash/41c542dfe6e4fc3deb251d64cf6ed2e4-Abstract.html

  1229. X. Wu, J. Li, M. Xu, W. Dong, S. Wu, C. Bian, D. Xiong, “DEPN: Detecting and Editing Privacy Neurons in Pretrained Language Models” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), H. Bouamor, J. Pino, K. Bali, Eds. (Association for Computational Linguistics, Gateway, Singapore , 2023), pp. 2875–2886; https://doi.org/10.18653/v1/2023.emnlp-main.174

  1230. K. Li, O. Patel, F. Viégas, H. Pfister, M. Wattenberg, “Inference-Time Intervention: Eliciting Truthful Answers from a Language Model” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=aLLuYpn83y

  1231. N. Belrose, D. Schneider-Joseph, S. Ravfogel, R. Cotterell, E. Raff, S. Biderman, “LEACE: Perfect Linear Concept Erasure in Closed Form” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) (New Orleans, LA, USA, 2023); https://openreview.net/forum?id=awIpKpwTwF&noteId=Ju4XcafMir

  1232. A. M. Turner, L. Thiergart, D. Udell, G. Leech, U. Mini, M. MacDiarmid, Activation Addition: Steering Language Models Without Optimization, arXiv [cs.CL] (2023); http://arxiv.org/abs/2308.10248

  1233. E. Hernandez, B. Z. Li, J. Andreas, Inspecting and Editing Knowledge Representations in Language Models, arXiv [cs.CL] (2023); http://arxiv.org/abs/2304.00740

  1234. D. Brown, C. Godfrey, C. Nizinski, J. Tu, H. Kvinge, “Robustness of Edited Neural Networks” in ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo 2023) (Kigali, Rwanda, 2023); https://openreview.net/forum?id=JAjH6VANZ4

  1235. [industry] C. Anil, E. Durmus, M. Sharma, J. Benton, S. Kundu, J. Batson, N. Rimsky, M. Tong, J. Mu, D. Ford, F. Mosconi, R. Agrawal, R. Schaeffer, N. Bashkansky, S. Svenningsen, M. Lambert, A. Radhakrishnan, … D. Duvenaud, “Many-Shot Jailbreaking” (Anthropic, 2024); https://www-cdn.anthropic.com/af5633c94ed2beb282f6a53c595eb437e8e7b630/Many_Shot_Jailbreaking__2024_04_02_0936.pdf

  1236. Y. Deng, W. Zhang, S. J. Pan, L. Bing, “Multilingual Jailbreak Challenges in Large Language Models” in 12th International Conference on Learning Representations (2024); https://openreview.net/forum?id=vESNKdEMGp

  1237. Y. Yuan, W. Jiao, W. Wang, J.-T. Huang, P. He, S. Shi, Z. Tu, “GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher” in 12th International Conference on Learning Representations (2024); https://openreview.net/forum?id=MbfAK4s61A

  1238. P. Ding, J. Kuang, D. Ma, X. Cao, Y. Xian, J. Chen, S. Huang, “A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts Can Fool Large Language Models Easily” in North American Chapter of the Association for Computational Linguistics (2023); https://api.semanticscholar.org/CorpusID:265664913

  1239. Z. Wei, Y. Wang, A. Li, Y. Mo, Y. Wang, Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations, arXiv [cs.LG] (2023); http://arxiv.org/abs/2310.06387

  1240. [industry] M. Russinovich, A. Salem, R. Eldan, Great, Now Write an Article about That: The Crescendo Multi-Turn LLM Jailbreak Attack, arXiv [cs.CR] (2024); http://arxiv.org/abs/2404.01833

  1241. A. Madry, A. Makelov, L. Schmidt, D. Tsipras, A. Vladu, “Towards Deep Learning Models Resistant to Adversarial Attacks” in The 6th International Conference on Learning Representations (ICLR 2018) (Vancouver, BC, Canada, 2018); https://openreview.net/forum?id=rJzIBfZAb

  1242. S. Friedler, R. Singh, B. Blili-Hamelin, J. Metcalf, B. J. Chen, “AI Red-Teaming Is Not a One-Stop Solution to AI Harms: Recommendations for Using Red-Teaming for AI Accountability” (Data & Society, 2023); https://datasociety.net/library/ai-red-teaming-is-not-a-one-stop-solution-to-ai-harms-recommendations-for-using-red-teaming-for-ai-accountability/

  1243. N. Jain, A. Schwarzschild, Y. Wen, G. Somepalli, J. Kirchenbauer, P.-Y. Chiang, M. Goldblum, A. Saha, J. Geiping, T. Goldstein, Baseline Defenses for Adversarial Attacks Against Aligned Language Models, arXiv [cs.LG] (2023); http://arxiv.org/abs/2309.00614 2

  1244. S. Lee, M. Kim, L. Cherif, D. Dobre, J. Lee, S. J. Hwang, K. Kawaguchi, G. Gidel, Y. Bengio, N. Malkin, M. Jain, Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning, arXiv [cs.CL] (2024); http://arxiv.org/abs/2405.18540 2

  1245. A. Peng, J. Michael, H. Sleight, E. Perez, M. Sharma, Rapid Response: Mitigating LLM Jailbreaks with a Few Examples, arXiv [cs.CL] (2024); http://arxiv.org/abs/2411.07494

  1246. Z. Liu, G. Dou, Z. Tan, Y. Tian, M. Jiang, Towards Safer Large Language Models through Machine Unlearning, arXiv [cs.CL] (2024); http://arxiv.org/abs/2402.10058

  1247. A. Lynch, P. Guo, A. Ewart, S. Casper, D. Hadfield-Menell, Eight Methods to Evaluate Robust Unlearning in LLMs, arXiv [cs.CL] (2024); http://arxiv.org/abs/2402.16835 2

  1248. D. Gamage, J. Chen, K. Sasahara, “The Emergence of Deepfakes and Its Societal Implications: A Systematic Review” in Conference for Truth and Trust Online 2021 (2021), pp. 28–39; https://www.researchgate.net/publication/355583941_The_Emergence_of_Deepfakes_and_its_Societal_Implications_A_Systematic_Review

  1249. A. Kaushal, A. Mina, A. Meena, T. H. Babu, “The Societal Impact of Deepfakes: Advances in Detection and Mitigation” in 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT) (2023), pp. 1–7; https://doi.org/10.1109/ICCCNT56998.2023.10307353

  1250. F. Romero Moreno, Generative AI and Deepfakes: A Human Rights Approach to Tackling Harmful Content. International Review of Law Computers & Technology 38, 297–326 (2024); https://doi.org/10.1080/13600869.2024.2324540

  1251. R. Tang, Y.-N. Chuang, X. Hu, The Science of Detecting LLM-Generated Text. Communications of the ACM 67, 50–59 (04/2024); https://doi.org/10.1145/3624725

  1252. K. Krishna, Y. Song, M. Karpinska, J. F. Wieting, M. Iyyer, “Paraphrasing Evades Detectors of AI-Generated Text, but Retrieval Is an Effective Defense” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) (2023); https://openreview.net/pdf?id=WbFhFvjjKj

  1253. L. Lin, N. Gupta, Y. Zhang, H. Ren, C.-H. Liu, F. Ding, X. Wang, X. Li, L. Verdoliva, S. Hu, Detecting Multimedia Generated by Large AI Models: A Survey, arXiv [cs.MM] (2024); http://arxiv.org/abs/2402.00045

  1254. R. Corvi, D. Cozzolino, G. Zingarini, G. Poggi, K. Nagano, L. Verdoliva, “On The Detection of Synthetic Images Generated by Diffusion Models” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023), pp. 1–5; https://doi.org/10.1109/ICASSP49357.2023.10095167

  1255. U. Ojha, Y. Li, Y. J. Lee, “Towards Universal Fake Image Detectors That Generalize Across Generative Models” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE Computer Society, 2023), pp. 24480–24489; https://doi.org/10.1109/CVPR52729.2023.02345

  1256. H. B. Wee, J. D. Reimer, Non-English Academics Face Inequality via AI-Generated Essays and Countermeasure Tools. Bioscience 73, 476–478 (2023); https://doi.org/10.1093/biosci/biad034

  1257. Y. Zhao, T. Pang, C. Du, X. Yang, N.-M. Cheung, M. Lin, A Recipe for Watermarking Diffusion Models, arXiv [cs.CV] (2023); http://arxiv.org/abs/2303.10137

  1258. M. Christ, S. Gunn, O. Zamir, “Undetectable Watermarks for Language Models” in Proceedings of 37th Conference on Learning Theory, S. Agrawal, A. Roth, Eds. (PMLR, 2024) vol. 247 of Proceedings of Machine Learning Research, pp. 1125–1139; https://proceedings.mlr.press/v247/christ24a.html

  1259. J. Kirchenbauer, J. Geiping, Y. Wen, J. Katz, I. Miers, T. Goldstein, “A Watermark for Large Language Models” in Proceedings of the 40th International Conference on Machine Learning (PMLR, 2023), pp. 17061–17084; https://proceedings.mlr.press/v202/kirchenbauer23a.html

  1260. Y. Liu, Y. Bu, “Adaptive Text Watermark for Large Language Models” in Forty-First International Conference on Machine Learning (2024); https://openreview.net/forum?id=7emOSb5UfX

  1261. A. Liu, L. Pan, Y. Lu, J. Li, X. Hu, X. Zhang, L. Wen, I. King, H. Xiong, P. S. Yu, A Survey of Text Watermarking in the Era of Large Language Models, arXiv [cs.CL] (2023); http://arxiv.org/abs/2312.07913

  1262. H. Zhang, B. L. Edelman, D. Francati, D. Venturi, G. Ateniese, B. Barak, Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models, arXiv [cs.LG] (2023); http://arxiv.org/abs/2311.04378

  1263. A. Knott, D. Pedreschi, R. Chatila, T. Chakraborti, S. Leavy, R. Baeza-Yates, D. Eyers, A. Trotman, P. D. Teal, P. Biecek, S. Russell, Y. Bengio, Generative AI Models Should Include Detection Mechanisms as a Condition for Public Release. Ethics and Information Technology 25, 55 (2023); https://doi.org/10.1007/s10676-023-09728-4

  1264. C2PA, Overview (2022); https://c2pa.org/

  1265. AI for Good, AI and Multimedia Authenticity Standards Collaboration (2024); https://aiforgood.itu.int/multimedia-authenticity/

  1266. A. Al-Dhaqm, R. A. Ikuesan, V. R. Kebande, S. A. Razak, G. Grispos, K.-K. R. Choo, B. A. S. Al-Rimy, A. A. Alsewari, Digital Forensics Subdomains: The State of the Art and Future Directions. IEEE Access 9, 152476–152502 (2021); https://doi.org/10.1109/ACCESS.2021.3124262

  1267. F. Casino, T. K. Dasaklis, G. P. Spathoulas, M. Anagnostopoulos, A. Ghosal, I. Borocz, A. Solanas, M. Conti, C. Patsakis, Research Trends, Challenges, and Emerging Topics in Digital Forensics: A Review of Reviews. IEEE Access 10, 25464–25493 (2022); https://doi.org/10.1109/ACCESS.2022.3154059

  1268. H. R. Hasan, K. Salah, Combating Deepfake Videos Using Blockchain and Smart Contracts. IEEE Access: Practical Innovations, Open Solutions 7, 41596–41606 (2019); https://doi.org/10.1109/access.2019.2905689

  1269. C. C. Ki Chan, V. Kumar, S. Delaney, M. Gochoo, “Combating Deepfakes: Multi-LSTM and Blockchain as Proof of Authenticity for Digital Media” in 2020 IEEE / ITU International Conference on Artificial Intelligence for Good (AI4G) (IEEE, 2020); https://doi.org/10.1109/ai4g50087.2020.9311067

  1270. P. Fraga-Lamas, T. M. Fernández-Caramés, Fake News, Disinformation, and Deepfakes: Leveraging Distributed Ledger Technologies and Blockchain to Combat Digital Deception and Counterfeit Reality, arXiv [cs.CY] (2019); http://dx.doi.org/10.1109/MITP.2020.2977589

  1271. S. Mohammad Niyaz Khan, J. Mohd Ghazali, L. Q. Zakaria, S. N. Ahmad, K. A. Elias, Various Image Classification Using Certain Exchangeable Image File Format (EXIF) Metadata of Images. Malaysian Journal of Information and Communication Technology (MyJICT), 1–12 (2018); https://doi.org/10.53840/myjict3-1-33

  1272. A. Chan, C. Ezell, M. Kaufmann, K. Wei, L. Hammond, H. Bradley, E. Bluemke, N. Rajkumar, D. Krueger, N. Kolt, L. Heim, M. Anderljung, “Visibility into AI Agents” in The 2024 ACM Conference on Fairness, Accountability, and Transparency (ACM, New York, NY, USA, 2024); https://doi.org/10.1145/3630106.3658948 2

  1273. A. Chan, N. Kolt, P. Wills, U. Anwar, C. S. de Witt, N. Rajkumar, L. Hammond, D. Krueger, L. Heim, M. Anderljung, IDs for AI Systems, arXiv [cs.AI] (2024); http://arxiv.org/abs/2406.12137

  1274. B. Pan, N. Stakhanova, S. Ray, Data Provenance in Security and Privacy. ACM Computing Surveys 55, 1–35 (2023); https://doi.org/10.1145/3593294

  1275. E. Laird, M. Dwyer, “Off Task: EdTech Threats to Student Privacy and Equity in the Age of AI” (Center for Democracy and Technology, 2023); https://cdt.org/insights/report-off-task-edtech-threats-to-student-privacy-and-equity-in-the-age-of-ai/

  1276. S. S. El Mokadem, The Effect of Media Literacy on Misinformation and Deep Fake Video Detection. Arab Media & Society (2023); https://www.arabmediasociety.com/the-effect-of-media-literacy-on-misinformation-and-deep-fake-video-detection/

  1277. Y. Hwang, J. Y. Ryu, S.-H. Jeong, Effects of Disinformation Using Deepfake: The Protective Effect of Media Literacy Education. Cyberpsychology, Behavior and Social Networking 24, 188–193 (2021); https://doi.org/10.1089/cyber.2020.0174

  1278. S. Y. Shin, J. Lee, The Effect of Deepfake Video on News Credibility and Corrective Influence of Cost-Based Knowledge about Deepfakes. Digital Journalism 10, 412–432 (2022); https://doi.org/10.1080/21670811.2022.2026797

  1279. S. Qian, C. Shen, J. Zhang, Fighting Cheapfakes: Using a Digital Media Literacy Intervention to Motivate Reverse Search of out-of-Context Visual Misinformation. Journal of Computer-Mediated Communication: JCMC 28 (2022); https://doi.org/10.1093/jcmc/zmac024 2

  1280. T. Ali, P. Kostakos, HuntGPT: Integrating Machine Learning-Based Anomaly Detection and Explainable AI with Large Language Models (LLMs), arXiv [cs.CR] (2023); http://arxiv.org/abs/2309.16021

  1281. G. Pang, C. Shen, L. Cao, A. Van Den Hengel, Deep Learning for Anomaly Detection: A Review. ACM Computing Surveys 54, 38:1–38:38 (2021); https://doi.org/10.1145/3439950

  1282. J. Geng, F. Cai, Y. Wang, H. Koeppl, P. Nakov, I. Gurevych, A Survey of Confidence Estimation and Calibration in Large Language Models, arXiv [cs.CL] (2023); http://arxiv.org/abs/2311.08298

  1283. A. Aldahdooh, W. Hamidouche, S. A. Fezza, O. Déforges, Adversarial Example Detection for DNN Models: A Review and Experimental Comparison. Artificial Intelligence Review 55, 4403–4462 (2022); https://doi.org/10.1007/s10462-021-10125-w

  1284. J. Hayase, W. Kong, R. Somani, S. Oh, “SPECTRE: Defending against Backdoor Attacks Using Robust Statistics” in Proceedings of the 38th International Conference on Machine Learning, M. Meila, T. Zhang, Eds. (PMLR, 2021) vol. 139 of Proceedings of Machine Learning Research, pp. 4129–4139; https://proceedings.mlr.press/v139/hayase21a.html

  1285. A. T. Mallen, N. Belrose, “Eliciting Latent Knowledge from Quirky Language Models” in ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models (2024); https://openreview.net/forum?id=Z1531QeqAQ

  1286. [industry] M. MacDiarmid, T. Maxwell, N. Schiefer, J. Mu, J. Kaplan, D. Duvenaud, S. Bowman, A. Tamkin, E. Perez, M. Sharma, C. Denison, E. Hubinger, Simple Probes Can Catch Sleeper Agents (2024); https://www.anthropic.com/news/probes-catch-sleeper-agents

  1287. S. Han, K. Rao, A. Ettinger, L. Jiang, B. Y. Lin, N. Lambert, Y. Choi, N. Dziri, “WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs” in 38th Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2024); https://openreview.net/forum?id=Ich4tv4202 2

  1288. R. Greenblatt, B. Shlegeris, K. Sachan, F. Roger, AI Control: Improving Safety Despite Intentional Subversion, arXiv [cs.LG] (2023); http://arxiv.org/abs/2312.06942

  1289. M. Phute, A. Helbling, M. D. Hull, S. Peng, S. Szyller, C. Cornelius, D. H. Chau, “LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked” in The Second Tiny Papers Track at ICLR 2024 (Vienna, Austria, 2024); https://openreview.net/forum?id=YoqgcIA19o

  1290. [industry] H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y. Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, M. Khabsa, Llama Guard: LLM-Based Input-Output Safeguard for Human-AI Conversations, arXiv [cs.CL] (2023); http://arxiv.org/abs/2312.06674

  1291. T. Kim, S. Kotha, A. Raghunathan, Jailbreaking Defenses with the Purple Problem, arXiv [cs.CR] (2024); http://arxiv.org/abs/2403.14725

  1292. S. O. Hansson, M.-Å. Belin, B. Lundgren, Self-Driving Vehicles—an Ethical Overview. Philosophy & Technology 34, 1383–1408 (2021); https://doi.org/10.1007/s13347-021-00464-5

  1293. N. R. Jennings, L. Moreau, D. Nicholson, S. Ramchurn, S. Roberts, T. Rodden, A. Rogers, Human-Agent Collectives. Communications of the ACM 57, 80–88 (2014); https://doi.org/10.1145/2629559

  1294. [industry] A. Dafoe, E. Hughes, Y. Bachrach, T. Collins, K. R. McKee, J. Z. Leibo, K. Larson, T. Graepel, Open Problems in Cooperative AI, arXiv [cs.AI] (2020); http://arxiv.org/abs/2012.08630

  1295. A. Dafoe, Y. Bachrach, G. Hadfield, E. Horvitz, K. Larson, T. Graepel, Cooperative AI: Machines Must Learn to Find Common Ground. Nature 593, 33–36 (2021); https://doi.org/10.1038/d41586-021-01170-0

  1296. D. Hadfield-Menell, A. Dragan, P. Abbeel, S. Russell, “Cooperative Inverse Reinforcement Learning” in Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS 2016) (Curran Associates Inc., Red Hook, NY, USA, 2016), pp. 3916–3924; https://papers.nips.cc/paper_files/paper/2016/hash/c3395dd46c34fa7fd8d729d8cf88b7a8-Abstract.html

  1297. I. Seeber, E. Bittner, R. O. Briggs, T. de Vreede, G.-J. de Vreede, A. Elkins, R. Maier, A. B. Merz, S. Oeste-Reiß, N. Randrup, G. Schwabe, M. Söllner, Machines as Teammates: A Research Agenda on AI in Team Collaboration. Information & Management 57, 103174 (2020); https://doi.org/10.1016/j.im.2019.103174

  1298. R. Shah, P. Freire, N. Alex, R. Freedman, D. Krasheninnikov, L. Chan, M. D. Dennis, P. Abbeel, A. Dragan, S. Russell, Benefits of Assistance over Reward Learning (2020); https://openreview.net/forum?id=DFIoGDZejIB

  1299. S. D. Ramchurn, S. Stein, N. R. Jennings, Trustworthy Human-AI Partnerships. iScience 24, 102891 (2021); https://doi.org/10.1016/j.isci.2021.102891

  1300. X. Wu, L. Xiao, Y. Sun, J. Zhang, T. Ma, L. He, A Survey of Human-in-the-Loop for Machine Learning. Future Generations Computer Systems: FGCS 135, 364–381 (2022); https://doi.org/10.1016/j.future.2022.05.014

  1301. K. L. Mosier, L. J. Skitka, Automation Use and Automation Bias. Proceedings of the Human Factors and Ergonomics Society … Annual Meeting. Human Factors and Ergonomics Society. Annual Meeting 43, 344–348 (1999); https://doi.org/10.1177/154193129904300346

  1302. J. Babcock, J. Krámar, R. V. Yampolskiy, “Guidelines for Artificial Intelligence Containment” in Next-Generation Ethics: Engineering a Better Society, A. E. Abbas, Ed. (Cambridge University Press, Cambridge, 2019), pp. 90–112; https://doi.org/10.1017/9781108616188.008

  1303. S. G. Patil, T. Zhang, V. Fang, N. C., R. Huang, A. Hao, M. Casado, J. E. Gonzalez, R. A. Popa, I. Stoica, GoEX: Perspectives and Designs Towards a Runtime for Autonomous LLM Applications, arXiv [cs.CL] (2024); http://arxiv.org/abs/2404.06921

  1304. J. Gryz, M. Rojszczak, Black Box Algorithms and the Rights of Individuals: No Easy Solution to the “explainability” Problem. Internet Policy Review 10 (2021); https://policyreview.info/articles/analysis/black-box-algorithms-and-rights-individuals-no-easy-solution-explainability

  1305. J. A. McDermid, Y. Jia, Z. Porter, I. Habli, Artificial Intelligence Explainability: The Technical and Ethical Dimensions. Philosophical Transactions. Series A, Mathematical, Physical, and Engineering Sciences 379, 20200363 (2021); https://doi.org/10.1098/rsta.2020.0363

  1306. T. Ploug, S. Holm, “Right to Contest AI Diagnostics Defining Transparency and Explainability Requirements from a Patient’s Perspective” in Artificial Intelligence in Medicine (Springer Publishing Company, 2022), pp. 227–238; https://doi.org/10.1007/978-3-030-64573-1_267

  1307. S. H. Tanneru, D. Ley, C. Agarwal, H. Lakkaraju, On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models, arXiv [cs.CL] (2024); http://arxiv.org/abs/2406.10625

  1308. [industry] J. Chua, E. Rees, H. Batra, S. R. Bowman, J. Michael, E. Perez, M. Turpin, Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought, arXiv [cs.CL] (2024); http://arxiv.org/abs/2403.05518

  1309. [industry] A. Radhakrishnan, K. Nguyen, A. Chen, C. Chen, C. Denison, D. Hernandez, E. Durmus, E. Hubinger, J. Kernion, K. Lukošiūtė, N. Cheng, N. Joseph, N. Schiefer, O. Rausch, S. McCandlish, S. El Showk, T. Lanham, … E. Perez, Question Decomposition Improves the Faithfulness of Model-Generated Reasoning, arXiv [cs.CL] (2023); http://arxiv.org/abs/2307.11768

  1310. J. Li, P. Cao, Y. Chen, K. Liu, J. Zhao, Towards Faithful Chain-of-Thought: Large Language Models Are Bridging Reasoners, arXiv [cs.CL] (2024); http://arxiv.org/abs/2405.18915

  1311. D. Paul, R. West, A. Bosselut, B. Faltings, Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning, arXiv [cs.CL] (2024); http://arxiv.org/abs/2402.13950

  1312. A. Saranya, R. Subhashini, A Systematic Review of Explainable Artificial Intelligence Models and Applications: Recent Developments and Future Trends. Decision Analytics Journal 7, 100230 (2023); https://doi.org/10.1016/j.dajour.2023.100230

  1313. H. Zhao, H. Chen, F. Yang, N. Liu, H. Deng, H. Cai, S. Wang, D. Yin, M. Du, Explainability for Large Language Models: A Survey. ACM Transactions on Intelligent Systems and Technology 15, 1–38 (2024); https://doi.org/10.1145/3639372

  1314. S. Casper, C. Ezell, C. Siegmann, N. Kolt, T. L. Curtis, B. Bucknall, A. Haupt, K. Wei, J. Scheurer, M. Hobbhahn, L. Sharkey, S. Krishna, M. Von Hagen, S. Alberti, A. Chan, Q. Sun, M. Gerovitch, … D. Hadfield-Menell, “Black-Box Access Is Insufficient for Rigorous AI Audits” in The 2024 ACM Conference on Fairness, Accountability, and Transparency (ACM, New York, NY, USA, 2024), pp. 2254–2272; https://doi.org/10.1145/3630106.3659037

  1315. O. Aarne, T. Fist, C. Withers, “Secure, Governable Chips: Using On-Chip Mechanisms to Manage National Security Risks from AI & Advanced Computing” ( Center for a New American Security, 2024); https://s3.us-east-1.amazonaws.com/files.cnas.org/documents/CNAS-Report-Tech-Secure-Chips-Jan-24-finalb.pdf

  1316. G. Kulp, D. Gonzales, E. Smith, L. Heim, P. Puri, M. Vermeer, Z. Winkelman, “Hardware-Enabled Governance Mechanisms” (RAND Corporation, 2024); https://www.rand.org/pubs/working_papers/WRA3056-1.html

  1317. Z. Ghodsi, T. Gu, S. Garg, SafetyNets: Verifiable Execution of Deep Neural Networks on an Untrusted Cloud. Advances in Neural Information Processing Systems 30 (2017); https://proceedings.neurips.cc/paper_files/paper/2017/file/6048ff4e8cb07aa60b6777b6f7384d52-Paper.pdf

  1318. H. Chen, C. Fu, B. D. Rouhani, J. Zhao, F. Koushanfar, “DeepAttest: An End-to-End Attestation Framework for Deep Neural Networks” in Proceedings of the 46th International Symposium on Computer Architecture (Association for Computing Machinery, New York, NY, USA, 2019)ISCA ‘19, pp. 487–498; https://doi.org/10.1145/3307650.3322251

  1319. H. Jia, M. Yaghini, C. A. Choquette-Choo, N. Dullerud, A. Thudi, V. Chandrasekaran, N. Papernot, “Proof-of-Learning: Definitions and Practice” in 2021 IEEE Symposium on Security and Privacy (SP) (IEEE, 2021), pp. 1039–1056; https://doi.org/10.1109/SP40001.2021.00106

  1320. S. Goldwasser, G. N. Rothblum, J. Shafer, A. Yehudayoff, “Interactive Proofs for Verifying Machine Learning” in 12th Innovations in Theoretical Computer Science Conference (ITCS 2021), J. R. Lee, Ed. (Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 2021) vol. 185 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 41:1–41:19; https://doi.org/10.4230/LIPIcs.ITCS.2021.41

  1321. [industry] Apple, “Apple Platform Security” (Apple, 2024); https://help.apple.com/pdf/security/en_US/apple-platform-security-guide.pdf

  1322. [industry] J. Zhu, H. Yin, P. Deng, A. Almeida, S. Zhou, Confidential Computing on nVIDIA H100 GPU: A Performance Benchmark Study, arXiv [cs.DC] (2024); http://arxiv.org/abs/2409.03992

  1323. R. Anderson, S. Fuloria, “Who Controls the off Switch?” in 2010 First IEEE International Conference on Smart Grid Communications (IEEE, 2010), pp. 96–101; https://doi.org/10.1109/smartgrid.2010.5622026

  1324. Organisation for Economic Co-Operation and Development, “Emerging Privacy-Enhancing Technologies” (OECD, 2023); https://doi.org/10.1787/bf121be4-en

  1325. N. Subramani, S. Luccioni, J. Dodge, M. Mitchell, “Detecting Personal Information in Training Corpora: An Analysis” in Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), A. Ovalle, K.-W. Chang, N. Mehrabi, Y. Pruksachatkun, A. Galystan, J. Dhamala, A. Verma, T. Cao, A. Kumar, R. Gupta, Eds. (Association for Computational Linguistics, Toronto, Canada, 2023), pp. 208–220; https://doi.org/10.18653/v1/2023.trustnlp-1.18 2 3

  1326. Y. Elazar, A. Bhagia, I. H. Magnusson, A. Ravichander, D. Schwenk, A. Suhr, E. P. Walsh, D. Groeneveld, L. Soldaini, S. Singh, H. Hajishirzi, N. A. Smith, J. Dodge, “What’s In My Big Data?” in 12th International Conference on Learning Representations (2024); https://openreview.net/forum?id=RvfPnOkPV4

  1327. A. Narayanan, V. Shmatikov, “Robust De-Anonymization of Large Sparse Datasets” in 2008 IEEE Symposium on Security and Privacy (sp 2008) (2008), pp. 111–125; https://doi.org/10.1109/SP.2008.33

  1328. H. Brown, K. Lee, F. Mireshghallah, R. Shokri, F. Tramèr, “What Does It Mean for a Language Model to Preserve Privacy?” in Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ‘22) (Association for Computing Machinery, New York, NY, USA, 2022), pp. 2280–2292; https://doi.org/10.1145/3531146.3534642 2

  1329. [industry] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. Rosenberg, G. Mann, BloombergGPT: A Large Language Model for Finance, arXiv [cs.LG] (2023); http://arxiv.org/abs/2303.17564

  1330. G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, H. Alobeidli, A. Cappelli, B. Pannier, E. Almazrouei, J. Launay, “The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only” in 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Datasets and Benchmarks Track (New Orleans, LA, USA, 2023); https://openreview.net/pdf?id=kM5eGcdCzq

  1331. T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. D. Iii, K. Crawford, Datasheets for Datasets. Communications of the ACM 64, 86–92 (2021); https://doi.org/10.1145/3458723

  1332. A. Ghorbani, J. Zou, “Data Shapley: Equitable Valuation Of Data for Machine Learning” in Proceedings of the 36th International Conference on Machine Learning (ICML 2019), K. Chaudhuri, R. Salakhutdinov, Eds. (PMLR, New Orleans, LA, USA, 2019) vol. 97 of Proceedings of Machine Learning Research, pp. 2242–2251; https://proceedings.mlr.press/v97/ghorbani19c.html

  1333. T. Li, E. F. Villaronga, P. Kieseberg, Humans Forget, Machines Remember: Artificial Intelligence and the Right to Be Forgotten. Computer Law & Security Review 34, 304 (2018); https://scholarship.law.bu.edu/faculty_scholarship/817

  1334. Z. Zhang, M. Jia, H.-P. Lee, B. Yao, S. Das, A. Lerner, D. Wang, T. Li, “It’s a Fair Game”, or Is It? Examining How Users Navigate Disclosure Risks and Benefits When Using LLM-Based Conversational Agents, arXiv [cs.HC] (2023); http://dx.doi.org/10.1145/3613904.3642385

  1335. Z. Zhang, C. Shen, B. Yao, D. Wang, T. Li, Secret Use of Large Language Model (LLM), arXiv [cs.HC] (2024); http://arxiv.org/abs/2409.19450

  1336. C. Dwork, F. McSherry, K. Nissim, A. Smith, “Calibrating Noise to Sensitivity in Private Data Analysis” in Theory of Cryptography, S. Halevi, T. Rabin, Eds. (Springer, Berlin, Heidelberg, 2006) vol. 3876 of Lecture Notes in Computer Science; https://doi.org/10.1007/11681878_14

  1337. M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, L. Zhang, “Deep Learning with Differential Privacy” in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS ‘16) (Association for Computing Machinery, New York, NY, USA, 2016), pp. 308–318; https://doi.org/10.1145/2976749.2978318

  1338. [industry] S. De, L. Berrada, J. Hayes, S. L. Smith, B. Balle, “Unlocking High-Accuracy Differentially Private Image Classification through Scale” (Google Deepmind, 2022); http://arxiv.org/abs/2204.13650

  1339. X. Li, F. Tramer, P. Liang, T. Hashimoto, “Large Language Models Can Be Strong Differentially Private Learners” in International Conference on Learning Representations 2022 (Virtual, 2022); https://openreview.net/forum?id=bVuP3ltATMz

  1340. D. Yu, S. Naik, A. Backurs, S. Gopi, H. A. Inan, G. Kamath, J. Kulkarni, Y. T. Lee, A. Manoel, L. Wutschitz, S. Yekhanin, H. Zhang, “Differentially Private Fine-Tuning of Language Models” in International Conference on Learning Representations (2022); https://openreview.net/forum?id=Q42f0dfjECO

  1341. [industry] A. Kurakin, N. Ponomareva, U. Syed, L. MacDermed, A. Terzis, Harnessing Large-Language Models to Generate Private Synthetic Text, arXiv [cs.LG] (2023); http://arxiv.org/abs/2306.01684

  1342. R. Liu, J. Wei, F. Liu, C. Si, Y. Zhang, J. Rao, S. Zheng, D. Peng, D. Yang, D. Zhou, A. M. Dai, “Best Practices and Lessons Learned on Synthetic Data” in First Conference on Language Modeling (2024); https://openreview.net/forum?id=OJaWBhh61C

  1343. A. Yale, S. Dash, R. Dutta, I. Guyon, A. Pavao, K. P. Bennett, “Assessing Privacy and Quality of Synthetic Health Data” in Proceedings of the Conference on Artificial Intelligence for Data Discovery and Reuse (ACM, New York, NY, USA, 2019); https://doi.org/10.1145/3359115.3359124

  1344. X. Tang, R. Shin, H. A. Inan, A. Manoel, F. Mireshghallah, Z. Lin, S. Gopi, J. Kulkarni, R. Sim, “Privacy-Preserving In-Context Learning with Differentially Private Few-Shot Generation” in 12th International Conference on Learning Representations (2024); https://openreview.net/forum?id=oZtt0pRnOl

  1345. F. Mireshghallah, Y. Su, T. Hashimoto, J. Eisner, R. Shin, “Privacy-Preserving Domain Adaptation of Semantic Parsers” in ACL (1) (2023), pp. 4950–4970; https://doi.org/10.18653/v1/2023.acl-long.271

  1346. J. Mattern, Z. Jin, B. Weggenmann, B. Schölkopf, M. Sachan, “Differentially Private Language Models for Secure Data Sharing” in EMNLP (2022), pp. 4860–4873; https://aclanthology.org/2022.emnlp-main.323

  1347. T. Stadler, B. Oprisanu, C. Troncoso, “Synthetic Data – Anonymisation Groundhog Day” in 31st USENIX Security Symposium (USENIX Security 22) (USENIX Association, Boston, MA, USA, 2022), pp. 1451–1468; https://www.usenix.org/conference/usenixsecurity22/presentation/stadler

  1348. M. Meeus, F. Guepin, A.-M. Creţu, Y.-A. de Montjoye, “Achilles’ Heels: Vulnerable Record Identification in Synthetic Data Publishing” in 28th European Symposium on Research in Computer Security (ESORICS 2023), G. Tsudik, M. Conti, K. Liang, G. Smaragdakis, Eds. (Springer Nature Switzerland, The Hague, The Netherlands, 2024), pp. 380–399; https://doi.org/10.1007/978-3-031-51476-0_19

  1349. G. Ganev, E. De Cristofaro, On the Inadequacy of Similarity-Based Privacy Metrics: Reconstruction Attacks against “Truly Anonymous Synthetic Data’’, arXiv [cs.CR] (2023); http://arxiv.org/abs/2312.05114

  1350. R. Gilad-Bachrach, N. Dowlin, K. Laine, K. Lauter, M. Naehrig, J. Wernsing, “CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and Accuracy” in Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan, K. Q. Weinberger, Eds. (PMLR, New York, New York, USA, 2016) vol. 48 of Proceedings of Machine Learning Research, pp. 201–210; https://proceedings.mlr.press/v48/gilad-bachrach16.html

  1351. D. Kang, T. Hashimoto, I. Stoica, Y. Sun, “Scaling up Trustless DNN Inference with Zero-Knowledge Proofs” in NeurIPS 2023 Workshop on Regulatable ML (New Orleans, LA, US, 2023); https://openreview.net/forum?id=GjNRF5VTfn

  1352. B. Knott, S. Venkataraman, A. Hannun, S. Sengupta, M. Ibrahim, “CrypTen: Secure Multi-Party Computation Meets Machine Learning” in Advances in Neural Information Processing Systems (Curran Associates, Inc., 2021) vol. 34, pp. 4961–4973; https://papers.neurips.cc/paper/2021/hash/2754518221cfbc8d25c13a06a4cb8421-Abstract.html

  1353. P. Mohassel, Y. Zhang, “SecureML: A System for Scalable Privacy-Preserving Machine Learning” in 2017 IEEE Symposium on Security and Privacy (SP) (IEEE Computer Society, San Jose, CA, USA, 2017), pp. 19–38; https://doi.org/10.1109/SP.2017.12

  1354. O. Ohrimenko, F. Schuster, C. Fournet, A. Mehta, S. Nowozin, K. Vaswani, M. Costa, “Oblivious Multi-Party Machine Learning on Trusted Processors” in Proceedings of the 25th USENIX Conference on Security Symposium (SEC’16) (USENIX Association, Austin, TX, 2016), pp. 619–636; https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/ohrimenko

  1355. F. Tramer, D. Boneh, “Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware” in International Conference on Learning Representations (2019); https://openreview.net/forum?id=rJVorjCcKQ

  1356. [industry] J. Zhu, H. Yin, P. Deng, A. Almeida, S. Zhou, Confidential Computing on nVIDIA H100 GPU: A Performance Benchmark Study, arXiv [cs.DC] (2024); http://arxiv.org/abs/2409.03992

  1357. T. South, J. Drean, A. Singh, G. Zyskind, R. Mahari, V. Sharma, P. Vepakomma, L. Kagal, S. Devadas, A. Pentland, “A Roadmap for End-to-End Privacy and Security in Generative AI” (MIT, 2024); https://doi.org/10.21428/e4baedd9.9af67664

  1358. A. Cavoukian, Privacy by Design: The 7 Foundational Principles. (2009); https://privacy.ucsc.edu/resources/privacy-by-design—foundational-principles.pdf

  1359. M. ElBaih, The Role of Privacy Regulations in AI Development (A Discussion of the Ways in Which Privacy Regulations Can Shape the Development of AI) (2023); https://doi.org/10.2139/ssrn.4589207

  1360. E. Rader, R. Wash, B. Brooks, “Stories as Informal Lessons about Security” in Proceedings of the Eighth Symposium on Usable Privacy and Security (ACM, New York, NY, USA, 2012); https://doi.org/10.1145/2335356.2335364

  1361. [industry] J. Lamb, Generative AI in Healthcare: Adoption Trends and What’s next (2024); https://www.mckinsey.com/industries/healthcare/our-insights/generative-ai-in-healthcare-adoption-trends-and-whats-next

  1362. G. Dhanuskodi, S. Guha, V. Krishnan, A. Manjunatha, M. O’Connor, R. Nertney, P. Rogers, Creating the First Confidential GPUs: The Team at NVIDIA Brings Confidentiality and Integrity to User Code and Data for Accelerated Computing. Queueing Systems. Theory and Applications 21, 68–93 (2023); https://doi.org/10.1145/3623393.3623391

  1363. X. Zhou, H. Kim, F. Brahman, L. Jiang, H. Zhu, X. Lu, F. Xu, B. Y. Lin, Y. Choi, N. Mireshghallah, R. L. Bras, M. Sap, HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions, arXiv [cs.AI] (2024); http://arxiv.org/abs/2409.16427

  1364. K. Tirumala, A. H. Markosyan, L. Zettlemoyer, A. Aghajanyan, “Memorization without Overfitting: Analyzing the Training Dynamics of Large Language Models” in 36th International Conference on Neural Information Processing Systems (NeurIPS 2022) (Curran Associates Inc., Red Hook, NY, USA, 2024); https://proceedings.neurips.cc/paper_files/paper/2022/file/fa0509f4dab6807e2cb465715bf2d249-Paper-Conference.pdf

  1365. N. Mireshghallah, H. Kim, X. Zhou, Y. Tsvetkov, M. Sap, R. Shokri, Y. Choi, “Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory” in ICLR (2024); https://openreview.net/forum?id=gmg7t8b4s0

  1366. M. Brundage, S. Avin, J. Wang, H. Belfield, G. Krueger, G. Hadfield, H. Khlaaf, J. Yang, H. Toner, R. Fong, T. Maharaj, P. W. Koh, S. Hooker, J. Leung, A. Trask, E. Bluemke, J. Lebensold, … M. Anderljung, Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims, arXiv [cs.CY] (2020);