Independent Review of The Future of Compute: Final report and recommendations
Updated 6 March 2023
The Rt Hon Jeremy Hunt MP
Chancellor of the Exchequer
The Rt Hon Michelle Donelan MP
Secretary of State for Science, Innovation and Technology
Dear Chancellor of the Exchequer and Secretary of State for Science, Innovation and Technology,
I was delighted to be asked to lead this review on the future of compute for the UK. Having spent three decades in academia as an AI researcher, and more recently in the technology industry, I have witnessed the transformative potential of compute first hand. Modern compute has given us the ability to simulate and model complex phenomena, and the power to use data-driven machine learning technologies to provide powerful AI tools for society. Compute underpins much of what we do today and our dependence on it will only grow. Much like electricity, rail travel, and the internet, large scale compute is part of the infrastructure of modern life, and its effect on society is hard to measure. Many existing economic sectors depend on compute, and new discoveries that will significantly advance the health and prosperity of society rely on compute.
I have experienced many transformative technological changes in computing throughout my career, but at no time have I felt the immensity of the technological opportunity that we have now. The next decade is sure to bring even more advances that will continue to astonish us. The UK needs to be well prepared to take advantage of these opportunities to sustain growth and cement its position as a Science and Technology Superpower. This review on the future of compute makes specific recommendations to the government that are essential to achieve the UK’s ambition.
Compute can be thought of in three, somewhat overlapping, broad areas. First is compute for AI. AI is absolutely transformative, and depends heavily on compute, with the largest AI models using many exaflops of compute. The UK has great talent in AI with a vibrant start-up ecosystem, but public investment in AI compute is seriously lagging. The economic value of AI is undeniable - the world’s trillion dollar tech companies are all betting heavily on AI. A second area is compute for modelling and simulation, which is used widely across the sciences (e.g. physics, biology, climate) and engineering (design, simulation). Public facilities are essential resources operating at local, national and international scales to enable computations that cannot be achieved at smaller scales. Third, we have cloud computing, often provided by the private sector, which has proven to be a tremendous catalyst for SMEs and has made on demand computing a core part of building efficient and flexible businesses. All three of these types of compute are converging - AI compute can be done on the cloud, HPC access models are becoming more cloud-like, and HPC is supporting more and more AI workloads.
We need several interventions if we want compute to unlock the world-leading high-growth potential of the UK. I would like to highlight a few recommendations here. First, we need a strategic vision, roadmap and national coordination. The UK’s public compute infrastructure is fragmented and we do not currently have a long-term plan. We need a national coordination body to deliver the vision for compute, that can provide long-term stability and adapt to the rapid pace of change in compute technology. Second, we need to make immediate investments in the path to exascale compute, using a phased approach outlined in this review, so that we are not falling behind our peers. Third, we need to increase capacity for AI research immediately to power the UK’s impressive AI research community and plan for further AI capacity as part of our exascale system.
In approaching this review, we have focused on identifying how the UK can invest in the right technologies, with the right architectures, to meet the needs of all users, broaden access and reap the true benefits of compute. That is why the recommendations of this review should be viewed as a long-term, holistic package that supports the creation of a vibrant ecosystem. This includes, investing in domestic skills and attracting and retaining talent as a priority, supporting users at all levels of expertise and increasing awareness. We should collaborate with international partners, such as the US, Japan, and the EU, where we have had longstanding and valuable joint programs which would be beneficial to retain. Any facilities that we build should be built and operated sustainably, and incorporate best practice in secure computing.
These recommendations are critical for many of the government’s priorities. I recognise the potential difficulty of implementing these recommendations during a time of economic and fiscal challenges. However, the potential growth that compute could unlock across the economy, and within our domestic tech sector, is significant. Indeed, I personally envisage a future compute ecosystem in which the private sector can build upon these investments to be at the forefront of affordable compute provision for all users. This is not the time to limit our future by delaying needed investments.
I have thoroughly enjoyed my time on this review and wish to thank the panel members Sue Daley, Shaheen Sayed, Graham Spittle and Anne Trefethen. Their diverse expertise spanning academia and industry, and their knowledge of compute, has been indispensable. I also wish to thank the review’s Secretariat for their tremendous contribution that went into producing this review.
Sincerely,
Zoubin Ghahramani FRS
Professor, University of Cambridge
Vice President of Research, Google
Introduction from the expert panel
Compute is a material part of modern life. It is among the critical technologies lying behind innovation, economic growth and scientific discoveries. Compute improves our everyday lives. It underpins all the tools, services and information we hold on our handheld devices - from search engines and social media, to streaming services and accurate weather forecasts. This technology may be invisible to the public, but life today would be very different without it.
Sectors across the UK economy, both new and old, are increasingly reliant upon compute. By leveraging the capability that compute provides, businesses of all sizes can extract value from the enormous quantity of data created every day; reduce the cost and time required for research and development (R&D); improve product design; accelerate decision making processes; and increase overall efficiency. Compute also enables advancements in transformative technologies, such as AI, which themselves lead to the creation of value and innovation across the economy. This all translates into higher productivity and profitability for businesses and robust economic growth for the UK as a whole.
Compute powers modelling, simulations, data analysis and scenario planning, and thereby enables researchers to develop new drugs; find new energy sources; discover new materials; mitigate the effects of climate change; and model the spread of pandemics. Compute is required to tackle many of today’s global challenges and brings invaluable benefits to our society.
Compute’s effects on society and the economy have already been and, crucially, will continue to be transformative. The scale of compute capabilities keeps accelerating at pace. The performance of the world’s fastest compute has grown by a factor of 626 since 2010. The compute requirements of the largest machine learning models has grown 10 billion times over the last 10 years. We expect compute demand to significantly grow as compute capability continues to increase. Technology today operates very differently to 10 years ago and, in a decade’s time, it will have changed once again.
Yet, despite compute’s value to the economy and society, the UK lacks a long-term vision for compute. The ecosystem is fragmented and complex for users to navigate. Existing compute capabilities are not fit to serve all users, particularly AI users, and are falling behind those of other advanced economies. As of November 2022, the UK had only 1.3% share of the global compute capacity and did not have a system in the top 25 of the Top 500 global systems. Meanwhile, other countries continue to bolster their compute capabilities, with many already testing, building or planning for exascale systems - the next generation of computing technology. Infrastructure investment in the UK is the result of piecemeal procurement and there are no coordinated efforts to mitigate the environmental impact of compute and ensure secure access to infrastructure. A scarcity of compute skills and limited access to future EU systems lead to an uncertain outlook for UK compute.
We were asked, as a panel with expertise in compute, to understand the UK’s compute needs over the next decade; identify cost-effective, future-facing interventions that may be required to ensure research and industry have access to internationally competitive compute; and, establish a view of the role of compute in delivering the Integrated Review and securing our status as a Science Superpower this decade.
In line with the report of the Government Office for Science (GO-Science), we define compute or advanced compute as computer systems where processing power, memory, data storage and network are assembled at scale to tackle computational tasks beyond the capabilities of everyday computers. We have taken a long-term approach, considering what actions the government needs to take, now and in the decades to come. Underpinning our work is the belief that compute will unlock growth and innovation across the whole economy and will lead to transformative discoveries, improving the wellbeing of all citizens.
We have considered the broad range of compute users — established adopters and emerging users across academia, industry and the public sector. We have assessed the state of UK infrastructure, explored barriers to access and considered international best practices. We have identified the actions necessary to meet user needs, ensure the UK has cutting-edge compute capabilities and create a strong compute ecosystem.
Targeted intervention is required to fully deliver the value of compute and meet the government’s wider objectives. The extent of the cost and risk associated with computing technology means that private investment on its own will not be able to unleash the full potential of compute. Government action is urgently needed to ensure the UK keeps pace with computing advancements, addresses users’ needs and remains globally competitive. The breadth of users and compute capabilities means there is not a one-size-fits-all approach, and government and industry must work together to ensure success.
This review presents a set of 10 recommendations to the government on how the UK can harness the power of compute to achieve economic growth and address society’s greatest challenges. The government should unlock the world-leading, high-growth potential of compute through the creation of a strategic vision, increased coordination and broader use of compute. It should build world-class, sustainable compute capabilities. This includes investment in an exascale facility through a phased approach, to ensure ecosystem readiness and value for money. The government should also provide the resources necessary to train the largest AI models, support greater access to compute through the cloud and invest in sustainable compute infrastructure at all levels. Lastly, the government must empower the compute community by creating a sustainable skills pipeline, ensuring the security of compute systems and fostering international collaborations.
Crucially, creating a strong compute ecosystem and improving UK compute capability requires long-term action, extending to the next decade and beyond. All 10 recommendations should be implemented as part of a holistic approach to compute, and should be established within the government’s strategic vision. Only by harnessing the benefits of compute can the government realise its ambitions for the UK to sustain economic growth; achieve net zero; secure its status as a science and technology superpower; be a global AI superpower; and build a competitive and innovative digital economy. Compute is required to achieve each and every one of these objectives. Further, compute is closely linked to other core policies, including (but not limited to) digital competition, data, digital skills, semiconductors, quantum technology and sectoral policy.
We recognise that the cost of building world-leading systems is considerable. Even in difficult economic climates, it remains essential to plan for and invest in the future. By acting now, the government not only will ensure the UK remains a prosperous country, but will also deliver invaluable societal benefits. The UK is currently an international technology hub, a leader in research and innovation and hosts world-leading universities. It ranks third globally for investment and innovation in AI and in the development and uptake of advanced digital technologies. To capitalise on and further grow these strengths, the government must ensure the country has the necessary compute resources now, over the next decade and beyond. Inaction will be to the detriment of the UK’s scientific capability, innovative economy and overall international reputation.
We must urgently plan for the future of compute. While the challenges here are great, so are the opportunities. The recommendations outlined in this review aim to ensure the UK is on the path to success.
Professor Zoubin Ghahramani FRS
Vice President of Research at Google and Professor of Information Engineering at the University of Cambridge
Sue Daley
Director of Technology and Innovation, TechUK
Shaheen Sayed
Senior Managing Director, Accenture UK and Ireland
Dr Graham Spittle CBE
Dean of Innovation at Edinburgh University
Professor Anne Trefethen FREng
Pro-Vice Chancellor and Professor of Scientific Computing, University of Oxford
Report outline
The Government Office for Science’s 2021 report on large-scale computing) and the Alan Turing Institute and Technopolis’ analysis of the digital research infrastructure requirements for AI have provided an overview of the UK’s compute ecosystem. Both reports have called for strategic planning, greater coordination, continued public investment and collaboration across academia, government and industry. The review builds upon and expands these findings and provides actionable recommendations to create a world-class, cutting-edge compute ecosystem in the UK.
The following chapters present the key elements of the compute ecosystem, illustrating the UK’s position within the international landscape, the drivers of compute demand, the current infrastructure and future requirements and the government’s role in creating a vibrant ecosystem.
Chapter 1, The significance of compute for the UK, sets out how compute can help meet the UK’s goals, presents the current compute landscape and explores the government’s role in creating a thriving ecosystem.
Chapter 2, The international landscape of compute, presents compute as an international issue, not just a domestic one. It looks at the UK’s position within the international landscape and illustrates best practices adopted by other countries as part of their compute policy.
Chapter 3, The demand for compute in the UK, presents the different user needs and barriers to compute uptake. It also looks at the potential users of the future and how to stimulate compute demand.
Chapter 4, Meeting the UK’s compute needs, outlines the requirements to meet demand for compute in the UK, now and in the future. It looks at what the UK needs across the ecosystem, covering exascale, cloud and AI infrastructure, the importance of software and the environmental impact of compute systems.
Chapter 5, Creating a vibrant compute ecosystem, describes the need for a long-term vision and better coordination. It illustrates how, by addressing these strategic requirements, the UK will be able to create a vibrant compute ecosystem, build a sustainable skills pipeline, ensure secure access to infrastructure, and establish beneficial partnerships.
Recommendations sets out the review’s actions for the government. These outline the need for the government to be ambitious and visionary in its approach to compute, the tangible actions required to bolster the UK’s compute infrastructure and the specific measures that will maximise the value of the UK’s compute ecosystem through empowering the community.
List of recommendations
A. Unlock the world-leading, high-growth potential of UK compute
B. Build world-class, sustainable compute capabilities
C. Empower the compute community
Proposed timeframe
Summer 2023: Establish AI Research Resource
2023: Publish vision
2023: Launch exascale phase 1
2023: Continue to collaborate internationally
Spring 2024: Publish roadmap
2024: Establish coordinating body
Beyond 2024: Continue involvement with OCRE
2025: Interoperability project
2026: Deliver full exascale
2033: Refresh strategic vision
Glossary of terms
Accelerator | Specialised hardware for certain computational workloads. |
---|---|
Artificial Intelligence (AI) | Machines that perform tasks normally performed by human intelligence, especially when the machines learn from data how to do those tasks. |
Cloud computing | An access model where computing infrastructure is accessed on-demand via the internet. |
Commercial cloud | A company that provides computing resources as a service through the cloud. |
Compute | Computer systems where processing power, memory, data storage and network are assembled at scale to tackle computational tasks beyond the capabilities of everyday computers. The review recognises ‘compute’ as an umbrella term including, but not limited to, advanced compute, high performance computing (HPC), high throughput computing (HTC), large scale computing (LSC) and supercomputing. Cloud computing does not fall entirely under this umbrella, but it is considered as an access model for compute if used for high computational loads. |
Central Processing Unit (CPU) | The unit within a computer which contains the circuitry necessary to execute software instructions. CPUs are suited to completing a wide variety of workloads, distinguishing them from other types of processing unit (for example, GPUs) which are more specialised. |
Exascale | A high-performance computing system that is capable of at least one Exaflop per second (a system that can perform more than 1018 floating point operations per second). |
Flops | Floating point operations per second (the number of calculations involving rational numbers that a computer can perform per second). |
Graphics Processing Unit (GPU) | Accelerator hardware optimised for certain calculations often associated with image processing and related applications. Foundational to modern AI advances. |
Petascale | A high-performance computing system that is capable of at least one Petaflop per second (a system that can perform more than 1015 floating point operations per second). |
1. The significance of compute for the UK
Key findings
-
Compute is foundational to the lives of all citizens, both now and in the future. It impacts social and professional lives, across academic research, government and industry. Investing in compute will bring wide-ranging benefits.
-
Compute is essential for achieving the UK’s ambitions for science and technology and economic growth. Compute drives productivity and efficiency and is vital to support technologies such as Artificial Intelligence (AI), which will have a profound impact on the economy.
-
The UK is falling behind on compute and the government will need to take substantive action if it is to achieve its ambitions. The UK must have a thriving ecosystem that is fit for the future, delivers the necessary infrastructure and enables access for existing and new users.
1.1 Introduction
Compute plays an integral role in modern life. Compute powers the technologies used today, underpins world class research and drives the economy by enabling businesses to be more productive and develop new products and services. It shapes how individuals live, with compute underpinning services essential to modern life such as communication, travelling, shopping or weather forecasting. It powers smart assistants on handheld devices, search engines and social media. It enables cutting-edge technologies, such as roadworthy autonomous vehicles, that would have been impossible only 15 years ago.
A strong compute ecosystem is integral to delivering the UK’s ambitions around economic growth and its status as a Science and Technology Superpower, today and in the future. The UK is an international leader in AI and will require continued investment into compute to remain at the forefront. Many industries, sectors, products and services already rely on compute, and this reliance is only likely to increase given the pace of scientific and technological innovation. The potential benefits of compute in areas like clean energy and healthcare — to name just two sectors — could have transformative impacts on the economy.
As compute technologies are changing rapidly and investments can be risky and expensive, the market alone will be unable to meet demand. For the UK to remain competitive in research and innovation, academia and industry need to be able to access low cost compute at the times they need it. Investment into the UK’s current facilities has not kept pace with the evolving needs of users, restricting their ability to access compute.
It is vital to support the UK’s compute ecosystem immediately through public investment. This chapter sets out the importance of compute to the UK and the role this technology can play in meeting government ambitions.
1.2 Understanding compute
There is no standardised definition for compute across academia, government and industry. For the purposes of this review, compute is defined as follows:
‘Compute’ or ‘advanced compute’ refers to computer systems where processing power, memory, data storage and network are assembled at scale to tackle computational tasks beyond the capabilities of everyday computers.
The review recognises ‘compute’ as an umbrella term including, but not limited to, advanced compute, high performance computing (HPC), high throughput computing (HTC), large scale computing (LSC) and supercomputing. Cloud computing does not fall entirely under this umbrella, but it is considered as an access model for compute if used for high computational loads. There are numerous technologies associated with compute, from different types of accelerators (e.g. Graphics and Tensor Processing Units) and distributed computing, to emerging and novel paradigms (e.g. neuromorphic and quantum computing). The review does not explore these technologies in detail, but recognises that future strategic decisions for compute should consider how they interact.
Whilst compute or the process of computation occurs as data calculations on a processor, it relies on so much more: the software that controls the system and instructs the processor; the other hardware described in the definition above; the skilled operators, and; access to quality data.
Software, hardware and skills have been considered in detail as part of this review. Data is a broad topic, and beyond the remit of this review, but it should be recognised that compute is only possible with access to quality and secure datasets. This will become increasingly vital to unlock the benefits of AI and machine learning techniques.
Unlocking the power of data
21st century economies will be defined by new scientific and technological developments in AI, quantum technologies and robotics. To be world class in these areas, the UK needs to be world-class in its approach to data sharing.
The National Data Strategy, specifically Mission 1 (‘Unlocking the value of data across the economy’), is a central part of the government’s wider ambition for a thriving economy. For the strategy’s objectives to be realised, organisations must be equipped to access, interpret and use data effectively. Compute is the technology that enables organisations to make sense of data and thus benefit from it.
Higher compute uptake and the delivery of Mission 1 are therefore closely intertwined. Unlocking the significant opportunities from data access and data use through compute, and vice versa, will be critical to keeping the UK at the forefront of science, research, innovation and technological development, and to creating a world-class digital economy.
The scale of compute
One way to measure computational power is in ‘flops’ — floating point operations per second. A floating point operation is roughly equivalent to a single arithmetic calculation (addition, subtraction, multiplication or division) involving two numbers, and can be done at varying levels of precision.
While a leading smartphone today is capable of 1012 flops (a trillion flops or a ‘teraflop’ per second), compute reaches beyond 1015 flops (a thousand trillion flops or a ‘petaflop’ per second). The most powerful system, as defined by the Top500 list, is the USA’s exascale system Frontier. It is more than an order of magnitude more powerful, operating at 1018 flops — a million trillion flops (an ‘exaflop’ per second).
The performance of the world’s fastest compute has grown by a factor of 626 since 2010, while the compute requirements of the largest machine learning models have grown by a factor of 10 billion in the same timeframe. According to OpenAI, computational demand to train AI is currently doubling every 3-4 months. Their 2018 analysis focuses on the ‘modern era’, which starts in 2012 with AlexNet, a neural network, and describes the AI models that have led to breakthroughs in vision and language learning unimaginable only a decade ago.
The power of more compute
More compute and the resulting increase in model resolution has significant implications for research and innovation. This was particularly evident during the coronavirus (COVID-19) pandemic.
ARCHER, the UK’s most powerful public system in 2020 and predecessor to ARCHER2, could only model parts of the COVID-19 virus. According to evidence gathered by the review, this increased the time to find binding sites in support of vaccine development. Comparatively, Japan’s Fugaku — launched ahead of schedule in 2020 to provide researchers with compute for intensive COVID-19 research and about 100 times more powerful than ARCHER — was used to simulate how COVID-19 might spread from person to person via aerosolized droplets. This had a significant impact on many governments’ policies, particularly with regards to the role of face masks. The Japanese team that conducted the study was awarded a Gordon Bell Special Prize, a recognition of the impact of their research.
1.3 Achieving the UK’s ambitions through compute
The Prime Minister has committed to building a more prosperous future and growing the economy and the UK has bold ambitions across research, technology and innovation. Compute underpins or supports several government priorities: it powers innovation and productivity across the economy; it is essential for cutting-edge scientific research; it drives the development of transformative technologies such as AI, new materials and quantum computing; and helps ensure national security.
Compute underpins government ambitions, including:
- the Plan for Growth and Innovation Strategy, which set out the government’s plans to support economic growth through innovation
- the Integrated Review, which outlines the need to deliver prosperity and security at home and abroad, and shape the open international order of the future
- the UK Digital Strategy, which sets out the government’s vision for harnessing digital transformation and building a more inclusive, competitive and innovative digital economy
- the National AI Strategy, which outlines the ten year plan to make Britain a global AI superpower, and the upcoming white paper on regulation
- the National Data Strategy, which sets out the UK’s vision to harness the power of responsible data use to boost productivity; create new businesses and jobs; improve public services; support a fairer society; and drive scientific discovery
- the Net Zero Strategy, which sets out policies and proposals for decarbonising all sectors of the UK economy to meet the UK’s net zero target by 2050
- the National Cyber Strategy, which outlines the government’s approach to protecting and promoting the UK’s interests in cyberspace and ensures that the UK continues to be a leading responsible and democratic cyber power
- the upcoming National Quantum Strategy, which will set out a ten-year vision for the UK to be a leading quantum enabled economy
- the upcoming Semiconductor Strategy, which will set out how the UK can best protect and grow its domestic industry, secure greater supply chain resilience and consider the security implications of this vital technology
These ambitions simply cannot be achieved without access to world-leading compute capabilities. The UK is currently a global science and technology superpower, but, without a thriving compute ecosystem, the UK’s technological, scientific and economic capability is at risk. As the recommendations of this review are taken forward, it is vital to ensure greater coordination among government strategies and policies that relate to compute.
Compute for a more prosperous future
Compute has the potential to unlock productivity as sectors across the economy make better and more extensive use of data analysis, simulation and AI technologies. Compute is an enabler of large parts of the UK’s digital sector, which itself is one of the highest growing industries, contributing nearly £140 billion to the economy in 2021.[footnote 4] Research has demonstrated the return on compute investment in the US across different sectors, with insurance, oil and gas and financial sectors realising the greatest returns in terms of profit and cost savings.[footnote 5]
The size of the UK HPC market has grown steadily over recent years and is expected to grow significantly over the next five years with forecasts indicating a compound annual growth rate (CAGR) of up to 11%.[footnote 6] Internal analysis by the Department of Science, Innovation and Technology indicates that demand projections for cloud computing and HPC could be up to 1.8 and 2.7 times larger by 2032 than in 2022 respectively. Whilst HPC and cloud computing are an incomplete representation of the compute market, it is the best available information and a strong indicator of future trends.
Figure 1A illustrates the significant potential benefits of compute based on economic analysis carried out to support recent public and private investments. These include improvements to the scale, efficiency and quality of research, increased productivity in the private sector and greater international collaborations. For example, the potential benefit range for Engineering and Physical Sciences Research Council (EPSRC) investment is already very high, but their analysis suggests the additional spillover benefits may be significant too, estimated to be £5.8 billion.[footnote 7]
Economic analysis of recently funded compute systems
There are substantial benefits to investing in compute. Three examples of investments into UK compute infrastructure demonstrates the economic impacts to the UK economy.
HPC investments by EPSRC cost £466 million and are anticipated to lead to benefits of between £3 billion and £9.1 billion to the UK economy.[^8] This results in a benefit cost ratio of 6.5 to 19.5. This includes spillover impact of HPC research on UK output, contributions of industry impact and also those benefiting from training and skills development.
The Met Office invested £1.2 billion on a 10-year supercomputer and the expected social value is expected to be £13.74 billion with a benefit cost ratio of 9.[^9] The winning tenderer, Microsoft, has also committed to investment of £45 million of UK skills development and to reduce the UK’s carbon impacts.
Finally, $100 million was invested into Cambridge-1 with the estimated potential to be around $825 million over the next 10 years, a benefit cost ratio of 12.1.[^10]
Figure 1A: Economic analysis of recently funded compute systems
Investment | Benefit Cost Ratio |
---|---|
EPSRC HPC (2019) | 6.5-19.5 |
Met Office Supercomputer (2021) | 9 |
NVIDIA Cambridge-1 (2021) | 12.1 |
There is also evidence that increased business uptake of compute can lead to regional growth and development. Regional programmes such as LCR 4.0 and Cheshire & Warrington 4.0 provide support to UK SMEs and supply chains in the adoption of compute and AI technologies.[footnote 11] LCR 4.0 provided over 300 manufacturing SMEs in the Liverpool City Region (LCR) with technical and business support. Approximately 50% of businesses were supported with access to compute.[footnote 12] The project is on target to create 955 jobs within the Liverpool City Region and generate £31 million GVA, though this cannot all be attributed to the use of compute.
Evidence collected by the review revealed examples of the positive impact of compute on business productivity, with spillover effects throughout the value chain. As use cases for compute increase, it can be expected that efficiency gains and productivity benefits across the economy will also grow - but only if users can exploit compute capabilities fully.
Compute for AI
The National AI Strategy describes AI as a general-purpose technology and sets out the government’s ambitions to be a global AI superpower. Compute-intensive AI-augmented R&D has the potential to increase the rate of productivity growth significantly - by a factor of 2 compared to average rates in the last 75 years or a factor of 3 since 2000.[footnote 13]
Progress at the frontier of AI has been rapid. For example, London-based DeepMind has used AI to predict how proteins take shape, cutting cancer drug discovery times from years to under a month. AI models such as ChatGPT and Stable Diffusion can mimic human abilities in language, reasoning and drawing. Fundamental advances in AI such as these come with huge commercial potential and are permeating throughout businesses and the wider economy.
Richard Sutton, a leading AI professor, determined that the biggest lesson of 70 years of AI research is that the most effective methods, by a large margin, are those that leverage and scale with compute. Training models require massive investment in compute, with usage at the frontier doubling every six months. UK researchers must have access to sufficient specialised accelerator-driven compute to operate at the frontier of AI research and realise the social and economic benefits of AI.
Compute for scientific advantage and societal goals
The importance of computation for scientific research is huge and, to a large degree, immeasurable. The creation and analysis of large amounts of data is crucial for the development of science, the outcomes of which are difficult to foresee.
Figure 1B: The scale of the challenge — applications for compute
As outlined in Figure 1B, compute is required to tackle many of today’s global challenges and to unlock breakthroughs across several disciplines and sectors. For example:
- Our climate, weather and earth: Earth observation data and climate modelling for more accurate weather forecasts and to understand the impacts and tipping point of climate change.
- Our universe: Simulate phenomena like gravitation waves and dark matter to develop our fundamental understanding.
- Our biology: Simulate how our bodies work and the functions of our genes to develop personalised medicine; use AI to search through endless data to match illness and potential drug treatments.
- Engineering and materials: Simulate jet engines and plasma in a fusion reactor to engineer a low carbon future, and discover new materials with AI.
- Our society and culture: Use insights and trends to create solutions to improve society.
There is huge potential for compute to solve research and social challenges quicker than ever before — it took 13 years to first sequence a human genome; compute can now do this in less than a day.
Compute for weather and climate predictions[footnote 15]
Every weather forecast and climate prediction that the UK Met Office makes today is generated through compute. Weather modelling starts with raw observations. The Met Office takes in 215 billion weather observations from all over the world each day, consisting of temperature, pressure, wind speed and direction, humidity and other atmospheric variables.
The Cray XC40 supercomputing system is the latest in a long history of compute technology used by the Met Office. Completed in 2016, it consists of three separate machines that are capable of 14 petaflops. This compute capability is essential for the scale and complexity of the weather and climate models run by the Met Office. To reflect increasing data availability and demand for more local and accurate forecasting, the Met Office is investing in the next generation of compute.
Weather forecasting helps the UK with everything from routing flights and shipping, to managing the provision of key utilities such as energy and water. Climate prediction enables the UK to identify, and plan effectively how to mitigate, the impacts of climate change. This means that, by the end of its life, the Met Office’s current supercomputing system is predicted to have enabled an additional £1.6 billion of socio-economic benefits across the UK.
Using compute and AI to develop fusion technologies for a low carbon future[footnote 16]
Realising the UK’s ambition for fusion to provide a carbon-free energy supply early in the 2040s will require a radical transformation of the engineering process. Fusion power plants must be designed ‘in-silico’. A collaboration between the Science and Technology Facilities Council (STFC) Hartree Centre and the UK Atomic Energy Authority (UKAEA) is applying the latest supercomputing and data science expertise to accelerate developments in fusion energy, across an environment increasingly being referred to as the Industrial Metaverse.
Such work is seeding the development of a whole new industrial sector – the pioneering SME’s, universities and engineering giants that will deliver and commercialise fusion. Worldwide this has attracted $5 billion in venture capital. In the era of net zero, this is a race, a necessity and a tremendous economic opportunity for the UK.
For this fledgling sector to thrive, new infrastructure is needed, powered by extreme scale computing and AI. The fusion process, whether relying upon powerful lasers or magnetic confinement of a plasma ten times hotter than the core of the sun, has long been referred to as an ‘exascale’ grand challenge. From modelling the turbulence that causes heat to leak from the plasma core, to simulating the materials needed to build a robust powerplant that can tolerate such an extreme environment, systems must be designed using new, state-of-the-art tools. These include simulation at the exascale, advanced methods in uncertainty quantification and algorithms built upon the convergence of AI and HPC - all made possible due to recent, disruptive developments in compute technology and machine learning. Many of the methods being developed by the UKAEA and the Hartree Centre will translate into other science and industry sectors, improving UK skills and capability growth.
1.4 UK compute today
To understand the UK’s requirements for compute capability over the next decade, consideration must be given to the current landscape. The GO-Science report offers a comprehensive overview of the compute ecosystem and the interdependence of hardware, software and skills. This review builds on that analysis, showing that the UK’s compute capability lags internationally and that users face challenges in accessing the resources they require.
Public compute systems are often categorised into tiers based on capability. UK public compute and commercial cloud facilities have been outlined in Figure 1C. UK Research and Innovation (UKRI) operates tier 1 and 2 systems, providing national and regional compute capability in the UK with universities providing local tier 3 systems. The UK also has limited access to some international pre-exascale and exascale systems. Commercial cloud offers additional resource to UK researchers. The following chapters explore how the UK’s current fragmented landscape does not meet user needs and fails to provide clarity on future provision.
Figure 1C: UK public compute and commercial cloud facilities
1.5 Policy implications
There is a clear strategic case for investing in compute to achieve the UK’s economic, scientific and societal ambitions. This is founded upon four principal factors: compute will produce significant economic and social benefits; compute is essential to achieve UK ambitions; the market alone will not provide all the necessary compute supply; and UK compute capability is falling behind internationally.
Commissioning this review signalled the government’s interest in giving strategic consideration to compute policy and investment. However, investment into compute infrastructure needs to be combined with long-term strategic planning across the whole ecosystem to maximise the benefits of compute uptake for all users. Government must work with industry, academia and the public sector to succeed.
Action required: Unlock the world-leading, high-growth potential of UK compute
Compute is critical for achieving the government’s ambitions in the long and short term. These include: creating a more prosperous future through economic growth and productivity; securing the UK’s status as a global Science and Technology Superpower; and being a world leader in AI. The UK cannot achieve these goals without having a compute ecosystem and infrastructure that is fit for the future. Action is needed to unlock the world-leading, high-growth potential of UK compute and ensure the UK remains competitive internationally.
2. The international landscape of compute
Key findings
-
Internationally, compute is recognised as a strategic resource required to be competitive in global science and innovation. Other countries are moving ahead of the UK through strategic planning and public investment into compute resources, which translate into world-class scientific outputs and innovation.
-
To secure the UK’s status as a global Science and Technology Superpower by 2030 and to be at the forefront of shaping international thinking around digital innovation, a strong domestic compute ecosystem and enhanced compute capability are required. The UK needs to act now to catch up with competitors and to realise its international, scientific and tech ambitions.
2.1 Introduction
As noted in Chapter 1, compute is critical to understanding and tackling many domestic and global challenges. Internationally, the role of compute as an essential enabler of global innovation and cutting-edge science, and as a critical component of any strong digital economy, is widely recognised. This is reflected in the long-term, strategic approach that other countries have adopted, as well as their commitment to public investment into compute.
However, despite compute’s strategic importance, the UK is falling behind. In 2022, the US broke the exaflop barrier with their Frontier system. In comparison, the UK’s most powerful public system, ARCHER2, is 56 times less powerful. The UK has a long and proud tradition of computer science and cutting-edge research, with world-class strengths in areas including innovation, software development and machine learning. However, to maximise its expertise and ensure it remains at the forefront of global competitiveness, technological innovation, and scientific discoveries, the UK needs an internationally competitive compute ecosystem. Therefore, investment into and use of compute systems should be looked at through a global lens.
This chapter explores the international compute landscape, the UK’s position within it and best practices adopted internationally to build a vibrant compute ecosystem.
2.2 International compute capability
While other countries have continued to invest in compute, Figure 2A shows the UK has lost ground since 2005, a time when the UK and Japan were peers and only the US had greater compute capability. As of November 2022, the UK was trailing in compute capability (Rmax), ranking 10th globally. The EU’s EuroHPC programme is further driving other countries’ compute capability ahead of the UK’s, starting with pre-exascale systems in Finland and Italy.
Figure 2A: Growth in total compute performance between 2005 and 2022[footnote 17]
Countries are also investing in AI compute. In addition to its national research cloud, Canada funds dedicated computing resources of hundreds of top-spec GPUs for each of its three national AI institutes; the US are looking to establish a National AI Research Resource; France has extended its Jean Zay system to include accelerators; and Italy hosts the latest EuroHPC system Leonardo with 10 exaflops of AI performance. In comparison, the UK does not have a dedicated AI compute resource and has a limited number of accelerators in national systems.
Advanced economies are investing in next generation infrastructure and exascale capability — threatening to widen the gap between the UK further. Figure 2B demonstrates the most powerful compute systems in the world. The US and China have already deployed exascale infrastructure, with multiple systems expected to be operational by 2025. The US are now looking further ahead to systems 5-10 times as powerful as Frontier, with at least one of these planned for 2025-2030.[footnote 19] The EU’s EuroHPC programme is deploying three pre-exascale systems (in Finland, Italy and Spain), and will deploy its first exascale system in Germany in 2024. Japan’s Fugaku marked the transition to exascale in 2020 with the first multi-hundred petaflop system and it is paving the way for exaflop-capable infrastructure. Australia and Singapore have teamed up to jointly explore collaboration in the area of exascale, including capability development and readiness of underlying infrastructure.
Figure 2B: Most powerful compute systems by location (November 2022)[footnote 20]
Location | System | A system’s maximal achieved performance (Rmax, petaflops) | Installation year | TOP500 Rank |
---|---|---|---|---|
USA | Frontier | 1,102 | 2021 | 1 |
Japan Supercomputer | Fugaku | 442 | 2020 | 2 |
Finland* | LUMI | 309 | 2022 | 3 |
Italy* | Leonardo | 175 | 2022 | 4 |
US | Summit | 149 | 2018 | 5 |
UK | Archer | 20 | 2021 | 28 |
*EuroHPC facilities
UK researchers currently have access to EuroHPC systems such as LUMI and are able to pursue partnerships with European colleagues to increase the likelihood of successful allocation. However, UK researchers will not have direct access to future EU systems such as Jupiter, the EU’s first exascale system scheduled for 2023/24. Indirect access might be possible through negotiated access.
2.3 The impact of a robust compute capability
A robust compute capability, paired with a healthy ecosystem that maximises the investment in infrastructure, leads to world-leading results in science, technology and innovation. For instance, Frontier will further integrate AI applications with modelling and simulation, enabling the US to achieve advances in medicine, astrophysics, bioinformatics, cosmology, climate, physics and nuclear energy. As the US continues to develop its compute capability over the decade, it will be able to tackle computationally intensive problems, such as complex physical systems or those with high fidelity requirements.[footnote 21]
Case study: Frontier powering medical research
Frontier’s exascale capability was leveraged by researchers at the US Department of Energy’s Oak Ridge National Laboratory to conduct a study that could result in health-related breakthroughs. Researchers used Frontier to perform a large-scale scan of biomedical literature in order to find potential links among symptoms, diseases, conditions and treatments, understanding connections between different conditions and potentially leading to new treatments. By using Frontier, scanning time was significantly reduced, with the system able to search more than 7 million data points from 18 million medical publications in only 11.7 minutes. The study identified four sets of paths that will need further investigation through clinical trials.
World-changing scientific discoveries powered by world-class compute infrastructure enhance a country’s international standing in science and technology. The UK’s strengths in computing — especially in software development, machine learning and increasingly green computing — and in science and research more broadly, are internationally recognised. The fact that UK researchers are able to access the world’s most powerful systems on the basis of their projects’ scientific merit shows this. Evidence gathered by this review notes that since 2010, UK researchers were included in 5% of the projects supported by the US’ INCITE programme (which supports large-scale scientific and engineering projects by providing access to compute resources), with the majority having a UK-based Principal Investigator. That evidence also showed that UK researchers were also included in 20% of the projects supported by the EU’s PRACE programme since 2010.
However, a lack of a strong domestic compute ecosystem constrains the UK’s potential and threatens its standing as an international leader in science and technology. Low investment in domestic compute capability may have already negatively impacted UK research: a consortium including UK researchers has not won the Gordon Bell Prize — an annual award for outstanding achievement in compute — since 2011, unlike researchers from the US, China, Japan, Switzerland, France and Germany.
Case study: 2022 Gordon Bell Prize winners
A team of French, Japanese and American institutions won the 2022 Gordon Bell Prize for its simulation code, WarpX. The code enables high-speed, high-fidelity kinetic plasma simulations, with the potential of powering advances in nuclear fusion and astrophysics. The code is optimised for the Frontier (US), Summit (US), Fugaku (Japan) and Perlmutter (US) supercomputers, which have different architectures and are all in the top 10 of the November 2022 Top500 list.
Without further investment in domestic compute capability, the UK’s position will continue to decline as other countries press ahead with plans to bolster their compute infrastructure. This will affect the UK’s ability to remain at the forefront of global innovation and secure its status as a Science and Technology Superpower.
2.4 International compute policies
Internationally, the development of domestic sovereign capabilities and of a strong compute ecosystem have been a result of targeted government interventions and long-term planning. Leading countries in compute have adopted clear, strategic measures to ensure an adequate supply, facilitate access to compute, increase uptake among their domestic users and stimulate their domestic supply chains and R&D. The strategic approaches and long-term planning of global compute leaders are driven by the recognition that private investment in new computing technologies is limited by the extent of the cost, risk and technological change involved. Without public intervention, the provision of compute would likely be insufficient in terms of high-end capability, inefficient in terms of capacity allocation, unequal in terms of access and ineffective in stimulating innovation across the whole economy.
International best practices
Countries have adopted a number of approaches to compute designed to meet their particular needs. Some interventions are common to countries that have been successful in creating robust compute ecosystems.
These include:
- A long-term strategy, updated on a rolling basis, that underpins and guides planning and investment around compute. Rolling updates are crucial to ensure strategies remain up to date and include the latest technological developments. Investment plans are also updated on a rolling basis, as they are informed by long-term strategies.
- Having a strong coordinating entity to enable and facilitate access to national systems. National coordination bodies are usually arm-length bodies or government-sponsored networks of research institutions. They are responsible for facilitating user access to systems, promoting shared investment plans across national facilities, and providing training.
- An incremental approach to investment, that grows and adapts to reflect changing needs. This stems from having a long-term strategic vision and often includes clear plans around investment into next generation hardware, such as exascale, alongside the necessary supporting infrastructure.
- A recognition of the importance of public investment in different infrastructure tiers - from flagship systems (Tier 0) to regional infrastructure (Tier 2). This ensures applications can be scaled up incrementally, allowing innovation to trickle up the compute ecosystem.
- Investment in heterogeneous architectures, as these represent the current direction of technological development. Several countries have invested in heterogeneous systems due to the convergence between AI and more traditional compute applications.
- Investment in hardware is coupled with investment in software development, applications and skills to support the community of users that will access new systems.
- Enabling greater uptake of compute by lowering barriers to access. This includes providing cloud access to compute resources, providing targeted support for users and fostering industry’s compute uptake through ad hoc initiatives.
- Adopting effective procurement models that lower costs, mitigate technological risks, and take into account innovative approaches for future upgrades. Joint procurement — either at a national level (among different public facilities/national laboratories) or at an international level (in partnerships with other countries) — can deliver economic as well as technological advantages, particularly when investing in multiple systems with different architectures.
- Investment in supply chains and the wider domestic compute R&D to strengthen the domestic compute industry and develop new technologies.
The panel is of the view that the UK should aim to create a strong and vibrant compute ecosystem by implementing all the above measures. This would help the UK reduce its lag behind other countries, and be a credible international partner at the forefront of digital innovation. If the UK were to adopt only some of these best practices — for example, only improving its long-term vision and coordination — it would become more competitive internationally, but would fall short of becoming a global leader in compute.
Case study: USA
The US have adopted a holistic, strategic approach to compute - spanning heterogeneous systems, data and networking requirements — and have developed a plan containing clear, actionable measures. The National Science and Technology Council Subcommittee on Future Advanced Computing Ecosystems provides strategic direction and ensures coordination between Federal agencies. The Department of Energy (DOE) and the National Science Foundation (NSF) coordinate access to and use of compute for governmental and academic research respectively. The Department of Defence also plays a role, with a military and defence focus.
Forward-looking investments, such as the Exascale Computing Initiative, are fulfilling the US’s strategic planning. The US have also adopted a joint procurement approach, which has seen DOE’s laboratories jointly procuring systems. As they look to develop more advanced capabilities, the US are exploring the possibility of moving from monolithic acquisitions toward a model based on modular upgrades. Future systems will be part of an Advanced Computing Ecosystem and integrated with other DOE facilities.
The US have implemented initiatives to facilitate access to compute, such as the STRIDES initiative (which supports the use of cloud for biomedical research) and the NSF’s CloudBank (which helps the computer science community access and use public clouds for research and education). The National AI Research Resource, currently in development, aims to leverage a combination of HPC and cloud computing to provide AI researchers with access to compute resources and data. The US also have initiatives open to industry, such as the DOE’s INCITE and HPC for Energy Innovation (HPC4EI) programmes (which offer access to compute and technical support for projects aimed at reducing the environmental impact of the manufacturing, materials and energy sectors).
The US invests in compute as a sovereign capability through investment in the full supply chain, starting from semiconductors — a critical component of compute systems. This includes the allocation of $52 billion to maintain technological advantage in the semiconductor sector and build domestic capabilities.
2.5 UK policy implications
The UK lags behind other advanced economies in compute. To secure the UK’s status as a Science and Technology Superpower, action is needed to match international investment in compute and further capitalise on the UK’s world-class research and innovation capability. Only by having a strong domestic compute ecosystem can the UK be seen as an influential player internationally and project its global power as a science and technology leader.
Lessons from those at the forefront of compute suggest that this requires government intervention to provide strategic leadership, ensure sovereign compute capability, and create a vibrant ecosystem. The UK can learn from the approach of others and carry out interventions that match the characteristics of its compute landscape. These points form the basis of this review’s recommendations. The following chapters will explore how the UK can design an ecosystem that meets its domestic requirements, satisfies the needs of its users and improves compute uptake.
3. The demand for compute in the UK
Key findings
-
The UK’s compute ecosystem needs to reflect the variety of users, both existing and emerging, and their needs. It needs to ensure a broad offering to support prosperity and growth across the UK.
-
For many academic users and researchers, and other pioneers of compute, the current provision of compute is not sufficient. This is limiting the UK’s scientific capability and inhibiting scientific breakthroughs.
-
Many more could benefit from using compute, particularly SMEs. Increased use of compute can boost economic growth, but more support is needed to meet user needs and help new users to navigate this landscape.
3.1 Introduction
Building on the findings of the GO-Science report, this chapter assesses the needs of compute users and the challenges they face in accessing and using compute.
There are different types of compute users and many ways to categorise them. This review has considered three main groups — academia, industry and the public sector — as well as users’ different maturity levels within each of these groups. The difference between traditional compute users and the growing community of AI users, who have specific requirements, needs to be recognised.
To distinguish between the maturity of users, this chapter has adopted a categorisation similar to that used by UKRI:
Pioneers are at the cutting-edge of computational research and rely on the most powerful systems. They are a small proportion of users, but demand resources of the highest specifications.
Established users typically use compute in a particular research domain and their technical requirements vary depending on the application.
Emerging users of compute include research disciplines and businesses that traditionally have not been compute-intensive, such as from the life sciences, digital humanities and social sciences.
The compute required for AI is distinct from that of more traditional uses. Traditional compute has historically used general purpose CPUs, whilst AI training and inference can be done at a much lower precision and uses specialised hardware accelerators such as graphics processing units (GPUs). Accelerators are orders of magnitude faster and more energy efficient than CPUs for AI-related tasks.
Understanding the diversity of requirements is essential to assess the suitability of compute provision and identify where support is needed, both in the short and longer term. This information will be outlined throughout this chapter and is summarised in Figure 3A. It will be key to ensure any future investment in compute infrastructure meets users’ needs.
Figure 3A: Summary of the type of compute, applications and delivery of compute resources by user category
Users | Pioneers | Established Users | Emerging Users | AI Users |
---|---|---|---|---|
Type of compute | Cutting-edge computational research | Large scale modelling, simulations and data science | Small scale modelling and simulations | AI training and AI-based research |
Use | World-leading science, research development and innovation. Examples: • weather • energy • medicine • defence |
Use in a particular research domain. Examples: • aerospace • manufacturing and engineering |
Use in traditionally non-compute-intensive disciplines. Examples: • agriculture • engineering |
Use in AI training and inference. Examples • autonomous vehicles • health and medicine |
Tiers | All tiers | Tiers 1&2 private facilities | Tier 3 commercial cloud | All tiers commercial cloud |
Specific needs | Powerful systems 1 Exaflop |
More accelerators More capability Up to 150 petaflops |
Awareness Lower costs and technical support |
At least 3000 top-spec accelerators |
Shared needs (common to all users) | Skills Security Data Software Partnerships |
3.2 Compute for academic users, research and scientific discovery
Academic and research communities use compute across many disciplines. Academic users present varied levels of maturity and include pioneers, established users and emerging users. As compute infrastructure underpins the research and innovation ecosystem, a wide range of operational and access models exist for public facilities. The majority of researchers rely on on-premise compute which is free at the point of use.
UKRI’s science case for supercomputing finds that a factor of 10 to 100 increase in UK computing power is needed if they are to deliver their immediate science goals. Compute infrastructure is essential to the competitiveness of the UK’s academic landscape and, without appropriate levels of access, the UK’s ability to attract international talent will be hindered.
Many universities partner with UKRI to provide access to Tier 2 systems. Academic users often use more than one facility. To access national level compute resources, academics typically submit an application that will be reviewed to assess the technical feasibility and the merit of the proposed research or project.[footnote 22] Currently this model prioritises excellent science rather than mission-led activity. Research centres and universities also gain access to compute through shared procurement frameworks and individual arrangements with commercial cloud providers.
Pioneers and established academic users
As set out in Chapter 1, compute is critical for world-leading science, research, development and innovation. Academic users of all maturities face substantial challenges from the lack of capacity in national compute facilities, resulting in significant unmet demand across all research areas. This is particularly true for those operating at the limits of compute capability. Academic users often use more than one facility and so interoperability is also a factor. As explored in Chapter 4, the UK’s compute ecosystem is fragmented and that limits researchers’ access to compute resources and ability to collaborate.
Emerging academic users
Those with less experience using compute face challenges around how to find compute resources suitable for their needs and how to access them. This is due to limited signposting of the resources offered by different public compute facilities. Emerging users also face challenges in terms of having the correct skills needed to use compute effectively and the access to technical support.
AI researchers and accelerator-based compute
Accelerator-based compute is essential to AI breakthroughs and enables the creation of general-purpose technologies. Access to compute allows researchers to conduct small-scale exploratory AI research and larger teams to mature their models. Compute is critical to the most important advances at the frontier of the field where massive models are trained on huge datasets using thousands of accelerators concurrently for weeks or even months. To train frontier AI, research indicates that the amount of compute is currently doubling every 5-6 months.
Case study: Deep generative modelling of the human brain for healthcare[^23]
Contemporary brain scanners produce finely detailed images, but the analytical tools used to analyse this imaging data present limitations. The gap between the volume of available data and the depth of analysis has direct implications for the quality of care doctors are able to provide patients with neurological diseases, which in turn can lead to years of productive life wasted.
Researchers at UCL and KCL are using AI techniques to tackle this challenge by developing a suite of ‘deep generative models’ which are trained on large-scale brain imaging data. Wherever a disease is associated with characteristic changes on brain imaging, these models enable doctors to identify the pathological component with high level of precision. Such models have therefore the power to revolutionise neurological care and transform the value doctors can draw from brain imaging in healthcare.
Building deep generative models requires HPC. Specific requirements, such as large clusters of high-memory accelerators with high-bandwidth interconnect, are needed. The value of the approach cannot be demonstrated at smaller, prototype scales. Public compute was not able to support this research, with researchers procuring local compute clusters of dozens of accelerators and relying on free access to the commercial Cambridge-1 supercomputer.
UK AI researchers face significant challenges in obtaining the compute they need. At present, grant money often supports procurement of small systems for local use. The growing ‘compute divide’ between academia and industry is a global phenomenon. Researchers indicate that the relative lack of accelerator compute in the UK makes it particularly challenging to hire and retain talent within academia, in comparison with other countries and companies. Some researchers rely on international or industrial partnerships to pursue their work, leading to a loss of research independence. This also has implications for oversight and safety, and areas of research that have less direct routes to commercialisation. Without better access to more compute, breakthroughs may be prevented from diffusing throughout the economy via spinouts and start-ups.
In addition to their need for accelerators, AI researchers have particular requirements for data access. Large, open access datasets have been catalysts for significant scientific progress in AI. It is common for independent researchers, funded by private donations and start-up sponsorship, to collectively create these datasets. This type of data and AI research should be complementary to non-commercial, fully transparent research. AI researchers need to interrogate the underlying data to understand how training sets influence trained model behaviours.
3.3 Compute for business users
Use of compute is growing in many sectors of the economy. This section outlines the needs of pioneers and established business users (both large and small businesses) and those of emerging business users, AI businesses and AI adopters.
Pioneers and established business users
There are numerous business applications for compute. Access to compute enables businesses to innovate better for new products and services and improve productivity. Each of these benefits can help businesses gain a competitive edge by increasing efficiency and operational effectiveness.
Business users access compute in a variety of ways — via their own on-premise systems, commercial cloud or public-sector facilities. Having easy and sufficient access to compute is key. Some businesses, mostly larger enterprises, use in-house systems for their regular workload and use ‘cloud bursting’ to cover spikes in demand. Businesses can also partner with universities or national facilities, either through research collaborations or paying for system time. They generally do this when they are not able to access or fund internal resources, or want to collaborate or de-risk innovation. The majority of businesses that use compute do not currently require the most capable systems (i.e. exascale). Some pioneers may make use of exascale systems over the next decade, especially in partnerships with academia.
Some public facilities support business access to compute through the provision of expertise, compute resources and software. Unilever, for example, explained that access to compute at the Hartree Centre enables them to maintain competitive advantage by unlocking breakthrough innovations faster than competitors.[footnote 24]
Case study: Compute to boost skin’s natural defences
To help the skin protect itself, there is a fine balance of micro-organisms living in perfect harmony, known as a microbiome. The skin also releases proteins (antimicrobial peptides, or AMPs) which protect against harmful bacteria and viruses. In 2022, the Hartree Centre and IBM Research worked with Unilever to find a way to increase the skin’s natural line of defence to protect against infections or other issues, such as body odour, dandruff and eczema.
Unilever knew that niacinamide, an active form of vitamin B3 naturally found in your skin and body, could enhance the level of AMPs in laboratory models. The team used compute facilities at the Hartree Centre for advanced simulations to visualise, with atomic precision, how the niacinamide molecules interact with the proteins and the bacterial membrane. The extra compute power enables the simulation of much more complex systems by using more sophisticated and realistic modelling to represent bacterial membranes. This will help Unilever to develop new skin hygiene products and cosmetics, as well as providing the foundation for new drug development.
Other examples of private-public partnerships include GlaxoSmithKline with Oxford University and Rolls Royce with EPCC. However, with the exception of the Hartree Centre, there is generally limited capacity available for businesses in public compute facilities.
Case study: Compute to accelerate the discovery of drugs
A 5-year collaboration between Oxford University and GlaxoSmithKline (GSK) was agreed in 2021 to establish the Oxford-GSK Institute of Molecular and Computational Medicine. Through the use of datasets and machine learning, the Institute aims to uncover new indicators and predictors of disease and use them to accelerate the most promising areas for drug discovery. Together they will be using patient, molecular information and state-of-the-art platforms to pinpoint the GSK targets that are most likely to succeed and be developed into safe, effective, disease mechanism-based medicines.
Genetic evidence has already been shown to double success rates in clinical studies of new treatments. The digitisation of human biology alongside powerful compute has the potential to improve drug discovery by more closely linking genes to patients.
Case study: Large scale simulations at Rolls-Royce[^25]
Simulation and modelling enabled by compute have transformed the way Rolls-Royce designs and engineers its products. These simulations are not only key to innovation, but can also save the company millions of pounds in design, development and certification.
As the company’s in-house compute is prioritised for production design, access to public systems is crucial for research and development. The ‘Strategic Partnership in Computational Science for Advanced Simulation and Modelling of Virtual Systems’ (ASiMoV) seeks to develop the world’s first high-fidelity simulation of a complete gas-turbine engine in operation. Initiated in 2018, the project is jointly led by EPCC and Rolls-Royce, collaborates with the Universities of Bristol, Cambridge, Oxford and Warwick, as well as two SMEs, CFMS and Zenotech.
Working towards the virtual certification of aircraft engines by 2030, ASiMoV will require a new high-resolution physical model of an entire engine and of the airflow and turbulence during its operation. This requires a trillion cell model and simulated operation on millions of computing cores. ASiMoV demands the next generation of compute (exascale) to process the unprecedented amount of data, robustly, securely and affordably.
Estimated cost savings are measured in tens of millions of pounds per engine programme. For Rolls-Royce, virtual certification could bring a major transformation requiring unprecedented trust, from both the company and certifiers worldwide, in ‘virtual twin’ simulation and fundamental changes to research and development.
Established business users report challenges in system performance, access to technical skills and an increasing demand for flexible and scalable compute architectures. Some stakeholders engaged by the review have reported that commercial cloud can, in some instances, result in lock-in effects and high data extraction costs, making it difficult for users to move between services. Furthermore, the use of public compute often includes a requirement to publish results. This can act as a barrier to some businesses adopting compute, particularly where intellectual property protection is a major concern.[footnote 26] Some businesses also noted that the costs of data management and governance can be an inhibitor for those who might want to use more substantial computing resources.
Lastly, the review considers those researching and developing quantum computing and its applications among the pioneer users of compute. Although the most impactful applications of quantum computing are still being identified, early examples include drug development and logistics optimisation.[footnote 27] The integration of classical compute and quantum systems will be critical as quantum computing develops, with applications relying on the complementary strengths of both systems to produce optimal results. Further assessment is needed into the specific compute requirements of the quantum sector so this nascent technology can thrive. The publication of the government’s National Quantum Strategy will set the government’s long-term vision for accelerating both quantum computing and other quantum technologies supporting businesses to equip themselves for the future.
Emerging business users
Smaller businesses often have lower levels of technology adoption and could significantly benefit from compute uptake. With SMEs accounting for the majority of the UK business population, encouraging greater adoption of compute could result in significant economic benefits. Whilst businesses that already use compute often have similar requirements irrespective of their size, emerging business users have different needs.
Public facilities can play a key role in supporting emerging business users. They can support new users to access compute and software, in the past only available to academia and large-scale business; provide domain experts to develop business models; build the workflows required; and train staff. Some facilities provide these services for free, removing cost barriers.
The Hartree Centre
The Hartree Centre is a high performance computing, data analytics, artificial intelligence (AI) and hybrid quantum/classical computing research facility created specifically to focus on increasing adoption of advanced compute technologies in UK businesses.
Sectoral teams made up of digital experts and business professionals work with companies to help upskill staff and apply practical digital solutions to individual and industry-wide challenges for enhanced productivity, innovation and economic growth. In its first four years of operation (2013-2017) the centre generated £34.6 million net impact GVA for the UK economy.
In 2021, the Hartree National Centre for Digital Innovation (HNCDI) programme was established to further support businesses and public-sector organisations to advance the pace of discovery through the use of HPC, AI and quantum computing technologies.
Case study: ‘Platform as a service’ HPC for SMEs[^28]
QED Naval is an SME which specialises in supporting design and deployment of ‘green’ electricity generation installations that harness wind, tidal and wave energy. Sophisticated simulations are a key tool and typically have to be completed within challenging budgets and timeframes.
As part of its Subhub project, aiming to cut the cost of deploying tidal turbines, QED Naval needed to carry out simulations using computational fluid dynamics (CFD) software. In expanding its services, QED Naval decided to trial enCORE, the ‘platform as a service’ HPC offering delivered on behalf of the Hartree Centre by channel partner OCF.
Simulation runtimes were 4.2 times faster than those achievable in-house, enabling QED to increase the size of the models they used and run projects more quickly and efficiently, without increasing their overheads. enCORE has now become a key component of the engineering simulation services provided by the company.
Case study: Enhancing data science skills for SMEs
ORCHA is an SME that specialises in evaluating healthcare apps. What a ‘good’ health app looks like varies hugely, so ORCHA aims to help health professionals identify and prescribe the best app for their patients, enabling better monitoring and self-management of health conditions.
Through the LCR 4.0 project, ORCHA worked with data scientists at the Hartree Centre to explore new data-driven techniques that could speed up their evaluation process and develop a more sustainable business model.
Using natural language processing to begin automating the data collection process and creating classifier tools that sped up the categorisation and quality evaluation process by highlighting relevant terminology, the team enabled ORCHA to scale up and offer more insight in its app reviews. Working with the Hartree Centre in AI and data analytics has helped the company continue to grow, future-proofing their product and business by enhancing the speed and efficiency of their evaluation service.
Case study: HPC for optimising abattoir returns
As cattle approach maturity, large and expensive feed inputs reap increasing small returns on the saleable meat at the abattoirs to which farmers sell their stock. Agri-tech business Innovent Technology uses video of live cattle and carcasses in the abattoir to help farmers optimise these inputs and maximise profit. Key parameters extracted from cattle images can be used to select the most advantageous time for farmers to sell and help abattoirs maximise returns on meat sold to market.
Capturing and analysing the shape and mass of live cattle requires large and computationally expensive operations. These computations can strain conventional computing devices and slow down the software used to monitor these settings. Innovent uses HPC to ensure they are reliably capturing and processing data, and is investing in compute to further reduce processing times. They are also investigating the feasibility of running certain operations on high-end GPUs, to improve processing times. This will enable Innovent to process higher resolution images at higher frame rates, and to extract a larger set of more precise data to inform the decisions of farmers and abattoirs.
As the EU’s former Fortissimo project demonstrates that adoption of compute can dramatically accelerate SMEs growth. In the UK, the Hartree Centre’s sectoral approach is proving to be successful, but an in-depth sectoral analysis is needed to determine the scale of the demand for compute in different sectors and help prioritise government actions.
Fortissimo - helping European SMEs become more competitive
The EU’s Fortissimo project was created to support manufacturing SMEs to help overcome the cost, skills and awareness barriers of adopting compute. Coordinated by the UK, it established a virtual market place that connected industry players with experts at National Competence Centres providing cloud-based HPC, expertise, applications and tools.
The pay-per-use model adopted by the project proved to be an efficient way of helping businesses to considerably lower their production costs, reduce the time needed to bring products to market and improve their products and services. Between 2013 and 2018, 92 SMEs from different European countries and sectors participated with success stories presenting the benefits across the value chain.
Many smaller businesses remain unaware of the commercial benefits that compute could bring to their businesses. SMEs have limited time, capacity and skills and so require greater support. Even where businesses appreciate the benefits of compute, they still need support and resources to realise them. Policies and support programmes are already in place, but they need to be made accessible and attractive to the companies targeted.
AI companies and AI adopters
AI business users include those who develop innovative technologies (AI-based companies or AI developers) and businesses who adopt AI products (AI users). Compute is core to these companies’ ability to unlock the benefits of AI. The National AI Strategy has outlined the need to transition to an AI-enabled economy by supporting AI businesses and enabling organisations to harness the power of AI. It is estimated around 15% of UK businesses have adopted at least one AI technology, with large companies more likely to be at the forefront of AI adoption.
Case study: Compute to enable self-driving cars[^29]
Wayve is a British automated vehicle (AV) start-up founded in 2017 by experts in artificial intelligence from the University of Cambridge. The company uses a novel approach of end-to-end machine learning to give vehicles the embodied intelligence to drive in complex urban environments.
Wayve is pioneering AV2.0, a next-generation autonomous driving system that can quickly and safely adapt to new use cases and driving domains anywhere in the world. Its ‘driving intelligence’ AI models learn to drive from petabytes of driving data and can adapt those learnings to different vehicle types and new geographies without re-engineering. Wayve expects its AI-driven approach to scale significantly faster than traditional approaches to AV development, unlocking new markets and encouraging the wider adoption of self-driving vehicles.
Wayve recently began commercial trials with Asda and Ocado Group, pushing its machine learning workloads to new levels of scale. Wayve is partnering with Microsoft to leverage the supercomputing infrastructure needed to accelerate the development of billion-parameter foundation models for autonomous vehicle systems. As these applications advance, supercomputing capabilities are crucial for processing the immense data required to simulate, validate and train AI models that enable safe and secure autonomous driving.
Case study: AI compute for battery maintenance
Faraday Battery is a start-up looking to manufacture rechargeable battery packs for large electric vehicles like trains and buses. The company was looking for a way to measure and quantify the health of a battery cell in real-time and display warnings of cell health.
Using compute for AI and data analysis, the Hartree Centre worked with Faraday Battery to investigate data from individual battery cells and develop tools to measure their normal operation. Machine learning enabled the prediction and quantification of variables affecting cell health, highlighting cells needing maintenance in real-time.
Being able to monitor the health of batteries helps manufacturers to understand how well their batteries are performing overall, detect failures sooner and prevent serious issues. It could also help with overall maintenance scheduling and cost-saving by replacing individual cells, rather than full battery packs, when needed. Future net zero transport providers could also keep track of their resources more efficiently.
Almost all AI users use commercial cloud. This includes developers using large-scale systems to train their models and users deploying cloud-based pre-trained models or AI services. Industrial AI research has been enabled by massive access to accelerator-driven compute through the cloud, from the exploratory phase of research when trialling different model settings, through to final training runs once optimal settings have been found. These training runs can have compute costs into the millions of exaflops for the largest models, equivalent to using compute for weeks or months continuously.[footnote 30]
Meanwhile, programmes such as Digital Catapult’s Machine Intelligence Garage helps businesses of all sizes to gain access to compute and expertise. It also assists AI and machine learning companies with the application of their technology, to enable them to bring compute-enabled solutions to other sectors such as creative industries. The programme was oversubscribed and could support ten times as many teams.
3.4 Compute for public-sector users
Government departments and agencies use compute in a wide variety of domains, with compute being provided via a mix of both classical on-premise systems and cloud-based services. There is large variation in the maturity of users in the public sector. The Met Office has one of the most advanced compute systems in the world. Other established public sector use cases include environmental, defence and health applications. However, the National Data Strategy sets out that there is massive untapped potential in the way the government uses data. Compute will be critical in achieving the objective to transform government’s use of data to drive efficiency and improve public services. This section explores the public sector pioneers of compute and emerging users.
Public-sector pioneers
The Met Office is the largest government user of compute. In 2020, up to £1.2 billion was invested into compute for a weather and climate supercomputer, keeping the UK at the forefront of science and enhancing the country’s resilience to the changing climate.[footnote 31] The Ministry of Defence uses compute in weapons production and weapons platforms management, cryptography, combat simulation and missile defence, and is increasingly investing in AI, space and cyberspace compute. The Atomic Weapons Establishment simulates the use and storage of nuclear weapons. The Department for Environment, Food and Rural Affairs (DEFRA) uses big data analytics in policy models ranging from flood defence, agricultural compliance and food chain analysis. NHS Digital uses compute to support clinicians and patients and improve treatments.
Case study: AI for Peatlands Project[^32]
The AI for Peatlands project uses AI models to detect man-made drainage features in peatlands. These drainage features known as ‘grips’ were historically dug to improve land for agriculture. Since then, it was discovered these have caused peatlands to release large volumes of stored CO2, increased flood risk and destroyed vital habitat for a range of wildlife.
The project leverages compute provided by the Data Analytics & Science Hub (DASH) — a cloud-based platform that provides scientists and analysts across DEFRA with compute and accessible datasets. The aim is to locate drainage features, so that teams can block them, trapping the water back into the peat and turning it back into a carbon sink. The DASH platform, in collaboration with Microsoft Azure, has enabled the project team to run deep learning experiments and utilise image classification, semantic segmentation and object detection techniques.
This work has used deep learning models to accurately map grip clusters, grip channels and peat dams. The next stage will be to develop more accurate machine learning models and offer greater insight into the peatland quality, alongside producing a national map of peatland surface features. This will require a larger amount of data and additional computational power.
Compute is increasingly being used at the local level for data analysis. The Hartree Centre is currently exploring several projects in healthcare, social care and social housing to support public bodies across the country. For instance, one potential project looks to explore how compute and AI might be used to look at images of mould in properties to determine its seriousness. The DiRAC HPC facility has also supported a series of Innovation Placements with the Guys’ and St Thomas Trust and the NHS Get It Right First Time (GIRFT) programme. These placements have developed AI-based tools for predicting which patients will fail to turn up for appointments and assessing clinical needs of frail patients.[footnote 34]
Emerging public-sector users
The potential of compute to revolutionise the public sector, the functions it delivers and the policies it sets, is significant. For example, compute can be used for real time decision making and more sophisticated computational models to help government understand the consequences of policy choices. However, as with smaller businesses, awareness and access outside of established users appears to be limited. The review has found evidence gathering for public-sector use of compute challenging. Aligning the work of the cross-government Integrated Data Service, which provides access to leading cloud technologies and a broad range of data sources to enable faster and collaborative analysis for robust policy-making, could play an important role.
3.5 Common challenges facing users
UK users of compute have widely varying requirements. This creates specific challenges depending on the user group and maturity level. However, there are also a number of common challenges faced by users that need to be addressed to unlock and maximise the benefits of compute, now and in the future.
Awareness and support to access compute
Better information about the benefits of compute as well as the availability of and access to public resources is needed for both existing and future users. Stakeholders have also indicated that uncertainty around the future of national provision means that users may defer investment decisions for their own compute resources until plans for national provision are clearer. Effective signposting is needed to provide different communities with the information they need.
This lack of awareness disproportionately affects SMEs. Evidence demonstrated a clear consensus that smaller businesses need help to understand the opportunities brought by compute adoption and how to access compute resources. Awareness events run by facilities are reported to be successful at addressing this issue. While the government’s investment in the Hartree Centre demonstrates commitment to supporting new users, there remain many more potential users to identify, educate and support. Increased awareness would support new users and could give further impetus to the development of compute-dependent industries, stimulating compute demand and the associated economic benefits.
Furthermore, gaining access to compute and learning how to use it creates additional costs for users. While the UK government has taken positive steps to address some of the cost barriers that inhibit cloud adoption — announcing in Autumn Budget 2021 that R&D tax reliefs would be reformed to include data licences and cloud computing services costs — few stakeholders engaged by this review were aware of the initiative.
Action required: Broaden the use of compute
The government should raise awareness and support the adoption of compute across the economy, particularly among SMEs. The government should take a proactive approach, improving signposting of available resources and increasing coordination across the compute ecosystem. This should include supporting the work of public facilities, considering how facilities can work together to help new and emerging users and promoting existing schemes, such as R&D tax credits.
Capacity at public facilities
UK facilities are currently running at capacity, with users struggling to access the compute resources they need. With demand for compute expected to increase, problems arising from limited compute capacity will continue to worsen. The loss of access to EU systems will further limit the compute capacity available to UK users. As the cost of leading-edge compute would be prohibitive for any single business or institution, the government must invest in UK compute capacity. Further assessment of compute capacity will be presented in Chapter 4.
Capability for different users
Facilities need to support diverse workloads, from the most computationally intensive to those requiring specialised architectures, as in the case of AI research. Investing in exascale infrastructure and accelerator-based systems is key to increase UK compute capability.
More powerful systems are increasingly required to perform cutting-edge, transformative research in academia and industry. Research and development at the frontier of AI is currently only achievable using significant compute capabilities. That said, investing in exascale capability will require supporting the whole compute ecosystem, including increasing compute capacity across other tiers.
In addition to exascale, greater provision of accelerator-based computing systems has been identified as essential to meet the current and future needs of the AI community. At present, there is limited accelerated compute at Tier 1 in the UK, and Tier 2 facilities with accelerators are at maximum capacity or under-resourced.[footnote 35] There is anecdotal evidence that this lack of accelerator-based compute for AI researchers is preventing breakthroughs, pushing AI researchers out of academia and out of the UK.
Access to data and skills
Access to data is integral to users’ ability to exploit compute. However, the UK data sharing landscape is complex and significantly fragmented, with users unsure how or where to access data. Varied access and licensing models as well as data interoperability issues can make it challenging to combine data from multiple sources. Meanwhile, commercial datasets can be prohibitively expensive for researchers to licence. Whilst data is out of scope, this review supports the implementation of the National Data Strategy to enable safe and secure access to and sharing of data by recognising the role of compute.
Furthermore, access to the skills necessary to understand and use compute effectively has repeatedly been the biggest issue raised by stakeholders. It is essential that the UK grows a sustainable and diverse skills pipeline to support greater uptake. This issue will be further explored in Chapter 5.
Action required: Create a diverse, healthy and integrated compute ecosystem that meets user needs
The current provision of compute in the UK does not sufficiently meet the needs of either existing or emerging users. An integrated, coordinated, diverse and resilient ecosystem will improve efficiency and maximise compute’s impact for all users.
The government should take a holistic approach to meeting the demands of all users, both new and existing, and recognise the need for exascale and accelerator-based systems. In practice, this means expanding the provision of hardware, software, skills and data across all public facilities tiers. All components are critical and interdependent. This system-wide view must be at the forefront of the government’s approach to delivering the recommendations set out in this report.
3.6 Policy implications
There is the potential to boost economic growth by broadening the use of compute across the economy — particularly among smaller businesses. However, current compute provision does not meet demand.
The government should ensure the sufficient supply of compute resources and reduce the barriers to entry, supporting all users to access and use compute. This will require addressing the challenges currently faced by users and creating a diverse ecosystem. Longer-term, strategic planning is essential to develop an ecosystem that provides capacity and capability at local, regional and national levels, and delivers services in an integrated way. Chapter 4 and Chapter 5 will explore how the government can meet compute demand and create a vibrant compute ecosystem.
4. Meeting the UK’s compute needs
Key findings
-
UK public compute provision is not meeting user demand. Without targeted action, the UK will have insufficient compute capacity and capability at all tiers over the next decade. This means the UK is becoming less attractive to researchers, with potential repercussions for competitiveness and innovation-led economic growth.
-
The UK needs a diverse compute infrastructure that includes a growing role for commercial cloud. Commercial cloud needs to be supplemented by public compute to meet demand, particularly from academic researchers.
-
The UK must invest in an integrated compute ecosystem with varying capabilities and capacities to maintain its science and technology leadership and grow its economy. This means ensuring that researchers and industry have access to exascale infrastructure, which they need to be internationally competitive, and immediately planning for the next generation of compute.
-
The UK needs to increase its AI compute capacity immediately. Ambitions to make the UK an AI superpower require an immediate, significant and sustained investment in accelerator-driven compute capabilities that directly support AI research.
-
To build sustainable systems for the future, the UK must produce guidance, promote innovation and shape procurement practices. Managing the environmental impact and energy demand stemming from building and operating compute will be essential, particularly on the path towards exascale.
-
The UK must support the development of software and skills to get the most out of its infrastructure. These enablers should be considered alongside any future infrastructure investment decisions.
4.1 Introduction
Access to compute is essential for science and innovation. At present, there remains a gap between what users need and what they are able to access in the UK. The UK’s current publicly funded infrastructure has traditionally focused on a core set of users that require specialist compute. However, these systems are operating at capacity and are frequently oversubscribed. Additionally, the demand for compute has broadened, with AI-specific technologies requiring significant levels of computational power.
At present, there are limited plans for investment in next generation systems to respond to the dual pressure of increased demand and a broadening user base. Without the provision of sufficient compute infrastructure, UK-based science and innovation is restricted. If the status quo persists, more users are expected to seek access to compute outside the UK, with negative consequences for the UK’s talented research base. The UK needs to take action quickly and commit to and build world-class compute infrastructure. This should be done through a mixture of public, commercial and private systems, and in the most sustainable way feasible. Necessary actions include: the development of a UK exascale ecosystem; investment in all tiers of the UK computational infrastructure to support the varying compute needs; a national AI capability; and improved access to public systems via cloud interfaces.
Compute infrastructure has four functions: to process, store, transfer and generate data. This requires heterogenous systems fit for the growing uses outlined in Chapter 3, underpinned by sufficient operational infrastructure and paired with the necessary software and engineering expertise. This chapter explores the UK’s current and future compute infrastructure landscape and outlines what is needed for it to be fully operational.
4.2 UK’s current system infrastructure landscape
The GO-Science report provides a comprehensive summary of the UK’s compute infrastructure. This comprises a loose collection of publicly supported compute facilities, including a small number of AI compute systems, supplemented by commercial cloud services.
Publicly supported compute
Numerous government departments are involved in public funding of compute decisions. UKRI funds national and regional facilities in the UK. This includes: ARCHER2, the UK’s most powerful system; DiRAC, providing targeted compute for theoretical research and GPUs to develop AI and hybrid capabilities; the Hartree Centre, providing support for UK industry and the public sector at Tier 1; as well as a number of regional Tier 2 clusters. In Wales, the Supercomputing Wales cluster was part-funded by the European Regional Development Fund (ERDF) through the Welsh Government, with support from partnering universities.
Many Public Sector Research Establishments (PSREs), Research Council Institutes/Units, Independent Research Organisations and Higher Education Institutions (HEIs) use and fund compute. Operational decisions are often devolved to each facility who act autonomously, with each focusing on a particular community of users. The Met Office and AWE, amongst other PSREs, invest in and operate their own systems, which offer limited or no access to other users. There are also about 42 Higher Education Institutions that self fund local (Tier 3) compute facilities which are managed locally to support students and research groups.
This range of actors involved in the delivery, funding and operation of compute has led to a fragmented landscape. Whilst there is some coordination of compute through UKRI’s oversight of digital research infrastructure and through the HPC special interest group, there is no overarching strategic planning for public compute infrastructure.
ARCHER2
ARCHER2 is a world-class compute resource for UK researchers, provided by UKRI. After dropping 6 places in 2022, it ranks 28th globally, providing around 20 petaflops (Rmax) of compute over 5,860 nodes. Each node consists of two 2nd generation AMD EPYC processors.
ARCHER2 supports hundreds of applications and about 4,000 users to improve cloud modelling for better weather and climate prediction; predict aircraft jet noise; forecasting the dispersion of volcanic ash and gas; and many more.
The allocation of compute resources differs by workflow, but the basic unit of compute consists of a node for an hour. Prioritisation is done according to research grants, specific calls for the research councils and through the PRACE consortia with decisions taken by a panel.
As previously noted, some public facilities work with commercial partners to provide compute. The Hartree National Centre for Digital Innovation is a collaborative five-year programme between the STFC and IBM. It provides compute resources and services to UK businesses and the public sector, including training for mid and early career researchers. The Cambridge Dell Intel Centre and NVIDIA’s Cambridge-1 provide extensive compute resources for healthcare and life sciences research.
Despite the diverse network of public infrastructure, facility directors have reported that their systems are operating at capacity and in many cases are significantly oversubscribed. Furthermore, the allocation process for determining who can access systems is administered by different public bodies. Whilst this can allow for high quality and efficient use of individual systems, a variable and somewhat uncoordinated approach to allocation may have unintended consequences. For example, it could lead to inefficient allocation of resources and restrict utilisation, make it more difficult to plan for future infrastructure and limit the allocation to emerging users. It may also mean that research outcomes are not aligned with national priorities.
Public systems utilisation[footnote 36]
Public compute facilities directors report most systems are at capacity or oversubscribed.*
- ARCHER2 use has averaged 88% since launch in January 2021
- DiRAC is 120% oversubscribed
- Baskerville has averaged an utilisation rate of 82% in 2022, with a peak of 96%
- CSD3 resources have already been fully allocated 6 months ahead
- JADE has reached the target utilisation of 80% of the available academic time
*To note: Systems often run multiple workloads simultaneously and the demanded resources rarely use every node, meaning over 80% is close to maximum capacity in practice.
Commercial cloud compute
Demand for commercial cloud is increasing quickly. The breadth and variety of configurations on offer is growing — for example, parallelisation, fast networks, and AI accelerators — making cloud suitable for many users. It is widely used as an accessible, flexible and scalable resource, particularly by businesses and the AI community. Facilities directors reported the need to purchase cloud credits on behalf of researchers. For example, the Alan Turing Institute on behalf of AI researchers and the Digital Catapult for AI startups. Researchers report using commercial cloud where public resources are unavailable or when they need more accelerators.
Commercial cloud should be seen as a key component of the UK’s compute infrastructure. It provides rapid access to compute, supplementing the capacity of public systems (‘cloud bursting’) and giving access to other capabilities unavailable elsewhere. Some commercial cloud also provides hybrid use models, as in the case of Microsoft’s partnership with the Met Office, which deploys Azure cloud alongside an HPE system.
However, commercial cloud does have limitations and cannot be seen as a universal substitute for public infrastructure. It may be unsuitable for researchers who need specific networking or compute performance, and for those who need to port large datasets between systems. Some researchers have reported that the cost of commercial cloud is prohibitive, although there are a growing number of more affordable commercial cloud options (such as spot pricing and long-term contracts).
AI compute
As explored in Chapter 3, AI users are heavily dependent on hardware accelerators. However, there are significant shortages in public accelerator capacity in the UK, with fewer than 1,000 NVIDIA A100 GPUs available to researchers.[footnote 37] Facilities directors reported systems that provide accelerator-based compute are running at capacity or oversubscribed. Some public-sector AI researchers have pro-bono access to private computing systems, such as university and NHS access to NVIDIA’s Cambridge-1, or have collaborations with leading private labs. However, such access is ad hoc, preferential and cannot be relied upon to support important public research in the long term.
The UK’s public provision of top-spec accelerators should be seen in the context of other leading AI nations and the private sector. The EU and US are well-positioned, with Leonado offering 14,000 and Perlmutter 6,000 A100 GPUs respectively. These public systems are not dedicated to AI and resources are shared with other areas of research. In contrast, large private companies have built or leased systems to exclusively train their AI models. For example, in 2020 OpenAI used 10,000 GPUs on Azure, and in 2022 Meta used 21,400 A100 GPUs, including on Azure. Anthropic recently suggested that the US public sector would need accelerator-based resources in the order of 100,000 top-spec GPUs to be truly competitive when compared to private sector laboratories.
Case study: Neural networks for drug and material discovery in quantum chemistry
In 2020, researchers from Imperial College London and DeepMind introduced a novel approach to solving the many-particle Schrödinger equation, which is fundamental to understanding molecules and chemistry. They used a novel formulation of neural networks to approximate the equation that could be trained without the use of external data. The approach matches and often beats the best conventional approaches to quantum chemistry for small molecules. The eventual goal of this fundamental research is to make computational chemistry flexible and precise enough to replace laboratory-based chemistry in many instances. This would speed up and simplify the development of new drugs, catalysts, fuel cells, batteries and electronics, amongst many applications.
This research was enabled by a grant of 850,000 GPU-hours on the German JUWELS Booster supercomputer and would have been impossible to do in the UK as training compute requirements were beyond the capability of UK public facilities. Industrial partners were unable to provide access to internal computing systems, while commercial cloud had prohibitive costs and lacked suitable high-bandwidth interconnect between accelerators.
Academic and industrial research groups in the US, China, and other European countries are now building upon this work, through access to the accelerators needed to drive future advances and capture the commercial value of this research. Without significantly increased UK accelerator capacity, there are serious challenges to the ability of UK researchers to remain competitive in this area.
Access via international compute
Collaboration is essential for delivering world-class innovation and research. International facilities foster global research consortia and support joint endeavours that can provide access to additional compute capacity. The UK is a respected partner in many of these programmes, owing in part to its excellence in software development and skilled engineers.
UK users have access to some EU systems including LUMI (which provides 309 petaflops) and Leonardo (with NVIDIA Ampere GPUs that deliver 10 exaflops of AI performance). Successful bids on these systems often provide researchers with significantly more resources compared to the ones available in the UK. For example, one researcher reported that, by accessing EU resources, they received six times more GPU hours than they could access in the UK. By 2027, many existing EU systems will be decommissioned and the UK will not have access to the systems that will succeed them. UK researchers can also access the US’ DOE systems, including Frontier, the first exascale system, via the INCITE programme. For example, UCL leads a project using deep learning to speed up drug discovery.
4.3 UK’s future infrastructure requirements
Chapter 3 explained that demand from academic, business and public-sector users will exceed current compute capacity and capabilities. Compute systems require significant planning and investment, with many having a lifespan of around five years, after which they require substantial upgrade or replacement. Maintaining current capability requires planned investment via public or private routes. However, there are currently no public funding commitments for future systems of international or national class (Tier 0 and 1), which provide the most capability and support for pioneer users, and research and innovation.
Existing supply must not only be upgraded or replaced, but also substantially increased to meet growing demand from established and emerging users. In the US, Summit has been available to users since 2018, providing access to nearly 150 petaflops of compute and in 2022 they broke the 1 exaflop barrier with Frontier for pioneering public research. The UK must ensure that there is heterogeneous compute across all tiers. The following sections make the case for investment in next generation exascale systems, and the pressing need to support the growing AI community.
The need for exascale
Exascale is the next generation of compute. The GO-Science report set out how such developments are expected to open up a range of new use cases and previously intractable areas of research. The US broke the exascale barrier with Frontier in June 2022, enabling huge workflows of parallel computing to develop new technologies for energy, medicine, and materials. The government has indicated that UKRI’s strategy is to deploy an exascale system by 2025. However, there is currently no funding in place to deliver this ambition.
As exascale becomes more widely available, demand is expected to emerge from research-intensive businesses. There are many sectors that derive significant value from cutting-edge innovation, such as pharmaceutical, material, automotive and engineering, logistics and finance sectors. Provision of exascale compute is crucial for the UK to remain an attractive destination for private R&D investment and world-class research.
Any future exascale system will need to be heterogenous to suit the variety of users it supports. It will need accelerators to support the AI community and hybrid workflows of traditional users, fast storage for users needing high throughput and sufficient scale to keep developing high performance codes ready for the next generation of compute. An exascale ecosystem must support greater use by businesses, public sector and a broader research community. This will need careful planning, design and compromise that incorporates funding for the development of exascale-ready hardware, software and skills and considers future technological convergence.
Action required: Make immediate investments in the pathway to public exascale capability
Exascale capability is an essential component in achieving the UK’s long-term ambitions to be an AI superpower, deliver world-class research and be a global hub for innovation. Exascale opens up exciting new opportunities in research and innovation to grow the economy, enabling researchers to build more complex models and simulations. It will allow researchers to understand climate change, power the discovery of new drugs and advance the UK’s engineering capabilities. It is a critical component in maximising the UK’s potential in AI. However, it must be ensured that all tiers of compute, as the foundations of exascale, are also maintained at an appropriate level.
The UK must continue to engage closely with international exascale operators to understand the lessons learned in other countries and the delivery challenges involved. For example, collaboration between the Hartree Centre and the US’s Exascale Computing Project (ECP-USA) supports the development of scalable algorithms, scalable AI and exascale software environments. Whilst exascale may no longer be cutting edge at the moment of delivery in the UK, some technological risks remain and the funding commitment is substantial (Frontier cost about $600 million and Jupiter has a budget of €500 million). According to compute suppliers, it would take around 18 months from contract award to build and deliver operational exascale capability. In preparation for the decommissioning of existing Tier 1 systems, including ARCHER2 in 2025 based on facility directors’ estimates, the government must commit to exascale as a matter of urgency.
The path to exascale
There is consensus from across the community that there must be a clear plan for delivering exascale and that the UK needs to start laying the path to exascale today. A plan must allow the community to transition codes to a more powerful and accelerated system and develop the skills needed to maximise the use of exascale technologies.
Given the size of an exascale system, there are different options to deliver it. On balance, the review has concluded that exascale capability should be reached through a phased approach. This would deliver exascale-ready public capability immediately, adding additional hardware that increases the capability to full exascale ecosystem by 2026. This phased approach maximises the effectiveness of government investment by matching capability with readiness and managing the risk of early adoption.
This ecosystem should be heterogeneous and include accelerators and configurations that reflect the variety of Tier 1 workflows, including but not limited to HPC, HTC, AI and data science. The system relies on investment in all tiers of compute to support user progression and provide a sustained flow of skills to optimise operation.
The key delivery phases should include:
Phase 1: Immediately deliver hardware that supports a wide range of demands from research and business communities. This should provide at least 250 petaflops with enough performance and capacity to support current and future user requirements. This is internationally competitive and an order of magnitude more powerful than ARCHER2 to support exascale-ready code development. An expanded software development programme is necessary.
Phase 2: Deliver hardware that has at least one exaflop of processing power by increasing performance using compatible hardware. This should expand Phase 1, to deploy the most appropriate architectures for the entire community. This should be delivered no later than 2026, and within 2 years of phase 1 to maximise investment.
The need for AI compute
AI technologies are increasingly reliant on computationally-intensive models. The UK is currently ill-equipped to support public AI research due to its limited number of top-spec accelerators. The situation is exacerbated by researchers reporting that they incur high costs to use commercial cloud at scale, limiting the ambitions of the problems they can address. Furthermore, there is concern that a lack of specialist compute, awareness of resources and ease of access, could make the UK academia less attractive to the best of global AI talent, which is scarce and the focus of fierce international competition.
In an increasingly competitive geopolitical landscape, inadequate compute resources mean the UK lacks the agency to steer the development of frontier AI in a manner aligned with UK’s values and objectives. To meet its ambitions on AI, the government must make immediate, significant and sustained investments in accelerated compute capabilities that directly support AI tools and research.
The Turing Foundation Model
The Alan Turing Institute (ATI) has been developing plans for a sovereign UK AI resource, the Turing Foundation Model. Their vision is to lead ambitious foundational research of building large language models, bringing UK public sector research to the frontier of AI, in partnership with government bodies and the private sector.
ATI are motivated by the fact that most contemporary AI technologies involve massive models trained with extremely large data sets. The compute resources required places such technologies out of reach for UK researchers, leaving only a handful of large, mostly foreign technology companies able to develop them.
In addition to providing a trusted AI resource for the public sector, the project will increase the UK’s compute capability and capacity for future projects, and train hundreds of PhD students at the frontier of AI. Furthermore, the initiative would support innovation and commercialisation through spinoffs and start-ups from the outset.
Accelerator-based compute is required at all tiers, from local university systems to future national exascale systems, where AI research must be a first-class use case. Facilities need to support ambitious large-scale AI projects carried out by research consortia as well as creative, small-scale experimentation done by students and small academic groups. Systems should also have capacity to support the use of large models shared amongst multiple users and should provide a platform for private industry models to be securely shared through structured access to select users.
At the national level, significantly more resource is required to allow AI researchers to conduct on-demand exploratory research and concurrently train large language models. The panel considers that at least 3,000 top-spec accelerators would meet this need in the immediate term. This is based on the current supply of accelerators to UK academics; investment by international competitors; and the scale of resources required to support the growing UK AI community as well as large-scale training runs. This would be greater in ambition than the BLOOM model trained by French researchers on the Jean Zay supercomputer (384 A100s for 4 months) or the GLM-130B model trained by Chinese academics in partnership with a start-up (768 A100s for 2+ months). Accelerator capacity should be scaled over time as user demand increases.
Recognising that large systems take time to procure and build, and may not meet the urgent need for AI compute capacity, commercial cloud could be used to rapidly meet existing demand and act as a pilot to inform the business case for long-term AI compute provision. This should be done in parallel to investments in existing and new accelerator-based systems. Government could explore a partnership with industry or public organisation to quickly deliver the necessary compute for AI. In addition to potentially reducing the amount of public investment required, this would have the advantage of pooling expertise and resources, supporting the use of the latest AI software, improving research quality and increasing the flow of ideas across fields.
Similar to the model used by the Digital Research Alliance in Canada, individual accelerator access should be available to any UK AI researcher to support exploratory research and small-scale experimentation. More ambitious large-scale projects should be offered access to the necessary compute through frequent resource allocations calls. Such calls should be open to any UK entity and promote a mixture of (un)conventional research and commercial projects to spark innovation.
Given the increasingly dual-use nature of AI research, the government may wish to consider how this research undergoes ethics and safety oversight. As well as new opportunities, compute-intensive applications of AI pose novel risks. Recognising the UK’s ambition to steer the global development of highly advanced AI systems in a manner aligned with UK values, the government could use compute investments to support the Centre for Data Ethics and Innovation’s roadmap to an effective AI assurance ecosystem, as well as the Office for AI’s pro-innovation approach to regulating AI. This could include support for research into technical mechanisms for monitoring and verification of compute usage for the training and deployment of advanced and possibly dual-use AI systems.
Action required: Immediately and significantly increase compute capacity for AI research
The UK AI community has immediate requirements for large scale accelerator-driven compute to remain internationally competitive and deliver on the UK’s ambitions to be an AI superpower.
The government should establish a UK AI Research Resource for immediate use by academic and commercial users in the AI community. It should provide significant accelerator capacity of at least 3,000 new top-spec AI accelerators. The AI Research Resource should also provide access to a wide range of key datasets and skilled staff to support its use, and be complementary to existing investments and upgrades in accelerator-driven compute.
There also needs to be continuous and sustained investment in accelerator-driven compute capabilities that directly support AI research as a first-class use case. This capability should be present from national exascale systems to compute facilities at Tier 2 and locally at universities. Tier 2 systems in particular should be used to support a diverse breadth of AI-focused accelerator hardware beyond that chosen at the exascale tier.
4.4 Operational infrastructure
Investing in compute infrastructure to meet future demand is crucial, but insufficient by itself. This must be complemented by investment in what makes infrastructure operational: efficient software, skilled developers and operators. Furthermore, it is imperative to ensure that compute infrastructure is as sustainable as possible to limit the environmental impact of compute.
The importance of software
Well-engineered software is a crucial component of computational capability and efficiency, improving the performance of compute and making it easier to use. The UK needs to align investment in software development with hardware investments to get maximum benefit from new infrastructure. The use of AI accelerator systems requires significant investment into research software engineering. Existing software will need to be reengineered or rebuilt to benefit from the increased parallelism and accelerator capabilities of an exascale system. There are also significant performance improvements to be had from algorithmic efficiency or different approaches, such as DeepMind’s optimisation of AlphaGo to AlphaGo Zero. Research software engineers (RSEs) who port and maintain these codes and optimise system performance are a crucial part of operational infrastructure.
UK strength in software
The UK’s strength in software development, particularly research software development, is internationally recognised. The UK’s software sector has been expanding, with software companies making up the majority of the fastest growing tech companies in 2021. When considering the HPC market specifically, the UK’s software component is the second largest by size ($447 million in 2021) in Europe and is expected to grow at the highest 2022-2030 compound annual growth rate (5%). The UK is particularly strong in computational simulation, with UK codes often among the most used on ARCHER2. In the private sector, ENGYS and ICON are examples of UK companies providing computational fluid dynamics software and design optimisation solutions for industry applications.
Many parts of the software stack are built on open source software. International software development efforts, such as those conducted on Summit (porting codes to accelerators) and on Frontier (to develop and test the first exascale codes), are important to the international research community and contribute to the UK’s exascale readiness. The UK should continue to support these initiatives whilst investing in software development programmes that prioritise UK research interests, develop skilled engineers and maintain an excellent international reputation.
Action required: Increase software investment to align with delivery of exascale
The government should invest in initiatives to accelerate code scaling, port codes to accelerators and develop skilled engineers. This should target and benchmark codes already on capability machines and those required by both AI researchers and industry.
Many UK codes need to be reengineered or rebuilt to maximise the benefit of exascale for all users. Existing programmes should broaden their remit to support all software with potential to scale and an active user base. The UK is a leader in software development and should drive further collaboration with leading compute countries. International partnerships should be established and sustained to develop software as part of the UK’s roadmap to exascale capability by negotiating access to international (pre)exascale systems (e.g. US’s Frontier) and leveraging existing access (e.g. EU’s LUMI). This would enable the UK to have exascale-ready tools and applications as soon as domestic exascale capability is deployed.
The importance of cloud access
Cloud access and orchestration of resources has the potential to make compute more accessible to a broad range of users, with workflows able to seamlessly run on commercial cloud, private cloud and on-premise systems. The current technological trajectory points towards more demand for hybrid systems, combining cloud with on-premise compute resources.
Case study: The Euclid space mission
Scientists at the Royal Observatory Edinburgh are using the Science and Technology Facilities Council (STFC) cloud for the Euclid space mission, a European Space Agency mission, to do groundbreaking research on why the universe’s expansion is accelerating and the nature of dark energy. Over a six year period, the Euclid satellite telescope will observe the Euclid Deep Fields, generating hundreds of thousands of images and analysing several tens of petabytes of data.
The Euclid programme requires access to significant compute in a flexible format. The STFC Cloud has been used to build a huge virtual network of servers that work in parallel with each other and operate like a traditional HPC cluster. The flexibility of cloud allows the programme to expand, contract and reconfigure the resources to meet the needs of specific tasks and for Euclid data processing.
At present, public compute infrastructure makes limited use of cloud technologies. This limits access to facilities and resources, as well as the ability to share and access data. Government should encourage wider adoption of cloud technologies in public systems in order to make more resources available to more researchers.
Improving the interoperability between public systems and commercial cloud would remove key barriers to higher adoption by researchers and businesses. Developing and testing interoperability solutions, such as containerisation and platforms like OpenStack will improve software availability and the movement of workloads between systems and support public facilities in adopting best practice. Work is already underway in some Tier 1 and 2 systems to adopt such technologies, but without further support this could lead to variable experience and inconsistent adoption.
Action required: Improve access to and interoperability of public cloud
A lighthouse project should be undertaken to demonstrate the advantages and articulate the challenges of interoperability in public facilities. The project should test various technologies; promote greater collaboration among public facilities and their teams; improve provision of appropriate compute resources and access to important datasets; explore solutions that mitigate vendor lock-in; and support international collaboration. Such a project should also upskill engineers at public facilities, as well as users, to support cloud adoption.
Action can also be taken to signpost cloud as an affordable route to compute. Procurement frameworks can be used to increase the collective purchasing power of institutions. The Open Clouds for Research Environments (OCRE), for example, is an European framework used primarily by academic institutions. Improving the adoption of OCRE across the research community and influencing future iterations of the framework could provide more cloud resources and access models at discounted prices. The UK should ensure that it retains influence in the next OCRE framework.
Action required: Improve the cost-effectiveness of commercial cloud
UKRI, PSREs, and universities should leverage the increased purchasing power of OCRE. The UK should ensure that the next iteration of the framework caters to all users’ collective needs; includes more access models (in addition to pay-as-you-go); lowers data extraction costs; and promotes multi-cloud solutions (that include commercial and private clouds).
Government should publicise international accounting standards and R&D tax credits to reduce the barriers to greater adoption of commercial cloud. This would provide clarity and a better comparison between public systems and commercial cloud.
Meeting the UK’s future compute sustainably
Compute is critical to combating climate change, underpinning work on climate modelling, AI tools for mitigation and advances in alternative sources including fusion. However, compute is also a power-hungry technology. Despite energy efficiency innovation that ranks Frontier sixth on the Green500 list, the US exascale system requires up to 21.1 megawatts per hour, enough to supply 42,000 homes in the UK.
An exascale system in the UK would likely have a similar energy demand to Frontier’s and make up about 0.05% of the UK’s current electricity supply, according to the GO-Science report. The specialisation of hardware, for example accelerators, has allowed for significant gains in energy efficiency, but energy demand will continue to grow for the foreseeable future. These systems also need to be cooled, with a significant amount of energy consumed in maintaining a controlled environment. The cooling infrastructure requirements for individual facilities depend on numerous factors including technology and local geography, climate and energy sources. Some cooling systems can be more efficient than others. Liquid cooling uses significantly less power than air cooling, which can constitute up to 40% of total data centre energy consumption. All EuroHPC systems now use water cooling and industry is adopting it in new and existing facilities.
At a national level, the impact of compute’s energy on the grid must be considered as the UK transitions to net zero. A number of UK facilities, such as EPCC and the Met Office, run on renewables. This is also an increasing trend in the private sector, with facilities building on-site renewable generation facilities and grid-connected power storage to mitigate the impact on national supply. For example, 75% of the energy used by the UK’s data centre industry is currently renewable. While using renewable energy does not solve the issues of the energy demand from computing as a whole, it is a good step towards bringing facilities in line with net zero objectives. The sustainability of manufacturing and building compute infrastructure and its components, such as semiconductors, must also be considered.
The government has a role to play in developing standards and guidance for sustainable compute, promoting good practice from across the private and public sector. Guidance should be informed by engagement with power suppliers and network providers and should consider energy efficiency requirements; encourage efficient system design; promote the use of appropriate sustainability contractual clauses; and inform the location of infrastructure. It should also emphasise the need to achieve sustainable computing through innovation in infrastructure and software design. To this end, the government should actively support the growing body of UK expertise in green computing.
UK growing expertise in green computing
The UK is well-positioned to become a leader in green computing. UK public compute facilities are taking steps towards a more sustainable use of compute. UKRI is committed to becoming net zero by 2040, allocating £1.8 million to the net zero digital research infrastructure project to conduct research into net zero computing. DiRAC — a UKRI Tier 1 distributed computing facility - has enforced a cap of 1.2 on data centre Power Usage Effectiveness (PUE), which has encouraged innovation in data centre design including evaporative cooling technologies and use of renewable energy. Meanwhile, solutions to use heat generated by compute in commercial and domestic premises are being tested in Edinburgh. Efforts are also being made to increase software efficiency as well as to collect energy consumption metrics for jobs to inform future compute resources allocations.
In the private sector, UK-based companies are achieving success in sustainability and green computing. For instance, Iceotope is a world leader in immersion cooling systems for computing and data centres, ARM’s energy-efficient processor design has been adopted in compute systems and Graphcore designs energy-efficient Intelligence Processing Units specialised for AI.
There are significant opportunities for the UK to shape the market and create a world-leading sustainable compute ecosystem. Ambitious approaches to procurement are a key part of this. The government could act as a ‘buyer of first resort’, removing market risk in nascent technologies. A similar approach was adopted by the US with transformative results, catalysing explosive growth in computing through the bulk purchase of integrated circuits in 1962 and creating a market for private space launch systems in the 2000s.
Managing the power demands of Frontier
A 2008 study concluded power demands of exaflop systems may be as much as 70-155MW, leading Defense Advanced Research Projects Agency (DARPA) to set an ambitious 20MW per exaflop target. To tackle this issue, the US Office of Science set out to reduce energy consumption in the existing petascale systems. In 2012, the Mira supercomputer at the Argonne Leadership Computing Facility swapped energy-intensive air cooling for more efficient water cooling. The same year, the Oak Ridge Computing Leadership Facility introduced a new type of processing unit, the GPU, in the Titan supercomputer. These were instrumental to the eventual design of Frontier, which uses 21 megawatts for 1.1 exaflops (Rmax).
The government could consider using challenges, similar to the one set by DARPA, to drive targets in sustainable computing. Given the global scale of the issue, international partnerships can also support the UK’s goal of mitigating the environmental impact of compute by enabling cross-border collaborations on sustainable compute technology.
Action required: Guidance, procurement and targets to support innovation in sustainable compute infrastructure
The government should provide guidance to promote innovation in sustainable computing. The guidance should be produced in line with whole-life carbon project planning and in collaboration with local energy network providers and the national grid. It should cover the design of energy efficient systems; the use of renewable energy; specific sustainability-focused contractual clauses; the location of infrastructure; and the management of water use.
The government should use procurement as a lever for advancing towards net zero goals. Procurement should be used to de-risk cutting-edge sustainability technology, with the government acting as the ‘buyer of first resort’. This approach should be evaluated to see whether it can be deployed in other areas of compute.
In the past, ambitious environmental targets and challenges have been essential to driving innovation in sustainable computing. The government should set similar challenges for compute, beyond exascale.
4.4 Policy implications
The actions identified in this chapter seek to ensure that the UK has access to world-class, sustainable compute capabilities. Without significant investment, the UK will become less competitive as other advanced economies operate or plan next generation compute systems. The UK therefore risks becoming a less attractive destination for top talent and world-class companies, undermining its ability to capitalise on its existing strengths in areas such as AI research and software development. There is also a risk that the disparity in compute resources available to academic researchers compared to private companies will widen further in the AI sector.
The private sector has a role to play alongside the government, but it cannot replace public investment. The UK is well served by commercial cloud and its growing capability means that it has the potential to be used more widely in academic research. However, commercial cloud will not be able to meet all the increased demand, particularly from academic researchers. Public systems at all tiers are also crucial to ensure the provision of compute to a wide range of users and to support public-private collaboration.
International partnerships can play a key role in ensuring the UK can access and test new compute technologies and should complement domestic public investment. Existing arrangements must continue and new ones must be developed, whilst recognising that domestic capability is essential to remain competitive, foster business and academic partnerships and attract talent to the UK. As the UK moves towards a new relationship with the EU without guaranteed access to their newest systems, the government must consider how to provide continued access to the most powerful compute systems that exceed domestic funding potential. It is clear that, without public investment into new infrastructure both now and long-term, the UK will be left behind.
5. Creating a vibrant compute ecosystem
Key findings
-
To fully achieve its ambitions on compute and adopt a truly holistic approach to this technology, the UK requires a clear strategic vision and coordination. Long-term planning and a coordinated compute landscape would lead to: more efficient procurement; more effective allocation of compute resources; increased awareness of available compute resources among users; and clear signalling to stakeholders.
-
Alongside investment in domestic infrastructure, the UK needs a sustainable skills pipeline for compute as well as secure and trusted systems. This would maximise infrastructure investment, increasing compute uptake and enabling users to reap the benefits of compute.
-
Partnerships are integral to creating a competitive domestic compute ecosystem and to ensuring the UK plays a key role within the global compute community.
5.1 Introduction
Previous chapters have made the case for investment in the UK’s domestic compute infrastructure. However, it is essential to adopt a holistic approach to compute and create a vibrant ecosystem. The review endorses the view set out in the GO-Science report that greater coordination and long-term, strategic planning for compute are urgently needed. The fragmented and uncoordinated nature of the UK compute landscape limits the potential that could be derived from existing and new compute resources.
Long-term planning and greater coordination would drive efficiencies in the allocation of compute resources, enable a broader use of compute through increased awareness and trust, and support commercial and international collaboration. It will also help build and attract the skills necessary to be at the forefront of innovation and cutting-edge science. This chapter explores what actions are needed to maximise the return on infrastructure investments.
5.2 The need for a strategic vision and coordination
The GO-Science report highlighted that the UK’s approach to computing is uncoordinated, resulting in inefficient procurement and limited sharing of resources. The evidence collected by the review has reinforced this view. The lack of strategic approach, coordination and vision translates into piecemeal funding, inefficient procurement, lack of user awareness and cumbersome user experience. This also impacts confidence in the UK’s future compute capabilities and competitiveness.
A long-term strategic vision and improved coordination would enable the UK to create a vibrant ecosystem and help address the barriers currently affecting compute uptake — including issues around skills and user needs around secure infrastructure. It would also allow the UK to build strong partnerships, both with the private sector and internationally, further improving its compute ecosystem and international standing. These areas are discussed later in this chapter.
A strategic vision for compute
As outlined in Chapter 2, leading countries in compute have long-term strategies linked to specific long-term goals or key societal challenges, as in the case of Australia’s approach to compute. These plans have led to the creation of strong domestic compute ecosystems and internationally competitive facilities.
Case study: Australia National Research Infrastructure
Australia has a clearly structured approach to compute. It comprises a 10-year vision, refreshed through 5-year rolling roadmaps, and a 2-year rolling investment plan informed by roadmaps. Roadmaps are developed in collaboration with the research community and are based on strategic policy and priorities for research.
The 2021 roadmap identified eight key challenges: resources technology and critical minerals processing; food and beverage; medical products; recycling and clean energy; defence; space; environment and climate; frontier technologies and modern manufacturing. The challenge-based approach will support the planning and investment around Australia’s National Research Infrastructure over the next five to ten years, aiming at maximising research efforts around key societal issues.
To create a world-class compute ecosystem, the government should urgently outline a long-term strategic vision over the next decade and beyond. The vision will need to reflect the UK’s priorities and ambitions and should take forward the full set of recommendations outlined in this report. The vision should be paired with a clear implementation roadmap.
Action required: A strategic vision for compute
A 10-year strategic vision is needed to provide certainty for users and suppliers and to make decisions on which UK strengths should be leveraged to be internationally competitive. This should be published in 2023, capture all recommendations made by this review and cover all tiers of the compute ecosystem. The vision should include delivery of exascale capability and ambitious objectives to support compute via areas such as skills, sustainability and procurement. This vision needs to be aligned with the government’s major technology and R&D strategies, demonstrating how compute will support their delivery.
The UK must also set out how it will deliver its long-term vision for compute. A UK compute roadmap should be published by spring 2024, setting out specific implementation plans for delivery. It should articulate clear priorities, factor in capacity and resourcing requirements and be based on strong economic analysis of the compute sector. The roadmap should set out clear steps for public investment in the compute ecosystem; the path towards delivering future exascale capability; the plan for investing in future, post-exascale capability; how software will be delivered to maximise the utility of hardware investment; and how skills will be built, to ensure benefits can be fully achieved. The roadmap should be regularly refreshed and underpinned by evidence and user requirements.
It is also important to consider the long-term investment requirements of compute and the role of the UK in developing new compute technologies. Delivering the proposals outlined in Chapter 4 will go part of the way in confirming the UK’s commitment to be an international leader in compute, but it is essential that national planning also considers the UK’s long-term system requirements. Technological advances have been rapid in the last decade and it will not be long before novel hardware architectures and workloads emerge, taking computing into the post-exascale era. The UK must ensure it keeps pace and start planning for next generation systems to regain its competitive advantage in computing.
Action required: Plan for the UK’s long-term system requirements
The typical lifespan of compute is five to seven years. An exascale system will significantly enhance the UK’s computing power in the medium term, but the government must also prepare for the implementation of future systems to be at the frontier of the next era of computing.
Improving coordination across compute facilities
Better coordination across compute facilities would help drive wider access to compute and more efficient procurement of new systems. The GO-Science report called for a dedicated oversight group tasked with providing effective coordination for large scale computing. The review agrees that a coordinating body could provide strategic leadership and oversight across the full compute ecosystem, holding a holistic view of compute infrastructure and requirements. Such a body should have responsibility for overseeing the development and delivery of the implementation roadmap and advise the government on investments across the UK’s public computing landscape. Consideration should also be given to the interactions between compute, its enablers and the technologies it enables, including data, AI, quantum and semiconductors.
This would bring the UK in line with other countries, with many international competitors having already established national coordination bodies to facilitate access, promote shared investment and provide user support. Such an entity should aim to increase awareness of compute, understand the needs of the ecosystem and advise on making procurement more efficient.
Case study: Gauss Centre for Supercomputing, Germany
The Gauss Centre for Supercomputing (GCS) is the leading coordinating entity in Germany, set up as a joint initiative between the German federal and state government. By combining the country’s three major facilities, GCS ensures coordination in order to foster scientific research and innovation through greater uptake of compute. To this end, GCS promotes and facilitates access to compute resources, and provides user support and training. GSC also enables and facilitates international collaboration as a hosting member of PRACE. The entity represents Germany in PRACE, providing HPC resources and related services to other PRACE’s members.
As discussed in Chapter 3, users can find it challenging to navigate the compute landscape and identify resources that meet their needs. Allocation of resources is determined by individual facilities, and there is no central oversight of public resource allocation to ensure that outcomes align with government priorities. Better signposting and a common objective to increase uptake across the full user base would improve awareness and usage of available resources.
Understanding the demand for and availability of compute in the UK will help drive better decision making and ensure that signposting is effectively targeted towards matching users with suppliers. Limited available data currently makes it difficult to accurately monitor supply and demand of compute resources, and therefore determine where further allocation or investment needs to take place. Commercial providers do not publish usage statistics, data on unsuccessful bids is not routinely collected by existing public facilities, and it is difficult to determine unmet private sector demand. This issue could be addressed by identifying clear metrics, surveying users, publishing usage statistics and measuring trends over time.
Improved understanding of compute supply and demand will also help identify how to allocate government funding to the areas of greatest need. Currently, national laboratories, universities, public facilities and UKRI’s research councils conduct individual procurement processes for new compute facilities. As outlined in Chapter 4, the current fragmentation of the compute landscape limits the availability of compute resources to certain users. The UK needs a diverse, broad compute ecosystem that fosters different system architectures at different scales and delivers for everyone. More focused and targeted procurement could, for example, encourage innovation within the supply chain and support R&D in specific areas (e.g. green compute).
Action required: Institute a national coordinating body for compute
The UK needs an expert authority on compute. It should: deliver and update the UK’s vision for compute; deliver the UK compute roadmap; support users by improving the coordination and awareness of public compute infrastructure; and, establish a framework for gathering evidence and analysis to measure and monitor UK compute, including through regular surveys of user needs and requirements. There are also opportunities for the coordinating body to improve access to technical support and skills needed for compute, and facilitate public procurement of secure, sustainable compute. Resourcing should reflect the requirement for sufficient technical expertise and effectively influence decision making. Coordination efforts should identify and advise on interactions with data, AI, quantum, semiconductors and other emerging technologies.
5.3 A sustainable skills pipeline
A strong ecosystem relies on the people who can build, design and operate compute infrastructure; develop the software that runs on it; and enable users to utilise the infrastructure they need. Ensuring the UK has a sustainable skills pipeline for compute is essential to maximise the efficiency of systems, deliver cutting-edge research and innovation and attract further talent and investment. This pipeline relies on compute specific skills, which are founded on more general digital, software and data skills.
The Digital Strategy outlines the government’s commitment to building the skills needed to create a world-leading digital economy. The review welcomes this commitment and emphasises the need to move at pace to ensure that the UK has the digital skills required to build a strong compute ecosystem. Targeted action is required to grow a sustainable and diverse pipeline of digital and STEM skills and build the technical skills needed to support compute users.
Skills for compute users
The benefits of compute to the UK economy, science and innovation ultimately depend upon users’ ability to maximise the power this technology provides. And yet, skills are the most commonly raised barrier for users who want to access and adopt compute. The first challenge for users is often understanding what compute resources are, how to access them and how to succeed in allocation and bidding procedures. Skills requirements also include how to use specific system architectures or how to utilise or design software. The skills barrier is particularly acute for new and emerging users.
Users accessing commercial cloud benefit from a wide range of training and certifications, often provided by the suppliers. These support a broad range of IT abilities and benefit the suppliers by increasing their user base. Many public facilities also offer training programmes to support users’ access. These are essential to enable new users to access systems and write efficient code. ARCHER2 provides a training pathway for different user types. The Hartree Centre provides a range of standard and bespoke training courses for industry users, often in collaboration with partners, such as NVIDIA or IBM. There are also initiatives run at international level, such as the European PRACE’s extensive free training events.
Skills programmes should upskill users, as well as technical support staff, for the lifetime of the infrastructure, with these operational costs secured at the time of the infrastructure investment. However, financially-stretched public facilities are unable to offer the full suite of training required to support users’ needs. A lack of effective signposting of training courses further curbs their potential for upskilling users.
Action required: Pair infrastructure investment with skills programmes
Investment in compute infrastructure needs to be matched by investment in skills to ensure the UK reaps the benefit of investment in compute. Users’ access to skills programmes must be streamlined, including through better signposting to foundational data and software skills.
Skills for compute professionals and technical staff
The GO-Science report outlined a shortage of large-scale computing professionals in the UK, ranging from system architects and data engineers, to system operation professionals and software engineers. Compute professionals, such as Research Software Engineers (RSEs) and Infrastructure Engineers, are skilled individuals who can build systems and support users to operate them in the most efficient way. These skills are in high demand in both the public and private sectors. The entire compute community recognises the importance, and scarcity, of technical staff.
In the last decade, substantial work has been done to build and support the UK’s technical compute community, including through the creation of a recognised RSE profession and specific initiatives. For example, the ExCALIBUR Research Software Engineer Knowledge Integration programme aims to equip the UK research community with the skills necessary to seize the opportunities unlocked by exascale capability and increase collaboration between academia and industry. However, despite these valuable initiatives, the review has found that RSEs are increasingly leaving academia due to low salaries, short-term contracts, lack of recognition and uncertain career progression.
There is also anecdotal evidence that the lack of cutting-edge infrastructure in the UK risks creating a brain drain of top talent, with highly-skilled individuals choosing to relocate in countries where they can access state-of-the-art infrastructure. For the UK to build a globally competitive digital economy and be an AI superpower, it must retain and attract technical talent through supporting domestic and global compute skills initiatives.
Action required: Create, attract and retain world-class compute talent
The government must support the pipeline of people with compute skills through investment in computational and digital skills, and the implementation of the Digital Strategy. This includes supporting the digital education pipeline, increasing awareness of career pathways and attracting the best and brightest international talent through visas for digital/research professionals. It should grow and support the talent needed for the UK to use compute, and reward and recognise the technical skills that enable access to it.
For technical professionals, such as Research Software Engineers, the government must further support and extend the work delivered by the community and UKRI to increase professionalisation, enhance recognition and enable clear career progression.
The government should enable international exchanges and industry secondments for technical staff, as well as for users, to increase the UK’s domestic skill base. This will foster cross-border collaborations, build new areas of research, and provide a better understanding of other countries’ infrastructure.
5.4 Support for a secure and resilient compute ecosystem
Trust in the security and durability of compute systems is essential to supporting greater uptake. Security requirements for compute depend upon the nature of the data, the need to transfer data between systems, the underlying hardware design, the needs and mix of users and their organisations, and the evolving threat environment. Different compute users have different security policies and requirements. For example, the location, access and cybersecurity of systems are key considerations for public sector users working with software or data that is sensitive or relates to national security, or where intellectual property needs to be safeguarded. Industry and academic collaborations may require specific security policies, procedures and staff training.
The rise in cross-discipline working is driving an increase in the need to bring together sensitive data with other data sources across different organisations. However, variable security measures adopted by different public facilities, the increasing diversity of users and connectedness of systems, the increasing heterogeneity of compute uses and the movement of data across different systems, can leave room for security vulnerability. Improved coordination and strategic planning, integrated with guidance on the latest technological developments, would enhance consistency, assurances and understanding of system security procedures. It is essential to balance data accessibility and sharing to meet user needs with security requirements. Building secure Trusted Research Environments (TREs) is one approach that can provide this balance.
Trusted Research Environments (TREs)
TREs, also known as Data Safe Havens, are highly secure computing environments that facilitate data sharing by providing a high level of security, anonymisation/ pseudonymisation and HPC. This provides an opportunity for datasets that would otherwise not be available for research to be unlocked and importantly support users from different disciplines to collaborate. TREs require a fine balance between allowing data access and securing the data kept in the environment.
There are several TREs across the UK, such as NHS Digital’s Secure Data Environment and the Office of National Statistics’ Integrated Data Service (IDS). Each TRE differs, but the Five Safes framework is a central tenet: safe people; projects; settings; data; and outputs.
Many TREs in the UK are used in healthcare applications and played a key role in the fight against COVID-19. During the pandemic, these secure environments accelerated research on the impact of the vaccine on certain communities. Among them is the NHS Digital Trusted Research Environment for England, which enabled studies on the effect of COVID-19 on cardiovascular diseases. NHS Digital TRE is also working with DATA-CAN, enabling important COVID-19 related research on rates of cancer referrals, diagnosis and treatment. The need for TRE’s with high-end data-science capability to explore converged data sets is expected to increase.
System security is a broad and complex topic, and the evidence gathered by this review suggests that further guidance in this space would be welcome. Providing guidance and promoting best practice can be effective in increasing user awareness of security and ensure UK computing infrastructure remains secure. There are several organisations that provide guidance on securing compute, covering topics on protecting data, governance, access, auditing and supply chains. For example, UKRI launched a programme to support the research and innovation community by setting out principles for information management and knowledge sharing.
The National Cyber Security Centre (NCSC) provides security support to the most critical organisations in the UK, the wider public sector, industry, as well as the general public. There are also a number of cyber security industry standards (such as ISO 27001) available. Activities to raise awareness and promote the importance of compute to UK users, particularly SMEs, should include advice and information on security.
It is also important to consider the development of future public infrastructure with a ‘secure by design’ approach. As the UK compute infrastructure expands and becomes more integrated, security becomes ever more critical. New compute facilities will need to be designed with the latest security principles in mind, in partnership with the NCSC. Taking a risk-based approach to system design can protect the research and industry partnerships enabled by compute infrastructure and support a more resilient ecosystem. An innovative and resilient compute ecosystem should also be underpinned by diverse supply chains. A future strategic approach should consider these issues, in line with wider economic, national security and foreign policy ambitions.
Action required: Promote existing guidance and secure future systems
Users and providers of public compute should be encouraged to follow the latest cybersecurity guidance issued by NCSC. Guidance will address standards and best practices for securing compute, but also give users confidence in using appropriate compute — including commercial cloud.
New compute facilities should adopt ‘secure by design’ principles and a risk-based approach that ensure systems can protect the research and industry partnerships they serve. This could include adopting standards, robust technical measures and relevant NCSC guidance based on a consistent risk- and principle-based approach.
5.5 The role of partnerships
Partnerships are another critical factor in creating a vibrant compute ecosystem. Public-private partnerships can deliver multiple benefits to the UK through the creation of more innovative, cost-effective investments. International partnerships can enable the UK to keep pace with global innovation in compute technology; increase compute resources available to UK users; share and build expertise; and develop solutions to tackle global challenges.
Collaboration with the private sector on the delivery of public compute could have substantial benefits, including opening up compute to a greater diversity of suppliers and users; stimulating innovation and new funding models; and increasing domestic R&D. Partnering in the delivery of compute can also help with cost-sharing and drive better commercialisation of compute technology. Furthermore, public compute infrastructure is already used to support a broad range of academic and industry partnerships, as outlined in Chapter 3. This trend is expected to continue in the future, potentially opening up new collaborations at the infrastructure level as well. In seeking a diverse, future-proofed compute ecosystem, the government should explore the role of the private sector in the delivery of all the recommendations outlined in this review.
International partnerships can also drive better outcomes in the procurement, development and deployment of compute infrastructure. Joint procurement helps reduce overall infrastructure costs and, if multiple systems are procured, it could also lower the technological risks associated with different architectures. The EU’s EuroHPC has been seizing these benefits, with multiple countries sharing the cost of new infrastructure and different architectures being tested across the ecosystem. International collaborations tend to focus on and be driven by areas of common interest - such as green data centres, artificial intelligence and machine learning algorithms, or digital twins — and often entail access to international systems. Collaborations are also key to the development of new computing tools and applications, as these are often led by international consortia, and can help address cross-border challenges, such as systems’ energy efficiency.
Case study: The CEA-RIKEN High Performance Computing and Computational Science agreement
The French Alternative Energies and Atomic Energy Commission (CEA) and Japan’s RIKEN have collaborated since 2017 to develop HPC applications in areas such as health, material science and hazard management and improve their compute technologies. They state that combining efforts is a way to accelerate some developments and amplify their dissemination while fostering extra talent generation on both sides.
Having agreed in 2022 to continue collaboration for a further five years, the partnership has expanded its focus to include quantum computing, AI and Big Data, new compute architectures and software development, as well as continuing to develop applications to tackle societal challenges (e.g. disaster prevention).
Besides investing in domestic compute infrastructure, other countries are already leveraging international partnerships to strengthen their compute ecosystem. Striving for ways to drive more collaboration and sharing of resources through international partnerships should therefore form a key part of the UK’s vision for compute. That said, this must support, rather than replace, investment into national capabilities and the overall domestic ecosystem.
Action required: Collaborate with international partners
The UK should build international partnerships to support the UK’s ambitions. The types of partnerships the UK should seek will evolve with time as domestic capabilities strengthen and objectives refine.
The UK should work with international partners to complement domestic interventions. As the UK invests in its domestic infrastructure, international partnerships should be leveraged to ensure UK users can keep accessing the necessary compute resources, particularly heterogeneous and next generation systems. To this end, the government should ensure that the UK retains access to systems under existing agreements (e.g. PRACE), as well as systems the UK has contributed to via Horizon 2020 (e.g. LUMI); and ensure better signposting of international systems already open to UK users (e.g. US DOE INCITE Program). In the future, the government should seek continued close collaboration with leaders in compute, including the US, Japan and the EU.
As the UK expands its own compute capability, it should use partnerships to lower procurement costs and technological risks. The UK should also seek to access, and test, different or novel compute architectures and technologies by partnering with countries that have infrastructure and expertise that complements the UK’s capability.
International partnerships can increase countries’ influence over the international compute landscape, for instance through joint efforts in international standards development. They can also increase countries’ contribution to global scientific endeavours through international research initiatives and consortia. For instance, the consortium of UK compute facilities that worked on the UK’s domestic response to COVID-19 also supported global efforts, contributing to the work of the international US-led COVID-19 High Performance Computing Consortium. Researchers used the combined computational power of multiple UK compute resources to deliver a range of computational models and simulations that helped fight the spread of the pandemic, domestically and globally.
Building strengths in specific areas and leveraging them internationally by collaborating with countries with complementary strengths could deliver significant benefits. The UK has already established strengths in software development, machine learning skills and a growing body of expertise in green compute. Maximising the benefits of these existing expertise would improve the UK’s overall domestic compute ecosystem as well as increase the UK’s influence within the international compute landscape.
Action required: Lead and participate in global compute initiatives
Throughout the next decade, the UK should seek to strengthen its standing within the international compute landscape. This should be done by leveraging a range of new and existing international partnerships and initiatives. The UK should aim to capitalise on its strength in research software development; jointly develop sustainable compute technologies and solutions; increase the participation of UK researchers in international projects; and, encourage international knowledge sharing. The UK should also seek opportunities to establish partnerships to grow its influence in international standards organisations, multilateral fora, and to support its international development efforts.
5.6 Policy implications
To reap the benefits of compute, investment in infrastructure is not enough. The UK needs to adopt a holistic approach to compute and strive to create a strong, vibrant ecosystem. Setting a long-term strategic vision and improving coordination across the landscape is essential. This would support long-term planning, clearly signal intentions to stakeholders, and help remove barriers currently hindering compute uptake.
To maximise infrastructure investment, it is imperative that both users and facility staff have the right skills to benefit from existing systems and be able to access new ones. At the same time, compute systems need to be secure to create a healthy, diverse and trustworthy ecosystem. The UK should also build partnerships, both with the private sector and internationally, and leverage its existing compute strengths to improve its domestic ecosystem and establish itself as a global leader in compute.
6. Recommendations
The power of compute is clear. To support the government’s ambitions for economic growth, scientific research and technological innovation, urgent action is needed to bolster the UK’s compute infrastructure and create a world-class compute ecosystem.
There are three key objectives the government should aim to achieve by implementing the review’s recommendations:
6.1 Recommendations
A. Unlocking the world-leading, high-growth potential of UK compute
The UK’s ambitions are for long-term economic growth and to cement its status as a Science and Technology Superpower by 2030. These ambitions rely on the UK’s ability to harness the opportunities that compute provides and maximise the impact of this technology across its economy and science base. The government must be ambitious and visionary to truly use compute to deliver its objectives.
To achieve this, the government should implement the actions detailed in recommendations 1 to 3.
Recommendation 1: Inaugurate a strategic vision and roadmap for compute, for the next decade and beyond
The UK needs a strategic vision for compute. It must articulate the role of compute for economic growth, making the most of UK strengths and cementing its position as a Science and Technology Superpower. This vision needs to be accompanied by an implementation roadmap setting out in detail how the vision will be delivered.
1a) Publish a long-term, strategic vision for compute in 2023, covering a 10 year period.
A 10-year strategic vision is needed to provide certainty for users and suppliers, and to make decisions on which UK strengths should be leveraged to be internationally competitive. The vision should capture all recommendations made by this review and cover all tiers of the compute ecosystem. It should include delivery of exascale capability and ambitious objectives to support areas such as skills, sustainable infrastructure and procurement. The vision needs to be aligned with the government’s major technology and R&D strategies, demonstrating how compute will support the delivery of the government’s objectives.
1b) Publish a roadmap by spring 2024, detailing how the UK’s strategic vision for compute will be implemented.
The government must set out how it will deliver the UK’s long-term vision for compute. A UK compute roadmap should be published, detailing specific implementation plans for delivery. It should articulate clear priorities, factor in capacity and resourcing requirements, and be based on strong economic analysis of the compute sector. The roadmap should set out clear steps for public investment in the compute ecosystem; the path towards delivering future exascale capability; the plan for investing in future, post-exascale capability; how software will be delivered to maximise the utility of hardware investment; and how skills will be built, to ensure benefits can be fully achieved. The roadmap should be regularly refreshed and underpinned by evidence and user requirements.
1c) Begin planning for the UK’s long-term system requirements immediately.
The typical lifespan of computing hardware is five to seven years. An exascale system will significantly enhance the UK’s computing power in the medium term, but the government must also prepare for the implementation of future systems to be at the frontier of the next era of computing.
Recommendation 2: Increase leadership and national coordination to support users and ensure the UK’s vision for compute is delivered
The UK’s public compute infrastructure is fragmented, making it challenging and overly complex to use. Academic and business users, particularly SMEs, find it difficult to navigate the current compute landscape and identify compute resources that meet their requirements. There is a need for simplification, stronger coordination and clear UK-wide leadership. A coordinating entity would provide strategic leadership and oversight across the full compute ecosystem, holding a holistic view of compute infrastructure and requirements.
2a) Institute a national coordinating body to implement the UK’s vision for compute.
The UK needs an expert authority on compute. This should: deliver and update the UK’s vision for compute; deliver the UK compute roadmap; support users by improving the coordination and awareness of public compute infrastructure; and establish a framework for gathering evidence and analysis to measure and monitor UK compute. There are also opportunities to improve access to technical support and skills needed for compute and facilitate public procurement of secure, sustainable compute. Resourcing should reflect the requirement for sufficient technical expertise and to effectively influence decision making. Coordination efforts should align with and support activities for AI, quantum and other emerging technologies.
Recommendation 3: Enable increased and broader use of compute by all users, both existing and potential, through the creation of a diverse, healthy and integrated compute ecosystem
The current provision of compute in the UK does not sufficiently meet the needs of either existing or emerging users. An integrated, coordinated, diverse and resilient ecosystem will allow more effective use of compute by all users, improving efficiency and maximising impact. At the same time, the government should raise awareness, and advocate for the utility of compute, to stimulate its adoption across the economy, in particular among SMEs. The government should take a proactive approach to support new users to access and use compute by lowering the barriers to entry.
3a) Ensure that compute users have access to the necessary hardware, software, skills and data to meet their usage requirements.
The government should take a holistic approach to meeting the demand of all users, both new and existing. In practice, this means expanding the provision of hardware, software, skills and data across all public facilities. Each of these components is critical and interdependent. This system-wide view must be at the forefront of the government’s approach to delivering the recommendations set out in this report.
3b) Stimulate demand and broaden the use of compute through targeted awareness building, funding support and skill development programmes.
Broadening the use of compute will help to drive economic growth. The use of compute should be promoted across the economy, particularly by improving signposting and coordination. Clear steps should be taken to support and promote the work of public facilities. In addition, the government should also aid the development of a strong skills base, via skills and business development programmes.
B. Building world-class, sustainable compute capabilities
Beyond just aspiration, fast, tangible action is required by the government. Economic demand and international competition are both accelerating. If the UK does not commit to new infrastructure now and long-term, it will be left behind.
Investment in compute infrastructure will enable UK industry to grow and become more competitive and UK research to be at the forefront of scientific innovation. Investment must cover the breadth of the UK’s infrastructure: hardware, software and skills, as well as critical elements including data storage, data capabilities and network capacity.
To achieve this, the government should implement the actions detailed in recommendations 4 to 7.
Recommendation 4: Make immediate investments in the pathway to public exascale capability
Exascale capability is an essential component in achieving the UK’s long-term ambitions to be an AI superpower, deliver world-class research and be a global hub for innovation. The UK has just 1.3% of the performance worldwide and its most powerful system ranks 28th. Exascale opens up exciting new opportunities in research and innovation to grow the economy, enabling researchers to build more complex models and simulations. It will allow researchers to understand climate change, power the discovery of new drugs and advance the UK’s engineering capabilities. It is a critical component in maximising the UK’s potential in AI. However, it must be ensured that all tiers of compute, as the foundations of exascale, are also maintained at an appropriate level.
The government should commit to the path to exascale, via the adoption of a phased approach. This would deliver exascale-ready public capability immediately, adding further hardware that increases the capability to full exascale by 2026. This phased approach maximises the effectiveness of government investment by matching capability with readiness and managing the risk of early adoption.
4a) Deliver full exascale capability by 2026.
Deliver hardware of at least one exaflop of processing power that supports a wide range of demands from research and business communities. This should follow a phased approach, delivering exascale-ready public capability immediately (phase 1) and adding additional compatible hardware that increases the capability to full exascale (phase 2). Phase 2 should be delivered no later than 2026, and within 2 years of phase 1 to ensure backward compatibility.
4b) Invest in the first stage of exascale as soon as possible, enhancing existing capability immediately ahead of achieving full exascale capability.
To support the pathway towards full exascale capability, the government should provide at least 250 petaflops of exascale-ready public capacity immediately. This should include enough nodes to support future users and be compatible with the delivery of a full exascale system. This is internationally competitive and an order of magnitude more powerful than ARCHER2. An expanded software development programme is also necessary to support exascale-ready code development.
4c) Increase software and engineering skills investment to align with delivery of exascale.
The government should invest in initiatives to accelerate code scaling, port codes to accelerators and develop skilled engineers. This should target and benchmark codes already on national capability systems and those required by AI researchers.
Many UK codes need to be reengineered or rebuilt to maximise the benefit of exascale. Existing programmes should broaden their remit to support all software with potential to scale and an active user base. The UK is a leader in software development, and must drive further collaboration with leading compute countries. International partnerships should be established and sustained to develop software as part of the UK’s roadmap to exascale capability by negotiating access to international (pre)exascale systems (e.g. US’s Frontier) and leveraging existing access (e.g. EU’s LUMI). This would enable the UK to have exascale-ready tools and applications as soon as domestic exascale capability is deployed.
Recommendation 5: Improve access to public compute via the cloud through improved interoperability and better procurement
Cloud access has the potential to make compute more accessible to a broad range of users, with workflows able to run between commercial cloud, private cloud and on-premise compute. Public compute infrastructure makes limited use of commercial cloud services. This limits access to facilities and resources as well as the ability to share and access data. Improving the interoperability across public systems, working in partnership with commercial cloud providers to improve access to compute and improving awareness of accounting rules for cloud provision would remove barriers to higher cloud adoption.
5a) Establish a lighthouse project to develop and test interoperability solutions.
A lighthouse project should be undertaken to demonstrate the advantages and articulate the challenges of interoperability in public facilities. The project should test various technologies; promote greater collaboration among public facilities and their teams; improve provision of appropriate compute resources and access to important datasets; explore solutions that mitigate vendor lock-in; and support international collaboration. Such a project should also upskill engineers at public facilities, as well as users, to support cloud adoption.
5b) Ensure continued UK involvement with, and influence over, the Open Clouds for Research Environments (OCRE) framework, beyond 2024.
UKRI, PSREs, and universities should leverage the increased purchasing power of OCRE. The UK should ensure that the next iteration of the framework caters to all users’ collective needs; includes more access models (in addition to pay-as-you-go); lowers data extraction costs; and promotes multi-cloud solutions (that include commercial and private clouds).
5c) Improve awareness of cloud accounting practice to increase access to commercial cloud.
The government should publicise international accounting standards and R&D tax credits to reduce the barriers to greater adoption of commercial cloud. This would provide clarity and a better comparison between public systems and commercial cloud, which should promote value for money.
Recommendation 6: Immediately and significantly increase compute capacity for AI research
The AI community has immediate requirements for large-scale accelerator-driven compute to remain internationally competitive and deliver on the UK’s ambitions to be an AI superpower. Provision of compute for AI as a first-class use case should also be sustained and provided through future facilities, from exascale through to local clusters.
6a) Establish a UK AI Research Resource by summer 2023.
The government should establish a UK AI Research Resource for immediate use by academic and commercial users within the AI community. It should provide significant accelerator capacity of at least 3,000 top-spec AI accelerators, sufficient to support exploratory compute for every UK AI researcher as well as large-scale training runs, and provide access to a wide range of key datasets and skilled staff to support its use. This should be complementary to existing investments and upgrades in accelerator-driven compute.
6b) Ensure that future compute infrastructure, including exascale, provides AI accelerators.
There needs to be continuous and sustained investment in accelerator-driven compute capabilities that directly support AI research as a first-class use case. This capability should be present from national exascale systems through to compute facilities at Tier 2, and locally at universities. The Tier 2 systems in particular should be used to support a diverse breadth of AI-focused accelerator hardware beyond that chosen at the exascale tier.
Recommendation 7: Manage the sustainability of building and using compute through planning, procurement and innovation
Public facilities will need to address the environmental impact arising from compute’s energy demand while balancing service delivery. Future public compute provision should encourage innovation and coordinate best practice.
7a) Develop comprehensive planning guidance for the development of sustainable compute infrastructure, in line with whole-life carbon project planning.
Sustainability needs to be a central feature of the vision and implementation roadmap. The government should provide guidance to promote innovation in sustainable computing. The guidance should be produced in line with whole-life carbon project planning and in collaboration with local energy network providers and the national grid. It should cover the design of energy efficient systems; the use of renewable energy; specific sustainability-focused contractual clauses; the location of infrastructure; and the management of water use.
7b) Use public procurement to support innovative sustainable computing.
The government should use procurement as a lever for advancing towards net zero goals, building on practices such as the Met Office’s requirement for green energy supply for its new supercomputer. Procurement should be used to set ambitious environmental targets, run challenges for innovations in sustainable computing going beyond exascale, as well as de-risk cutting-edge sustainability technology with the government acting as the ‘buyer of first resort’. This approach should be evaluated to see whether it can be deployed in other areas of compute.
7c) Invest in training to promote practices that ensure facilities are being used as efficiently as possible.
Training and skills are an essential part of using systems efficiently. The UK should invest in training on sustainable compute, not only to promote understanding of how to use systems efficiently, but also to facilitate the adoption of more energy-efficient hardware and software solutions. Users must have the skills to write efficient code and understand how to optimise code for minimum environmental impact.
C. Empowering the compute community
To be truly world-leading, the UK compute ecosystem needs to be greater than the sum of its parts. Whilst investment in physical infrastructure is essential, the benefits compute can bring to the UK will result from the skills and endeavours of those who use these systems. Investment in a secure compute infrastructure needs to be matched by investment in skilled people, enabling all users to reap the benefits of compute. The UK should also build partnerships to further support its domestic ecosystem and increase its influence over the international compute landscape.
To achieve this, the government should implement the actions detailed in recommendations 8 to 10.
Recommendation 8: Invest in attracting and retaining world-class researchers and technical talent through targeted investment into domestic and international compute skills programmes
Digital, STEM and compute-specific skills are an essential element of the UK compute ecosystem. Compute specific skills are needed to develop innovative computational techniques and provide access to compute facilities.
8a) Widen the pipeline of people with compute skills by supporting broader digital skills and talent, as outlined in the Digital Strategy.
Investment in the UK’s computational capability should be aligned with broader investment in computational and digital skills. This includes supporting the digital education pipeline, increasing awareness of career pathways and attracting the best and brightest international talent through visas for digital/research professionals.
8b) Ensure physical infrastructure investments are paired with compute specific skills programmes, including for the technical staff needed to operate these facilities.
The government must grow and support the talent needed for the UK to use compute and reward and recognise the technical skills that enable access to it. Users’ access to skills programmes must be streamlined, including through better signposting to foundational data and software skills. For technical professionals, such as Research Software Engineers, the government must further support and extend the work delivered by the community and UKRI to increase professionalisation, enhance recognition and enable clear career progression.
8c) Support the development of international collaboration for compute-specific training and staff exchange.
The government should enable international exchanges and industry secondments for users and technical staff to increase the UK’s domestic skill base. This will foster cross-border collaborations, build new areas of research and provide a better understanding of other countries’ computing architectures.
Recommendation 9: Provide guidance and support for appropriately secure compute
To secure and protect the UK’s compute infrastructure, awareness and consistency of best practice must be improved across the ecosystem. A cultural change towards the adoption of a risk-based approach is required to make best use of available resources. Securing data access is at the heart of this and best practice such as Trusted Research Environments should be adopted.
9a) The design of new systems should be supported by security principles developed in collaboration with the National Cyber Security Centre (NCSC).
Security should be an important part of the UK’s strategic approach to compute. The compute ecosystem should work closely with NCSC in developing security guidelines. Security should be considered at the conception of new systems to adopt ‘secure by design’ principles and a risk-based approach that ensure systems can protect the research and industry partnerships they serve.
9b) Signpost security guidance from the National Cyber Security Centre (NCSC).
Users and providers of public compute should follow the latest cyber security guidance issued by NCSC. This will improve consistency and help cultural change. Guidance will address standards and best practices for securing compute, but also give users confidence in using appropriate compute, including commercial cloud.
Recommendation 10: Collaborate with international partners to strengthen the UK’s domestic compute ecosystem and increase its international influence in science and technology
The UK should build international partnerships to support the UK’s ambitions and complement domestic compute interventions. The types of partnerships the UK should seek will evolve with time as domestic capabilities strengthen and objectives refine.
10a) Work with international partners to complement domestic compute interventions.
While the UK strengthens its domestic compute ecosystem, the government should ensure UK users can access the capacity and capability they need by retaining access to systems under existing agreements (eg. PRACE). In the future, the government should seek continued close collaboration with leaders in compute, including the US, Japan and the EU. The UK should also establish partnerships as part of its efforts to build domestic skills and attract international talent by leveraging multilateral fora and bilateral agreements.
10b) Explore the role of international partnerships in delivering the best value and technological advantages to the UK.
As the UK expands its own compute capability, it should use partnerships to lower procurement costs and technological risks. For example, this could be achieved by exploring the possibility of jointly procuring new systems, possibly with architectures complementary to the ones already in place in the UK. The UK should also seek to access, and test, different or novel compute architectures and technologies by partnering with countries that have infrastructure and expertise that complements the UK’s capability.
10c) Lead the way in global, compute initiatives.
Throughout the next decade, the UK should seek to become a global leader in compute, and secure its status as a Science and Technology Superpower more broadly. This should be done by leveraging a range of new and existing international partnerships and initiatives. The UK should aim to capitalise on its strength in research software development and AI; jointly develop sustainable compute technologies and solutions; increase the participation of UK researchers in international projects; and encourage international knowledge sharing. The UK should also seek to establish partnerships to grow its influence in international standards organisations and to support its international development efforts.
6.2 Future decision points for government
This review has set out 10 recommendations to improve the UK’s compute capability and wider ecosystem. Each and every recommendation is important. A piecemeal approach — a vision without funding; infrastructure without coordination or skills — is unlikely to unlock the UK’s potential. To ensure a focus on delivery, there should be clear accountability and governance for compute within the government. There are actions that could be taken immediately. For instance, developing a vision and implementation roadmap would begin to bring academia and industry together to work towards common goals.
Recognising the challenges of the current economic context, the government must consider the importance of making investments that support long-term economic growth. Funding decisions need to be considered in the context of the government’s wider science and technology priorities and commitments. The case for compute is clear. Decisions on investment into compute should not be taken in isolation nor can they be ignored.
6.3 The importance of acting now
Action is required now. The UK is falling behind international competitors and does not have a compute ecosystem fit to serve its world-class scientific base and innovative economy. Without intervention, not only the government will not be able to realise its economic, scientific and technological ambitions, but the UK’s internationally recognised strengths in science and technology will risk fading away. To be a global actor and a competitive economy in the 21st century, strong compute capability and a vibrant compute ecosystem are essential.
While there are challenges and risks of inaction, the UK’s great potential must be recognised — it lies in its industries, its academia, its people. As the government invests in the future of compute, so too will industry, leading towards a diverse ecosystem of compute suppliers and users. There is a huge opportunity to further unleash this potential by unlocking compute’s economic and societal benefits. This requires the government to be visionary, committed to deliver its ambitions and determined to guarantee the UK’s prosperity for future generations.
Acknowledgements
The panel would like to thank the many experts and stakeholders who provided valuable evidence and expertise to the review, including all of those who contributed to the call for evidence. We would like to offer specific thanks to those listed below for their time and support on the review.
-
Rob Akers, Head of Advanced Computing, UKAEA
-
Liz Ashall-Payne, CEO and Co-founder, ORCHA
-
Stephen Belcher, Chief Scientific Adviser, Met Office
-
Mathew Foulkes, Professor of Physics, Imperial College London
-
Richard Gunn, Co-Director Digital Research Infrastructure, UKRI
-
Justin Hotard, EVP High Performance Computing and Artificial Intelligence, HPE
-
Earl Joseph, CEO, Hyperion Research
-
Leigh Lapworth, Head of Computational Science, Rolls Royce
-
Simon McIntosh-Smith, Principal Investigator, ExCALIBUR
-
John Midgley, Director of Public Policy, AWS
-
Justin O’Byrne, Co-Director Digital Research Infrastructure, UKRI
-
Parashkev Nachev, Professor of Neurology, UCL Queen Square Institute of Neurology
-
Mark Parsons, Executive Director, EPCC, University of Edinburgh
-
Thomas Rodden, Chief Scientific Adviser, DCMS
-
Katherine Royse, Director, Hartree Centre
-
Paul Selwood, Principal Fellow in Supercomputing, Met Office
-
Mark Thomson, Executive Chair, STFC, UKRI
-
Mark Wilkinson, Director, DiRAC
-
Stuart Wilson, Director Global HPC, Atos
-
Michael Wooldridge, Director of Foundational AI Research, Alan Turing Institute
-
The definition of the ‘digital sector’ is based on the OECD definition of the ‘information society’. This is a combination of the OECD definition for the ‘ICT sector’ as well as including the definition of the ‘content and media sector’. ↩
-
The study looked at 175 projects: 26 academic projects, 6 government projects, and 143 industry projects. Earl Joseph, Melissa Riddle, Tom Sorensen, and Steve Conway,‘The Economic and Societal Benefits of Linux Supercomputers’, Hyperion Research, 2022. ↩
-
Data provided by Hyperion, 2022. To note: Hyperion defines ‘high performance computing’ as the entire market for computer servers used by scientists, engineers, analysts, and other groups using computationally data-intensive applications. ↩
-
From an approximate comparison with road infrastructure investments to 2020. ↩
-
Both of these programmes were supported by the European Regional Development Fund. ↩
-
Information provided by the Hartree Centre. ↩
-
Besiroglu, T, Nicholas Emery-Xu, and Neil Thompson. ‘Economic impacts of AI-augmented R&D’, 2022. ↩
-
Information provided by Met Office. ↩
-
Information provided by UKAEA and the Hartree Centre. ↩
-
Data from Top500 List Statistics filtered by countries/regions for November 2005 and 2022 ↩
-
US Department of Energy, Request for Information - Advanced Computing Ecosystems, 2022. ↩
-
Rmax and Rpeak are scores based on systems’ performance using the LINPACK Benchmark ↩
-
US Department of Energy, Request for Information - Advanced Computing Ecosystems, 2022. ↩
-
Research requests abroad are often assessed through a peer-review process or managed centrally by the organisation. ↩
-
The Hartree Centre, ‘Digital Innovation for Economic Resilience’, 2022 ↩
-
Go-Science report, Large Scale Computing: the case for greater UK coordination, 2021. ↩
-
STFC Hartree Centre, Hybrid Quantum/Classical and Quantum Computing uptake in UK, 2022. ↩
-
See the public data set by Jaime Sevilla, Pablo Villalobos, Juan Felipe Cerón, Matthew Burtell, Lennart Heim, Amogh B. Nanjajjar, Anson Ho, Tamay Besiroglu, Marius Hobbhahn and Jean Stanislas-Denain, ‘Parameter, Compute and Data Trends in Machine Learning’, n.d. ↩
-
Met Office, ‘Up to £1.2billion for weather and climate supercomputer’, 2020. ↩
-
Information provided by DiRAC. ↩
-
The Alan Turing Institute and Technopolis Group, Review of Digital Research Infrastructure Requirements for AI, 2022 ↩
-
Information provided by UKRI. ↩
-
According to evidence gathered for the review, 456 are at DiRAC Tursa, 320 are at CDS3 Wilkes3 and 208 are at Baskerville. A further 504 previous-generation NVIDIA V100 GPUs are at JADE. The Hartree Centre will be adding a further 90 A100 GPUs by 2025 and DiRAC Tursa will be adding 256 in 2023. ↩