Long-term future of HPC and data center cooling

In the ever-evolving world of technology, the need for efficient cooling systems is paramount. As data centers grow and processing power increases, the heat generated by these systems also rises, necessitating advanced cooling techniques. This blog post delves into the current methods and limitations of three primary cooling techniques: air, liquid, and immersion cooling. We will then analyze the trends in processing power over the past 20 years and make a prediction on how long these cooling techniques will continue to be a feasible cooling solution.

Each of the three current techniques has its unique advantages and challenges, and understanding them is crucial for anyone involved in managing or designing data centers. Let’s take a closer look at each one.

Air-cooled systems:

Traditional data center cooling methods predominantly rely on fans to disperse heat from components. A standard 42U rack typically boasts a cooling capacity of about 10-15 kW, although highly efficient air cooling can potentially extend this to as much as 30 kW. These systems are relatively straightforward to implement, but their cooling efficacy can be compromised when dealing with high-density equipment, a limitation primarily stemming from the air’s modest thermal transfer capacity. The challenges that air cooling faces, which have driven the industry to seek alternatives, are rooted in the physical size and power requirements of the fans necessary for adequately cooling the latest high-density servers. The increased size and energy consumption of these fans both occupy valuable space and escalate operational costs, making the approach less viable and practical on multiple levels going forward.

Liquid-cooled systems:

Liquid cooling, either by direct contact or via cooling loops, possesses a greater capacity to dissipate heat compared to air cooling, due to the inherently better thermal conductivity of liquids. A well-engineered liquid-cooled 42U rack can manage a heat load of approximately 50-60 kW, and with certain configurations—such as those that incorporate two-phase and Cooling Distribution Units (CDUs)—this can be augmented to as much as 100 kW per rack. However, these systems pose more intricacies in their implementation and maintenance, given the requirement for extensive plumbing across the building, the necessity to prevent leaks, and the need for regular fluid quality monitoring. Despite these challenges, liquid cooling not only promises a lower total cost of ownership when compared to air cooling, but also ensures adequate cooling capacity to meet the evolving processing power demands in the immediate future.

Immersion Cooling:

Immersion cooling—a technology that involves submerging components directly into a non-conductive liquid that absorbs heat—has long been recognized for its superior cooling capacity, yet it has recently gained renewed attention. Capable of managing heat loads of 100 kW and generally peaking below 200 kW per 42U rack, the technology’s high efficiency arises from the robust thermal transfer capabilities of the cooling liquids employed. Immersion cooling can be classified into two types: single-phase and two-phase. Single-phase immersion cooling engulfs components in a fluid bath that is subsequently circulated and cooled. Meanwhile, two-phase immersion cooling utilizes a fluid with a low boiling point, effectively leveraging the transition from liquid to gas to expel heat. Despite its potential to significantly enhance energy efficiency, immersion cooling necessitates comprehensive modifications to an existing data center, and the sizable tanks required tend to occupy a considerable amount of floor space.

Now that we have discussed the three leading cooling solutions available, let’s take a look at the observable trends of GPU’s in regards to TDP (thermal design power) and make calculated estimates of how long these three cooling techniques will remain a viable cooling solution.

The following graph showcases the progression of GPUs released over approximately the past two decades, effectively highlighting the rapid escalation of Thermal Design Power (TDP) during this period.

Now that we have discussed the three leading cooling solutions available, let’s examine the observable trends in GPU’s Thermal Design Power (TDP) and predict the viability timeline for these cooling techniques.

To get a clear picture, let’s review the following graph, which elucidates the rapid rise of Thermal Design Power (TDP) for GPUs over the past two decades. This visual representation underlines the urgency for innovative cooling solutions, as the escalating TDP of GPUs continues to challenge the effectiveness of existing methods.

Figure A

GPU	Release Year	TDP (Watts)
NVIDIA GeForce 256	1999	25
NVIDIA GeForce2 GTS	2000	25
NVIDIA GeForce3 Ti 200	2001	31
NVIDIA GeForce4 Ti 4200	2002	45
NVIDIA GeForce FX 5800 Ultra	2003	73
NVIDIA GeForce 6800 Ultra	2004	110
NVIDIA GeForce 7800 GTX	2005	100
NVIDIA GeForce 8800 GTX	2006	155
NVIDIA GeForce 9800 GTX	2008	140
NVIDIA GeForce GTX 280	2008	236
NVIDIA GeForce GTX 480	2010	250
NVIDIA GeForce GTX 580	2010	244
AMD Radeon HD 6970	2010	250
NVIDIA GeForce GTX 680	2012	195
AMD Radeon HD 7970	2012	250
NVIDIA GeForce GTX 780	2013	250
AMD Radeon R9 290X	2013	290
NVIDIA GeForce GTX 980	2014	165
AMD Radeon R9 Fury X	2015	275
NVIDIA GeForce GTX 980 Ti	2015	250
AMD Radeon RX 480	2016	150
NVIDIA GeForce GTX 1060	2016	120
NVIDIA GeForce GTX 1070	2016	150
NVIDIA GeForce GTX 1080	2016	180
NVIDIA GeForce RTX 2060	2019	160
AMD Radeon RX 5500 XT	2019	130
NVIDIA GeForce RTX 2070	2018	175
NVIDIA GeForce RTX 2080	2018	215
AMD Radeon RX 5700 XT	2019	225
NVIDIA GeForce RTX 3070	2020	220
NVIDIA GeForce RTX 3080	2020	320
AMD Radeon RX 6800 XT	2020	300
AMD Radeon RX 6700 XT	2021	230
AMD Radeon RX 6950 XT	2022	313
NVIDIA GeForce RTX 3090 Ti	2022	315
AMD Radeon RX 7900 XT	2022	344
NVIDIA GeForce RTX 4090	2022	430
NVIDIA GeForce RTX 4070	2023	344

To gain a deeper understanding of the longevity and efficacy of our current cooling strategies, let’s revisit the the capabilities and limitations of air, liquid, and immersion cooling methods detailed in the chart below.

Figure 2

Type	Capacity	Wattage
Cooling Method	Typical Cooling Capacity	Upper Limit
Air-Cooled	10-15 kW per 42U rack	Up to 30 kW per 42U rack
Liquid-Cooled	50-60 kW per 42U rack	Up to 100 kW per 42U rack
Immersion Cooling	>100 kW per 42U rack	Varies, but generally < 180 kW per 42U rack

Assuming that the GPU Thermal Design Power (TDP) trend observed in Figure A continues to ascend at a consistent rate—an assumption few challenge—it is highly likely we will witness TDP values approaching 1 kW per GPU by the end of 2025 demonstrated in the figure below.

*Insert graph that extends GPU curve out to 2023

Recently, an esteemed panel of thermal design experts, architects, and strategists from leading companies like Dell, Intel, Vertiv, and NVIDIA gathered for a meaningful dialogue on the future of data center and High-Performance Computing (HPC) cooling (Link). Among the panelists, NVIDIA’s Ali Heydari made a thought-provoking assertion: “the laws of physics are going to dictate the future of cooling.” His observation underscores the inherent limitations of even advanced cooling methodologies like liquid and immersion cooling, which could likely reach their limits within the next five years. This looming challenge mirrors the current predicament we face as we transition away from traditional air cooling techniques.

So, how can we leverage physics to address this problem? Intriguingly, the cooling process hinges on two primary physical mechanisms. The first, known as kinetic thermal transport, encompasses both convective and conductive cooling strategies, including air, liquid, and immersion cooling. This mechanism is driven by atomic collisions at the interface between a hot object and a cooler solid, liquid, or gas, resulting in heat transfer from the warmer to cooler areas and thus equalizing the temperature.

The second, radiative thermal transport, is powered by the acceleration of charged particles that interact with the electromagnetic field to emit photons. These photons carry heat energy from a hot surface to a cooler one. However, at server temperatures, the cooling power from the radiated photons is relatively modest, making up only a tiny fraction of the total dissipation compared to convective and conductive cooling.

However, recent technological advancements have given rise to metamaterials or thermal metasurfaces that can enhance and control radiative properties of matter. At Maxwell Labs, we’re dedicated to augmenting this second mechanism – radiative thermal transport – so that it becomes a vital player in thermal management strategies. By harnessing large-scale AI-coupled simulation in the design space of thermal metasurfaces, we’re pioneering a new era of materials-based cooling technology where thermal radiation is not an auxiliary player, but the leading factor in heat dissipation.

In the end, we don’t just aim to adapt to the laws of physics; we aim to work with them, leveraging their principles to redefine the future of cooling. At Maxwell Labs, we aren’t just looking at what’s possible today. We’re focusing on what will be necessary tomorrow, pushing the boundaries of physics and innovation to create sustainable, efficient, and high-performance cooling solutions for the next generation of computing.

If you would like to learn more about how we are pioneering this technology, you can connect with us on our contact page.

Long-term future of HPC and data center cooling

Recent News

Maxwell Labs Welcomes Henry Newmann to the Team

Maxwell Labs Welcomes Norm Troullier to the Team

Long-term future of HPC and data center cooling

Maxwell Labs Awarded Phase I SBIR Grant

Contact Us

Follow Us