Long-term future of HPC and data center cooling

In the ever-evolving world of technology, the need for efficient cooling systems is paramount. As data centers grow and processing power increases, the heat generated by these systems also rises, necessitating advanced cooling techniques. This blog post delves into the current methods and limitations of three primary cooling techniques: air, liquid, and immersion cooling. We will then analyze the trends in processing power over the past 20 years and make a prediction on how long these cooling techniques will continue to be a feasible cooling solution.

Each of the three current techniques has its unique advantages and challenges, and understanding them is crucial for anyone involved in managing or designing data centers. Let’s take a closer look at each one.

Air-cooled systems:

Traditional data center cooling methods predominantly rely on fans to disperse heat from components. A standard 42U rack typically boasts a cooling capacity of about 10-15 kW, although highly efficient air cooling can potentially extend this to as much as 30 kW. These systems are relatively straightforward to implement, but their cooling efficacy can be compromised when dealing with high-density equipment, a limitation primarily stemming from the air’s modest thermal transfer capacity. The challenges that air cooling faces, which have driven the industry to seek alternatives, are rooted in the physical size and power requirements of the fans necessary for adequately cooling the latest high-density servers. The increased size and energy consumption of these fans both occupy valuable space and escalate operational costs, making the approach less viable and practical on multiple levels going forward.


Liquid-cooled systems:

Liquid cooling, either by direct contact or via cooling loops, possesses a greater capacity to dissipate heat compared to air cooling, due to the inherently better thermal conductivity of liquids. A well-engineered liquid-cooled 42U rack can manage a heat load of approximately 50-60 kW, and with certain configurations—such as those that incorporate two-phase and Cooling Distribution Units (CDUs)—this can be augmented to as much as 100 kW per rack. However, these systems pose more intricacies in their implementation and maintenance, given the requirement for extensive plumbing across the building, the necessity to prevent leaks, and the need for regular fluid quality monitoring. Despite these challenges, liquid cooling not only promises a lower total cost of ownership when compared to air cooling, but also ensures adequate cooling capacity to meet the evolving processing power demands in the immediate future.


Immersion Cooling:

Immersion cooling—a technology that involves submerging components directly into a non-conductive liquid that absorbs heat—has long been recognized for its superior cooling capacity, yet it has recently gained renewed attention. Capable of managing heat loads of 100 kW and generally peaking below 200 kW per 42U rack, the technology’s high efficiency arises from the robust thermal transfer capabilities of the cooling liquids employed. Immersion cooling can be classified into two types: single-phase and two-phase. Single-phase immersion cooling engulfs components in a fluid bath that is subsequently circulated and cooled. Meanwhile, two-phase immersion cooling utilizes a fluid with a low boiling point, effectively leveraging the transition from liquid to gas to expel heat. Despite its potential to significantly enhance energy efficiency, immersion cooling necessitates comprehensive modifications to an existing data center, and the sizable tanks required tend to occupy a considerable amount of floor space.

 

Now that we have discussed the three leading cooling solutions available, let’s take a look at the observable trends of GPU’s in regards to TDP (thermal design power) and make calculated estimates of how long these three cooling techniques will remain a viable cooling solution.

The following graph showcases the progression of GPUs released over approximately the past two decades, effectively highlighting the rapid escalation of Thermal Design Power (TDP) during this period.

Now that we have discussed the three leading cooling solutions available, let’s examine the observable trends in GPU’s Thermal Design Power (TDP) and predict the viability timeline for these cooling techniques. 

To get a clear picture, let’s review the following graph, which elucidates the rapid rise of Thermal Design Power (TDP) for GPUs over the past two decades. This visual representation underlines the urgency for innovative cooling solutions, as the escalating TDP of GPUs continues to challenge the effectiveness of existing methods.


Figure A

GPURelease YearTDP (Watts)
NVIDIA GeForce 256199925
NVIDIA GeForce2 GTS200025
NVIDIA GeForce3 Ti 200200131
NVIDIA GeForce4 Ti 4200200245
NVIDIA GeForce FX 5800 Ultra200373
NVIDIA GeForce 6800 Ultra2004110
NVIDIA GeForce 7800 GTX2005100
NVIDIA GeForce 8800 GTX2006155
NVIDIA GeForce 9800 GTX2008140
NVIDIA GeForce GTX 2802008236
NVIDIA GeForce GTX 4802010250
NVIDIA GeForce GTX 5802010244
AMD Radeon HD 69702010250
NVIDIA GeForce GTX 6802012195
AMD Radeon HD 79702012250
NVIDIA GeForce GTX 7802013250
AMD Radeon R9 290X2013290
NVIDIA GeForce GTX 9802014165
AMD Radeon R9 Fury X2015275
NVIDIA GeForce GTX 980 Ti2015250
AMD Radeon RX 4802016150
NVIDIA GeForce GTX 10602016120
NVIDIA GeForce GTX 10702016150
NVIDIA GeForce GTX 10802016180
NVIDIA GeForce RTX 20602019160
AMD Radeon RX 5500 XT2019130
NVIDIA GeForce RTX 20702018175
NVIDIA GeForce RTX 20802018215
AMD Radeon RX 5700 XT2019225
NVIDIA GeForce RTX 30702020220
NVIDIA GeForce RTX 30802020320
AMD Radeon RX 6800 XT2020300
AMD Radeon RX 6700 XT2021230
AMD Radeon RX 6950 XT2022313
NVIDIA GeForce RTX 3090 Ti2022315
AMD Radeon RX 7900 XT2022344
NVIDIA GeForce RTX 40902022430
NVIDIA GeForce RTX 40702023344
   

To gain a deeper understanding of the longevity and efficacy of our current cooling strategies, let’s revisit the the capabilities and limitations of air, liquid, and immersion cooling methods detailed in the chart below.


Figure 2

TypeCapacityWattage
Cooling MethodTypical Cooling CapacityUpper Limit
Air-Cooled10-15 kW per 42U rackUp to 30 kW per 42U rack
Liquid-Cooled50-60 kW per 42U rackUp to 100 kW per 42U rack
Immersion Cooling>100 kW per 42U rackVaries, but generally < 180 kW per 42U rack

Assuming that the GPU Thermal Design Power (TDP) trend observed in Figure A continues to ascend at a consistent rate—an assumption few challenge—it is highly likely we will witness TDP values approaching 1 kW per GPU by the end of 2025 demonstrated in the figure below.

*Insert graph that extends GPU curve out to 2023

Recently, an esteemed panel of thermal design experts, architects, and strategists from leading companies like Dell, Intel, Vertiv, and NVIDIA gathered for a meaningful dialogue on the future of data center and High-Performance Computing (HPC) cooling (Link). Among the panelists, NVIDIA’s Ali Heydari made a thought-provoking assertion: “the laws of physics are going to dictate the future of cooling.” His observation underscores the inherent limitations of even advanced cooling methodologies like liquid and immersion cooling, which could likely reach their limits within the next five years. This looming challenge mirrors the current predicament we face as we transition away from traditional air cooling techniques.

So, how can we leverage physics to address this problem? Intriguingly, the cooling process hinges on two primary physical mechanisms. The first, known as kinetic thermal transport, encompasses both convective and conductive cooling strategies, including air, liquid, and immersion cooling. This mechanism is driven by atomic collisions at the interface between a hot object and a cooler solid, liquid, or gas, resulting in heat transfer from the warmer to cooler areas and thus equalizing the temperature.

The second, radiative thermal transport, is powered by the acceleration of charged particles that interact with the electromagnetic field to emit photons. These photons carry heat energy from a hot surface to a cooler one. However, at server temperatures, the cooling power from the radiated photons is relatively modest, making up only a tiny fraction of the total dissipation compared to convective and conductive cooling.

However, recent technological advancements have given rise to metamaterials or thermal metasurfaces that can enhance and control radiative properties of matter. At Maxwell Labs, we’re dedicated to augmenting this second mechanism – radiative thermal transport – so that it becomes a vital player in thermal management strategies. By harnessing large-scale AI-coupled simulation in the design space of thermal metasurfaces, we’re pioneering a new era of materials-based cooling technology where thermal radiation is not an auxiliary player, but the leading factor in heat dissipation.

In the end, we don’t just aim to adapt to the laws of physics; we aim to work with them, leveraging their principles to redefine the future of cooling. At Maxwell Labs, we aren’t just looking at what’s possible today. We’re focusing on what will be necessary tomorrow, pushing the boundaries of physics and innovation to create sustainable, efficient, and high-performance cooling solutions for the next generation of computing.

If you would like to learn more about how we are pioneering this technology, you can connect with us on our contact page.

Recent News