In the ever-evolving world of technology, the need for efficient cooling systems is paramount. As data centers grow and processing power increases, the heat generated by these systems also rises, necessitating advanced cooling techniques. This blog post delves into the current methods and limitations of three primary cooling techniques: air, liquid, and immersion cooling. We will then analyze the trends in processing power over the past 20 years and make a prediction on how long these cooling techniques will continue to be a feasible cooling solution.
Each of the three current techniques has its unique advantages and challenges, and understanding them is crucial for anyone involved in managing or designing data centers. Let’s take a closer look at each one.
Air-cooled systems:
Traditional data center cooling methods predominantly rely on fans to disperse heat from components. A standard 42U rack typically boasts a cooling capacity of about 10-15 kW, although highly efficient air cooling can potentially extend this to as much as 30 kW. These systems are relatively straightforward to implement, but their cooling efficacy can be compromised when dealing with high-density equipment, a limitation primarily stemming from the air’s modest thermal transfer capacity. The challenges that air cooling faces, which have driven the industry to seek alternatives, are rooted in the physical size and power requirements of the fans necessary for adequately cooling the latest high-density servers. The increased size and energy consumption of these fans both occupy valuable space and escalate operational costs, making the approach less viable and practical on multiple levels going forward.
Liquid-cooled systems:
Liquid cooling, either by direct contact or via cooling loops, possesses a greater capacity to dissipate heat compared to air cooling, due to the inherently better thermal conductivity of liquids. A well-engineered liquid-cooled 42U rack can manage a heat load of approximately 50-60 kW, and with certain configurations—such as those that incorporate two-phase and Cooling Distribution Units (CDUs)—this can be augmented to as much as 100 kW per rack. However, these systems pose more intricacies in their implementation and maintenance, given the requirement for extensive plumbing across the building, the necessity to prevent leaks, and the need for regular fluid quality monitoring. Despite these challenges, liquid cooling not only promises a lower total cost of ownership when compared to air cooling, but also ensures adequate cooling capacity to meet the evolving processing power demands in the immediate future.
Immersion Cooling:
Immersion cooling—a technology that involves submerging components directly into a non-conductive liquid that absorbs heat—has long been recognized for its superior cooling capacity, yet it has recently gained renewed attention. Capable of managing heat loads of 100 kW and generally peaking below 200 kW per 42U rack, the technology’s high efficiency arises from the robust thermal transfer capabilities of the cooling liquids employed. Immersion cooling can be classified into two types: single-phase and two-phase. Single-phase immersion cooling engulfs components in a fluid bath that is subsequently circulated and cooled. Meanwhile, two-phase immersion cooling utilizes a fluid with a low boiling point, effectively leveraging the transition from liquid to gas to expel heat. Despite its potential to significantly enhance energy efficiency, immersion cooling necessitates comprehensive modifications to an existing data center, and the sizable tanks required tend to occupy a considerable amount of floor space.
Now that we have discussed the three leading cooling solutions available, let’s take a look at the observable trends of GPU’s in regards to TDP (thermal design power) and make calculated estimates of how long these three cooling techniques will remain a viable cooling solution.
The following graph showcases the progression of GPUs released over approximately the past two decades, effectively highlighting the rapid escalation of Thermal Design Power (TDP) during this period.
Now that we have discussed the three leading cooling solutions available, let’s examine the observable trends in GPU’s Thermal Design Power (TDP) and predict the viability timeline for these cooling techniques.
To get a clear picture, let’s review the following graph, which elucidates the rapid rise of Thermal Design Power (TDP) for GPUs over the past two decades. This visual representation underlines the urgency for innovative cooling solutions, as the escalating TDP of GPUs continues to challenge the effectiveness of existing methods.
Figure A
GPU | Release Year | TDP (Watts) |
NVIDIA GeForce 256 | 1999 | 25 |
NVIDIA GeForce2 GTS | 2000 | 25 |
NVIDIA GeForce3 Ti 200 | 2001 | 31 |
NVIDIA GeForce4 Ti 4200 | 2002 | 45 |
NVIDIA GeForce FX 5800 Ultra | 2003 | 73 |
NVIDIA GeForce 6800 Ultra | 2004 | 110 |
NVIDIA GeForce 7800 GTX | 2005 | 100 |
NVIDIA GeForce 8800 GTX | 2006 | 155 |
NVIDIA GeForce 9800 GTX | 2008 | 140 |
NVIDIA GeForce GTX 280 | 2008 | 236 |
NVIDIA GeForce GTX 480 | 2010 | 250 |
NVIDIA GeForce GTX 580 | 2010 | 244 |
AMD Radeon HD 6970 | 2010 | 250 |
NVIDIA GeForce GTX 680 | 2012 | 195 |
AMD Radeon HD 7970 | 2012 | 250 |
NVIDIA GeForce GTX 780 | 2013 | 250 |
AMD Radeon R9 290X | 2013 | 290 |
NVIDIA GeForce GTX 980 | 2014 | 165 |
AMD Radeon R9 Fury X | 2015 | 275 |
NVIDIA GeForce GTX 980 Ti | 2015 | 250 |
AMD Radeon RX 480 | 2016 | 150 |
NVIDIA GeForce GTX 1060 | 2016 | 120 |
NVIDIA GeForce GTX 1070 | 2016 | 150 |
NVIDIA GeForce GTX 1080 | 2016 | 180 |
NVIDIA GeForce RTX 2060 | 2019 | 160 |
AMD Radeon RX 5500 XT | 2019 | 130 |
NVIDIA GeForce RTX 2070 | 2018 | 175 |
NVIDIA GeForce RTX 2080 | 2018 | 215 |
AMD Radeon RX 5700 XT | 2019 | 225 |
NVIDIA GeForce RTX 3070 | 2020 | 220 |
NVIDIA GeForce RTX 3080 | 2020 | 320 |
AMD Radeon RX 6800 XT | 2020 | 300 |
AMD Radeon RX 6700 XT | 2021 | 230 |
AMD Radeon RX 6950 XT | 2022 | 313 |
NVIDIA GeForce RTX 3090 Ti | 2022 | 315 |
AMD Radeon RX 7900 XT | 2022 | 344 |
NVIDIA GeForce RTX 4090 | 2022 | 430 |
NVIDIA GeForce RTX 4070 | 2023 | 344 |
To gain a deeper understanding of the longevity and efficacy of our current cooling strategies, let’s revisit the the capabilities and limitations of air, liquid, and immersion cooling methods detailed in the chart below.
Figure 2
Type | Capacity | Wattage |
Cooling Method | Typical Cooling Capacity | Upper Limit |
Air-Cooled | 10-15 kW per 42U rack | Up to 30 kW per 42U rack |
Liquid-Cooled | 50-60 kW per 42U rack | Up to 100 kW per 42U rack |
Immersion Cooling | >100 kW per 42U rack | Varies, but generally < 180 kW per 42U rack |
Assuming that the GPU Thermal Design Power (TDP) trend observed in Figure A continues to ascend at a consistent rate—an assumption few challenge—it is highly likely we will witness TDP values approaching 1 kW per GPU by the end of 2025 demonstrated in the figure below.
*Insert graph that extends GPU curve out to 2023
Recently, an esteemed panel of thermal design experts, architects, and strategists from leading companies like Dell, Intel, Vertiv, and NVIDIA gathered for a meaningful dialogue on the future of data center and High-Performance Computing (HPC) cooling (Link). Among the panelists, NVIDIA’s Ali Heydari made a thought-provoking assertion: “the laws of physics are going to dictate the future of cooling.” His observation underscores the inherent limitations of even advanced cooling methodologies like liquid and immersion cooling, which could likely reach their limits within the next five years. This looming challenge mirrors the current predicament we face as we transition away from traditional air cooling techniques.
So, how can we leverage physics to address this problem? Intriguingly, the cooling process hinges on two primary physical mechanisms. The first, known as kinetic thermal transport, encompasses both convective and conductive cooling strategies, including air, liquid, and immersion cooling. This mechanism is driven by atomic collisions at the interface between a hot object and a cooler solid, liquid, or gas, resulting in heat transfer from the warmer to cooler areas and thus equalizing the temperature.
The second, radiative thermal transport, is powered by the acceleration of charged particles that interact with the electromagnetic field to emit photons. These photons carry heat energy from a hot surface to a cooler one. However, at server temperatures, the cooling power from the radiated photons is relatively modest, making up only a tiny fraction of the total dissipation compared to convective and conductive cooling.
However, recent technological advancements have given rise to metamaterials or thermal metasurfaces that can enhance and control radiative properties of matter. At Maxwell Labs, we’re dedicated to augmenting this second mechanism – radiative thermal transport – so that it becomes a vital player in thermal management strategies. By harnessing large-scale AI-coupled simulation in the design space of thermal metasurfaces, we’re pioneering a new era of materials-based cooling technology where thermal radiation is not an auxiliary player, but the leading factor in heat dissipation.
In the end, we don’t just aim to adapt to the laws of physics; we aim to work with them, leveraging their principles to redefine the future of cooling. At Maxwell Labs, we aren’t just looking at what’s possible today. We’re focusing on what will be necessary tomorrow, pushing the boundaries of physics and innovation to create sustainable, efficient, and high-performance cooling solutions for the next generation of computing.
If you would like to learn more about how we are pioneering this technology, you can connect with us on our contact page.