Challenges in Data Center Cooling: An Engineer’s Perspective

Meet the Expert: Alfonso Ortega, PhD

Alfonso Ortega

Dr. Alfonso Ortega is the James R. Birle Professor of Energy Technology at Villanova University. He is also the director of the Villanova NSF Center for Energy Smart Electronic Systems and the Laboratory for Advanced Thermal and Fluid Systems (LATFS). He earned his BS in mechanical engineering from the University of Texas at El Paso and his MS and PhD in mechanical engineering from Stanford University.

Dr. Ortega is an internationally recognized researcher in thermal management of data centers and electronic systems. Prior to joining Villanova, he spent 18 years on the faculty of the Department of Aerospace and Mechanical Engineering at the University of Arizona, Tucson, where he directed the Experimental and Computational Heat Transfer Laboratory. He also served as program director for thermal transport and thermal processing at NSF’s chemical and transport systems division.

Dr. Ortega is a Fellow of the ASME. He received the 2003 SEMITHERM Thermie Award and the 2017 ITherm Achievement Award for his contributions to the field of electronics thermal measurements.

The Evolution of the Data Center

Data centers aren’t new. Dating back to the 1960s, the earliest data centers were just called computer rooms, and they were exactly what the name suggested: a bunch of computing devices—DEC (Digital Equipment Corporation) made self-contained mini-computers like the VAX, which resembled small refrigerators—sitting in a room together. Hospitals, schools, and companies had their own computer rooms, on-site, and the types of applications they were handling didn’t require much power: things like payroll, credit card transactions, and medical records.

The earliest computer rooms could be cooled primarily with systems adapted from traditional air conditioning. But as data needs grew, so did the size, complexity, and power consumption of these rooms. By the internet boom of the 1990s, simple on-site computer rooms weren’t enough. What emerged was a model called colocation, or COLOs, which provided the data center space, cooling, power, and infrastructure for multiple clients.

“A typical data center, for years and years, was an air-cooled data center,” Dr. Ortega says. “The typical architecture was a raised-floor: basically, you had a concrete floor, and then you had a false floor several feet above. In between, cold air was delivered through the false floor’s perforated panels, then sucked into server racks, which are about 2.5 feet wide, eight feet tall, and four feet deep, stacked with pizza-box-size servers. The cooling system was designed so cold air would pick up the heat from the servers, then get tossed out the back as hot air.”

Air cooling systems became more sophisticated over time: lining up racks more strategically, building containment methods to isolate hot and cold air streams, and using refrigerant-based systems to cool return air before sending it back through the system.

For a long time, that worked—until it got too expensive to run the chillers that cooled the air. Some data centers turned to evaporative cooling strategies: mixing outside air with inside air to reduce its temperature before blowing it back into the data center. But as power demand kept creeping up, new solutions were needed.

“What started to happen was companies like Meta found you could get a lot of economy by going to larger data centers, which are now called hyperscale data centers,” Dr. Ortega says. “Very large companies have very large requirements: they provide cloud services, or online services, which require a lot of memory and data storage.”

Around five years ago, most data centers were still cooled with air. The shift towards GPU processors from CPU processors has changed things. GPUs are better at the repetitive calculations needed for AI. But they also drive much higher power densities.

“What’s happened is like an arms race,” Dr. Ortega says. “We’ve gotten into a very rapid turnaround on generations of GPU chips: there’s a new generation practically every year, and every generation demands more power, but is also more powerful.”

The Switch to Liquid Cooling

The primary alternative to traditional air cooling is liquid cooling. It isn’t new, either: IBM, Fujitsu, and Cray have used it in supercomputing environments for decades. But what is new is the adoption of liquid cooling for commercial applications. And as the market for bigger, more powerful data centers has grown, the demand for liquid cooling solutions has grown in tandem.

“I’m a cooling specialist, and I can tell you that we passed the horizon of being able to cool these things with air about five years ago,” Dr. Ortega says. “We realized we had to shift over to liquid cooling. Currently, we’re in the first big adoption of liquid cooling technology at scale.”

Direct-to-chip water cooling has taken off, with hyperscale operators accounting for the largest share. Direct-to-chip cooling places small, liquid-cooled plates on top of the hottest components inside a server: typically, the chips themselves. Water then flows through these cold plates, absorbs the heat, and carries it out. But once again, there are natural limits.

“Water is a miraculous liquid: a great heat transfer fluid,” Dr. Ortega says. “But as the power demand increases, with GPUs getting more powerful, at some point we won’t be able to push any more water through the system. A decision will need to be made: what’s next?”

One limitation is the erosion limit: too much water, pushed fast enough, starts to erode metal through pressure and impact. Another limitation is cost: the lower the water’s temperature, the more heat you can remove, but chilling water is expensive. For now, it’s not a problem, as most systems can cool chips to around 85 C—generally within the normal operating range—with water as warm as 40 C. But as power increases, the water temperature will need to drop, requiring mechanical chilling via refrigerant systems.

“The problem is it takes a lot of energy to drive compressors, which are essentially mechanical pumps,” Dr. Ortega says. “Refrigeration is very expensive. It will be the ultimate death of liquid cooling when we can no longer use room temperature water—say 20 to 25 C—for cooling data centers.”

Immersion Cooling and Beyond

Another approach is immersion cooling: placing entire servers in a bath of an electrically non-conducting liquid, then removing heat through natural or pumped circulation. Because the fluid is in direct contact with hot surfaces, it can handle extremely high power densities. But there are significant tradeoffs, namely that it’s really messy: you’re literally putting your servers in oil or some other type of refrigerant.

“It sounds like science fiction, and I believe it is science fiction,” Dr. Ortega says. “I hate to say it, but I just don’t see that in the near future. It’d be a complete paradigm shift. No data center operator is going to want to go down that path at a really large scale.”

But there might be uses for immersion cooling at a smaller scale. Some data providers, like streaming services, are bringing computing closer to the site of use: several small, modular data centers, rather than one large data center, can result in higher bandwidth, lower latency, and better reliability through being more proximate to the users accessing them.

“In my opinion, that might be the application where immersion cooling shines: in these small, contained, modular data centers,” Dr. Ortega says. “If you get a leak, it’s not disastrous like it would be at a huge data center.”

For Dr. Ortega, the future is two-phase cooling: a system in which refrigerants flow through cold plates placed on top of GPUs, changing phase from liquid to vapor as they boil. It’s extremely efficient and good at absorbing heat because, rather than changing the fluid’s temperature, it changes its phase from liquid to vapor.

“As a research laboratory, we’re working on concepts beyond those currently being used,” Dr. Ortega says. “And we’re working on two-phase cooling systems with the thinking that at some point, not now but at some point, they’re probably going to be needed. And we have to have the technological readiness to be able to shift over to this kind of cooling, this two-phase cooling.”

The Future of Data Center Cooling

Water cooling in data centers works for now, but it won’t forever. There’s no consensus timeline, but pessimists say the current system has another five years of high functionality. And yet this is an extremely risk-averse industry that won’t change until it has to, especially when hyperscale data centers have spent millions on their current infrastructure. It’s possible that changes to the cooling equation will come from other vectors.

“Chip manufacturers might be forced to do something innovative, rather than continuing to build these gigantic GPUs at high power,” Dr. Ortega says. “Back when Intel had a similar power cooling crisis, they took their chips and split them up, sectorizing their architecture. They learned how to run smart software that would throttle the chips, turning them down when not in use. They used a lot of tricks in software and architecture to reduce the amount of power, and by doing that, they were able to buy a lot of life in air cooling — extending it by probably a decade.”

Another looming issue is sustainability. Data centers are huge power consumers, and engineers are working aggressively on waste heat harvesting: basically, doing something useful with the heat that’s ejected from the data center. It’s possible that heat could be used to produce cooling, using what’s called absorption refrigeration.

There’s also the question of where a data center’s power should come from. Data centers draw power from the grid, with diesel generators providing backup during grid blackouts or brownouts. But the power demands from new and planned data center construction are unimaginably high, and the design and construction of new power plants to meet increased grid needs will not keep pace with demand. The emerging solution is to locate power generation on-site using natural gas turbines and fuel cells, and even small modular nuclear reactors.

“We have opportunities to run the onsite power generation and the data center in a synergistic way that makes the entire enterprise sustainable, for example, by recovering the waste heat from the co-located power generation plant,” Dr. Ortega says. “It’s a big deal, and I very much am a proponent of sustainable engineering. Major companies haven’t seen the business case for it until recently. But they’re taking a serious, serious look at this.”

Engineers will determine what the future of data centers looks like, but they’ll need to work together, across disciplines. There’s a need for T-shaped systems engineers: specialized engineers with deep domain expertise and cross-functional literacy to collaborate with colleagues with different specialties.

“It’s important that you’re really good at one thing, your domain of expertise, and that you have the curiosity and willingness to learn whatever you need to overlap with people outside your field,” Dr. Ortega says. “We’ve got to work together on these very interdisciplinary problems. We can’t work in our individual silos. These are very coupled systems problems.”

Related Features

Women in Engineering: Degree Programs & Opportunities

Why are women underrepresented in engineering, the top-paying undergraduate major in the country? Why does a disproportionate amount of engineering research funding go to men? Which schools are actively creating opportunities for women? Which female engineers are leading the way? Find out here.

Business Systems Analyst – A Day in the Life

This is a role for tech-lovers, for logical thinkers, for those who like being given an answer and then are told to find the question. But it’s also a role for communicators, for relationship builders, for people who enjoy cross-departmental collaboration.

Vital AI: The Future of Chip Design

Apple’s A16 Bionic processor has 16 billion transistors. There is no way for a human to design such a chip manually. In electronic design automation (EDA), a field that includes the design of chips and circuit boards, today’s engineers rely on software tools to aid them.

An Expert’s Guide to Becoming a Field Engineer

Field engineering is a crucial discipline within the broader engineering landscape, focusing primarily on the on-site implementation, troubleshooting, and maintenance of engineering projects. Field engineers are tasked with applying technical knowledge in real-world settings, often collaborating with construction personnel, project managers, and clients to ensure that projects are executed according to specifications and within the allocated timelines. Their role demands high technical proficiency, adaptability, and problem-solving skills, as they must swiftly address any challenges that arise on-site.

An Expert’s Guide to Using Digital Twins

Today, digital twins are not limited to just physical objects. With the rise of virtual and augmented reality technologies, digital twins can now replicate entire environments and systems in a virtual space. This has opened up new possibilities for testing and simulation, allowing companies to reduce costs and risks associated with physical prototypes.