[en-us] Design Principles (240628)

Design both power and cooling together to optimize AI infrastructure.

Power is at a premium. Eliminate stranded power by aligning AI clusters to data center capacity blocks.

Handle AI workload surges through system-level controls including power and cooling buffers.

Balance cost, redundancy, and risk in AI design.

Design for a mix of liquid and air cooling.

Design for the future.

Power into a data center is segmented into capacity blocks, commonly 1-3 MW and determined by industry-standard sizing of breakers or generators. AI is deployed in clusters, soon to be common at 100+ kW / rack, and upward from there. Aligning clusters to capacity blocks can ensure that every available kW can be utilized.

Power, cooling, and AI hardware compete for limited space and energy. A wholistic power and cooling design approach is required to maximize the share of space and energy dedicated to AI processing.

Consider the total cost of ownership, redundancy, and blast radius in AI power and cooling designs and the tradeoffs among them. The value of AI hardware, at $1-4M+ per rack, and the processing it supports are driving increased consideration of redundancy in power and cooling designs, especially for inference applications. Designs that limit the blast radius or the impact from the loss of a single capacity segment (server, rack, row) tend to use higher counts of smaller components, potentially at the expense of the total cost of ownership. Designs that favor total cost of ownership tend to use fewer larger components, often with redundancy, to reduce the possibility of a capacity segment's loss.

AI' training' tends to drive large numbers of processors to act in unison, creating massive power consumption surges that can repeat and degrade the performance and lifespan of power and cooling infrastructure. Mitigation designs include system-level controls with rapid response and immediately accessible buffers in power and cooling capacity.

The combination of liquid and air cooling has an interdependent impact on the ability to remove heat. Power into the data center equals the heat rejected. Air and liquid cooling temperatures and flows must stay within the operating envelope of the AI servers and the data center heat rejection equipment.

AI power density will rapidly increase to 500kW per rack. The typical lifespan of a data center is almost two decades, while the AI chip design cycle is less than two years.