data-centershardwaredevops

Running Large Models Today: A Practical Checklist for Liquid-Cooled Colocation

UUnknown

2026-04-08

8 min read

A hands-on checklist for devs and DevOps evaluating liquid cooling (DLC & RDHx) in colo: power, cabling, contracts, monitoring, and migration traps.

Running Large Models Today: A Practical Checklist for Liquid-Cooled Colocation

If your team needs to deploy GPU-dense clusters for LLM training or inference now — not in some roadmap future — liquid cooling in colocation is no longer optional: it’s often mandatory. This hands-on checklist walks developers and DevOps teams through the practical, contract-level, and operational items you must validate when evaluating direct-to-chip (DLC) and rear door heat exchanger (RDHx) solutions in a colocation environment.

Why liquid cooling and colocation matter now

High-density compute has outpaced traditional air-cooled data centers. Modern GPUs and multi-GPU nodes pack so much heat into a rack that air cooling either becomes impossibly loud, prohibitively power-hungry, or simply ineffective. Liquid cooling options — DLC (direct-to-chip cold plates) and RDHx (rear door heat exchangers) — enable much higher watt-per-rack densities, which translates into more model training throughput per square foot. But to get the benefits, you must align technical requirements, contracts, and operational practices before you move metal.

Top-level checklist (quick view)

Confirm facility power capacity: per-rack kW guarantees and breaker specifics.
Define liquid-cooling tech: DLC vs RDHx suitability for your hardware and ops model.
Verify chilled water / coolant specs: temperature, flow, water quality, and redundancy.
Lock down cabling and power distribution: busbars, PDUs, and connector types.
Negotiate contracts: SLA for power, coolant, remote hands, and exit terms.
Plan thermal and electrical monitoring: sensors, alerts, and telemetry pipelines.
Prepare migration strategy: lift, install, rollback, and failure modes.

Power: the non-negotiable foundation

Power is the primary gating factor for high-density GPU racks. Don’t assume “enough” means “available.” Dig into specifics:

Per-rack kW: Get a guaranteed per-rack continuous power value (e.g., 30–60 kW). Ask for measurement history, not just nameplate capacity.
Phasing and voltage: Confirm whether the site provides 208V 3-phase, 400V 3-phase, or custom busbar feeds and what PDUs you’ll need.
Distribution: Prefer busbar-fed cabinet strips or high-amp switched PDUs over dangling extension cords. Verify breaker sizes and available breaker slots.
Redundancy: Decide between N, N+1, and 2N for UPS and generator — for training clusters you may want 2N or at least automatic failover.
Metering and billing: Ensure per-rack metering and transparent billing for kW usage, demand charges, and power factor adjustments.

Action items

Request detailed one-line diagrams and breaker schedules from the colo.
Plan for a ~20–30% headroom above sustained load for safety and peak usage.
Confirm allowed power cord types (IEC C19/C20, large pin connectors) and procurement lead times.

Choosing between DLC and RDHx

Both options remove heat from the rack, but they do so at different densities and operational complexity:

RDHx (rear door heat exchanger): A passive or low-power rear-door unit that transfers rack exhaust heat to facility chilled water. Easier to deploy, less invasive to servers, works well up to moderate high-density ratios. Typically good for 10–30 kW/rack depending on blower capacity and water temperature.
DLC (direct-to-chip): Cold plates directly on CPUs/GPUs with a closed coolant loop. Higher density (30–100+ kW/rack), better thermal control, but requires leak detection, fluid handling, and more invasive service procedures.

Action items

Map your hardware: identify models and TDPs per GPU, and estimate total sustained rack wattage.
Run heat-budget scenarios for both DLC and RDHx using your worst-case sustained utilization (e.g., full-speed LLM training).
Require facility proof-of-concept (POC) runs if possible: staged load tests or thermal modeling results.

Cooling liquid, piping, and water quality

Liquid systems introduce new failure modes and contract line items. Get these specs in writing:

Coolant type: chilled water, glycol mix, or dielectric fluids — each has implications for freeze protection, conductivity, and maintenance.
Temperatures and delta-T: Supply and return temperatures and expected delta (e.g., 12°C supply at X L/min giving 6–8°C delta per rack).
Flow and pressure: Minimum flow rates and pressure ranges per rack; understand pump head limits when racks are populated across aisles.
Water quality and treatment: Hardness, conductivity, corrosion inhibitors, and scheduled treatment windows.
Leak detection, isolation, and emergency shutdown procedures.

Action items

Include water/coolant specs and maintenance windows in your contract. Ask for historical water quality logs.
Require visible leak-detection outputs to your monitoring stack and automated shutdown behavior on high-severity events.
Verify whether the colo provides supply pumps or you must supply rack-level pumps/heat exchangers.

Cabling, rack mechanics, and logistics

High-density racks are heavy, wired, and cramped. Cable management and mechanical constraints are frequently overlooked migration traps.

Weight and floor loading: Confirm raised-floor and slab loading capacity for fully populated racks and moving gear in/out.
Hoists and elevator access: Check elevator dimensions, door openings, and service windows for delivering heavy chassis.
Cable paths: Plan for NVLink, power feeder, and network fiber entry points. Avoid shared overhead trays with conflicting service windows.
PDUs and busbars: Ensure enough PDU ports and consider hot-swap PDU modules for maintenance without downtime.

Action items

Obtain rack elevation drawings and confirm physical clearances for cold plates and rear doors.
Reserve service windows for large installs and plan conservative timelines for initial rack population.

Contracts and SLAs: the legal checklist

Contracts must go beyond standard colocation text when liquid cooling and high densities are involved. Key items:

Power SLA: guaranteed kW per rack, uptime percentage, and credits for missed SLAs. Clarify how demand spikes are billed.
Cooling SLA: guaranteed coolant temperatures, flow, and redundancy. Include penalties for missed thermal targets.
Remote hands and response times: 24/7 vs business hours, parts retention, and escalation pathways.
Change control: who authorizes installing DLC plates, making coolant changes, or swapping RDHx doors.
Exit and migration terms: decommission, floor restoration, and potential capacity release fees.
Insurance and liability for leaks or water damage: define limits and responsibilities for hardware vs facility damage.

Action items

Add explicit acceptance tests (thermal and electrical) to the contract and tie them to payment milestones.
Negotiate transparency around PUE reporting, energy cost pass-through, and metering access.

Monitoring and observability

Operationalizing liquid-cooled clusters needs a telemetry-first approach. Your monitoring stack should include both IT and facilities data.

Thermal metrics: inlet temps, outlet temps, coolant delta-T, per-GPU die temps, and ambient aisle temps.
Fluid metrics: flow rate, pressure, leak sensors and conductivity sensors integrated into alerts.
Electrical metrics: per-rack kW, current by phase, breaker status, power factor, and UPS runtime.
Network and GPU telemetry: PCIe/InfiniBand link health, NVLink errors, GPU load and memory usage.
Long-term trend data: thermal maps and power usage over weeks to detect creeping hotspots.

Action items

Require APIs or streaming telemetry access from the colo for facility sensors and metering.
Integrate facility metrics into your incident/alerting system and run playbook drills for coolant events.
Instrument capacity planning dashboards combining job scheduling, power draw, and inlet temps.

Migration traps and rollback planning

Many teams fail during the migration phase because they try to retrofit assumptions from air-cooled operations. Watch out for these traps:

Underestimating minimum water temperature requirements for stable DLC operation.
Assuming RDHx will fix hot spots without verifying exhaust flow patterns and blower capacity.
Not planning for firmware/BIOS settings that change node power draw — small BIOS changes can push a rack over its power budget.
Ignoring emergency rollback processes if a DLC leak forces immediate node removal.

Action items

Create a stepwise migration playbook: pilot a single rack at full sustained workload, analyze telemetry for 72 hours, then scale to 3–5 racks before large rollouts.
Keep a hot spare air-cooled rack footprint to receive servers temporarily in case of coolant incidents or extended facility outages.
Test your onboarding/offboarding checklist: mounting cold plates, torque specs, and leak-test procedures.

Operational best practices

Run continuous thermal POCs — don’t accept theoretical calculations alone.
Automate alerts for anomalies (fast-rising inlet temp, sudden flow loss, delta-T out of band).
Keep a small testbed for firmware and BIOS changes to measure real power implications before cluster-wide updates.
Document all facility interactions and get as much monitoring access as possible into your observability stack.

Further reading and next steps

Evaluating liquid-cooled colocation for immediate high-density compute is a multidisciplinary exercise — power engineering, facilities management, procurement, and DevOps must be aligned. If you’re interested in how AI infrastructure is evolving or how DevOps practices are changing to accommodate next-gen hardware, check out our analysis in The Future of AI in DevOps. For teams thinking about supply chain and model safety around hardware moves, our write-up on Model-Safe Supply Chains is a useful companion.

Use this checklist as a conversation starter with colo providers and your internal stakeholders. Insist on measurable guarantees, clear telemetry access, and dry-run acceptance tests. With those in place, you can run GPU-dense LLM training and inference clusters today with confidence — and without turning future planning assumptions into last-minute crises.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.