Data Center Operator Troubleshooting: Method and Story

In data center operations, a very valuable skillset to have is TROUBLESHOOTING. While a lot of data center infrastructure systems are automated for power and cooling, sometimes things do fail. Or there could be an efficiency issue that doesn’t show as a fault or alarm, but could really cause issues in the building.

With logical troubleshooting skills, operators can both diagnose and solve possible issues in equipment and systems, as well as learning and applying systems knowledge.

The commonly referenced “Six Step Troubleshooting Method” comes from an old US Navy manual for “Troubleshooting Communications Equipment/NAVPERS 93500” in 1965. It is commonly used since it has logical steps, and isn’t just guessing and checking or using the tech manual and not finding it.

The six steps:
1. Symptom recognition

2. Symptom elaboration

3. List the probable faulty functions

4. Localize the faulty function

5. Localize the faulty circuit/component

6. Post failure analysis

An explanation of the steps with a real world example:

Initial conditions:

It is the spring season at a data center using CRAHs with evaporative media for cooling when temperature conditions exceed the BMS outside air setpoint. All zones in the data hall and electric rooms have transitioned to the adiabatic/evap cooling mode. After around 15 minutes, an operator is reviewing the supply air temperatures in BMS to ensure all units have transitioned to the cooling mode, and notices a CRAH with very high supply air temperature (SAT), but not high enough for the alarm…

1. Symptom Recognition:
An operator sees a CRAH in this zone has a much higher supply air temperature than the surrounding CRAH units. There is no alarm triggered for “High SAT” from this unit, but the higher than usual temperature is a cause for concern. The operator has just RECOGNIZED A SYMPTOM of an issue in this cooling system.

2. Symptom Elaboration:
The operator gathers more data for the zone, surrounding CRAHs, and the affected unit from BMS. The zone is being cooled properly, the surrounding CRAHs all have the expected SAT and are in the correct mode. This CRAH unit is in the correct mode, all fan speeds are fine, all dampers are matching the other CRAH positions, there are no alarms. This operator is ELABORATING THE SYMPTOMS by gathering more information that could narrow down the possible faulty functions of the CRAH.

3. List the probably faulty functions:
The operator knows these types of CRAHs well, and begins to list the functions and cross check them to the symptoms he has.
-Fans: running at normal speeds, no alarms
-Dampers: responding to BMS inputs and also match the adjacent units.

-Electrical power: Fans are on and no alarms

-Water: Unknown, no BMS indications for water in the unit, but likely faulty function due to the cooling mode change and high SAT still.
Listing out these has helped match the symptoms found before to establish there might be a water issue with this CRAH.

4. Localize the faulty function:

The operator has gathered this data from BMS and listed a few functions, determining there may be a water issue for this CRAH. Before leaving to the data hall to check this CRAH locally, the operator lists a few items that can cause trouble in the water system, using drawings and systems knowledge experience.

-Sump pump: Possible issue, but no alarms

-Sump fill valve: Maybe the sump is not filling?
-Isolation water valve: Not likely, the operator and team did the valve lineups in spring prep

-Drain valve: Maybe the actuator stuck open, never allowing the sump to fill?
-level probes: Maybe they are misaligned to the normal levels?

-balancing valve: Maybe this is stuck, not allowing water to the media?

-clogged showerhead over media: Possible?

The operator has listed these to identify suspected failed functions, LOCALIZING THE FAULTY FUNCTION. Listing these in possible order will help the troubleshooting, and prioritize the possible faulted component.

5. Localize the faulty circuit/component:

The operator takes his list and tools to the data hall, where this CRAH that has a possible water issue is located. Safely setting up the work area, shutting down, opening the CRAH, they get to work with the list:
-Sump is full of water, eliminating valving issues in sump

-Level probes are installed properly and in the water.
To check the other possibly faulted functions, the operator turns on the CRAH with the maintenance doors open, and sees the sump pump doesn't start. The breaker for the sump pump is closed, so it can get power but is not starting, meaning there must be an issue with the level probe that “tells” the sump pump it is safe to start without cavitation issues. The operator shines the flashlight into the sump, and visually sees the level probe, but it is missing its magnetic float. A rusted cotter pin is in the bottom of the sump…
The operator has properly LOCALIZED THE FAULTY COMPONENT with visual inspection and checking the previously made list.

6. Failure Analysis:

The operator finds the magnetic float and replaces a new cotter pin on the level probe, and the sump pump starts immediately. Checking all other systems and BMS, the CRAH is back in operation and cooling properly in the correct mode. The cotter pin had rusted off, allowing the magnetic float to fall off when the sump was emptied. The level probe the magnetic float was on was the starting signal for the sump pump, so the pump would not start. The pump not starting did not allow the media to become wetted, and without wetting the media, there is no evaporation cooling, meaning the SAT for this CRAH does not change when in adiabatic/evap mode. Operator shares the info with his operations team and engineering, and suggests better quality materials for water exposed components, and even thinks of a new alarm that can possibly detect this in the future.

The operator has accurately repaired and conducted FAILURE ANALYSIS to conclude the troubleshooting. Quoting the Navy technical manual on this part and mentioning “When the faulty part has been identified, it should not be replaced until you can substantiate that it is causing the actual trouble.” This is a very important key to troubleshooting and can prevent operators from replacing components that may not be the problem. In the example, the operator could’ve thought the pump was failed and replaced it, but the issue was still there in the probe. The cotter pin was the actual failure, and also a lot less complicated to replace!

In conclusion, the six step troubleshooting method is quite effective in narrowing down the function and component that failed, and promoting proper repair and replacement only once the suspected failed component is established to cause the issue. This is much more efficient than just guess and checking, which can get expensive if replacing major data center components. This troubleshooting method also can address issues potentially not written in the technical manual troubleshooting guides. Like anything, this method is a skill that should be sharpened with experience and practice, as well as some healthy knowledge for the equipment that is being worked on. Always be safe and use the proper equipment vendors when needed, but following along and doing a paper exercise of the troubleshooting is always valuable and proactive.

All my articles are handwritten/typed by me. By reading and sharing these, you support this work and you support real human authorship.

Data Center Operator Troubleshooting: Method and Story

Read more

What is PUE?!

Operational Paranoia

Future topics ideas list. . .

Data Center Operator Troubleshooting: Method and Story

Read more

What is PUE?!

Operational Paranoia

Future topics ideas list. . .

Submission Successful

Get Data Center Operators Updates