Breaking down. Non-functional. Not working. Busted. Totalled. Not currently in operation. 404 –resource not found. These are the states of a business that is not making money, improving its reputation, or satisfying customers, partners or employees. Any way you look at it, if it’s breaking, it’s bad. And (spoiler alert) it didn’t turn out so well for Walter White, did it?
Thankfully, the voodoo days of early computing and web technologies are behind organisations. An actual bug is exceptionally unlikely to damage a component. Software bugs, however, are a much bigger threat given the complexity of the IT estate running within a data centre. Then there are little causes, often driven by people, which result in power faults like that which brought down BA’s services in such a headline-catching way earlier this year.
At the other end, sometimes there are very big causes too, like SAN failures and extreme weather, like those that have affected Fujitsu. UPS in one form or another often gets cited as a contributing cause in data centre problems – like that of BA, or Microsoft.
>See also: Addressing the incredible complexity of the modern data centre
And the bigger the data centre and the more important the solutions it hosts, the more focus this naturally gets. The idea of AWS falling down gets many very important business and governmental leaders hot under the collar. AWS runs behind the scenes of many tech and financial giants that entire countries rely on.
So, if the steady drip of data centre stories hitting the international news holds anything, it’s a quick insight into what the likely ways a data centre manager or operator can quickly improve their facilities, processes, and assets in order to not join their peers on the news cycle. Because when it comes to data centre operation, no news is generally always good news.
So here are some simple things to check off the list to ensure that your pride and joy doesn’t end up in the data doldrums.
Key learning – ensure all stakeholders, partners, contractors, and employees work together
Your IT personnel, facility managers, and third parties must work together and share information. At times of change, this is supremely important. In fact, IT as a service has to be fully brought into the facilities side.
This will help to avoid situations like power overloads were new equipment was brought in and facility managers were unaware and unable to properly support it.
Key learning – documentation and processes are your friend
Documented processes are key to ensuring that consistent information is shared with all parties that need it. Good information sharing allows everyone to look back onto what’s been done and improve on procedures to avoid future disruptions. Always be learning and breaking bad habits.
>See also: Data dependence: the need for a reliable data centre
Key learning – understand the resilience
The ability to perform power failure simulations by “virtually” switching devices off — without affecting the production environment — allows for a well thought out action plan to recover services. News headlines of data centre operators who assumed their power chain and back-up system are fool proof, without a failsafe test, are common.
Power failure simulation enables you to locate where redundancy is lacking and uncover single points of failure. Needless to say, it’s imperative to build and document your recovery plan in advance of a catastrophic power failure.
Key learning – know the whole power chain
It is vital that the power chain is documented all the way through — from when the power enters the building, through the UPSs, PDUs, and all rack-mounted equipment. Know what is connected to what, as well as the devices’ respective interdependencies.
With this knowledge, you can understand the potential impact if a certain piece of equipment fails or is taken offline for maintenance. Additionally, be aware of the maintenance status for each power chain device, e.g. what’s the useful lifecycle status?
Key learning – monitor operations in real-time
It’s immensely helpful to know, at any given time, what energy is being used, where and by which devices. BMS systems are very useful, but they are also very specific and tend to keep data siloed.
>See also: Avoiding an outage disaster: continuous availability
You need to ask yourself, “Do I have the capability to look at all the information, all the infrastructure components in the facility and see the entire power management system in one?” A holistic view brings real-time monitoring and alarming that enables data centre operators to mitigate risks and make changes faster – to avoid disaster.
Key learning – identify changing trends and respond promptly to warning signs
The flip side of real-time monitoring. As critical as up-to-the-minute information is, it’s also vital to be able to analyse data centre performance over a long period of time, so trends and patterns can be pinned for easier, long-term forecasting. You can now plan for change and fluctuations, balance load, predict future capacity needs, plan workflow, and schedule service.
Key learning – overall, understand the overall vulnerability of the power chain
There are many more data centre devices that connect to a network besides what’s contained in racks; there are terminals and points of access everywhere. You should question when the last time your passwords were changed, and can outside contractors access devices that can shut down your whole building or transfer load?
>See also: Why critical data can’t be hosted with just one provider
A proven solution to power management can be realised with a Data Centre Infrastructure Management (DCIM) solution. DCIM enables IT and facility personnel to run the data centre at peak efficiency, while allowing all stakeholders to improve overall operations while identifying vulnerabilities to keep the power chain safe.
With a DCIM solution deployed, full data centre operation’s visibility is revealed that helps eliminate the communication silos between IT and facilities by sharing real-time data vis-a-vis easy to understand charts and graphs.
Preventing catastrophic power outages and other failures can be made a lot easier if the proper solutions, processes and understanding of the real situation is in place. It pays to break the bad cycles and start afresh.
Sourced by Mark Gaydos, chief marketing officer for Nlyte Software