The Dallas servers had a apx 30 minute outage on the night of 8/4/11… Instead of repeating the info here’s the DC’s explanation and planned long term resolution. If you have any questions, please feel free to ask email: support -at- lagniappeinternet.com. We will try to provide as much notice and as little impact for the circuit breaker switchout…
Regarding the unexpected shutdown of your equipment on 8/4/2011, here
are the results of our investigation as well as the proposed remedy.
Forgive the length of the report. I feel it's best to give a full
picture of what occurred so that you can see the context. As is often
the case, the issue was really the result of several small, seemingly
unrelated issues that would otherwise be quite harmless, until they come
together to create a "perfect storm".
Beginning on 8/1, we have had extensive work performed all week on the
cooling systems in the datacenter in order to increase cooling capacity.
With our continued growth and relentless extreme heatwave that Texas
has been experiencing for the past month, our cooling systems have been
working non-stop. We made the decision in early July to expedite the
planned installation of additional cooling, which had originally been
scheduled for Fall 2011. We ordered the equipment and worked with our
contractor to begin work ASAP, even in their peak season.
The new equipment was commissioned into service yesterday afternoon
around 1:30pm and we immediately began reconfiguring the overall cooling
strategy in the datacenter to make efficient use of the additional
capacity.
Beginning 3:30, we were alerted by ERCOT, the power grid operator for
the State of Texas, that they would begin shedding load on the Texas
grid because it had become overloaded. We "changed gears" from the
cooling project to the more immediate electric issue. At 3:43pm, we
transferred the datacenter to generator power. (That event ended at
6:16pm.)
Around 9:00pm, the new cooling system went offline. We did not process
this as a critical alarm since without the new unit we are still within
adequate cooling capacity for continued operations. However, we did not
fully complete our changes to the facility's cooling strategy when we
received the ERCOT notification.
At 10:08pm, the NOC processed 2 alarms: a power room temperature alarm
and an inactive switch alarm. An on-site tech was immediately
dispatched to investigate and remedy. Also, as is policy, a senior tech
was called to the datacenter and I was also called out.
We found that a sub-panel distribution breaker had been tripped. This
was not due to over-current (overloading) of the circuit because we are
reading load at just over 60% rated capacity. It was also not due to
any line-fault. We have made the determination that the breaker tripped
due to a faulty thermal protection mechanism. In other words, the
circuit breaker appears to be faulty.
Circuit breakers work on the theory of thermal-magnetic protection
whereby if they reach a certain threshold, they will trip and disconnect
the circuit. This hasn't typically been an issue since we keep the
datacenter and the power room well-regulated with regard to temperature.
With the increase in ambient temperature, the weakness on the breaker
was exposed.
Here is the 3-part remedy:
1) Work that was put-off due to the ERCOT event has been completed on
backup cooling arrangements for the power room.
2) Our air conditioning contractor is on-site to diagnose and remedy the
cause of failure for the new capacity.
3) We will make plans to change the weak circuit breaker. If it is
necessary to institute a planned shutdown for this to happen, then you
will be notified in advance so that you will be aware, and may
participate by issuing a graceful shutdown to your equipment as needed.
Of course, my team and I very much regret that your operations have been
impacted by this issue. We take a fair amount of pride in our work and
always strive to provide the best level of service possible. Of course,
we are human, and humans are often, sadly, imperfect. I'm sorry that
our imperfection has caused an issue for you.
Please let me know if I can answer any questions, or if there is
something further that I can do to assist on this. Thanks for your
business and patience. I appreciate the continued opportunity to serve you.
Robert Porter is the founder and managing member of Lagniappe Internet LLC. Robert holds multiple certifications incluging Oracle Certified Professional-Java 6 SE (OCP-J6), MCSE, A+, Net+, Project+, Security+, and multiple CIW certifications. He has been in the hosting industry for more than a decade and is founded Lagniappe Internet L.L.C. as a privately owned, completely debt free, hosting company based out of New Orleans. Robert's background includes 25+ years in programming, databases, networking and systems administration.
Comments Off