Yesterday evening, around 16:30 (luckily we were still around) the lights went off. We still do not have the complete picture, but it seems that somebody by mistake pushed the red button "swith off the building".
The cooling of the machine room stopped, but the machines at PIC still were working powered by the UPS. After 10 min in this situation, we decided to start stopping services. Not much later, the (yet to be understood) glitch arrived. All of the racks lost power for less than 1 second. After this, and to avoid that servers would still restart after a dirty stop, we just switched off all racks in the electric main board.
Today, at 8:00, we started switching on PIC. The good news is that it looks as if we did not have many hardware incidences after the dirty stop. The lesson learnt (hard way) is that we are still too far away from a controlled and efficient complete shutdown. We will have to repeat this on monday, due to the yearly electrical maintenance. So overall, it will be a good week to debug all this procedures.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment