Thursday, 24 April 2008

March reliability, or is the glass half empty or half full?

March reliabilities were published last week, and unortunately this time the numbers are not good for PIC: 86% reliability. It is our worst result since last June and the first time we don't reach the target for Tier-1s since then. Anyhow, we can still try and get a positive view out of it: how much those events that seem "external" to us can affect our site reliability.
The March unreliabilty came mainly from three events: First, the unexpected power cut that affected the whole building in the afternoon of March 13 (don't know the last report about it, but they were talking about somebody pressing the red button without noticing it). Second, there was an outage in the OPN dark fibre between Geneva and Madrid on the 25th that lasted for almost three hours.
The last source of SAM unreliability was of a slightly different nature: the OPS VO disk pools filled up due to massive DTEAM test transfers. So, this last one was under our domain, but actually it did not affect the LHC experiments service, only the monitoring. Anyhow, we have to take and understand SAM for the good and for the bad.
Last month we also had our yearly electrical shutdown during the Easter week. The impact of that scheduled downtime appears in the availability figure, which decreases 12 percentage points down to 74%.
So, it was a tough month in terms of management metrics to be reported (we will see these low points in graphs and tables many times in the following months, that's life). Anyhow, the scheduled intervention went well, and the LHC experiments were not that much affected, so I really believe that our customers are still satisfied. Let's keep them like this.

