Friday, 27 February 2009

January availability, hit by CMS and cooling

So, let's try and have a look to the Tier-1 availability and reliability during the past month of January... at least before February is over!
The result for the reliability was 98%, just above the target for the Tier-1s (which is now 97%). This looks quite ok, but it is worth to note that our colleagues at the Tier-1s are doing a very good job at their sites, so with 98% we got the 8th position in the ranking, tied with TRIUMF. Four centres got the 100%, and three more got 99%. Quite impressive, isn't it?
Our 2% unreliability was mostly caused by the incident in the SRM service we had on January the 21st. On that day we saw our dCache SRM server flying in the sky with loads up to 300. The cause of that was traced to be a bug in the CMS jobs software, that made them issue recursive srmls queries to the SRM. Once again, we saw how easy is to suffer a DoS from an innocent user and how little we can do to protect us against it.
For the availability, however, we scored pretty low in January: 92%. Well below the 97% target. This was caused by the cooling indicent we suffered on saturday 24th January. After shutting down PIC on saturday noon, we did not bring it up back again until monday morning. Here we see how fast the availability goes down on weekends :-)
With the LHC start around the corner, we should be definetely operating our full 24x7 now. As we see we are not quite there yet, so now it is time to think about the last step in the 24x7 journey (one could call it MoD-Phase3) that implements the required coverage for the critical services. First thing to do: ensure we have a proper definition of criticality.

No comments: