Tuesday 19 February 2008

January Availability

Last week we received the availability data for the LCG Tiers for
January 2008. This time, at PIC we were just on top of the target for Tier-0/Tier-1: 93%.
The positive read of it is that we are still one of the only three Tiers that reached the reliability target for every month since last July. The other two sites are CERN and TRIUMF.
The negative read is that 93% looks like a too low figure, when we were getting used to score over 95% in the last quarter of 2007.
The 7% unreliability of PIC in January 2008 is fully due to one single incidence that we had in the Storage system the weekend of the 26-27 January. The Monday before (21/01/2008) had been a black-monday in the global markets - European and Asian exchanges plummeted 4 to 7% - so, we still do not discard that our failure might be correlated to that fact.
However, Gerard's investigations point to the fact that the most probable cause of our incident was a less-glamourous problem in the system disk of the dCache core server. The funny symptom is that all the GET transfers from PIC were working fine, but the PUT transfers to PIC were failing. The problem could only be solved by manual intervention of the MoD, who came on Sunday to "press the button".
So, the "moraleja" as we call it in Spanish, could read: a) we need to implement remote reboot at least in the critical servers, b) a little sensor that checks that the system disk is alive would be very useful.
Now, back to work and let's see if next month we reach the super-cool super-green 100% monthly reliability that up to now only CERN is able to reach with apparently no much effort.

1 comment:

Unknown said...

Good work, Gonzalo. My congratulations for all the PIC personnel. I also hope the green 100% but not only for February