Wednesday 20 August 2008

Reliability under control... ready for data?

There is quite a long time since we do not comment about the monthly reliability results of the Tier-1 in this blog. This is not because we have stopped monitoring it, no! Actually, the opposite it is true. We are now looking at all sorts of monitoring daily. The SAM critical alerts are now generating an SMS to the Manager on Duty mobile phone. All of this to try and notice any possible problem as soon as it appears. It would be nice to catch them BEFORE they appear, but we are not there yet :-)

The reliabilities scored by PIC in the last three months (May to July) have been flat at 99%. This is ok and I think this tells us that the Manager on Duty shifts are getting mature. On the one hand the tools available for the shifters are improving: cleaner Nagios and better documentation and procedures (thanks to all of the service managers around) plus, of course, the shifters are doing a great job.

The availabilities for these last three months have moved up 92%-96%-97% for May-June-July, respectively. For these three months we have started implementing a regular Scheduled Downtime on the second (sometimes first) tuesday of the month. Knowing the SD date in advance makes the planning and user notification smoother. The availability in May is somewhat lower because we had an extra downtime on the 14th because the Regional Network Provider had to upgrade some equipment, so PIC had to be disconnected for some hours.

I always say that the SAM monitoring has proven to be a very useful tool for sites to improve on stability. The experiments have since long complained that sometimes they are not reflecting reality, specially because they run under the "fake" OPS VO. The solution is of course that they implement their VO-specific tests in the SAM framework. This is ongoing work since several months, but still not completely stable.

So, looks like we will get the first LHC data with our VO-specific SAM glasses a bit dirty... anyhow, I am sure that the "real data pressure" will help to make all of this converge so we will have still better tools for our shifters to know what's going on.