Monday, 8 June 2009

Tier-1 reliability in March and April

Now that everybody is talking about hot topics such as STEP09, let us take some break here to review some pretty outdated stuff. Sure we will be posting about the STEP09 in brief, but reviewing the scored monthly reliabilities and, most important, highlighting the causes of failures as homework to improve the service in the future has always proven to be a very useful exercise. Let's go then :-)
In March PIC scored 99% in reliability and 98% in availability. Pretty good, and above target. The missing reliability was caused by the jobs from a MAGIC user that filled up the local disk of WNs, turning them into black holes. Interesting lessons to learn here: protect ourselves from users filling up the disk (they will always do it, even if not deliberately) and minimise the impact that one user can have in all the other PIC users community.
April was not such a good month. Our reliability was above target (98%) but our availabilty was not (92%). The cause for the latter was our building yearly electrical maintenance. We scheduled two days downtime, and this brought us below the availability target for the Tier-1s. During this SD we tested a reduced downtime for the LHCb-DIRAC service. We managed to stop this service by just about 8 hours, so next time we should apply this to the other Tier-1 critical services.
The missing reliability in April was caused by a pretty bizarre problem in the SRM server. For some days we suffered huge overloads. The cause was finally found to be in the configuration of the dCache postgresql DB. In particular, the schedule of the "vacuum" procedures in the background. Using the "false" flag as recommended in the documentation was the origin of all our problems. After this incident, we have learnt quite a lot about postgres vacuum configuration and are pretty sure we are safe now. We have also learnt that trusting the documentation is not always a wise thing to do :-)

No comments: