Friday 10 October 2008

September Availability: LHCb, the SRM-killer

The reliability of the Tier-1 at PIC last month was right on target: 95%. Unfortunately, once adding our now regular monthly scheduled intervention, the availability dropped slightly below target: 93%.
The unavailability in September at PIC was mainly due to two issues: First, the overload of the SRM server caused by submission of about 13.000 srm-get requests in one shot from LHCb production jobs. This affected the SRM service on three days in September. The first issue that this incident made clear was one of the biggest problems of dCache, from my point of view: there is no way to have different SRM servers, each dedicated to one experiment. We are forced to share the SRM server, so when LHCb breaks it, ATLAS and CMS suffer the consequences. This is clearly bad.
Then one can discuss if issuing 13.000 srm-gets can be considered a DoS, or it is a reasonable activity from our users. I really do think that as a Tier-1 we should stand this load with no problems. As we post this, the storage team at PIC and the LHCb data management experts are in contact to try and learn what exactly got wrong and how to fix it.
Following the saying "better late than never", ATLAS started seriously testing the pre-stage procedure for the reprocessing at the Tier-1s just few days after LHCb. This are good news. This is the only way for us to learn how to configure our system so that it can deliver the required performance. Sure our SRM will die several times during this testing, but I hope it will converge to a reliable configuration... best before spring 2009.
The second contribution to PIC's unreliability last month came from the Network. On 23rd September the Spanish NREN suffered a major outage due to electrical and cooling problems in a TELVENT datacenter, which is hosting the NREN equipment. This resulted in a complete network outage at PIC of about 10 hours. Again, we see electrical and cooling issues at the very top of the LCG service risks list. In the end, looks like one of the trickiest bits of building such a complex computing infrastructure is just plugging it in and cooling it down.

No comments: