Friday, 10 October 2008

September Availability: LHCb, the SRM-killer

The reliability of the Tier-1 at PIC last month was right on target: 95%. Unfortunately, once adding our now regular monthly scheduled intervention, the availability dropped slightly below target: 93%.
The unavailability in September at PIC was mainly due to two issues: First, the overload of the SRM server caused by submission of about 13.000 srm-get requests in one shot from LHCb production jobs. This affected the SRM service on three days in September. The first issue that this incident made clear was one of the biggest problems of dCache, from my point of view: there is no way to have different SRM servers, each dedicated to one experiment. We are forced to share the SRM server, so when LHCb breaks it, ATLAS and CMS suffer the consequences. This is clearly bad.
Then one can discuss if issuing 13.000 srm-gets can be considered a DoS, or it is a reasonable activity from our users. I really do think that as a Tier-1 we should stand this load with no problems. As we post this, the storage team at PIC and the LHCb data management experts are in contact to try and learn what exactly got wrong and how to fix it.
Following the saying "better late than never", ATLAS started seriously testing the pre-stage procedure for the reprocessing at the Tier-1s just few days after LHCb. This are good news. This is the only way for us to learn how to configure our system so that it can deliver the required performance. Sure our SRM will die several times during this testing, but I hope it will converge to a reliable configuration... best before spring 2009.
The second contribution to PIC's unreliability last month came from the Network. On 23rd September the Spanish NREN suffered a major outage due to electrical and cooling problems in a TELVENT datacenter, which is hosting the NREN equipment. This resulted in a complete network outage at PIC of about 10 hours. Again, we see electrical and cooling issues at the very top of the LCG service risks list. In the end, looks like one of the trickiest bits of building such a complex computing infrastructure is just plugging it in and cooling it down.

LHC up&down

So, it's been quite a long time since we do not post into the Blog. This is not because we went away, not. We have been just a bit busy here in September (not only because of the LHC). Anyway, we are back to the blogsphere, and will keep reporting about the LHC activities at PIC regularly.
It is quite funny that the most silent month in our blog was probably the most visible one for the LHC all around the world. Well, we can always say "we did not talk about the LHC here since you could read it in any newspaper" :-)
So, the two big LHC things that happened in September, as all of you know, is that the LHC started circulating beams on Sep 10th and that it then had a major fault on the 19th. You can read the details of these both events anywhere in the web. I will just mention that those were quite special days in our community: the big excitement on the 10th, and then the "cold water bucket" few days later could be felt everywhere. Even the daily operation meeting was less crowded than usual since it was difficult those first days not to feel a bit "what to do now"?
I think now it is quite clear for everyone that life goes on. We at the LHC Computing Grid, continue operations exactly in the same way as we are doing since months. We are not receiving p-p collisions data, true, but the data flow did not stop. Both Cosmics data taking and MonteCarlo generation have not stopped.
We have said many times in the last years that the LHC is a extremely complex machine and that it might take a long time to put it in operations. Well, now we can see this complexity in front of us. There it is. Life goes on.