Friday, 28 August 2009
Scary, isn't it? PIC availability has been bully red in the last 24h but, sadly, there was not much we could do.
We have always said, and still strongly believe it, that SAM tests are a very good thing. Actually, I firmly believe that have been one of the key ingredients for the WLCG success. Success here meaning the evolution from "the Grid does not work" situation with 60% job success rate we had few years ago, to the rutine >97% availabilities we are used to see these days. But yes, not even SAM tests are perfect. There has always been a dark corner inside them: the so much questioned "SE test inside the CE", or lcg-rm test. And inside this controversial test there is another smaller corner which is still a bit darker: the file replication to CERN test. This was the one that started flickering on Tuesday at PIC and it is consistently failing since more than 24h. This test tries to copy a file sitting at PIC into a very concrete DPM server at CERN. This very precise connection was timing out for us while any other transfer to any other site, even to any other CERN storage server was working. This was strange enough so that we asked for help to our CERN colleagues. Today, they came with the good news: problem found, a problematic router.
Got a really puzzling error? Bet on the network...
Friday, 21 August 2009
Most of the people is these days either just back from holidays (me) or still out (lucky them) or neither of those (also lucky, since they will leave later...). Anyhow, life at the Tier-1 is 24x7 since we all know so, it doesn't matter if we are in the middle of August and it is near 40 degrees celsius out there, it is time to report about service performance in the past month.
We just got the WLCG reports for July and the results for PIC are pretty positive: 99% availability and reliability. This little 1% that gets us away of our beloved 100% happened on the 29th July, around noon. During four hours all of the Computing Elements at PIC were failing the SAM Job Submission tests, so definetely the Tier-1 service was affected during that period. The source of the problem was found to be a pretty mysterious one: the switch connecting the servers hosting Virtual Machines to the PIC LAN (VMs are always a bit misterious, aren't they?). Actually, the source of the problem was not found, but just disappeared when that switch was replaced by a new one (different brand, no names here to avoid anti-propaganda :)
Regarding the availability of PIC as seen from the experiment specific monitoring, we got also very good results for ATLAS and LHCb, close to 100%. However, the result for CMS was not that good: a mere 90%. This funny assymetry was due to a bug we introduced by mistake in the Torque ACL queues configuration which actually blocked CMS submissions to our short queue for 3 days (6-9 July). Somebody could ask why we did not notice this in 3 days... We should put priority in deploying the WLCG-Nagios in production. It will for sure help reducing these unavailable times.
So, besides this unfortunate CMS-blocking bug, we can say July was a pretty good month for the Tier-1 in terms of availability. Thanks to this, and also to the job submission hyperactivity seen from ATLAS and LHCb, we delivered a record amount of CPU cycles during July: around 70.000 ksi2k·days, which is very close to keeping 100% of our resources busy for the whole month.
Now things look quiet, being many people still away. PIC is up and running, cooling is ok (even the heat wave out there)... but watch out for the VM-networking ghosts, they could come at any time.