Friday, 21 August 2009
July: good availability comes with CPU delivery record
Most of the people is these days either just back from holidays (me) or still out (lucky them) or neither of those (also lucky, since they will leave later...). Anyhow, life at the Tier-1 is 24x7 since we all know so, it doesn't matter if we are in the middle of August and it is near 40 degrees celsius out there, it is time to report about service performance in the past month.
We just got the WLCG reports for July and the results for PIC are pretty positive: 99% availability and reliability. This little 1% that gets us away of our beloved 100% happened on the 29th July, around noon. During four hours all of the Computing Elements at PIC were failing the SAM Job Submission tests, so definetely the Tier-1 service was affected during that period. The source of the problem was found to be a pretty mysterious one: the switch connecting the servers hosting Virtual Machines to the PIC LAN (VMs are always a bit misterious, aren't they?). Actually, the source of the problem was not found, but just disappeared when that switch was replaced by a new one (different brand, no names here to avoid anti-propaganda :)
Regarding the availability of PIC as seen from the experiment specific monitoring, we got also very good results for ATLAS and LHCb, close to 100%. However, the result for CMS was not that good: a mere 90%. This funny assymetry was due to a bug we introduced by mistake in the Torque ACL queues configuration which actually blocked CMS submissions to our short queue for 3 days (6-9 July). Somebody could ask why we did not notice this in 3 days... We should put priority in deploying the WLCG-Nagios in production. It will for sure help reducing these unavailable times.
So, besides this unfortunate CMS-blocking bug, we can say July was a pretty good month for the Tier-1 in terms of availability. Thanks to this, and also to the job submission hyperactivity seen from ATLAS and LHCb, we delivered a record amount of CPU cycles during July: around 70.000 ksi2k·days, which is very close to keeping 100% of our resources busy for the whole month.
Now things look quiet, being many people still away. PIC is up and running, cooling is ok (even the heat wave out there)... but watch out for the VM-networking ghosts, they could come at any time.