Thursday 26 March 2009

February reliability, VOs joining the party

We should try and report about our scored availability and reliability for the past month of February... better before March is over!
Last month we got pretty good results for the OPS SAM tests: 98% availability (the missing 2% being basically the Scheduled Downtime on the 10th Feb) and... (drums in the background)... 100% reliability!
Yes. There we are, at the top of the ranking. Well, actually CERN, FNAL, BNL and RAL did also score max reliability last month, together with us.
Somebody might argue (the experiments, actually) that this are OPS VO tests, so not reflecting reality. Actually, the opposite was true since not long ago. The experiments SAM tests were still being put together, and they were showing lots of false negatives.
Anyway, the results for the VO-specific SAM tests in February were also pretty good at PIC: 98% reliability for ATLAS, 99% for CMS and 96% for LHCb. The ATLAS and CMS detected unreliability is completely true, we must admit. This was an indicent we suffered on the 22nd February (Sunday, by the way) with our "sword of damocles", also known as the SGM-dedicated WN. Some jobs hanged there, completely blocking further execution of the SGM jobs from the VOs. Luckily, Arnau reads his e-mail during weekends, and the problem was detected and solved quite fast (thanks!).
The LHCb reliability number is not really reliable, since their SAM test framework had some hickups the first 5 days of the month.
We had recently the WLCG Workshop in Prague, there we heard big complaints from CMS saying the reliability of Tier-1s is not good at all. The numbers they showed were indeed quite bad, and this is mostly due to the fact that they are adding extra ingredients to their reliability calculation. In particular, they use the results of routine dummy job submissions (JobRobot), and it happened that, even if our CMS SAMs were strong green, the JobRobot was more reddish.
I think it is good that experiments make their sites monitoring more sophisticated. However, for this to be a useful tool in improving reliability... they first need to tell us!
Now we are

No comments: