Thursday 26 March 2009

February reliability, VOs joining the party

We should try and report about our scored availability and reliability for the past month of February... better before March is over!
Last month we got pretty good results for the OPS SAM tests: 98% availability (the missing 2% being basically the Scheduled Downtime on the 10th Feb) and... (drums in the background)... 100% reliability!
Yes. There we are, at the top of the ranking. Well, actually CERN, FNAL, BNL and RAL did also score max reliability last month, together with us.
Somebody might argue (the experiments, actually) that this are OPS VO tests, so not reflecting reality. Actually, the opposite was true since not long ago. The experiments SAM tests were still being put together, and they were showing lots of false negatives.
Anyway, the results for the VO-specific SAM tests in February were also pretty good at PIC: 98% reliability for ATLAS, 99% for CMS and 96% for LHCb. The ATLAS and CMS detected unreliability is completely true, we must admit. This was an indicent we suffered on the 22nd February (Sunday, by the way) with our "sword of damocles", also known as the SGM-dedicated WN. Some jobs hanged there, completely blocking further execution of the SGM jobs from the VOs. Luckily, Arnau reads his e-mail during weekends, and the problem was detected and solved quite fast (thanks!).
The LHCb reliability number is not really reliable, since their SAM test framework had some hickups the first 5 days of the month.
We had recently the WLCG Workshop in Prague, there we heard big complaints from CMS saying the reliability of Tier-1s is not good at all. The numbers they showed were indeed quite bad, and this is mostly due to the fact that they are adding extra ingredients to their reliability calculation. In particular, they use the results of routine dummy job submissions (JobRobot), and it happened that, even if our CMS SAMs were strong green, the JobRobot was more reddish.
I think it is good that experiments make their sites monitoring more sophisticated. However, for this to be a useful tool in improving reliability... they first need to tell us!
Now we are

Monday 16 March 2009

CMS cpu consumer, back to business

Looks like someone in CMS was reading this blog, since few hours after the last post, saturday evening, CMS jobs started arriving to PIC. We see a constant load of about 300 jobs from CMS since then. Not bad.
Apparently these are the so-called "backfill" jobs. All the Tier-1s but us (and Taiwan, down these days due to a serious fire incident) started running these backfill jobs early March. After a bit of asking around, we found out that PIC was not getting its workload share because the old 32bit batch queue names were hardcoded somewhere in the CMS sytem (we deprecated 32bit queues more than one month ago!) plus they had a bug in the setup script that got the available TMPDIR space wrong.
Good that we found these problems and that they were promptly solved. Now CMS is back to the cpuburning business at PIC. ATLAS is still debugging the memory-exploding problem that stopped jobs being sent to PIC about one week ago. Looks we are close to the solution (missing packages) and we will soon se both experiments competing again for the CPU cycles at PIC.

Saturday 14 March 2009

CPU delivery efficiency

We are these days collecting the accounting data for February 2009. Looks like we reached a record figure for CPU efficiency delivery last month. The 3 LHC experiments used up to 80% of the total CPU days available at PIC: almost 37.000 ksi2k·days. This was largely thanks to ATLAS, who consumed around 80% of those CPU days. LHCb used just above 15% and CMS a mere 5%.
So, well done for ATLAS. It is true that most of that load are not "Tier-1 type jobs", but just contribution to the experiment MonteCarlo production. Anyway, it is better that Tier-1 resources are used for simualation rather than stay idle consuming electricity, heating the computing room and watching their 3-year lifetime pass by (at a rate of about 6 kEur/month).
From our point of view the Panda system which is used in ATLAS, and that implements the now so-loved pull model for computing (or pilot jobs), is definetely doing a good job in consuming all available CPU resources.
Unfortunately, not everything is so nice. This last week we have seen the CPU utilisation at PIC decreasing quite a lot. The ATLAS Panda system was not sending jobs to PIC, and we discovered this was due to a problem with Athena software running in 64bit OS. Suddenly the production jobs running at PIC's SL4/64bit WNs exploded in memory utilisation and were eventually killed by the system. The experts are working now to understand and fix this, hope they find a patch soon.
Let's see when CMS and LHCb implement efficient CPU consuming systems similar to ATLAS and we can benefit of being a multi-experiment Tier-1.
Meanwhile, at PIC, idle CPUs are transforming electricity into heat. Waiting for ATLAS to cure their 64bit indigestion.