Thursday, 20 May 2010

ATLAS torrent

It is true that this starts to be quite routine, but still I can not avoid to open my eyes wide when I see ATLAS moving data at almost 10 GB/s.
The plot shows the last 24h as shown in the DDM dashboard right now. Incoming traffic to PIC is shown in the 2nd plot. Almost half Gig sustained, not bad. Half to DATADISK and half to MCDISK.
Last but not least, the 3rd plot shows the traffic we are exporting to the Tier2s, also about half Gig sustained overall.
There is a nice feature to observe in the 2 last plots: the dip around last midnight. This is due to an incident we had with one of the DDN controllers. For some still unknown reason, the second controller did not take over transparently. Something to understand with the vendor support in the next days. Stay tuned.
Having into account the severity of the incident, it is nice to see that the service was only affected for few hours. Manager on Duty fire brigade took corrective action in a very efficient manner (ok Gerard!).
Now, let the vendors explain us why the super-whooper HA mechanisms are only there when you test them but not when you need them.

Welcome home lhcb pilot!

There is not much to say besides we are happy to see that the sometimes shy LHCb pilot jobs are back running at PIC since last midnight.
There was quite a while since we did not see these guys consuming CPU cycles at PIC, so they were starting with their full Fair Share budget. Interesting to see that in these conditions they were able to peak to 400 jobs quite fast and in about 6 hours they had already crossed their Fair Share red line.
I hear that ATLAS is about to launch another reprocessing campaign, so they will be asking for their Fair Share in a short time... I hope to see the LHCb load stabilizing at their share at some point, otherwise I will start suspecting they have some problem with us :-)

Thursday, 6 May 2010

DDN nightmare and muscle

We got a notable fright last weekend. Barça match against Inter in the Champions league semifinal was about to start when suddenly... crash. One of our flashy dCache pools serving a DDN fatty partition (125 TB, almost full of ATLAS data) got bananas.
The ghost of "data loss" was there, coming to us. Luckily, after a somewhat "hero mode" weekend for our MoD and experts (thanks Marc and Gerard!) following the indications of Sun-Support the problem could be solved with zero data loss (uf!). The recipe looks quite innocent from the distance: upgrade the OS to the last version, Solaris 10u8.
We find quite often that a solution comes with a new problem. This time was not an exception. The updated OS rapidly solved the unmountable ZFS partition problem, but it completely screwed up the network of the server.
We have not been able to solve this second problem yet, and this is why the 125TB of data of the upgraded server (dc012) were reconfigured to be served by its "twin" server (dc004). This is a nice configuration that the DDN SAN deployment enables. So this is I think the first time we try this feature in production, and there we have the picture: dc004 serving 250 TB of ATLAS data with a peak up to 600 MB/s... and no problem.
Looks like, besides OS version issues, the DDN hardware is delivering.

Pilot has decided to kill looping job... strikes back!

Some days ago we noticed in our dashboards a somewhat curious pattern. Here we go again: yesterday and today we can see the same behavior. A bunch of jobs in the batch showing near zero cpu efficiency (red in the upper plot). Looking for the smoking gun... we easily find a correlation with "atpilot" jobs (blue in the bottom plots). These atpilot jobs are nothing more than ATLAS user analysis jobs submitted through the Panda framework.
For various reasons, which we are still in the process of elucidating, these atpilot jobs tend to get "stack" reading input files, and they stay idle in the WN slot unntil the Panda pilot wrapper kills them. Luckily, it implements a 12 hours timeout for jobs detected as stalled.
So, this is the picture of today's 12h x 200 cores going to the bin. Hope we will find the ultimate reason why these atpilots are so reluctant to swallow our data... eventually.