Thursday, 6 May 2010

Pilot has decided to kill looping job... strikes back!

Some days ago we noticed in our dashboards a somewhat curious pattern. Here we go again: yesterday and today we can see the same behavior. A bunch of jobs in the batch showing near zero cpu efficiency (red in the upper plot). Looking for the smoking gun... we easily find a correlation with "atpilot" jobs (blue in the bottom plots). These atpilot jobs are nothing more than ATLAS user analysis jobs submitted through the Panda framework.
For various reasons, which we are still in the process of elucidating, these atpilot jobs tend to get "stack" reading input files, and they stay idle in the WN slot unntil the Panda pilot wrapper kills them. Luckily, it implements a 12 hours timeout for jobs detected as stalled.
So, this is the picture of today's 12h x 200 cores going to the bin. Hope we will find the ultimate reason why these atpilots are so reluctant to swallow our data... eventually.

No comments: