 We have seen a quite puzzling effect in the last week. After several weeks of low CMS activity, around one week ago we happily saw how reprocessing jobs started arriving to PIC in the hundreds.
We have seen a quite puzzling effect in the last week. After several weeks of low CMS activity, around one week ago we happily saw how reprocessing jobs started arriving to PIC in the hundreds.Few days later, our happiness turned into ... what's going on?
As days passed, we saw that the cpu efficiency of CMS reconstruction jobs at PIC was consistently very low (30-40%!!)... with no apparent reason for that! There was no cpu iowait in the WNs, nor the disk servers showed contention effects.
We still do not understand the origin of this problem, but have identified two possible sources:
1) The jobs themselves. We observed that most of the jobs with lower cpu efficiency were spitting a "fast copy disabled"message at the start of their output logfile. The CMSSW experts told us that this means that
"for some reason the input file has events which are
not  ordered as the framework wants, and thus the framework will read from  the input 
out-of-order (which indeed can wreck the I/O  performance and result in low cpu/wall
times)".Interesting, indeed. We still need to confirm if the 40% cpu efficiency was caused by this out-of-order input events...
2) Due to our "default configuration", plus the CMSSW one, those jobs were writing the output files to dCache using the gridftpv1 protocol. This means a) the traffic was passing through the gridftp doors, and b) it was using the "wan" mover queues in the dCache pools which eventually reached the "max active" limit (at 100 up to now) so movers were queued. This is always bad.
So, we still do not have a clue of what was the actual problem but looks as an interesting investigation so I felt like posting it here :-)
 
No comments:
Post a Comment