Few days later, our happiness turned into ... what's going on?
As days passed, we saw that the cpu efficiency of CMS reconstruction jobs at PIC was consistently very low (30-40%!!)... with no apparent reason for that! There was no cpu iowait in the WNs, nor the disk servers showed contention effects.
We still do not understand the origin of this problem, but have identified two possible sources:
1) The jobs themselves. We observed that most of the jobs with lower cpu efficiency were spitting a "fast copy disabled"message at the start of their output logfile. The CMSSW experts told us that this means that
"for some reason the input file has events which are
Interesting, indeed. We still need to confirm if the 40% cpu efficiency was caused by this out-of-order input events...
2) Due to our "default configuration", plus the CMSSW one, those jobs were writing the output files to dCache using the gridftpv1 protocol. This means a) the traffic was passing through the gridftp doors, and b) it was using the "wan" mover queues in the dCache pools which eventually reached the "max active" limit (at 100 up to now) so movers were queued. This is always bad.
So, we still do not have a clue of what was the actual problem but looks as an interesting investigation so I felt like posting it here :-)