Tuesday 27 April 2010

Uops! a would be transparent operation

If you look now at the ATLAS data transfers dashboard, you will easily find PIC since our efficiency in the last 24hrs hardly arrives to 50%. The reason for this are the transfer failure peak (orange in the plot) that we experienced yesterday between 10h and 14h. Up to 4000 transfers to PIC were failing per hour during a couple of hours.
These were transfer failing with "permission denied" errors at PIC destination, and the reason was us trying to implement an improved configuration for ATLAS in dCache: different uid/gid mappings for "user" and "production" roles so that, for instance, one can not delete the other's files by mistake.
The recursive chown and chmod commands on the full ATLAS name space were more expensive operations than we expected, so the operation was in the end not transparent. It took around 11 hours for these recursive commands to finish (hope this will get better with Chimera) but thanks to our storage expert MoD manually helping in the background, most of the errors were only visible for 4 hours.

Monday 26 April 2010

Scheduled intervention, in sync with LHC technical stop

We are right now draining PIC in preparation for a Scheduled intervention tomorrow. This is the first time we try and schedule an intervention in sync with the LHC operational schedule. Let's see how the experience works. In principle, it should be good that sites synchronize stops with the accelerator, but on the other hand we should make sure we do not stop all together! Communication challenge... our favorites :-)
One of our main interventions tomorrow will be the upgrade of the firmware of a bunch of 3Com switches we use to interconnect many of our disk and cpu servers. In the last days we have had quite a number of issues (tickets 57623, 57617, 57177) reported mainly by ATLAS. We believe these are caused by the old firmware in these switches. However, this is just a theory of course... will see after this intervention if these network failures disappear.
We always think that, having dozens of disk servers as we do have for ATLAS, the temporary failure of one of them would not be that much of an issue. But this is not quite so. The attached plot shows how in the night from 23rd to 24th April the transfers from PIC to Tier2s failed with up to 800 failed transfers per hour. The problematic disk pool was indeed first detected by ATLAS than by us.