Monday 26 April 2010

Scheduled intervention, in sync with LHC technical stop

We are right now draining PIC in preparation for a Scheduled intervention tomorrow. This is the first time we try and schedule an intervention in sync with the LHC operational schedule. Let's see how the experience works. In principle, it should be good that sites synchronize stops with the accelerator, but on the other hand we should make sure we do not stop all together! Communication challenge... our favorites :-)
One of our main interventions tomorrow will be the upgrade of the firmware of a bunch of 3Com switches we use to interconnect many of our disk and cpu servers. In the last days we have had quite a number of issues (tickets 57623, 57617, 57177) reported mainly by ATLAS. We believe these are caused by the old firmware in these switches. However, this is just a theory of course... will see after this intervention if these network failures disappear.
We always think that, having dozens of disk servers as we do have for ATLAS, the temporary failure of one of them would not be that much of an issue. But this is not quite so. The attached plot shows how in the night from 23rd to 24th April the transfers from PIC to Tier2s failed with up to 800 failed transfers per hour. The problematic disk pool was indeed first detected by ATLAS than by us.

No comments: