Thursday, 15 May 2008

In April, a thousand rains

Following the spanish proverb "en abril, aguas mil" the rain finally fell on Catalunya in April after months of severe drought. For the PIC reliability metric, the rainy season started a bit earlier, in March, and looks as if it kept raining in April. In that month, our reliability result has been a yellowish (not green, not red) 90%. Sightly better than the one of March (...positive slope, this is always good news) but still below the 93% WLCG goal for Tier-1s. The main contribution to last month unreliability was, believe it or not, the network. The networking conspiration we suffered on the 28-29 of April is responsible for at least 8 out of the 10 reliability percentage points we lost last month. The non-dedicated backup described by Gerard hours ago will help to lower our exposure to network outages, but we should keep pushing for a dedicated one in the future.
We had also other operative issues last month which also contributed to the overall unreliability. Every service had its grey-day last month: in the Storage service, a gridftp door (dcgftp07 its name) misteriously hang in a funny way such that the clever-dCache could not detect it and kept trying to use it. As far as I know, this is still in the Poltergeist domain. I hope it will just go with the next dCache upgrade. For the Computing service there have been also some hickups... it is not nice when an automated configuration system decides to erase all of the local users in the farm nodes at 18:00 p.m.
We have even had a nice example of collaborative-destruction among Services last month: a supposedly rutinary and harmless pool replication operation in dCache ended up saturating one network switch which happened to have, among others, the PBS master connected to it, which immediately lost connectivity to all of its Workers. Was it sending the jobs to the data, or bringing the data to the CPUs? Anyhow, a nice example of Storage - Computing love.

