Monday, 1 March 2010
January availability report
We started 2010 with a number of issues affecting our two main Tier1 services: Computing and Storage. They were not that bad to make us failing the availability/reliability target (we still scored 98%) but sure there are lessons to learn.
The first issue affected ATLAS and it showed up on Jan 2nd in the evening, when the ATLASMCDISK token completely filled up: no free space! This is a disk-only token, so the experiment should manage it. ATLAS acknowledged it had had some issue with its data distribution during Christmas. Apparently they were sending to this disk-only token some data that should have gone to tape. Anyway, it was still quite impressive to see how ATLAS was storing 80 TB of data in just about 3 days. Quite busy Christmas days!
The second issue appeared on the 25th Jan and was more worrisome. The symptom was an overload of the dCache SRM service. After some investigation, the cause was traced to be the hammering of the PNFS carried out simultaneously by some MAGIC inefficient jobs plus also inefficient ATLAS bulk deletions. This issue puzzle our storage experts for 2 or 3 days. I hope we have now the monitoring in place that helps us next time we see something similar. One might try and patch the PNFS, but I believe that we can suffer from its non-scalability until we migrate to Chimera.
The last issue of the month affected the Computing Service and sadly had a quite usual cause: a badly configured WN acting as blackhole. This time it was apparently a corrupted /dev/null in this box (never quite understood how it appeared). We made our blackhole detection tools stronger after this incident, so that it will not happen again.