Friday 23 May 2008

LAN is burning, dial 99999

This has been an intense CCRC08 week. We have kept PIC up and running, and the performance of services has been mostly good. However, we have had some interesting issues from which I think we should try and learn some lesson.
CMS has been basically the only user of the PIC farm this week, due to the lack of competition from other VOs. It was running about 600 jobs in parallel for some days. Rapidly, we saw how these jobs started to read input data from the dCache pools at a huge rate. By tuesday the WNs were reading at about 600MB/s sustained. Both the disk and the WNs switches were saturating. On thursday noon Gerard raised some of the Thumpers network uplink from 5 to 8 Gbps with 3 temporary cables crossing the room (yes, we will tidy them up once the so long awaited 10Gbps uplinks arrive) and we immediately saw how the extra bandwidth was immediately eaten up.
On thursday afternoon the WNs spent some hours reading at the record rate of 1200MB/s from disks. Then we tried to limit the number of parallel jobs CMS was running at PIC, and we saw that with only 200 parallel jobs it was already filling up 1000MB/s.
Homework for next week is to understand the characteristics of these CMS jobs (seems that are the "fake-skimming" ones). Which is their MB/s/job figure? (been asking the same question for years, now once again).
The second ccrc08-hickup this week arrived yesterday evening. ATLAS transfers to PIC disk started to fail. Among the various T0D1 atlas pools, there were two with plenty of free space while the others were 100% full. For some reason dCache was assigning transfers to the full pools. We have sent an S.O.S. to dCache-support and they answered immediately with the magic (configuration) recipe to solve this (thanks Patrick!).
Now looks like thinks are quiet (apart from some blade switches burning) and green... ready for the weekend.

No comments: