Friday 16 May 2008

Intensive test of data replication during the CCRC08 run2

During the last three days, and within the CCRC08 (Common Computing Readiness Challenge ) run2, PIC performance was impressing importing data. The test scope was to replicate data among all the ATLAS Tier-1s (nine), each one had a 3TB subset of data which was replicated to every T1 according some shares, overall more than 16TB were replicated from all other T1s to PIC. We started the test on Tuesday morning and after 14 hours pic imported more than 90% of the data with impressive sustained transfers rates, moreover we reached 100% efficiency from 7 T1s and 80% for the other two. During the first kick, were all the datasets subscriptions were placed in bulk, we reached approximately 0.5GB/s of throughput with 99% efficiency (see plot number 1):



And after 4 hours the importing efficiency still maintained the 99% and the accumulated mean throughput was over the 300MB/s. This lead to the correct placement of 16TB data in 14 hours.

On the other hand we observed different behavior while exporting data to other T1s, the efficiencies are not so brilliant in comparison with the importing. Also notice that while exporting we rely on external FTS services and sometimes the errors are not related to our capabilities but we did see observe some timeouts from our storage system, network was indeed a limit but we observed saturation at pool level while importing but not when others are reading, this means that the network among the pools and the data receiver were not guilty of the efficiency dropping while exporting. We have an insight and seems that our dCache have problems returning the TURLs (transfer urls) when is under heavy load (well known dCache bug), this could cause the kind of errors observed during the exercise. Besides that, almost all the T1s got more than 95% of the data from PIC, but sometimes not at the first sting and the exercise can be considered more than successful either from the PIC side and from ATLAS.

In my opinion this was one of the most successful test ran so far within the ATLAS VO as they involved all T1s under a heavy load of data cross-replication, the majority of sites performed very well and there are a couple of them who experienced problems, let me remind again the complexity of the system which should have a bunch of services ready and stable in order to finalize every single transfer: ATLAS data management tools, SE, FTS, LFC, network,etc.

We have to keep working hard to achieve robustness of the actual system, which improved enormously in the last months: dCache is performing very well and the new DataBase back-end for the FTS ironed out some of the overload problems found in past exercises. For that reason I want to deeply thank all the people involved maintaining the computing structure at PIC, this results show we are in the correct way. Congrats !

No comments: