Monday, 15 June 2009

WLCG proposed a test period were all the VOs can exercise their computing models altogheter in a realistic scenario before the data taking. This exercise was named STEP09. The challenge ended on Friday 23:00 CET. A lot of activity happened during these last two weeks and the ATLAS Computing Model were fully exercised (finally!):

a) Huge load of MC production, exercising the farms and the data aggregation form T2s to the T1. The job type were basically G4 simulation jobs and AOD merging.

b) Huge load of Analysis jobs that were racing against the MC ones. I'd like to emphasize that this activity was quite well *disorganized* in overall... that was good as possibly could simulate quite well the possible impact of the uncontrolled user analysis that everyone will get very soon. There were general problems with Athena build jobs (compatibly libraries), the PanDA brokerage were bypassing the ATLAS release check at the sites (spotted last Thursday), the protocols to get data were basically lcg-cp for File stager and dcopen/file for remote connection... there's a big room for improvement in the Pilot Jobs based User Analysis, hopefully some tests will be performed in the coming months (use "natural SE" protocols for the FS, tune the Read Ahead Buffer -32Kb used during STEP-, etc.)

c) Non-Stop Reprocessing, this was the primary activity at the Tier-1s and was focused to perform multi-VO reprocessing at the same time fro those Tier-1 serving more than one experiment. That was achieved: ATLAS, CMS and LHCb managed to use the robots at the same time during several days. Six ATLAS sites manage to pass the metrics: reprocess at five times the data taking speed assuming 40% of LHC machine efficiency. And three sites manage to get gold stars, the metrics for this were: reprocess at five times data taking speed assuming 100% of LHC machine efficiency. PIC obtained gold star.

e) Data distribution: was broadly tested on top of the processing activity. Merged AOD were pre-placed at the sites: T1-T1-T2s, Functional Tests kept running all the time during the two weeks. There were no major incidences although some sites had some instabilities during the STEP: disk space emptied out, gridftp doors overloads (intensive LAN, WAN activity), but in overall data flow has shown its robustness and major improving with respect two years ago.

The STEP showed that manual work is still required but we managed to spot and learn which things would be the most demanding ones, the target now is to work in this direction and deploy an scalable operations model before the start.

Concerning our cloud, I'd say the exercise was a real success ! I want to thanks every party involved. Site has been stable (no major issues) that allowed STEP to pass through in a quite intensive way. I've barley seen an idle CPU all over the cloud, letting schedulers to battle against the different roles and jobs. Also the data transfer has been pretty high and the output form PIC to the Tier-2s has been quite stable at around 150-200MB/s.

MonteCarlo production ran between 1k and 1.5k jobs during STEP and efficiencies reached were well higher the 90%

On the other side User Analysis jobs were also filling the remaining slots, see the snapshot for the Pilot-based jobs (the ones directly submitted trough WMS are not accounted). Efficiencies ranked among the 70% and 80% depending on the sites playing the game, there were some issues (all understood!) that prevented pilot user analysis jobs to finish correctly.

PIC reprocessed the data two times, job efficiencies and data flow from the robot to the buffers and to the WN as a final step showed a very good performance. Pre-Stage is manage to keep up with 500 concurrent jobs which is beyond the required reprocessing speed for a Tier1 of small size. The interesting point is that reprocessing outputs were written to tape as well, this is what the computing model say, and we found no major problems with the simultaneous usage of drive for reading and writing.

What we learned during the STEP is that, once again, storage services are the most sensitive layer of the infrastructure. Small outage on the SE induce a general failure of every single activity. Storage services would have to be well dimensioned on those sites that had instabilities under constant load. We learned also that disk space is as volatile as it is in our laptops, a good prevision of the storage versus desired data is mandatory, it is true that ATLAS has to be more aggressive in data deletion but this is at the hands of the physics groups. I'd say that we should reinforce the Federative structure of our Tier-2s and take profit of these for the data distribution and always preserve some disk quantity in case of crisis. In the near future ATLAS will provide an interface that for modifying the shares at the sites so can be dynamic and not human-based, but also DDM will cross-check if the space token has enough free space before the data is shipped, hence preventing to replicate in case of shortages.

Once again my feeling and understanding is that our cloud is ready for data taking, let's hope data start to flow soon...

No comments: