The Universidad Técnica Federico Santa Maria (UTFSM) of Valparaiso (Chile) has joined ATLAS Computing and has been associated to PIC as its Tier-1 center. They are now a small center (Tier-3) but surely will grow soon. Firsts transfers using the full chain of the ATLAS Data Management System has been successfully tested. We welcome UTFSM to PIC and to the distributed computing world !
Thursday, 18 June 2009
Monday, 15 June 2009
WLCG proposed a test period were all the VOs can exercise their computing models altogheter in a realistic scenario before the data taking. This exercise was named STEP09. The challenge ended on Friday 23:00 CET. A lot of activity happened during these last two weeks and the ATLAS Computing Model were fully exercised (finally!):
a) Huge load of MC production, exercising the farms and the data aggregation form T2s to the T1. The job type were basically G4 simulation jobs and AOD merging.
b) Huge load of Analysis jobs that were racing against the MC ones. I'd like to emphasize that this activity was quite well *disorganized* in overall... that was good as possibly could simulate quite well the possible impact of the uncontrolled user analysis that everyone will get very soon. There were general problems with Athena build jobs (compatibly libraries), the PanDA brokerage were bypassing the ATLAS release check at the sites (spotted last Thursday), the protocols to get data were basically lcg-cp for File stager and dcopen/file for remote connection... there's a big room for improvement in the Pilot Jobs based User Analysis, hopefully some tests will be performed in the coming months (use "natural SE" protocols for the FS, tune the Read Ahead Buffer -32Kb used during STEP-, etc.)
c) Non-Stop Reprocessing, this was the primary activity at the Tier-1s and was focused to perform multi-VO reprocessing at the same time fro those Tier-1 serving more than one experiment. That was achieved: ATLAS, CMS and LHCb managed to use the robots at the same time during several days. Six ATLAS sites manage to pass the metrics: reprocess at five times the data taking speed assuming 40% of LHC machine efficiency. And three sites manage to get gold stars, the metrics for this were: reprocess at five times data taking speed assuming 100% of LHC machine efficiency. PIC obtained gold star.
e) Data distribution: was broadly tested on top of the processing activity. Merged AOD were pre-placed at the sites: T1-T1-T2s, Functional Tests kept running all the time during the two weeks. There were no major incidences although some sites had some instabilities during the STEP: disk space emptied out, gridftp doors overloads (intensive LAN, WAN activity), but in overall data flow has shown its robustness and major improving with respect two years ago.
The STEP showed that manual work is still required but we managed to spot and learn which things would be the most demanding ones, the target now is to work in this direction and deploy an scalable operations model before the start.
Concerning our cloud, I'd say the exercise was a real success ! I want to thanks every party involved. Site has been stable (no major issues) that allowed STEP to pass through in a quite intensive way. I've barley seen an idle CPU all over the cloud, letting schedulers to battle against the different roles and jobs. Also the data transfer has been pretty high and the output form PIC to the Tier-2s has been quite stable at around 150-200MB/s.
MonteCarlo production ran between 1k and 1.5k jobs during STEP and efficiencies reached were well higher the 90%
On the other side User Analysis jobs were also filling the remaining slots, see the snapshot for the Pilot-based jobs (the ones directly submitted trough WMS are not accounted). Efficiencies ranked among the 70% and 80% depending on the sites playing the game, there were some issues (all understood!) that prevented pilot user analysis jobs to finish correctly.
PIC reprocessed the data two times, job efficiencies and data flow from the robot to the buffers and to the WN as a final step showed a very good performance. Pre-Stage is manage to keep up with 500 concurrent jobs which is beyond the required reprocessing speed for a Tier1 of small size. The interesting point is that reprocessing outputs were written to tape as well, this is what the computing model say, and we found no major problems with the simultaneous usage of drive for reading and writing.
What we learned during the STEP is that, once again, storage services are the most sensitive layer of the infrastructure. Small outage on the SE induce a general failure of every single activity. Storage services would have to be well dimensioned on those sites that had instabilities under constant load. We learned also that disk space is as volatile as it is in our laptops, a good prevision of the storage versus desired data is mandatory, it is true that ATLAS has to be more aggressive in data deletion but this is at the hands of the physics groups. I'd say that we should reinforce the Federative structure of our Tier-2s and take profit of these for the data distribution and always preserve some disk quantity in case of crisis. In the near future ATLAS will provide an interface that for modifying the shares at the sites so can be dynamic and not human-based, but also DDM will cross-check if the space token has enough free space before the data is shipped, hence preventing to replicate in case of shortages.
Once again my feeling and understanding is that our cloud is ready for data taking, let's hope data start to flow soon...
a) Huge load of MC production, exercising the farms and the data aggregation form T2s to the T1. The job type were basically G4 simulation jobs and AOD merging.
b) Huge load of Analysis jobs that were racing against the MC ones. I'd like to emphasize that this activity was quite well *disorganized* in overall... that was good as possibly could simulate quite well the possible impact of the uncontrolled user analysis that everyone will get very soon. There were general problems with Athena build jobs (compatibly libraries), the PanDA brokerage were bypassing the ATLAS release check at the sites (spotted last Thursday), the protocols to get data were basically lcg-cp for File stager and dcopen/file for remote connection... there's a big room for improvement in the Pilot Jobs based User Analysis, hopefully some tests will be performed in the coming months (use "natural SE" protocols for the FS, tune the Read Ahead Buffer -32Kb used during STEP-, etc.)
c) Non-Stop Reprocessing, this was the primary activity at the Tier-1s and was focused to perform multi-VO reprocessing at the same time fro those Tier-1 serving more than one experiment. That was achieved: ATLAS, CMS and LHCb managed to use the robots at the same time during several days. Six ATLAS sites manage to pass the metrics: reprocess at five times the data taking speed assuming 40% of LHC machine efficiency. And three sites manage to get gold stars, the metrics for this were: reprocess at five times data taking speed assuming 100% of LHC machine efficiency. PIC obtained gold star.
e) Data distribution: was broadly tested on top of the processing activity. Merged AOD were pre-placed at the sites: T1-T1-T2s, Functional Tests kept running all the time during the two weeks. There were no major incidences although some sites had some instabilities during the STEP: disk space emptied out, gridftp doors overloads (intensive LAN, WAN activity), but in overall data flow has shown its robustness and major improving with respect two years ago.
The STEP showed that manual work is still required but we managed to spot and learn which things would be the most demanding ones, the target now is to work in this direction and deploy an scalable operations model before the start.
Concerning our cloud, I'd say the exercise was a real success ! I want to thanks every party involved. Site has been stable (no major issues) that allowed STEP to pass through in a quite intensive way. I've barley seen an idle CPU all over the cloud, letting schedulers to battle against the different roles and jobs. Also the data transfer has been pretty high and the output form PIC to the Tier-2s has been quite stable at around 150-200MB/s.
MonteCarlo production ran between 1k and 1.5k jobs during STEP and efficiencies reached were well higher the 90%
On the other side User Analysis jobs were also filling the remaining slots, see the snapshot for the Pilot-based jobs (the ones directly submitted trough WMS are not accounted). Efficiencies ranked among the 70% and 80% depending on the sites playing the game, there were some issues (all understood!) that prevented pilot user analysis jobs to finish correctly.
PIC reprocessed the data two times, job efficiencies and data flow from the robot to the buffers and to the WN as a final step showed a very good performance. Pre-Stage is manage to keep up with 500 concurrent jobs which is beyond the required reprocessing speed for a Tier1 of small size. The interesting point is that reprocessing outputs were written to tape as well, this is what the computing model say, and we found no major problems with the simultaneous usage of drive for reading and writing.
What we learned during the STEP is that, once again, storage services are the most sensitive layer of the infrastructure. Small outage on the SE induce a general failure of every single activity. Storage services would have to be well dimensioned on those sites that had instabilities under constant load. We learned also that disk space is as volatile as it is in our laptops, a good prevision of the storage versus desired data is mandatory, it is true that ATLAS has to be more aggressive in data deletion but this is at the hands of the physics groups. I'd say that we should reinforce the Federative structure of our Tier-2s and take profit of these for the data distribution and always preserve some disk quantity in case of crisis. In the near future ATLAS will provide an interface that for modifying the shares at the sites so can be dynamic and not human-based, but also DDM will cross-check if the space token has enough free space before the data is shipped, hence preventing to replicate in case of shortages.
Once again my feeling and understanding is that our cloud is ready for data taking, let's hope data start to flow soon...
Monday, 8 June 2009
Tier-1 reliability in March and April
Now that everybody is talking about hot topics such as STEP09, let us take some break here to review some pretty outdated stuff. Sure we will be posting about the STEP09 in brief, but reviewing the scored monthly reliabilities and, most important, highlighting the causes of failures as homework to improve the service in the future has always proven to be a very useful exercise. Let's go then :-)
In March PIC scored 99% in reliability and 98% in availability. Pretty good, and above target. The missing reliability was caused by the jobs from a MAGIC user that filled up the local disk of WNs, turning them into black holes. Interesting lessons to learn here: protect ourselves from users filling up the disk (they will always do it, even if not deliberately) and minimise the impact that one user can have in all the other PIC users community.
April was not such a good month. Our reliability was above target (98%) but our availabilty was not (92%). The cause for the latter was our building yearly electrical maintenance. We scheduled two days downtime, and this brought us below the availability target for the Tier-1s. During this SD we tested a reduced downtime for the LHCb-DIRAC service. We managed to stop this service by just about 8 hours, so next time we should apply this to the other Tier-1 critical services.
The missing reliability in April was caused by a pretty bizarre problem in the SRM server. For some days we suffered huge overloads. The cause was finally found to be in the configuration of the dCache postgresql DB. In particular, the schedule of the "vacuum" procedures in the background. Using the "false" flag as recommended in the documentation was the origin of all our problems. After this incident, we have learnt quite a lot about postgres vacuum configuration and are pretty sure we are safe now. We have also learnt that trusting the documentation is not always a wise thing to do :-)
In March PIC scored 99% in reliability and 98% in availability. Pretty good, and above target. The missing reliability was caused by the jobs from a MAGIC user that filled up the local disk of WNs, turning them into black holes. Interesting lessons to learn here: protect ourselves from users filling up the disk (they will always do it, even if not deliberately) and minimise the impact that one user can have in all the other PIC users community.
April was not such a good month. Our reliability was above target (98%) but our availabilty was not (92%). The cause for the latter was our building yearly electrical maintenance. We scheduled two days downtime, and this brought us below the availability target for the Tier-1s. During this SD we tested a reduced downtime for the LHCb-DIRAC service. We managed to stop this service by just about 8 hours, so next time we should apply this to the other Tier-1 critical services.
The missing reliability in April was caused by a pretty bizarre problem in the SRM server. For some days we suffered huge overloads. The cause was finally found to be in the configuration of the dCache postgresql DB. In particular, the schedule of the "vacuum" procedures in the background. Using the "false" flag as recommended in the documentation was the origin of all our problems. After this incident, we have learnt quite a lot about postgres vacuum configuration and are pretty sure we are safe now. We have also learnt that trusting the documentation is not always a wise thing to do :-)
Subscribe to:
Posts (Atom)