Tuesday, 27 July 2010
Now it has been CMS that has carried out this consistency check on the Storage at PIC. Fortunately, they have also quite automatized machinery for this so we have got the results pretty fast.
Out of almost 1PB they have at PIC, CMS has found a mere 15TB of "dark data", or files that were not present in their catalog. Most of them from pretty recent (Jan 2010) productions that were known to have failed.
So, for the moment the CMS data seems to be around one order of magnitude "brighter" than the ATLAS one... another significant difference for a two quite similar detectors.
Friday, 23 July 2010
The two first plots on the left show the last 24h monitoring of the number of jobs in the farm and the total bandwidth in the Storage system, respectively. We see two nice peaks around 17h and 22h which got actually very near to a 4Gbytes/second total bandwidth being read from dCache. As far as I remember we had never seen this before at PIC, so we got another record for our picture album.
Looking at the pools that got the load, we can deduce that it was ATLAS who was generating this load. The good news is that the Storage and LAN systems at PIC coped with the load with no problems. Unfortunately, there is not much more we can learn from this: were these bytes actually generating useful information or were they just the artifact of some suboptimal ROOT caches configuration?
Monday, 5 July 2010
This is what happened las 23rd June. The MC-M-DST space token of the LHCb experiment at PIC got full and, according to the monitoring, we are stuck since then.
PIC is probably the smallest LHCb Tier1. Smallest than the average, and this probably creates some issues for the LHCb data distribution model. At first order, they consider all Tier1 the same size so essentially all DST data should go everywhere.
PIC can not pledge 16% of the LHCb needs for various reasons, so this is why some months ago we agreed with the experiment that, in order to still make an efficient use of the space we could provide, the data stored should be somehow "managed". In particular, we agreed that we could just keep the "two last versions" of the reprocessed data at PIC instead of keeping a longer history. Looked like a fair compromise.
Now we have our token full and looks we are stuck. It is time to check if that nice idea of "keeping only the two most recent versions" can actually be implemented.
Tuesday, 22 June 2010
Yesterday around 14:30 there was an interesting configuration change in the WNs at PIC. It looks just as an innocent environment variable, but setting GLOBUS_FTP_CLIENT_GRIDFTP2 to true it just does the business of telling the applications to use the version 2 of the gridftp protocol instead of the old version 1. One of the most interesting features of the new version is that data streams are opened directly against disk pools, so the traffic does not flow through the gridftp doors. This effect can be clearly seen in the left plot, where the graphic at the bottom shows the aggregated network traffic through the gridftp doors at PIC. It essentially went to zero after the change.
So, good news for the gridftp doors at PIC. We have less risk of a bottleneck there, and also can plan for having less of them to do the job.
Friday, 18 June 2010
Few days later, our happiness turned into ... what's going on?
As days passed, we saw that the cpu efficiency of CMS reconstruction jobs at PIC was consistently very low (30-40%!!)... with no apparent reason for that! There was no cpu iowait in the WNs, nor the disk servers showed contention effects.
We still do not understand the origin of this problem, but have identified two possible sources:
1) The jobs themselves. We observed that most of the jobs with lower cpu efficiency were spitting a "fast copy disabled"message at the start of their output logfile. The CMSSW experts told us that this means that
"for some reason the input file has events which are
Interesting, indeed. We still need to confirm if the 40% cpu efficiency was caused by this out-of-order input events...
2) Due to our "default configuration", plus the CMSSW one, those jobs were writing the output files to dCache using the gridftpv1 protocol. This means a) the traffic was passing through the gridftp doors, and b) it was using the "wan" mover queues in the dCache pools which eventually reached the "max active" limit (at 100 up to now) so movers were queued. This is always bad.
So, we still do not have a clue of what was the actual problem but looks as an interesting investigation so I felt like posting it here :-)
Tuesday, 8 June 2010
In the process we have almost 4 million ATLAS files at PIC, and about 10% of them appeared to be "dark", i.e. sitting on the disk but not registered in the LFC Catalog. Another 3,5% were also darkish but of another kind: they were registered in our local Catalog but not in the DDM central one.
The plots on the left show the effect of this cleaning campaign. Now the blue (what ATLAS thinks there is at PIC) and red (what actually we have on disk) lines are matching better.
So, this would go into the "inefficiency" of experiments using the disks. We have quantified this to be of the order of 90%. Substantially higher than the 70% which is in general used for WLCG capacity planning.
Thursday, 20 May 2010
It is true that this starts to be quite routine, but still I can not avoid to open my eyes wide when I see ATLAS moving data at almost 10 GB/s.
The plot shows the last 24h as shown in the DDM dashboard right now. Incoming traffic to PIC is shown in the 2nd plot. Almost half Gig sustained, not bad. Half to DATADISK and half to MCDISK.
Last but not least, the 3rd plot shows the traffic we are exporting to the Tier2s, also about half Gig sustained overall.
There is a nice feature to observe in the 2 last plots: the dip around last midnight. This is due to an incident we had with one of the DDN controllers. For some still unknown reason, the second controller did not take over transparently. Something to understand with the vendor support in the next days. Stay tuned.
Having into account the severity of the incident, it is nice to see that the service was only affected for few hours. Manager on Duty fire brigade took corrective action in a very efficient manner (ok Gerard!).
Now, let the vendors explain us why the super-whooper HA mechanisms are only there when you test them but not when you need them.
There was quite a while since we did not see these guys consuming CPU cycles at PIC, so they were starting with their full Fair Share budget. Interesting to see that in these conditions they were able to peak to 400 jobs quite fast and in about 6 hours they had already crossed their Fair Share red line.
I hear that ATLAS is about to launch another reprocessing campaign, so they will be asking for their Fair Share in a short time... I hope to see the LHCb load stabilizing at their share at some point, otherwise I will start suspecting they have some problem with us :-)
Thursday, 6 May 2010
The ghost of "data loss" was there, coming to us. Luckily, after a somewhat "hero mode" weekend for our MoD and experts (thanks Marc and Gerard!) following the indications of Sun-Support the problem could be solved with zero data loss (uf!). The recipe looks quite innocent from the distance: upgrade the OS to the last version, Solaris 10u8.
We find quite often that a solution comes with a new problem. This time was not an exception. The updated OS rapidly solved the unmountable ZFS partition problem, but it completely screwed up the network of the server.
We have not been able to solve this second problem yet, and this is why the 125TB of data of the upgraded server (dc012) were reconfigured to be served by its "twin" server (dc004). This is a nice configuration that the DDN SAN deployment enables. So this is I think the first time we try this feature in production, and there we have the picture: dc004 serving 250 TB of ATLAS data with a peak up to 600 MB/s... and no problem.
Looks like, besides OS version issues, the DDN hardware is delivering.
For various reasons, which we are still in the process of elucidating, these atpilot jobs tend to get "stack" reading input files, and they stay idle in the WN slot unntil the Panda pilot wrapper kills them. Luckily, it implements a 12 hours timeout for jobs detected as stalled.
So, this is the picture of today's 12h x 200 cores going to the bin. Hope we will find the ultimate reason why these atpilots are so reluctant to swallow our data... eventually.
Tuesday, 27 April 2010
These were transfer failing with "permission denied" errors at PIC destination, and the reason was us trying to implement an improved configuration for ATLAS in dCache: different uid/gid mappings for "user" and "production" roles so that, for instance, one can not delete the other's files by mistake.
The recursive chown and chmod commands on the full ATLAS name space were more expensive operations than we expected, so the operation was in the end not transparent. It took around 11 hours for these recursive commands to finish (hope this will get better with Chimera) but thanks to our storage expert MoD manually helping in the background, most of the errors were only visible for 4 hours.
Monday, 26 April 2010
One of our main interventions tomorrow will be the upgrade of the firmware of a bunch of 3Com switches we use to interconnect many of our disk and cpu servers. In the last days we have had quite a number of issues (tickets 57623, 57617, 57177) reported mainly by ATLAS. We believe these are caused by the old firmware in these switches. However, this is just a theory of course... will see after this intervention if these network failures disappear.
We always think that, having dozens of disk servers as we do have for ATLAS, the temporary failure of one of them would not be that much of an issue. But this is not quite so. The attached plot shows how in the night from 23rd to 24th April the transfers from PIC to Tier2s failed with up to 800 failed transfers per hour. The problematic disk pool was indeed first detected by ATLAS than by us.
Tuesday, 30 March 2010
At 12:58:34 the LHC Control Centre declared stable colliding beams: the collisions were immediately detected in CMS. Moments later the full processing power of the detector had analysed the data and produced the first images of particles created in the 7 TeV collisions traversing the CMS detector.
CMS was fully operational and observed around 200000 collisions in the first hour. The data were quickly stored and processed by a huge farm of computers at CERN before being transported to collaborating particle physicists all over the world for further detailed analysis.
The first step for CMS was to measure precisely the position of the collisions in order to fine-tune the settings of both the collider and the experiment. This calculation was performed in real-time and showed that the collisions were occurring within 3 millimetres of the exact centre of the 15m diameter CMS detector. This measurement already demonstrates the impressive accuracy of the 27 km long LHC machine and the operational readiness of the CMS detector. Indeed all parts of CMS are functioning excellently – from the detector itself, through the trigger and data acquisition systems that select and record the most interesting collisions, to the software and computing Grids that process and distribute the data.
“This is the moment for which we have been waiting and preparing for many years. We are standing at the threshold of a new, unexplored territory that could contain the answer to some of the major questions of modern physics” said CMS Spokesperson Guido Tonelli. “Why does the Universe have any substance at all? What, in fact, is 95% of our Universe actually made of? Can the known forces be explained by a single Grand-Unified force”. Answers may rely on the production and detection in laboratory of particles that have so far eluded physicists. “We’ll soon start a systematic search for the Higgs boson, as well as particles predicted by new theories such as ‘Supersymmetry’, that could explain the presence of abundant dark matter in our universe. If they exist, and LHC will produce them, we are confident that CMS will be able to detect them.” But prior to these searches it is imperative to understand fully the complex CMS detector. “We are already starting to study the known particles of the Standard Model in great detail, to perform a precise evaluation of our detector’s response and to measure accurately all possible backgrounds to new physics. Exciting times are definitely ahead”.
Images and animations of some of the first collisions in CMS can be found on the CMS public web site http://cms.cern.ch
CMS is one of two general-purpose experiments at the LHC that have been built to search for new physics. It is designed to detect a wide range of particles and phenomena produced in the LHC’s high-energy proton-proton collisions and will help to answer questions such as: What is the Universe really made of and what forces act within it? And what gives everything substance? It will also measure the properties of well known particles with unprecedented precision and be on the lookout for completely new, unpredicted phenomena. Such research not only increases our understanding of the way the Universe works, but may eventually spark new technologies that change the world in which we live. The current run of the LHC is expected to last eighteen months. This should enable the LHC experiments to accumulate enough data to explore new territory in all areas where new physics can be expected.
The conceptual design of the CMS experiment dates back to 1992. The construction of the gigantic detector (15 m diameter by 21m long with a weight of 12500 tonnes) took 16 years of effort from one of the largest international scientific collaborations ever assembled: more than 3600 scientists and engineers from 182 Institutions and research laboratories distributed in 39 countries all over the world.
Monday, 29 March 2010
Last week we finally started receiving ATLAS TAG data through the Oracle Streams, so we are now keeping an eye on how the users are going to consume such a "fancy" service. Selecting events directly querying an Oracle DB sounds fancy... at least to me :-)
I think in the end we allocated around 4 TB of space for this DB, so it will also be the largest DB at PIC.
All in all, an interesting exercise for sure. I hope now users will come in herds to query the TAGs like mad... there we go.
Thursday, 18 March 2010
Anyway, there is this funny parameter in Enstore called "check_written_file" which tells Enstore whether to check files were correctly written to tape... by reading them back! So, quite an expensive check, indeed.
Bottom line is that we had it set up at 10 without really realizing. On average, one every 10 files written was read back for checking. A bit too much, isn't it?
Last tuesday 16th in the evening this parameter was increased by a factor of at least 50.
The good news is that the ATLAS performance we report to SLS clearly shows a 30% improvement in the expected moment (top plot). Good!
The not so good news is that the same plot for CMS (bottom plot) does not show any hint of improvement... one could even see a degradation! We believe (hope!) this is due to the fact that CMS is not writting many files in one go these days, so it is dominated by tape mounts.
Will keep an eye on this, but to me it looks like we saved some Euros in tape drive throghput this week ;-)
Friday, 5 March 2010
The ATLAS Panda web page is quite cool, indeed, but not extremely useful for a profane to dig into it.
It took us quite some time to realise that the source of these extremely inefficient jobs was just at the end of the corridor: our ATLAS Tier2 colleagues submitting Hammercloud tests and checking that very low READ_AHEAD parameters for dCache remote access can be very inefficient. Next time we will ask them to keep the wave a big smaller.
Monday, 1 March 2010
This paper is under publication in JHEP and others will follow. CMS went into a major water-leak repair during the Winter shutdown, and now we are ready for more data. In fact, the LHC has restarted operations this weekend, and a few splash events have been already recorded by CMS.
After twenty years of design, tests, construction and commissioning, now is time for CMS collaborators to enjoy the long LHC run. LHC, we are prepared for the beams!
We started 2010 with a number of issues affecting our two main Tier1 services: Computing and Storage. They were not that bad to make us failing the availability/reliability target (we still scored 98%) but sure there are lessons to learn.
The first issue affected ATLAS and it showed up on Jan 2nd in the evening, when the ATLASMCDISK token completely filled up: no free space! This is a disk-only token, so the experiment should manage it. ATLAS acknowledged it had had some issue with its data distribution during Christmas. Apparently they were sending to this disk-only token some data that should have gone to tape. Anyway, it was still quite impressive to see how ATLAS was storing 80 TB of data in just about 3 days. Quite busy Christmas days!
The second issue appeared on the 25th Jan and was more worrisome. The symptom was an overload of the dCache SRM service. After some investigation, the cause was traced to be the hammering of the PNFS carried out simultaneously by some MAGIC inefficient jobs plus also inefficient ATLAS bulk deletions. This issue puzzle our storage experts for 2 or 3 days. I hope we have now the monitoring in place that helps us next time we see something similar. One might try and patch the PNFS, but I believe that we can suffer from its non-scalability until we migrate to Chimera.
The last issue of the month affected the Computing Service and sadly had a quite usual cause: a badly configured WN acting as blackhole. This time it was apparently a corrupted /dev/null in this box (never quite understood how it appeared). We made our blackhole detection tools stronger after this incident, so that it will not happen again.
Thursday, 18 February 2010
Last Tuesday, the 16th-February, Dr. Josep Flix went to IES Egara to give an overview of CERN, the LHC and the Grid to latest course high school students. It was not an easy task arriving there: it was raining, I was carrying around 100 CERN brochures with me, some other PIC brochures, the laptop, and all this... driving my bike! After getting lost on the town and asking several locals, I finally arrived to the high school. "Wet", but in time. The students were really surprised hearing what we do at CERN: science and technology. In the end, I was lucky enough not to electrocute myself during the talk (remember the rain and me "wet") and then students were able to place very interesting questions, indeed, well after the talk... Yes, the dark holes creation also was raised there, which seems a quite general and spread issue. From here, I want to congratulate Physics professor Juan Luis Rubio, to keep his students interested in Physics and with a very good knowledge of Particle Physics. After the talk, we spent also a good time in a nice restaurant on the town. At that time rain was gone...
Friday, 29 January 2010
After the talks we made a visit to PIC installations, so they could see how a Computing Center is built and managed. In groups of 15 people we showed them first the real-time views of what's actually occurring on the Grid: the nice visualization of the WLCG grid activity on Google Earth, the ATLAS concurrent jobs running at all their Tiers, the CMS overall data transfer volumes, the LHCb job monitor display, and a few local monitoring plots, like the batch system and LAN/WAN usages.
Then, the visit to the Computing Area itself started: we showed them the different kind of disk pools we have installed, which covered the SUN X4500 (we opened one, so they could see how disks are installed and can be easily replaced) and the new powerful DDN system that offers 2 PBs of disk space; our computational power based on brand new HP Blade systems; plus the two tape robots we have at PIC (around 3 PBs of data stored) and which are the tapes available on the market and how we use them. The students were impressed as well on the WAN and LAN capabilities, the latest improved with the acquisition of two new 10 Gbps switches.
So far, the morning was extremely fruiful. From PIC we want to thank the Professors (Gregorio, Fernando, Lino, Alberto, Alexandra) for their dedication and motivation they offer to their students. They enjoyed the visit and want to repeat it with other students from the school in two months from now. We are happy to receive them again! ;)