Tuesday, 27 July 2010

CMS Dark Data

Last month it was ATLAS who was checking the consistency of their catalogs and the actual contents in our Storage. The ultimate goal is to get rid of what has been called as "dark" or uncatalogued data, which fills up the disks with unusable data. Let us recall that at that time ATLAS found that 10% of their data at PIC was dark...
Now it has been CMS that has carried out this consistency check on the Storage at PIC. Fortunately, they have also quite automatized machinery for this so we have got the results pretty fast.
Out of almost 1PB they have at PIC, CMS has found a mere 15TB of "dark data", or files that were not present in their catalog. Most of them from pretty recent (Jan 2010) productions that were known to have failed.
So, for the moment the CMS data seems to be around one order of magnitude "brighter" than the ATLAS one... another significant difference for a two quite similar detectors.

Friday, 23 July 2010

ATLAS pilot analysis stressing LAN

These days a big physics conference is starting in Paris. May be this is the reason behind the ATLAS "I/O storm" analysis jobs we saw yesterday running at PIC... if this is so, I hope the guy sending them got a nice plot to show to the audience.
The two first plots on the left show the last 24h monitoring of the number of jobs in the farm and the total bandwidth in the Storage system, respectively. We see two nice peaks around 17h and 22h which got actually very near to a 4Gbytes/second total bandwidth being read from dCache. As far as I remember we had never seen this before at PIC, so we got another record for our picture album.
Looking at the pools that got the load, we can deduce that it was ATLAS who was generating this load. The good news is that the Storage and LAN systems at PIC coped with the load with no problems. Unfortunately, there is not much more we can learn from this: were these bytes actually generating useful information or were they just the artifact of some suboptimal ROOT caches configuration?

Monday, 5 July 2010

LHCb token full: game over, insert coin?

This is what happened las 23rd June. The MC-M-DST space token of the LHCb experiment at PIC got full and, according to the monitoring, we are stuck since then.
PIC is probably the smallest LHCb Tier1. Smallest than the average, and this probably creates some issues for the LHCb data distribution model. At first order, they consider all Tier1 the same size so essentially all DST data should go everywhere.
PIC can not pledge 16% of the LHCb needs for various reasons, so this is why some months ago we agreed with the experiment that, in order to still make an efficient use of the space we could provide, the data stored should be somehow "managed". In particular, we agreed that we could just keep the "two last versions" of the reprocessed data at PIC instead of keeping a longer history. Looked like a fair compromise.
Now we have our token full and looks we are stuck. It is time to check if that nice idea of "keeping only the two most recent versions" can actually be implemented.

Tuesday, 22 June 2010

Gridftpv2, the doors relief

Yesterday around 14:30 there was an interesting configuration change in the WNs at PIC. It looks just as an innocent environment variable, but setting GLOBUS_FTP_CLIENT_GRIDFTP2 to true it just does the business of telling the applications to use the version 2 of the gridftp protocol instead of the old version 1. One of the most interesting features of the new version is that data streams are opened directly against disk pools, so the traffic does not flow through the gridftp doors. This effect can be clearly seen in the left plot, where the graphic at the bottom shows the aggregated network traffic through the gridftp doors at PIC. It essentially went to zero after the change.
So, good news for the gridftp doors at PIC. We have less risk of a bottleneck there, and also can plan for having less of them to do the job.

Friday, 18 June 2010

CMS reprocessing on 1st gear at PIC

We have seen a quite puzzling effect in the last week. After several weeks of low CMS activity, around one week ago we happily saw how reprocessing jobs started arriving to PIC in the hundreds.
Few days later, our happiness turned into ... what's going on?
As days passed, we saw that the cpu efficiency of CMS reconstruction jobs at PIC was consistently very low (30-40%!!)... with no apparent reason for that! There was no cpu iowait in the WNs, nor the disk servers showed contention effects.
We still do not understand the origin of this problem, but have identified two possible sources:

The jobs themselves. We observed that most of the jobs with lower cpu efficiency were spitting a "fast copy disabled"message at the start of their output logfile. The CMSSW experts told us that this means that

"for some reason the input file has events which are
not ordered as the framework wants, and thus the framework will read from the input
out-of-order (which indeed can wreck the I/O performance and result in low cpu/wall

Interesting, indeed. We still need to confirm if the 40% cpu efficiency was caused by this out-of-order input events...

2) Due to our "default configuration", plus the CMSSW one, those jobs were writing the output files to dCache using the gridftpv1 protocol. This means a) the traffic was passing through the gridftp doors, and b) it was using the "wan" mover queues in the dCache pools which eventually reached the "max active" limit (at 100 up to now) so movers were queued. This is always bad.

So, we still do not have a clue of what was the actual problem but looks as an interesting investigation so I felt like posting it here :-)

Tuesday, 8 June 2010

ATLAS dark data

It was quite a while ago since we did not take the broom and did a bit of cleaning in our disks. One week ago we performed a Storage consistency check for the ATLAS data at PIC. Luckily, the tools and scripts to automatise this task have evolved quite a lot since we tried this last time so the whole procedure is now quite smooth.
In the process we have almost 4 million ATLAS files at PIC, and about 10% of them appeared to be "dark", i.e. sitting on the disk but not registered in the LFC Catalog. Another 3,5% were also darkish but of another kind: they were registered in our local Catalog but not in the DDM central one.
The plots on the left show the effect of this cleaning campaign. Now the blue (what ATLAS thinks there is at PIC) and red (what actually we have on disk) lines are matching better.
So, this would go into the "inefficiency" of experiments using the disks. We have quantified this to be of the order of 90%. Substantially higher than the 70% which is in general used for WLCG capacity planning.

Thursday, 20 May 2010

ATLAS torrent

It is true that this starts to be quite routine, but still I can not avoid to open my eyes wide when I see ATLAS moving data at almost 10 GB/s.
The plot shows the last 24h as shown in the DDM dashboard right now. Incoming traffic to PIC is shown in the 2nd plot. Almost half Gig sustained, not bad. Half to DATADISK and half to MCDISK.
Last but not least, the 3rd plot shows the traffic we are exporting to the Tier2s, also about half Gig sustained overall.
There is a nice feature to observe in the 2 last plots: the dip around last midnight. This is due to an incident we had with one of the DDN controllers. For some still unknown reason, the second controller did not take over transparently. Something to understand with the vendor support in the next days. Stay tuned.
Having into account the severity of the incident, it is nice to see that the service was only affected for few hours. Manager on Duty fire brigade took corrective action in a very efficient manner (ok Gerard!).
Now, let the vendors explain us why the super-whooper HA mechanisms are only there when you test them but not when you need them.

Welcome home lhcb pilot!

There is not much to say besides we are happy to see that the sometimes shy LHCb pilot jobs are back running at PIC since last midnight.
There was quite a while since we did not see these guys consuming CPU cycles at PIC, so they were starting with their full Fair Share budget. Interesting to see that in these conditions they were able to peak to 400 jobs quite fast and in about 6 hours they had already crossed their Fair Share red line.
I hear that ATLAS is about to launch another reprocessing campaign, so they will be asking for their Fair Share in a short time... I hope to see the LHCb load stabilizing at their share at some point, otherwise I will start suspecting they have some problem with us :-)

Thursday, 6 May 2010

DDN nightmare and muscle

We got a notable fright last weekend. Barça match against Inter in the Champions league semifinal was about to start when suddenly... crash. One of our flashy dCache pools serving a DDN fatty partition (125 TB, almost full of ATLAS data) got bananas.
The ghost of "data loss" was there, coming to us. Luckily, after a somewhat "hero mode" weekend for our MoD and experts (thanks Marc and Gerard!) following the indications of Sun-Support the problem could be solved with zero data loss (uf!). The recipe looks quite innocent from the distance: upgrade the OS to the last version, Solaris 10u8.
We find quite often that a solution comes with a new problem. This time was not an exception. The updated OS rapidly solved the unmountable ZFS partition problem, but it completely screwed up the network of the server.
We have not been able to solve this second problem yet, and this is why the 125TB of data of the upgraded server (dc012) were reconfigured to be served by its "twin" server (dc004). This is a nice configuration that the DDN SAN deployment enables. So this is I think the first time we try this feature in production, and there we have the picture: dc004 serving 250 TB of ATLAS data with a peak up to 600 MB/s... and no problem.
Looks like, besides OS version issues, the DDN hardware is delivering.

Pilot has decided to kill looping job... strikes back!

Some days ago we noticed in our dashboards a somewhat curious pattern. Here we go again: yesterday and today we can see the same behavior. A bunch of jobs in the batch showing near zero cpu efficiency (red in the upper plot). Looking for the smoking gun... we easily find a correlation with "atpilot" jobs (blue in the bottom plots). These atpilot jobs are nothing more than ATLAS user analysis jobs submitted through the Panda framework.
For various reasons, which we are still in the process of elucidating, these atpilot jobs tend to get "stack" reading input files, and they stay idle in the WN slot unntil the Panda pilot wrapper kills them. Luckily, it implements a 12 hours timeout for jobs detected as stalled.
So, this is the picture of today's 12h x 200 cores going to the bin. Hope we will find the ultimate reason why these atpilots are so reluctant to swallow our data... eventually.

Tuesday, 27 April 2010

Uops! a would be transparent operation

If you look now at the ATLAS data transfers dashboard, you will easily find PIC since our efficiency in the last 24hrs hardly arrives to 50%. The reason for this are the transfer failure peak (orange in the plot) that we experienced yesterday between 10h and 14h. Up to 4000 transfers to PIC were failing per hour during a couple of hours.
These were transfer failing with "permission denied" errors at PIC destination, and the reason was us trying to implement an improved configuration for ATLAS in dCache: different uid/gid mappings for "user" and "production" roles so that, for instance, one can not delete the other's files by mistake.
The recursive chown and chmod commands on the full ATLAS name space were more expensive operations than we expected, so the operation was in the end not transparent. It took around 11 hours for these recursive commands to finish (hope this will get better with Chimera) but thanks to our storage expert MoD manually helping in the background, most of the errors were only visible for 4 hours.

Monday, 26 April 2010

Scheduled intervention, in sync with LHC technical stop

We are right now draining PIC in preparation for a Scheduled intervention tomorrow. This is the first time we try and schedule an intervention in sync with the LHC operational schedule. Let's see how the experience works. In principle, it should be good that sites synchronize stops with the accelerator, but on the other hand we should make sure we do not stop all together! Communication challenge... our favorites :-)
One of our main interventions tomorrow will be the upgrade of the firmware of a bunch of 3Com switches we use to interconnect many of our disk and cpu servers. In the last days we have had quite a number of issues (tickets 57623, 57617, 57177) reported mainly by ATLAS. We believe these are caused by the old firmware in these switches. However, this is just a theory of course... will see after this intervention if these network failures disappear.
We always think that, having dozens of disk servers as we do have for ATLAS, the temporary failure of one of them would not be that much of an issue. But this is not quite so. The attached plot shows how in the night from 23rd to 24th April the transfers from PIC to Tier2s failed with up to 800 failed transfers per hour. The problematic disk pool was indeed first detected by ATLAS than by us.

Tuesday, 30 March 2010

CMS Statement for the 7 TeV collisions

Today the Large Hadron Collider (LHC) at CERN has, for the first time, collided two beams of 3.5 TeV protons – a new world record energy. The CMS experiment successfully detected these collisions, signifying the beginning of the “First Physics” at the LHC.

t 12:58:34 the LHC Control Centre declared stable colliding beams: the collisions were immediately detected in CMS. Moments later the full processing power of the detector had analysed the data and produced the first images of particles created in the 7 TeV collisions traversing the CMS detector.

CMS was fully operational and observed around 200000 collisions in the first hour. The data were quickly stored and processed by a huge farm of computers at CERN before being transported to collaborating particle physicists all over the world for further detailed analysis.

The first step for CMS was to measure precisely the position of the collisions in order to fine-tune the settings of both the collider and the experiment. This calculation was performed in real-time and showed that the collisions were occurring within 3 millimetres of the exact centre of the 15m diameter CMS detector. This measurement already demonstrates the impressive accuracy of the 27 km long LHC machine and the operational readiness of the CMS detector. Indeed all parts of CMS are functioning excellently – from the detector itself, through the trigger and data acquisition systems that select and record the most interesting collisions, to the software and computing Grids that process and distribute the data.

This is the moment for which we have been waiting and preparing for many years. We are standing at the threshold of a new, unexplored territory that could contain the answer to some of the major questions of modern physics” said CMS Spokesperson Guido Tonelli. “Why does the Universe have any substance at all? What, in fact, is 95% of our Universe actually made of? Can the known forces be explained by a single Grand-Unified force”. Answers may rely on the production and detection in laboratory of particles that have so far eluded physicists. “We’ll soon start a systematic search for the Higgs boson, as well as particles predicted by new theories such as ‘Supersymmetry’, that could explain the presence of abundant dark matter in our universe. If they exist, and LHC will produce them, we are confident that CMS will be able to detect them.” But prior to these searches it is imperative to understand fully the complex CMS detector. “We are already starting to study the known particles of the Standard Model in great detail, to perform a precise evaluation of our detector’s response and to measure accurately all possible backgrounds to new physics. Exciting times are definitely ahead”.

Images and animations of some of the first collisions in CMS can be found on the CMS public web site http://cms.cern.ch

CMS is one of two general-purpose experiments at the LHC that have been built to search for new physics. It is designed to detect a wide range of particles and phenomena produced in the LHC’s high-energy proton-proton collisions and will help to answer questions such as: What is the Universe really made of and what forces act within it? And what gives everything substance? It will also measure the properties of well known particles with unprecedented precision and be on the lookout for completely new, unpredicted phenomena. Such research not only increases our understanding of the way the Universe works, but may eventually spark new technologies that change the world in which we live. The current run of the LHC is expected to last eighteen months. This should enable the LHC experiments to accumulate enough data to explore new territory in all areas where new physics can be expected.

The conceptual design of the CMS experiment dates back to 1992. The construction of the gigantic detector (15 m diameter by 21m long with a weight of 12500 tonnes) took 16 years of effort from one of the largest international scientific collaborations ever assembled: more than 3600 scientists and engineers from 182 Institutions and research laboratories distributed in 39 countries all over the world.

Monday, 29 March 2010

Last week we finally started receiving ATLAS TAG data through the Oracle Streams, so we are now keeping an eye on how the users are going to consume such a "fancy" service. Selecting events directly querying an Oracle DB sounds fancy... at least to me :-)
I think in the end we allocated around 4 TB of space for this DB, so it will also be the largest DB at PIC.
All in all, an interesting exercise for sure. I hope now users will come in herds to query the TAGs like mad... there we go.

Friday, 5 March 2010


Fridays are normally interesting days, aren't they? No interventions or new actions should be scheduled for Fridays, to allow people enjoying a quiet weekend. But quite often Fridays come with a surprise. This morning surprise was this monitoring plot in the Ganglia PBS page. The CPU farm at PIC was being invaded by a growing red blob of very cpu inefficient jobs. The plot at the bottom pointed us to the originator: atlas pilot jobs.
The ATLAS Panda web page is quite cool, indeed, but not extremely useful for a profane to dig into it.
It took us quite some time to realise that the source of these extremely inefficient jobs was just at the end of the corridor: our ATLAS Tier2 colleagues submitting Hammercloud tests and checking that very low READ_AHEAD parameters for dCache remote access can be very inefficient. Next time we will ask them to keep the wave a big smaller.

Monday, 1 March 2010

LHC is back!

On February, the 7th, the CMS collaboration received the final positive referee report and publication acceptance on their very first Physics Results publication. The paper reports on first measurements of hadron production in proton-proton collisions occurred during the LHC commissioning December 2009 period. The successful operation and fast fata analysis impressed the editors and the entire collaboration was congratulated... and a party followed afterwards at CERN! ;)

This paper is under publication in JHEP and others will follow. CMS went into a major water-leak repair during the Winter shutdown, and now we are ready for more data. In fact, the LHC has restarted operations this weekend, and a few splash events have been already recorded by CMS.

After twenty years of design, tests, construction and commissioning, now is time for CMS collaborators to enjoy the long LHC run. LHC, we are prepared for the beams!

January availability report

We started 2010 with a number of issues affecting our two main Tier1 services: Computing and Storage. They were not that bad to make us failing the availability/reliability target (we still scored 98%) but sure there are lessons to learn.
The first issue affected ATLAS and it showed up on Jan 2nd in the evening, when the ATLASMCDISK token completely filled up: no free space! This is a disk-only token, so the experiment should manage it. ATLAS acknowledged it had had some issue with its data distribution during Christmas. Apparently they were sending to this disk-only token some data that should have gone to tape. Anyway, it was still quite impressive to see how ATLAS was storing 80 TB of data in just about 3 days. Quite busy Christmas days!
The second issue appeared on the 25th Jan and was more worrisome. The symptom was an overload of the dCache SRM service. After some investigation, the cause was traced to be the hammering of the PNFS carried out simultaneously by some MAGIC inefficient jobs plus also inefficient ATLAS bulk deletions. This issue puzzle our storage experts for 2 or 3 days. I hope we have now the monitoring in place that helps us next time we see something similar. One might try and patch the PNFS, but I believe that we can suffer from its non-scalability until we migrate to Chimera.
The last issue of the month affected the Computing Service and sadly had a quite usual cause: a badly configured WN acting as blackhole. This time it was apparently a corrupted /dev/null in this box (never quite understood how it appeared). We made our blackhole detection tools stronger after this incident, so that it will not happen again.

Thursday, 18 February 2010

PIC goes to IES Egara (outreach activity)

Last Tuesday, the 16th-February, Dr. Josep Flix went to IES Egara to give an overview of CERN, the LHC and the Grid to latest course high school students. It was not an easy task arriving there: it was raining, I was carrying around 100 CERN brochures with me, some other PIC brochures, the laptop, and all this... driving my bike! After getting lost on the town and asking several locals, I finally arrived to the high school. "Wet", but in time. The students were really surprised hearing what we do at CERN: science and technology. In the end, I was lucky enough not to electrocute myself during the talk (remember the rain and me "wet") and then students were able to place very interesting questions, indeed, well after the talk... Yes, the dark holes creation also was raised there, which seems a quite general and spread issue. From here, I want to congratulate Physics professor Juan Luis Rubio, to keep his students interested in Physics and with a very good knowledge of Particle Physics. After the talk, we spent also a good time in a nice restaurant on the town. At that time rain was gone...

Friday, 29 January 2010

IES Sabadell visits PIC (outreach activity)

Yesterday we had the visit of around 80 students and 5 professors from IES Sabadell to PIC installations. Their academic field based in Informatics ("Cicles formatius de grau mig/superior") made the tour to be exciting and full of questions. The visit, conducted by Dr. Josep Flix, started with two talks held in the IFAE Seminar Room (next to PIC). The first talk was entitled "The LHC and its 4 experiments: a data stream to understand the Big Bang" and was presented by Dra. Elisa Lanciotti, who is the LHCb contact at PIC. The students placed very interesting questions related to Physics and the techonology used on the LHC, during and after the talk. The level of curiosity was amazing! Maybe, in part related to the preparation sessions prior the visit the professors made and the comprenhesive Elisa's talk. Well after, Dr. Josep Flix presented "The use of Grid Computing by the LHC". He is currently the CMS contact at PIC and the CMS Facilities/Integration coordinator. The talk also raised questions from the attentive audience.

After the talks we made a visit to PIC installations, so they could see how a Computing Center is built and managed. In groups of 15 people we showed them first the real-time views of what's actually occurring on the Grid: the nice visualization of the WLCG grid activity on Google Earth, the ATLAS concurrent jobs running at all their Tiers, the CMS overall data transfer volumes, the LHCb job monitor display, and a few local monitoring plots, like the batch system and LAN/WAN usages.

Then, the visit to the Computing Area itself started: we showed them the different kind of disk pools we have installed, which covered the SUN X4500 (we opened one, so they could see how disks are installed and can be easily replaced) and the new powerful DDN system that offers 2 PBs of disk space; our computational power based on brand new HP Blade systems; plus the two tape robots we have at PIC (around 3 PBs of data stored) and which are the tapes available on the market and how we use them. The students were impressed as well on the WAN and LAN capabilities, the latest improved with the acquisition of two new 10 Gbps switches.

So far, the morning was extremely fruiful. From PIC we want to thank the Professors (Gregorio, Fernando, Lino, Alberto, Alexandra) for their dedication and motivation they offer to their students. They enjoyed the visit and want to repeat it with other students from the school in two months from now. We are happy to receive them again! ;)