Tuesday 16 December 2008

October/November reliability and the SRM nightmare

Here again, to comment on our last reliability scores: 97% for october (good, above the 95% WLCG target) and 93% for November (not so good, first time below target since March this year, do you remember uncheduled "lights off"?).
Not yet clear what happened by the end of October (may be some services did not like the end of the summer time on the 28th? :-) but something happened. On the 31st that month we started seeing the SRM server failing with timeouts: start of the nightmare. It was not such a terrible nightmare though, since a restart of the service did cure the problems. So, that was the story until the scheduled intervention on the 18th Nov: SRM timing out, MoDs restarting the service... and Paco chasing the problem. On the 18th, two SRM interventions were carried out: first a new SRM server with 64bit OS and latest Java VM, and second the PinManager was again taken out of the SRM server process virtual machine. The good news were that these cured the SRM timeout problem. The bad news is that a second SRM problem appeared: now the SRM-get requests were the only ones timing out (SRM-put's were happily working).
The solution came on wednesday 24th of November, when we were made aware of the existence of different queues in the SRM for put, bringonline and get requests (good to know!). Once we had a look to them, we realised that the SRM-get queue was so large that it was touching its internal limit. This problem appeared because experiments are issuing srm-get requests, but not releasing them. Now we know we have to watch closely to the srm-get queue: more monitoring, more alarms. Back to business.

Friday 10 October 2008

September Availability: LHCb, the SRM-killer

The reliability of the Tier-1 at PIC last month was right on target: 95%. Unfortunately, once adding our now regular monthly scheduled intervention, the availability dropped slightly below target: 93%.
The unavailability in September at PIC was mainly due to two issues: First, the overload of the SRM server caused by submission of about 13.000 srm-get requests in one shot from LHCb production jobs. This affected the SRM service on three days in September. The first issue that this incident made clear was one of the biggest problems of dCache, from my point of view: there is no way to have different SRM servers, each dedicated to one experiment. We are forced to share the SRM server, so when LHCb breaks it, ATLAS and CMS suffer the consequences. This is clearly bad.
Then one can discuss if issuing 13.000 srm-gets can be considered a DoS, or it is a reasonable activity from our users. I really do think that as a Tier-1 we should stand this load with no problems. As we post this, the storage team at PIC and the LHCb data management experts are in contact to try and learn what exactly got wrong and how to fix it.
Following the saying "better late than never", ATLAS started seriously testing the pre-stage procedure for the reprocessing at the Tier-1s just few days after LHCb. This are good news. This is the only way for us to learn how to configure our system so that it can deliver the required performance. Sure our SRM will die several times during this testing, but I hope it will converge to a reliable configuration... best before spring 2009.
The second contribution to PIC's unreliability last month came from the Network. On 23rd September the Spanish NREN suffered a major outage due to electrical and cooling problems in a TELVENT datacenter, which is hosting the NREN equipment. This resulted in a complete network outage at PIC of about 10 hours. Again, we see electrical and cooling issues at the very top of the LCG service risks list. In the end, looks like one of the trickiest bits of building such a complex computing infrastructure is just plugging it in and cooling it down.

LHC up&down

So, it's been quite a long time since we do not post into the Blog. This is not because we went away, not. We have been just a bit busy here in September (not only because of the LHC). Anyway, we are back to the blogsphere, and will keep reporting about the LHC activities at PIC regularly.
It is quite funny that the most silent month in our blog was probably the most visible one for the LHC all around the world. Well, we can always say "we did not talk about the LHC here since you could read it in any newspaper" :-)
So, the two big LHC things that happened in September, as all of you know, is that the LHC started circulating beams on Sep 10th and that it then had a major fault on the 19th. You can read the details of these both events anywhere in the web. I will just mention that those were quite special days in our community: the big excitement on the 10th, and then the "cold water bucket" few days later could be felt everywhere. Even the daily operation meeting was less crowded than usual since it was difficult those first days not to feel a bit "what to do now"?
I think now it is quite clear for everyone that life goes on. We at the LHC Computing Grid, continue operations exactly in the same way as we are doing since months. We are not receiving p-p collisions data, true, but the data flow did not stop. Both Cosmics data taking and MonteCarlo generation have not stopped.
We have said many times in the last years that the LHC is a extremely complex machine and that it might take a long time to put it in operations. Well, now we can see this complexity in front of us. There it is. Life goes on.

Monday 8 September 2008

CMS bends Cosmic Muons...

The CMS Superconducting Magnet is back on the scene. The cooling down to the nominal temperature of 4.5 K was achieved at the beginning of August. On August 25th, at 8pm, the final commissioning of the Magnet started, working at night to leave the day free for the forward region detector assembly.

Last Friday night, September 5th, the current was set to 14500 Amps (3 Tesla central field) for almost two hours to allow an extensive run of all sub-detectors. Cosmic muons bends on presence of magnetic field. The data of this "3 Tesla" magnet commissioning test was distributed to all CMS computing centers and bended tracks were further reconstructed.

With all the parameters being within their reference values, the first phase of the Magnet commissioning underground can be considered achieved. The test plans were fully achieved within the time allowed. The next step will be the test at full current... and to bend the products from proton-proton collissions!

Wednesday 20 August 2008

Reliability under control... ready for data?

There is quite a long time since we do not comment about the monthly reliability results of the Tier-1 in this blog. This is not because we have stopped monitoring it, no! Actually, the opposite it is true. We are now looking at all sorts of monitoring daily. The SAM critical alerts are now generating an SMS to the Manager on Duty mobile phone. All of this to try and notice any possible problem as soon as it appears. It would be nice to catch them BEFORE they appear, but we are not there yet :-)

The reliabilities scored by PIC in the last three months (May to July) have been flat at 99%. This is ok and I think this tells us that the Manager on Duty shifts are getting mature. On the one hand the tools available for the shifters are improving: cleaner Nagios and better documentation and procedures (thanks to all of the service managers around) plus, of course, the shifters are doing a great job.

The availabilities for these last three months have moved up 92%-96%-97% for May-June-July, respectively. For these three months we have started implementing a regular Scheduled Downtime on the second (sometimes first) tuesday of the month. Knowing the SD date in advance makes the planning and user notification smoother. The availability in May is somewhat lower because we had an extra downtime on the 14th because the Regional Network Provider had to upgrade some equipment, so PIC had to be disconnected for some hours.

I always say that the SAM monitoring has proven to be a very useful tool for sites to improve on stability. The experiments have since long complained that sometimes they are not reflecting reality, specially because they run under the "fake" OPS VO. The solution is of course that they implement their VO-specific tests in the SAM framework. This is ongoing work since several months, but still not completely stable.

So, looks like we will get the first LHC data with our VO-specific SAM glasses a bit dirty... anyhow, I am sure that the "real data pressure" will help to make all of this converge so we will have still better tools for our shifters to know what's going on.

Monday 28 July 2008

Euro Science Open Forum in Barcelona


One week ago we had the ESOF08 conference in Barcelona. This was a BIG event devoted to science communication. More than 3000 people attended the "scientific programme"during last weekend. Not bad having into account that the weather in Barcelona was just lovely, and the beach few metro stations away...

There were several presentations about the LHC. On Saturday morning the physics motivation and experiment status presentations were given by several people from CERN. Specially interesting and funny, as usual, was the presentation by Álvaro de Rújula. Unfortunately for the people that were not there, he is still using "analogic transparencies" (made by hand) so no way to download a copy to your PCs.

We organised a session on the LHC data processing and analysis challenge on sunday, and invited Pere Mato and Tony Cass from CERN as speakers. Pere first gave a talk on the challenge of the TDAQ systems in the LHC, to filter out and reduce the number of events from the 40MHz collision rate down to the 100Hz that can be permanently stored. Then Tony Cass presented the main challenges that the CERN computing centre is facing, as the Tier-0 of the LHC Grid. Finally I presented the LHC Computing Grid and the key role of this huge distributed infrastructure for the feasibility of the LHC data analysis.

There were quite a number of questions at the end of the session (not bad for a sunday-after-lunch one). Besides the most repeated one of "when exactly the LHC will start and how many days later you will discover new physics?" there was an interesting question asking about the similarities/differences between our LHC Grid and the now-so-famous Cloud Computing. We answered that, as of today, the LHC Grid and the Clouds available out there (like the Amazon one) are quite different. The LHC data processing, besides huge computing and storage capacities, needs a very big bandwith between those. Tier-1s are data centres specialiced in storing Petabytes of data and mining through all of this data using thousands of processors in a very efficient way. Trying to use the commercial Clouds to do this today, besides being too expensive, would most probably not meet the performance targets.

That said, we should all keep an eye on this new hype-word "the Cloud" as it will surely evolve in the next years and I am afraid our paths are poised to meet at some point. The LHC is today not a target customer for these Clouds, but what these giant companies are doing in order to be able to sell "resources as a service" is indeed very interesting and, as Wladawsky-Berger notes, is driving an "industrialisation" of IT data centers in a similar way as 25 years ago some companies like Toyota industralised the manufacturing process.

So, more productive, efficient and high-quality computing centers are coming out from the Clouds. We will definetely have to watch up to the sky very now and then, just to be prepared.

Thursday 10 July 2008

Cosmic Rays Illuminate CMS II !

Some displays of selected CMS events containing global muon tracks are now available. 5% of the processed events do contain a global track, often also with calorimetric hits nearby. For rendering reasons, only part of the tracker is shown, even if most of the layers are hit. That's good news overall!


Wednesday 9 July 2008

Cosmic Rays Illuminate CMS!

The third phase of the Cosmic Run at Zero Tesla (CRUZET3) is keeping all CMS collaborators busy this week... From 7th to 14th July this global commissioning activity is expected to yield millions of detector triggers. ~100 TBs of data coming from the detector will be transferred to Tier-1 sites. CMS is located 100 meters underground; although, high energetic cosmic muons are capable to reach and completely cross the CMS detector. These muons are very useful to study and commission different parts of the detector.

The fraction of CMS sub-detectors participating in CRUZET3 has steadily increased and includes from the first time all its components: the DT muon system, RPC barrel, CSC endcap, HCAL and barrel ECAL calorimeters, and the recently installed silicon strip tracker (the biggest tracker detector ever built!).

Last night (9.7.2008), over 1 million cosmic ray events were reconstructed on the tracker system. This is the first time we see triggered cosmic ray tracks in both TIB and TOB at the Tracker level:

Primary datasets are created using the new Tier-0 “repacker” in almost 'real time' and transferred to CAF and Tier-1 sites for prompt analyses. IN2P3 Tier-1 has the custodial responsibility to hold CRUZET3 data, although all Tier-1 sites are constantly receiving the cosmic data. During these two first days, ~3 TBs of data has landed to PIC, being the best Tier-1 site from Rate and Quality p.o.v (curiosly, yesterday we spent the whole day in scheduled downtime!).

Data volumes are expected to grow later this week as systems are better integrated on this commissioning exercise... So, we need to stay tuned! ;)

Tuesday 10 June 2008

May Data Transfers from CERN: CMS @ CCRC'08

During CCRC'08 May tests CMS tested the reliability and robustness of Data Transfers from CERN to all Tier-1 centers. The main metric was to be able to export data from CERN above nominal rate (600MB/sec) for more than 3 days. The individual metric was satisfied at all Tier-1 centers, except for FZK and FNAL.

For PIC, the importing shared target rated was of about 57 MB/s. PIC was all days over the metric, with some days importing a factor x2 than requested.

It is important that data flows from CERN in a smooth way. From reliability and robustness p.o.v. being 3-4 days over the metric is really bad for data taking as we could be stuck at the CERN overrunning CMS buffers. PIC was always over the metric during the whole month and a comparison to other Tier-1s can be seen on this table.


Last, but not least, the mean May '08 rate CERN->PIC was of 70 MB/s, with an impressive Data Transfer Quality of 90%. The test was successful and quite impressive for PIC indeed!

Tuesday 3 June 2008

A picture is worth a thousand words...

During the run2 of the CCRC08 (Common Computing Readiness Challenge), ATLAS tested the full chain of Distributed Computing Activities as if the detector was working: CERN=>T1s data exportation, T1 cross-transfers, T1s=>T2s data replication, Data Reprocessing and Simulated events production. The overall exercise was a success, in spite of small failures and outages that weren't strong enough to spoilt the tests. We tested the full chain and push the distribution over the limits (200% nominal rates) , now we are more confident while waiting for the protons to collide -end of August !-

PIC demonstrated it's reliability during the whole month achieving an efficiency of 91% (Best Tier-1) acquiring data, the high demanding reprocessing jobs push the Worker Nodes and pools to the limit and was very useful to the collaboration. Also the Tier-2s received data without major problems, I want to mention all PIC team that made possible such a good performance in every single activity. And as I mentioned in the title... A picture is worth a thousand words:

Friday 30 May 2008

Last chance to test, gone

So here we are, consuming the final hours of the last scheduled test of the WLCG service. Next weeks we should be just making final preparations, waiting for the real data to arrive.
Overall, the Tier-1 services have been up and running basically 100% of the time. As the two more relevant issues, I would mention first the CMS skimming jobs brutal I/O. After they were limited (to 100 accoriding to CMS) they were still running until mid-thursday and sustaining a quite high load on the LAN (around 500MB/s). On wednesday 28-May evening, ATLAS launched a battery of reprocessing jobs which very fast filled up more than 400 job slots. Apparently all of these jobs read the detector Conditions data from a big (4GB) file sitting in dCache. This file of course fast became super-hot, since all of these jobs were trying to access it simultaneously. This caused the second issue of the week. The output traffic of the pool in which this file was sitting immediately grew up to 100MB/s, saturating the 1Gbps switch that (sadly, and until we manage to deploy the definitive 2x10Gbps 3Coms) links it to the central router. This network saturation caused the dcache pool-server control connetion to lose packets, which eventually hanged the pool.
At first sight it seems that the Local Area Network has been the big issue at PIC this CCRC08. Let's see what the more detailed post-mortem analysis teaches us.

Friday 23 May 2008

LAN is burning, dial 99999

This has been an intense CCRC08 week. We have kept PIC up and running, and the performance of services has been mostly good. However, we have had some interesting issues from which I think we should try and learn some lesson.
CMS has been basically the only user of the PIC farm this week, due to the lack of competition from other VOs. It was running about 600 jobs in parallel for some days. Rapidly, we saw how these jobs started to read input data from the dCache pools at a huge rate. By tuesday the WNs were reading at about 600MB/s sustained. Both the disk and the WNs switches were saturating. On thursday noon Gerard raised some of the Thumpers network uplink from 5 to 8 Gbps with 3 temporary cables crossing the room (yes, we will tidy them up once the so long awaited 10Gbps uplinks arrive) and we immediately saw how the extra bandwidth was immediately eaten up.
On thursday afternoon the WNs spent some hours reading at the record rate of 1200MB/s from disks. Then we tried to limit the number of parallel jobs CMS was running at PIC, and we saw that with only 200 parallel jobs it was already filling up 1000MB/s.
Homework for next week is to understand the characteristics of these CMS jobs (seems that are the "fake-skimming" ones). Which is their MB/s/job figure? (been asking the same question for years, now once again).
The second ccrc08-hickup this week arrived yesterday evening. ATLAS transfers to PIC disk started to fail. Among the various T0D1 atlas pools, there were two with plenty of free space while the others were 100% full. For some reason dCache was assigning transfers to the full pools. We have sent an S.O.S. to dCache-support and they answered immediately with the magic (configuration) recipe to solve this (thanks Patrick!).
Now looks like thinks are quiet (apart from some blade switches burning) and green... ready for the weekend.

Friday 16 May 2008

Intensive test of data replication during the CCRC08 run2

During the last three days, and within the CCRC08 (Common Computing Readiness Challenge ) run2, PIC performance was impressing importing data. The test scope was to replicate data among all the ATLAS Tier-1s (nine), each one had a 3TB subset of data which was replicated to every T1 according some shares, overall more than 16TB were replicated from all other T1s to PIC. We started the test on Tuesday morning and after 14 hours pic imported more than 90% of the data with impressive sustained transfers rates, moreover we reached 100% efficiency from 7 T1s and 80% for the other two. During the first kick, were all the datasets subscriptions were placed in bulk, we reached approximately 0.5GB/s of throughput with 99% efficiency (see plot number 1):



And after 4 hours the importing efficiency still maintained the 99% and the accumulated mean throughput was over the 300MB/s. This lead to the correct placement of 16TB data in 14 hours.

On the other hand we observed different behavior while exporting data to other T1s, the efficiencies are not so brilliant in comparison with the importing. Also notice that while exporting we rely on external FTS services and sometimes the errors are not related to our capabilities but we did see observe some timeouts from our storage system, network was indeed a limit but we observed saturation at pool level while importing but not when others are reading, this means that the network among the pools and the data receiver were not guilty of the efficiency dropping while exporting. We have an insight and seems that our dCache have problems returning the TURLs (transfer urls) when is under heavy load (well known dCache bug), this could cause the kind of errors observed during the exercise. Besides that, almost all the T1s got more than 95% of the data from PIC, but sometimes not at the first sting and the exercise can be considered more than successful either from the PIC side and from ATLAS.

In my opinion this was one of the most successful test ran so far within the ATLAS VO as they involved all T1s under a heavy load of data cross-replication, the majority of sites performed very well and there are a couple of them who experienced problems, let me remind again the complexity of the system which should have a bunch of services ready and stable in order to finalize every single transfer: ATLAS data management tools, SE, FTS, LFC, network,etc.

We have to keep working hard to achieve robustness of the actual system, which improved enormously in the last months: dCache is performing very well and the new DataBase back-end for the FTS ironed out some of the overload problems found in past exercises. For that reason I want to deeply thank all the people involved maintaining the computing structure at PIC, this results show we are in the correct way. Congrats !

Thursday 15 May 2008

In April, a thousand rains

Following the spanish proverb "en abril, aguas mil" the rain finally fell on Catalunya in April after months of severe drought. For the PIC reliability metric, the rainy season started a bit earlier, in March, and looks as if it kept raining in April. In that month, our reliability result has been a yellowish (not green, not red) 90%. Sightly better than the one of March (...positive slope, this is always good news) but still below the 93% WLCG goal for Tier-1s. The main contribution to last month unreliability was, believe it or not, the network. The networking conspiration we suffered on the 28-29 of April is responsible for at least 8 out of the 10 reliability percentage points we lost last month. The non-dedicated backup described by Gerard hours ago will help to lower our exposure to network outages, but we should keep pushing for a dedicated one in the future.
We had also other operative issues last month which also contributed to the overall unreliability. Every service had its grey-day last month: in the Storage service, a gridftp door (dcgftp07 its name) misteriously hang in a funny way such that the clever-dCache could not detect it and kept trying to use it. As far as I know, this is still in the Poltergeist domain. I hope it will just go with the next dCache upgrade. For the Computing service there have been also some hickups... it is not nice when an automated configuration system decides to erase all of the local users in the farm nodes at 18:00 p.m.
We have even had a nice example of collaborative-destruction among Services last month: a supposedly rutinary and harmless pool replication operation in dCache ended up saturating one network switch which happened to have, among others, the PBS master connected to it, which immediately lost connectivity to all of its Workers. Was it sending the jobs to the data, or bringing the data to the CPUs? Anyhow, a nice example of Storage - Computing love.

First LHCOPN backup use at PIC

This is the first time PIC is online while fiber rerouting works are being carried out.

This night we've had a pair of network scheduled downtimes. The first one (red in the image) affected the PIC-RREN connection, this means a single 10Gbps fiber connection from PIC to Barcelona therefore, as scheduled, we had no connectivity from 21:35 to 23:20.



The second scheduled downtime was due to rerouting tasks at the French part of the PIC-CERN LHCOPN fiber. This used to be a critical task for us since we were isolated from the LHCOPN while tasks were taking place but it's not like this anymore. Now our NREN routes us through GÉANT independently on where data is going so we keep connected! As you can see in the image orange coloured areas, while reaching the OPN through our NREN (Anella+RedIRIS) we're limited to 2Gbps (our RREN uplink) but still reaching it so finally we can say we have a [not dedicated] backup link.

Waiting for the LHCOPN dedicated backup link won't be so hard now.

Monday 5 May 2008

Networking conspiration

Last thursday and friday (1st and 2nd May) we had a scheduled downtime for the yearly electrical maintenance of the building. Actually, this was the "easter intervention" that was moved in the last minute due to the unscheduled "lights off" that we had on the 13th March. Anyway, following what seems to be now a tradition, this time we had also a quite serious unscheduled problem just before our so-nicely-scheduled downtime. This time it was the network who caught us by surprise. On monday the 28th April, our Regional NREN had a scheduled intervention to deploy a new router to separate the switching and routing functionalities. We had been notified about this intervention. They told us we could see 5-10 minutes outages in a window of 4 hours.
In the end, the reality was that the intervention completely cut our network connectivity at 23:30 on the 28-Apr. The next morning, at 6:00, we recovered part of the service (the link to the OPN), but the non-OPN connectivity was not recovered until 17:30 on the 29-Apr.
When one thinks on the network one tends to assume "it's always there". On that N-day we decided to challenge this popular belief, so we did not have just one network incident but two. Just four hours after we recovered the OPN link, somwhere near Lyon a bulldozer destroyed part of the optical fiber that links us to CERN. This kept our OPN link completely down from 10:00 a.m. 29-Apr until 01:00 30-Apr.

Not bad as an aperitive, hours before a two-day scheduled intervention...

Throttling up

After the last power stop, it's the time for PIC to throttle up and catch up with the regular activity of the high demanding LHC experiments requirements. That is a good test also to show the ability of the services, either at the site or at CERN to achieve steady running after some days of outage. Concerning the two most important and critical services: the computing power and the data I/O, the nominal activity was reached extremely fast with a time gap of 30 minutes between the starting of the PIC pilot factory and the fact that more than 500 jobs were successfully running (pic.1):





On the other hand, for the data transfers the delay was even shorter, after restarting the site services at CERN only took about 5 minutes to start triggering data in and out from PIC (pic.2 -files done- and pic.3-Throughput in MB/s-):







Due to the complexity of the system and the number of cross-dependencies for each of this single services that were successfully recovered one can conclude that the "re-start" was extremely successful :)... but of course everything can be improved !

Thursday 24 April 2008

March reliability, or is the glass half empty or half full?

March reliabilities were published last week, and unortunately this time the numbers are not good for PIC: 86% reliability. It is our worst result since last June and the first time we don't reach the target for Tier-1s since then. Anyhow, we can still try and get a positive view out of it: how much those events that seem "external" to us can affect our site reliability.
The March unreliabilty came mainly from three events: First, the unexpected power cut that affected the whole building in the afternoon of March 13 (don't know the last report about it, but they were talking about somebody pressing the red button without noticing it). Second, there was an outage in the OPN dark fibre between Geneva and Madrid on the 25th that lasted for almost three hours.
The last source of SAM unreliability was of a slightly different nature: the OPS VO disk pools filled up due to massive DTEAM test transfers. So, this last one was under our domain, but actually it did not affect the LHC experiments service, only the monitoring. Anyhow, we have to take and understand SAM for the good and for the bad.
Last month we also had our yearly electrical shutdown during the Easter week. The impact of that scheduled downtime appears in the availability figure, which decreases 12 percentage points down to 74%.
So, it was a tough month in terms of management metrics to be reported (we will see these low points in graphs and tables many times in the following months, that's life). Anyhow, the scheduled intervention went well, and the LHC experiments were not that much affected, so I really believe that our customers are still satisfied. Let's keep them like this.

Friday 18 April 2008

Pic farm kicking

Two days ago the new HP blades were deployed at PIC after doing the required current update at the RAC level. The new CPUs are amazingly fast and ATLAS production system is feeding our nodes with a huge amount of jobs which are being devorated by the blades. We reached a peak of more than 500 jobs running in parallel and almost 1000 jobs finished in one day, only taking into account the ATLAS VO.
This is clearly visible in the following figures, impressive ramp-up in walltime and jobs finished per day:




Notice there are some reds in the figures as there were some configuration errors at the very beginnig, quickly resolved by the people maintaining the batch system (new things always bring new issues!).

The contribution of the Spanish sites to ATLAS MonteCarlo production has been throttled, altgough we are far from the gigantic Tier-1s we are firmly growing up and showing robustness (figure below: spanish sites are tagged as "ES" and shown in blue):



We keep seeing the advantage of using the pilot jobs schema as the new nodes were rapidly spotted by this "little investigators" and some hours after the deployment, all the blades were happily fizzing.

Monday 14 April 2008

Network bottleneck for the Tier-2s


Last week we reached a new record at PIC: the export transfer rate to the Tier-2 centres. On wednesday the 9th April, around noon, we were transfering data to the Tier-2s at 2Gbps. CMS started very strong on Moday. Pepe was so happy with the resurrected FTS, that started to comission channels to the Tier-2s like hell. Around thursday, CMS lost a bit of steam, but it looks like ATLAS kicked in exporting data to the UAM at a quite serious rate, so the weekly plot ends up quite fancy (attached picture).
The not-so-good news are that actually this 2Gbps is not only a record, but a bottleneck. At CESCA, in Barcelona, the Catalan RREN routes our traffic to RedIRIS (non-OPN sites) through a couple of Gigabit cables. Last October they confirmed us this fact (and now we have measured it ourselves) and also told us that they were planning to migrate this infrastructure to 10Gbps. So far so good. Now let's see if with the coming kick-off of the Spanish Networking Group for WLCG this plan gets to reality.

Monday 7 April 2008

FTS collapse!

Last week it was the week of the FTS collapse at PIC. Our local FTS instance had been getting slower and slower since quite a while. The cause seemed to be the high load of the oracle backend DB. The Oracle host had a constant load around 30, and we could see that there was a clear bottleneck in the I/O to disk. In the end, three weeks ago, we sort of concluded that the cause of this was that the tables of the FTS DB contained ALL the transfers done since we started the service. One of the main tables had more than 2 million rows! Any SELECT query on it was killing the server with IOPS (I/O requests per second, that was at the level of 600 according to Luis, our DBA). Apparently, an "fts history package" existed since almost one year that did precisely this needed cleanup. However, it seems that it had some problem so it was not really working until a new version was released on mid march this year. Unfortunately, it was too late for us. The history job was archiving old rows too slowly. After starting it, the load of our DB backend did not change at all. We were stuck.
The DDT transfers for CMS were so degraded, that most of the PIC channels had been decomissioned in the last days (see CMS talk). On thursday the 3rd April, we decided to solve this following a radical recipe: restart the FTS with a completely new DB. We lost the history rows, but at least the service was again up and running.
Now, let's try to recomission all those FTS channels asap... and quit the CMS blacklist!

Thursday 20 March 2008

Easter downtime, and February reliability

We have just had our yearly shutdown. PIC was completely stopped for more than 24 hours... how cool is a silent machine room!
The main intervention was the upgrade of 5 racks to 32A power lines. Now we can plug our HP blade centers. Will see how the PBS behaves when we scale-up the number of Workers by a factor of three.
This week we also got the results for the February reliability from the WLCG office. Our colleagues from Taiwan got the gold medal (100% reliability) breaking CERN's monopoly on this figure. PIC's reliability was very good as well. Actually, we reached our record: 99% reliability. And we got the silver medal for February.
The small 1% that we missed this time to reach the 100% was due to few hours of problems caused by a log file not rotating in the pnfs and a "not enough transparent" intervention in the Info System for dCache, which is still quite patchy for SRMv2.2.
Most probably next month's result will not be so green, due to the unscheduled power cut we had last week and the scheduled yearly shutdown this week. So, let's enjoy our silver medal until the next results come.

Friday 14 March 2008

Power Cut!

Yesterday evening, around 16:30 (luckily we were still around) the lights went off. We still do not have the complete picture, but it seems that somebody by mistake pushed the red button "swith off the building".
The cooling of the machine room stopped, but the machines at PIC still were working powered by the UPS. After 10 min in this situation, we decided to start stopping services. Not much later, the (yet to be understood) glitch arrived. All of the racks lost power for less than 1 second. After this, and to avoid that servers would still restart after a dirty stop, we just switched off all racks in the electric main board.
Today, at 8:00, we started switching on PIC. The good news is that it looks as if we did not have many hardware incidences after the dirty stop. The lesson learnt (hard way) is that we are still too far away from a controlled and efficient complete shutdown. We will have to repeat this on monday, due to the yearly electrical maintenance. So overall, it will be a good week to debug all this procedures.

Tuesday 11 March 2008

Crossing the highway

Last week we had in Madrid the first meeting of the “Board for the follow-up of GRID Spain activities”. I think this is the first time such board has been created to follow the progress of the projects funded by the Particle Physics program of the Spanish Ministry of Education. The classic yearly written report has been upgraded to a meeting with oral presentations and an evaluation board.
The meeting started with three presentations, one for each of the experiments, aiming to report on the state of the LCG from the point of view of each of them. It was quite interesting to see how the three talks were presenting completely different views about the same issue.
Both ATLAS and CMS mentioned the need of some sort of Tier-3s for the users to make their final analysis. There was some general concern due to the fact that such infrastructures are currently not being funded. The LHCb presentation was, from my point of view, the one that most directly presented the "view from the LCG users". The small number of physicists actually using the Grid was mentioned, and the most common problems found by them described. There was the usual "30% of the jobs sent to the Grid fail", and "sometimes the sites do not work, sometimes the experiment framework does not work". The result is always the same: the user just leaves saying "the Grid does not work". After some years of working trying to make "a Grid site to work" I really do think now that many of the problems remaining today are due to the experiment frameworks not working, or not properly managing the complexity of the Grid.
I presented the status of the Tier-1 at PIC, focusing in the last results obtained in the recent test CCRC08. Most of the results were actually quite positive, so I am quite confident that the board got the message "ther Tier-1 at PIC is working". It was also quite helpful to see direct references to the good performance of PIC in some of the Tier-2s presentations, like the LHCb one (thanks Ricardo).
There were two points I would like to highlight from the PIC presentation. The first one arose when showing two plots where the actual cost of equipment was compared to two CERN estimations: the one used in the proposal (oct-2006), and the update received three months ago. The results suggest that the hardware cost is lower than the estimation in the proposal. I think there are plenty of moving parameters in this project, one is the hardware cost estimations, but we should not forget that the event size, the cpu time or memory needed to generate a MonteCarlo event, the overall experiment requirements, etc. are also parameters with uncertainties of the order of 30 to 100%. If eventually we get to 2010 and the computing market has been such that prices have decreased faster than expected, good news. We will report this (as we already did with the past project and the delay of the LHC) and will propose to use the "saved" money for the 2011 purchases.
The second issue arose from a question from Les Robertson, who was a member of the board: "when do you expect that PIC will run out of power?". As the cpu and storage power of the equipment is (luckily for us!) exponentially growing, the power consumption of these wonderful machines is also going up to the sky. Soon the total input power at PIC will be raised from 200 KVA to 300 KVA. Though it is not an easy estimation, we believe this should be enough for the current phase of the Tier-1, up to 2010. Beyond that date, we should most probably be thinking about a major upgrade of the PIC site. Next to the UAB campus, on the other side of the highway, a peculiar machine is being built: a Synchrotron Ring. This stuff comes normally with a BIG plug... should we try and cross the highway to get closer to it?

Tuesday 19 February 2008

January Availability

Last week we received the availability data for the LCG Tiers for
January 2008. This time, at PIC we were just on top of the target for Tier-0/Tier-1: 93%.
The positive read of it is that we are still one of the only three Tiers that reached the reliability target for every month since last July. The other two sites are CERN and TRIUMF.
The negative read is that 93% looks like a too low figure, when we were getting used to score over 95% in the last quarter of 2007.
The 7% unreliability of PIC in January 2008 is fully due to one single incidence that we had in the Storage system the weekend of the 26-27 January. The Monday before (21/01/2008) had been a black-monday in the global markets - European and Asian exchanges plummeted 4 to 7% - so, we still do not discard that our failure might be correlated to that fact.
However, Gerard's investigations point to the fact that the most probable cause of our incident was a less-glamourous problem in the system disk of the dCache core server. The funny symptom is that all the GET transfers from PIC were working fine, but the PUT transfers to PIC were failing. The problem could only be solved by manual intervention of the MoD, who came on Sunday to "press the button".
So, the "moraleja" as we call it in Spanish, could read: a) we need to implement remote reboot at least in the critical servers, b) a little sensor that checks that the system disk is alive would be very useful.
Now, back to work and let's see if next month we reach the super-cool super-green 100% monthly reliability that up to now only CERN is able to reach with apparently no much effort.

Monday 18 February 2008

2008, the LHC and PIC

So, this is 2008. The year that the LHC will (finally) start colliding
protons at CERN. At PIC we are deploying one of the so-called Tier-1 centres: large computing centres that will receive and process the data from the detectors online. There will be eleven of such Tier-1s worldwide. Together with CERN (the Tier-0) and almost 200 more sites (the Tier-2s) these will form one of the largest distributed computing infrastructures in the world for scientific purposes: The LHC Computing Grid.
So, handling the many-Petabytes of data from the LHC is the challenge, and the LCG must be the tool.