Thursday, 24 April 2008

March reliability, or is the glass half empty or half full?

March reliabilities were published last week, and unortunately this time the numbers are not good for PIC: 86% reliability. It is our worst result since last June and the first time we don't reach the target for Tier-1s since then. Anyhow, we can still try and get a positive view out of it: how much those events that seem "external" to us can affect our site reliability.
The March unreliabilty came mainly from three events: First, the unexpected power cut that affected the whole building in the afternoon of March 13 (don't know the last report about it, but they were talking about somebody pressing the red button without noticing it). Second, there was an outage in the OPN dark fibre between Geneva and Madrid on the 25th that lasted for almost three hours.
The last source of SAM unreliability was of a slightly different nature: the OPS VO disk pools filled up due to massive DTEAM test transfers. So, this last one was under our domain, but actually it did not affect the LHC experiments service, only the monitoring. Anyhow, we have to take and understand SAM for the good and for the bad.
Last month we also had our yearly electrical shutdown during the Easter week. The impact of that scheduled downtime appears in the availability figure, which decreases 12 percentage points down to 74%.
So, it was a tough month in terms of management metrics to be reported (we will see these low points in graphs and tables many times in the following months, that's life). Anyhow, the scheduled intervention went well, and the LHC experiments were not that much affected, so I really believe that our customers are still satisfied. Let's keep them like this.

Friday, 18 April 2008

Pic farm kicking

Two days ago the new HP blades were deployed at PIC after doing the required current update at the RAC level. The new CPUs are amazingly fast and ATLAS production system is feeding our nodes with a huge amount of jobs which are being devorated by the blades. We reached a peak of more than 500 jobs running in parallel and almost 1000 jobs finished in one day, only taking into account the ATLAS VO.
This is clearly visible in the following figures, impressive ramp-up in walltime and jobs finished per day:

Notice there are some reds in the figures as there were some configuration errors at the very beginnig, quickly resolved by the people maintaining the batch system (new things always bring new issues!).

The contribution of the Spanish sites to ATLAS MonteCarlo production has been throttled, altgough we are far from the gigantic Tier-1s we are firmly growing up and showing robustness (figure below: spanish sites are tagged as "ES" and shown in blue):

We keep seeing the advantage of using the pilot jobs schema as the new nodes were rapidly spotted by this "little investigators" and some hours after the deployment, all the blades were happily fizzing.

Monday, 14 April 2008

Network bottleneck for the Tier-2s

Last week we reached a new record at PIC: the export transfer rate to the Tier-2 centres. On wednesday the 9th April, around noon, we were transfering data to the Tier-2s at 2Gbps. CMS started very strong on Moday. Pepe was so happy with the resurrected FTS, that started to comission channels to the Tier-2s like hell. Around thursday, CMS lost a bit of steam, but it looks like ATLAS kicked in exporting data to the UAM at a quite serious rate, so the weekly plot ends up quite fancy (attached picture).
The not-so-good news are that actually this 2Gbps is not only a record, but a bottleneck. At CESCA, in Barcelona, the Catalan RREN routes our traffic to RedIRIS (non-OPN sites) through a couple of Gigabit cables. Last October they confirmed us this fact (and now we have measured it ourselves) and also told us that they were planning to migrate this infrastructure to 10Gbps. So far so good. Now let's see if with the coming kick-off of the Spanish Networking Group for WLCG this plan gets to reality.

Monday, 7 April 2008

FTS collapse!

Last week it was the week of the FTS collapse at PIC. Our local FTS instance had been getting slower and slower since quite a while. The cause seemed to be the high load of the oracle backend DB. The Oracle host had a constant load around 30, and we could see that there was a clear bottleneck in the I/O to disk. In the end, three weeks ago, we sort of concluded that the cause of this was that the tables of the FTS DB contained ALL the transfers done since we started the service. One of the main tables had more than 2 million rows! Any SELECT query on it was killing the server with IOPS (I/O requests per second, that was at the level of 600 according to Luis, our DBA). Apparently, an "fts history package" existed since almost one year that did precisely this needed cleanup. However, it seems that it had some problem so it was not really working until a new version was released on mid march this year. Unfortunately, it was too late for us. The history job was archiving old rows too slowly. After starting it, the load of our DB backend did not change at all. We were stuck.
The DDT transfers for CMS were so degraded, that most of the PIC channels had been decomissioned in the last days (see CMS talk). On thursday the 3rd April, we decided to solve this following a radical recipe: restart the FTS with a completely new DB. We lost the history rows, but at least the service was again up and running.
Now, let's try to recomission all those FTS channels asap... and quit the CMS blacklist!