Friday 27 February 2009

January availability, hit by CMS and cooling

So, let's try and have a look to the Tier-1 availability and reliability during the past month of January... at least before February is over!
The result for the reliability was 98%, just above the target for the Tier-1s (which is now 97%). This looks quite ok, but it is worth to note that our colleagues at the Tier-1s are doing a very good job at their sites, so with 98% we got the 8th position in the ranking, tied with TRIUMF. Four centres got the 100%, and three more got 99%. Quite impressive, isn't it?
Our 2% unreliability was mostly caused by the incident in the SRM service we had on January the 21st. On that day we saw our dCache SRM server flying in the sky with loads up to 300. The cause of that was traced to be a bug in the CMS jobs software, that made them issue recursive srmls queries to the SRM. Once again, we saw how easy is to suffer a DoS from an innocent user and how little we can do to protect us against it.
For the availability, however, we scored pretty low in January: 92%. Well below the 97% target. This was caused by the cooling indicent we suffered on saturday 24th January. After shutting down PIC on saturday noon, we did not bring it up back again until monday morning. Here we see how fast the availability goes down on weekends :-)
With the LHC start around the corner, we should be definetely operating our full 24x7 now. As we see we are not quite there yet, so now it is time to think about the last step in the 24x7 journey (one could call it MoD-Phase3) that implements the required coverage for the critical services. First thing to do: ensure we have a proper definition of criticality.

Friday 6 February 2009

When the user becomes the enemy

Our poor SRM service has been the victim of a couple of user attacks in the last days. The user is always an inocent scientist somewhere trying to do some HEP research, but at some point starts hammering our SRM with requests which overload the system. It happened to us on the 21st January with CMS, whose jobs suddenly started issuing recursive srmls due to a bug. This overloaded our SRM service so that it could not handle other requests properly. Another event happened at the beginning of this week, when an ATLAS user from Germany started requesting a single file at PIC thousands of times. This was also traced to be a bug in the ATLAS Grid job framework. Even if innocent victims, we still need to protect against these events. And as of today there is no clear way on how to do it. We will need to work on splitting the SRM servers among VOs as well as being able to limit requests to the server in some way.

NFS, the batch killer, and the windy datacenter

Again, a long time since we do not post in this blog. This is not because nothing is going on here at PIC. On the contrary, too many things that let little time to write them down here.
Anyway, I will try now to briefly review the major issues we had in the last weeks. We had a couple of remarkable service problems. The first one, and most severe, affected the Computing service during Christmas. The problem started on the 19th December 2008, and it could not be fixed until the 12th January next year. The origin of the problem were LHCb and CMS jobs which were accessing to SQLite through NFS. This is known to be a bad practice since it can effectively hang the processes accessing NFS. This is what happened at PIC. The batch quickly filled up with hanged-unkillable jobs which in few days completely blocked the service. The batch master saw all of the WNs with high load, so could not deliver new jobs to run. We contacted back the experiments and asked them to stop using SQLite through NFS, but we also learnt useful lessons: we are missing very important monitoring.
The second problem arrived on saturday the 24th January around noon. There was a huge wind storm affecting the whole of Spain and the south of France. Among other incidences, this caused disruptions of the electric supply at the PIC building. The UPS system properly dealt with these short power cuts, but unfortunately the cooling system didn't. It stopped, and did not start back again automatically. The consequence was a fast temperature increase in the room. Fortunately, we could stop most of our servers gracefully so the restart on monday was quite smooth. In any case, more lessons learnt: more monitoring needed (a proper high level temperature alarm) and operational procedures in place, both for stopping the service asap and also for being able to start it back as soon as conditions are restablish. We should not forget we have to meet the MoU reliability metrics.