NFS, the batch killer, and the windy datacenter

Again, a long time since we do not post in this blog. This is not because nothing is going on here at PIC. On the contrary, too many things that let little time to write them down here.
Anyway, I will try now to briefly review the major issues we had in the last weeks. We had a couple of remarkable service problems. The first one, and most severe, affected the Computing service during Christmas. The problem started on the 19th December 2008, and it could not be fixed until the 12th January next year. The origin of the problem were LHCb and CMS jobs which were accessing to SQLite through NFS. This is known to be a bad practice since it can effectively hang the processes accessing NFS. This is what happened at PIC. The batch quickly filled up with hanged-unkillable jobs which in few days completely blocked the service. The batch master saw all of the WNs with high load, so could not deliver new jobs to run. We contacted back the experiments and asked them to stop using SQLite through NFS, but we also learnt useful lessons: we are missing very important monitoring.
The second problem arrived on saturday the 24th January around noon. There was a huge wind storm affecting the whole of Spain and the south of France. Among other incidences, this caused disruptions of the electric supply at the PIC building. The UPS system properly dealt with these short power cuts, but unfortunately the cooling system didn't. It stopped, and did not start back again automatically. The consequence was a fast temperature increase in the room. Fortunately, we could stop most of our servers gracefully so the restart on monday was quite smooth. In any case, more lessons learnt: more monitoring needed (a proper high level temperature alarm) and operational procedures in place, both for stopping the service asap and also for being able to start it back as soon as conditions are restablish. We should not forget we have to meet the MoU reliability metrics.

