Tuesday, 16 December 2008

October/November reliability and the SRM nightmare

Here again, to comment on our last reliability scores: 97% for october (good, above the 95% WLCG target) and 93% for November (not so good, first time below target since March this year, do you remember uncheduled "lights off"?).
Not yet clear what happened by the end of October (may be some services did not like the end of the summer time on the 28th? :-) but something happened. On the 31st that month we started seeing the SRM server failing with timeouts: start of the nightmare. It was not such a terrible nightmare though, since a restart of the service did cure the problems. So, that was the story until the scheduled intervention on the 18th Nov: SRM timing out, MoDs restarting the service... and Paco chasing the problem. On the 18th, two SRM interventions were carried out: first a new SRM server with 64bit OS and latest Java VM, and second the PinManager was again taken out of the SRM server process virtual machine. The good news were that these cured the SRM timeout problem. The bad news is that a second SRM problem appeared: now the SRM-get requests were the only ones timing out (SRM-put's were happily working).
The solution came on wednesday 24th of November, when we were made aware of the existence of different queues in the SRM for put, bringonline and get requests (good to know!). Once we had a look to them, we realised that the SRM-get queue was so large that it was touching its internal limit. This problem appeared because experiments are issuing srm-get requests, but not releasing them. Now we know we have to watch closely to the srm-get queue: more monitoring, more alarms. Back to business.