Monday, 7 April 2008

FTS collapse!

Last week it was the week of the FTS collapse at PIC. Our local FTS instance had been getting slower and slower since quite a while. The cause seemed to be the high load of the oracle backend DB. The Oracle host had a constant load around 30, and we could see that there was a clear bottleneck in the I/O to disk. In the end, three weeks ago, we sort of concluded that the cause of this was that the tables of the FTS DB contained ALL the transfers done since we started the service. One of the main tables had more than 2 million rows! Any SELECT query on it was killing the server with IOPS (I/O requests per second, that was at the level of 600 according to Luis, our DBA). Apparently, an "fts history package" existed since almost one year that did precisely this needed cleanup. However, it seems that it had some problem so it was not really working until a new version was released on mid march this year. Unfortunately, it was too late for us. The history job was archiving old rows too slowly. After starting it, the load of our DB backend did not change at all. We were stuck.
The DDT transfers for CMS were so degraded, that most of the PIC channels had been decomissioned in the last days (see CMS talk). On thursday the 3rd April, we decided to solve this following a radical recipe: restart the FTS with a completely new DB. We lost the history rows, but at least the service was again up and running.
Now, let's try to recomission all those FTS channels asap... and quit the CMS blacklist!

No comments: