Friday, 28 August 2009

Believe us, PIC is Ok

Scary, isn't it? PIC availability has been bully red in the last 24h but, sadly, there was not much we could do.
We have always said, and still strongly believe it, that SAM tests are a very good thing. Actually, I firmly believe that have been one of the key ingredients for the WLCG success. Success here meaning the evolution from "the Grid does not work" situation with 60% job success rate we had few years ago, to the rutine >97% availabilities we are used to see these days. But yes, not even SAM tests are perfect. There has always been a dark corner inside them: the so much questioned "SE test inside the CE", or lcg-rm test. And inside this controversial test there is another smaller corner which is still a bit darker: the file replication to CERN test. This was the one that started flickering on Tuesday at PIC and it is consistently failing since more than 24h. This test tries to copy a file sitting at PIC into a very concrete DPM server at CERN. This very precise connection was timing out for us while any other transfer to any other site, even to any other CERN storage server was working. This was strange enough so that we asked for help to our CERN colleagues. Today, they came with the good news: problem found, a problematic router.
Got a really puzzling error? Bet on the network...

