Thursday, 20 March 2008

Easter downtime, and February reliability

We have just had our yearly shutdown. PIC was completely stopped for more than 24 hours... how cool is a silent machine room!
The main intervention was the upgrade of 5 racks to 32A power lines. Now we can plug our HP blade centers. Will see how the PBS behaves when we scale-up the number of Workers by a factor of three.
This week we also got the results for the February reliability from the WLCG office. Our colleagues from Taiwan got the gold medal (100% reliability) breaking CERN's monopoly on this figure. PIC's reliability was very good as well. Actually, we reached our record: 99% reliability. And we got the silver medal for February.
The small 1% that we missed this time to reach the 100% was due to few hours of problems caused by a log file not rotating in the pnfs and a "not enough transparent" intervention in the Info System for dCache, which is still quite patchy for SRMv2.2.
Most probably next month's result will not be so green, due to the unscheduled power cut we had last week and the scheduled yearly shutdown this week. So, let's enjoy our silver medal until the next results come.

Friday, 14 March 2008

Power Cut!

Yesterday evening, around 16:30 (luckily we were still around) the lights went off. We still do not have the complete picture, but it seems that somebody by mistake pushed the red button "swith off the building".
The cooling of the machine room stopped, but the machines at PIC still were working powered by the UPS. After 10 min in this situation, we decided to start stopping services. Not much later, the (yet to be understood) glitch arrived. All of the racks lost power for less than 1 second. After this, and to avoid that servers would still restart after a dirty stop, we just switched off all racks in the electric main board.
Today, at 8:00, we started switching on PIC. The good news is that it looks as if we did not have many hardware incidences after the dirty stop. The lesson learnt (hard way) is that we are still too far away from a controlled and efficient complete shutdown. We will have to repeat this on monday, due to the yearly electrical maintenance. So overall, it will be a good week to debug all this procedures.

Tuesday, 11 March 2008

Crossing the highway

Last week we had in Madrid the first meeting of the “Board for the follow-up of GRID Spain activities”. I think this is the first time such board has been created to follow the progress of the projects funded by the Particle Physics program of the Spanish Ministry of Education. The classic yearly written report has been upgraded to a meeting with oral presentations and an evaluation board.
The meeting started with three presentations, one for each of the experiments, aiming to report on the state of the LCG from the point of view of each of them. It was quite interesting to see how the three talks were presenting completely different views about the same issue.
Both ATLAS and CMS mentioned the need of some sort of Tier-3s for the users to make their final analysis. There was some general concern due to the fact that such infrastructures are currently not being funded. The LHCb presentation was, from my point of view, the one that most directly presented the "view from the LCG users". The small number of physicists actually using the Grid was mentioned, and the most common problems found by them described. There was the usual "30% of the jobs sent to the Grid fail", and "sometimes the sites do not work, sometimes the experiment framework does not work". The result is always the same: the user just leaves saying "the Grid does not work". After some years of working trying to make "a Grid site to work" I really do think now that many of the problems remaining today are due to the experiment frameworks not working, or not properly managing the complexity of the Grid.
I presented the status of the Tier-1 at PIC, focusing in the last results obtained in the recent test CCRC08. Most of the results were actually quite positive, so I am quite confident that the board got the message "ther Tier-1 at PIC is working". It was also quite helpful to see direct references to the good performance of PIC in some of the Tier-2s presentations, like the LHCb one (thanks Ricardo).
There were two points I would like to highlight from the PIC presentation. The first one arose when showing two plots where the actual cost of equipment was compared to two CERN estimations: the one used in the proposal (oct-2006), and the update received three months ago. The results suggest that the hardware cost is lower than the estimation in the proposal. I think there are plenty of moving parameters in this project, one is the hardware cost estimations, but we should not forget that the event size, the cpu time or memory needed to generate a MonteCarlo event, the overall experiment requirements, etc. are also parameters with uncertainties of the order of 30 to 100%. If eventually we get to 2010 and the computing market has been such that prices have decreased faster than expected, good news. We will report this (as we already did with the past project and the delay of the LHC) and will propose to use the "saved" money for the 2011 purchases.
The second issue arose from a question from Les Robertson, who was a member of the board: "when do you expect that PIC will run out of power?". As the cpu and storage power of the equipment is (luckily for us!) exponentially growing, the power consumption of these wonderful machines is also going up to the sky. Soon the total input power at PIC will be raised from 200 KVA to 300 KVA. Though it is not an easy estimation, we believe this should be enough for the current phase of the Tier-1, up to 2010. Beyond that date, we should most probably be thinking about a major upgrade of the PIC site. Next to the UAB campus, on the other side of the highway, a peculiar machine is being built: a Synchrotron Ring. This stuff comes normally with a BIG plug... should we try and cross the highway to get closer to it?