Tuesday, 16 December 2008
October/November reliability and the SRM nightmare
Not yet clear what happened by the end of October (may be some services did not like the end of the summer time on the 28th? :-) but something happened. On the 31st that month we started seeing the SRM server failing with timeouts: start of the nightmare. It was not such a terrible nightmare though, since a restart of the service did cure the problems. So, that was the story until the scheduled intervention on the 18th Nov: SRM timing out, MoDs restarting the service... and Paco chasing the problem. On the 18th, two SRM interventions were carried out: first a new SRM server with 64bit OS and latest Java VM, and second the PinManager was again taken out of the SRM server process virtual machine. The good news were that these cured the SRM timeout problem. The bad news is that a second SRM problem appeared: now the SRM-get requests were the only ones timing out (SRM-put's were happily working).
The solution came on wednesday 24th of November, when we were made aware of the existence of different queues in the SRM for put, bringonline and get requests (good to know!). Once we had a look to them, we realised that the SRM-get queue was so large that it was touching its internal limit. This problem appeared because experiments are issuing srm-get requests, but not releasing them. Now we know we have to watch closely to the srm-get queue: more monitoring, more alarms. Back to business.
Friday, 10 October 2008
September Availability: LHCb, the SRM-killer
The unavailability in September at PIC was mainly due to two issues: First, the overload of the SRM server caused by submission of about 13.000 srm-get requests in one shot from LHCb production jobs. This affected the SRM service on three days in September. The first issue that this incident made clear was one of the biggest problems of dCache, from my point of view: there is no way to have different SRM servers, each dedicated to one experiment. We are forced to share the SRM server, so when LHCb breaks it, ATLAS and CMS suffer the consequences. This is clearly bad.
Then one can discuss if issuing 13.000 srm-gets can be considered a DoS, or it is a reasonable activity from our users. I really do think that as a Tier-1 we should stand this load with no problems. As we post this, the storage team at PIC and the LHCb data management experts are in contact to try and learn what exactly got wrong and how to fix it.
Following the saying "better late than never", ATLAS started seriously testing the pre-stage procedure for the reprocessing at the Tier-1s just few days after LHCb. This are good news. This is the only way for us to learn how to configure our system so that it can deliver the required performance. Sure our SRM will die several times during this testing, but I hope it will converge to a reliable configuration... best before spring 2009.
The second contribution to PIC's unreliability last month came from the Network. On 23rd September the Spanish NREN suffered a major outage due to electrical and cooling problems in a TELVENT datacenter, which is hosting the NREN equipment. This resulted in a complete network outage at PIC of about 10 hours. Again, we see electrical and cooling issues at the very top of the LCG service risks list. In the end, looks like one of the trickiest bits of building such a complex computing infrastructure is just plugging it in and cooling it down.
LHC up&down
It is quite funny that the most silent month in our blog was probably the most visible one for the LHC all around the world. Well, we can always say "we did not talk about the LHC here since you could read it in any newspaper" :-)
So, the two big LHC things that happened in September, as all of you know, is that the LHC started circulating beams on Sep 10th and that it then had a major fault on the 19th. You can read the details of these both events anywhere in the web. I will just mention that those were quite special days in our community: the big excitement on the 10th, and then the "cold water bucket" few days later could be felt everywhere. Even the daily operation meeting was less crowded than usual since it was difficult those first days not to feel a bit "what to do now"?
I think now it is quite clear for everyone that life goes on. We at the LHC Computing Grid, continue operations exactly in the same way as we are doing since months. We are not receiving p-p collisions data, true, but the data flow did not stop. Both Cosmics data taking and MonteCarlo generation have not stopped.
We have said many times in the last years that the LHC is a extremely complex machine and that it might take a long time to put it in operations. Well, now we can see this complexity in front of us. There it is. Life goes on.
Monday, 8 September 2008
CMS bends Cosmic Muons...
Last Friday night, September 5th, the current was set to 14500 Amps (3 Tesla central field) for almost two hours to allow an extensive run of all sub-detectors. Cosmic muons bends on presence of magnetic field. The data of this "3 Tesla" magnet commissioning test was distributed to all CMS computing centers and bended tracks were further reconstructed.
With all the parameters being within their reference values, the first phase of the Magnet commissioning underground can be considered achieved. The test plans were fully achieved within the time allowed. The next step will be the test at full current... and to bend the products from proton-proton collissions!
Wednesday, 20 August 2008
Reliability under control... ready for data?
The reliabilities scored by PIC in the last three months (May to July) have been flat at 99%. This is ok and I think this tells us that the Manager on Duty shifts are getting mature. On the one hand the tools available for the shifters are improving: cleaner Nagios and better documentation and procedures (thanks to all of the service managers around) plus, of course, the shifters are doing a great job.
The availabilities for these last three months have moved up 92%-96%-97% for May-June-July, respectively. For these three months we have started implementing a regular Scheduled Downtime on the second (sometimes first) tuesday of the month. Knowing the SD date in advance makes the planning and user notification smoother. The availability in May is somewhat lower because we had an extra downtime on the 14th because the Regional Network Provider had to upgrade some equipment, so PIC had to be disconnected for some hours.
I always say that the SAM monitoring has proven to be a very useful tool for sites to improve on stability. The experiments have since long complained that sometimes they are not reflecting reality, specially because they run under the "fake" OPS VO. The solution is of course that they implement their VO-specific tests in the SAM framework. This is ongoing work since several months, but still not completely stable.
So, looks like we will get the first LHC data with our VO-specific SAM glasses a bit dirty... anyhow, I am sure that the "real data pressure" will help to make all of this converge so we will have still better tools for our shifters to know what's going on.
Monday, 28 July 2008
Euro Science Open Forum in Barcelona
One week ago we had the ESOF08 conference in Barcelona. This was a BIG event devoted to science communication. More than 3000 people attended the "scientific programme"during last weekend. Not bad having into account that the weather in Barcelona was just lovely, and the beach few metro stations away...
There were several presentations about the LHC. On Saturday morning the physics motivation and experiment status presentations were given by several people from CERN. Specially interesting and funny, as usual, was the presentation by Álvaro de Rújula. Unfortunately for the people that were not there, he is still using "analogic transparencies" (made by hand) so no way to download a copy to your PCs.
We organised a session on the LHC data processing and analysis challenge on sunday, and invited Pere Mato and Tony Cass from CERN as speakers. Pere first gave a talk on the challenge of the TDAQ systems in the LHC, to filter out and reduce the number of events from the 40MHz collision rate down to the 100Hz that can be permanently stored. Then Tony Cass presented the main challenges that the CERN computing centre is facing, as the Tier-0 of the LHC Grid. Finally I presented the LHC Computing Grid and the key role of this huge distributed infrastructure for the feasibility of the LHC data analysis.
There were quite a number of questions at the end of the session (not bad for a sunday-after-lunch one). Besides the most repeated one of "when exactly the LHC will start and how many days later you will discover new physics?" there was an interesting question asking about the similarities/differences between our LHC Grid and the now-so-famous Cloud Computing. We answered that, as of today, the LHC Grid and the Clouds available out there (like the Amazon one) are quite different. The LHC data processing, besides huge computing and storage capacities, needs a very big bandwith between those. Tier-1s are data centres specialiced in storing Petabytes of data and mining through all of this data using thousands of processors in a very efficient way. Trying to use the commercial Clouds to do this today, besides being too expensive, would most probably not meet the performance targets.
That said, we should all keep an eye on this new hype-word "the Cloud" as it will surely evolve in the next years and I am afraid our paths are poised to meet at some point. The LHC is today not a target customer for these Clouds, but what these giant companies are doing in order to be able to sell "resources as a service" is indeed very interesting and, as Wladawsky-Berger notes, is driving an "industrialisation" of IT data centers in a similar way as 25 years ago some companies like Toyota industralised the manufacturing process.
So, more productive, efficient and high-quality computing centers are coming out from the Clouds. We will definetely have to watch up to the sky very now and then, just to be prepared.
Thursday, 10 July 2008
Cosmic Rays Illuminate CMS II !
Wednesday, 9 July 2008
Cosmic Rays Illuminate CMS!
The fraction of CMS sub-detectors participating in CRUZET3 has steadily increased and includes from the first time all its components: the DT muon system, RPC barrel, CSC endcap, HCAL and barrel ECAL calorimeters, and the recently installed silicon strip tracker (the biggest tracker detector ever built!).
Last night (9.7.2008), over 1 million cosmic ray events were reconstructed on the tracker system. This is the first time we see triggered cosmic ray tracks in both TIB and TOB at the Tracker level:
Primary datasets are created using the new Tier-0 “repacker” in almost 'real time' and transferred to CAF and Tier-1 sites for prompt analyses. IN2P3 Tier-1 has the custodial responsibility to hold CRUZET3 data, although all Tier-1 sites are constantly receiving the cosmic data. During these two first days, ~3 TBs of data has landed to PIC, being the best Tier-1 site from Rate and Quality p.o.v (curiosly, yesterday we spent the whole day in scheduled downtime!).
Data volumes are expected to grow later this week as systems are better integrated on this commissioning exercise... So, we need to stay tuned! ;)Tuesday, 10 June 2008
May Data Transfers from CERN: CMS @ CCRC'08
For PIC, the importing shared target rated was of about 57 MB/s. PIC was all days over the metric, with some days importing a factor x2 than requested.
It is important that data flows from CERN in a smooth way. From reliability and robustness p.o.v. being 3-4 days over the metric is really bad for data taking as we could be stuck at the CERN overrunning CMS buffers. PIC was always over the metric during the whole month and a comparison to other Tier-1s can be seen on this table.
Last, but not least, the mean May '08 rate CERN->PIC was of 70 MB/s, with an impressive Data Transfer Quality of 90%. The test was successful and quite impressive for PIC indeed!
Tuesday, 3 June 2008
A picture is worth a thousand words...
PIC demonstrated it's reliability during the whole month achieving an efficiency of 91% (Best Tier-1) acquiring data, the high demanding reprocessing jobs push the Worker Nodes and pools to the limit and was very useful to the collaboration. Also the Tier-2s received data without major problems, I want to mention all PIC team that made possible such a good performance in every single activity. And as I mentioned in the title... A picture is worth a thousand words:
Friday, 30 May 2008
Last chance to test, gone
Overall, the Tier-1 services have been up and running basically 100% of the time. As the two more relevant issues, I would mention first the CMS skimming jobs brutal I/O. After they were limited (to 100 accoriding to CMS) they were still running until mid-thursday and sustaining a quite high load on the LAN (around 500MB/s). On wednesday 28-May evening, ATLAS launched a battery of reprocessing jobs which very fast filled up more than 400 job slots. Apparently all of these jobs read the detector Conditions data from a big (4GB) file sitting in dCache. This file of course fast became super-hot, since all of these jobs were trying to access it simultaneously. This caused the second issue of the week. The output traffic of the pool in which this file was sitting immediately grew up to 100MB/s, saturating the 1Gbps switch that (sadly, and until we manage to deploy the definitive 2x10Gbps 3Coms) links it to the central router. This network saturation caused the dcache pool-server control connetion to lose packets, which eventually hanged the pool.
At first sight it seems that the Local Area Network has been the big issue at PIC this CCRC08. Let's see what the more detailed post-mortem analysis teaches us.
Friday, 23 May 2008
LAN is burning, dial 99999
CMS has been basically the only user of the PIC farm this week, due to the lack of competition from other VOs. It was running about 600 jobs in parallel for some days. Rapidly, we saw how these jobs started to read input data from the dCache pools at a huge rate. By tuesday the WNs were reading at about 600MB/s sustained. Both the disk and the WNs switches were saturating. On thursday noon Gerard raised some of the Thumpers network uplink from 5 to 8 Gbps with 3 temporary cables crossing the room (yes, we will tidy them up once the so long awaited 10Gbps uplinks arrive) and we immediately saw how the extra bandwidth was immediately eaten up.
On thursday afternoon the WNs spent some hours reading at the record rate of 1200MB/s from disks. Then we tried to limit the number of parallel jobs CMS was running at PIC, and we saw that with only 200 parallel jobs it was already filling up 1000MB/s.
Homework for next week is to understand the characteristics of these CMS jobs (seems that are the "fake-skimming" ones). Which is their MB/s/job figure? (been asking the same question for years, now once again).
The second ccrc08-hickup this week arrived yesterday evening. ATLAS transfers to PIC disk started to fail. Among the various T0D1 atlas pools, there were two with plenty of free space while the others were 100% full. For some reason dCache was assigning transfers to the full pools. We have sent an S.O.S. to dCache-support and they answered immediately with the magic (configuration) recipe to solve this (thanks Patrick!).
Now looks like thinks are quiet (apart from some blade switches burning) and green... ready for the weekend.
Friday, 16 May 2008
Intensive test of data replication during the CCRC08 run2
And after 4 hours the importing efficiency still maintained the 99% and the accumulated mean throughput was over the 300MB/s. This lead to the correct placement of 16TB data in 14 hours.
On the other hand we observed different behavior while exporting data to other T1s, the efficiencies are not so brilliant in comparison with the importing. Also notice that while exporting we rely on external FTS services and sometimes the errors are not related to our capabilities but we did see observe some timeouts from our storage system, network was indeed a limit but we observed saturation at pool level while importing but not when others are reading, this means that the network among the pools and the data receiver were not guilty of the efficiency dropping while exporting. We have an insight and seems that our dCache have problems returning the TURLs (transfer urls) when is under heavy load (well known dCache bug), this could cause the kind of errors observed during the exercise. Besides that, almost all the T1s got more than 95% of the data from PIC, but sometimes not at the first sting and the exercise can be considered more than successful either from the PIC side and from ATLAS.
In my opinion this was one of the most successful test ran so far within the ATLAS VO as they involved all T1s under a heavy load of data cross-replication, the majority of sites performed very well and there are a couple of them who experienced problems, let me remind again the complexity of the system which should have a bunch of services ready and stable in order to finalize every single transfer: ATLAS data management tools, SE, FTS, LFC, network,etc.
We have to keep working hard to achieve robustness of the actual system, which improved enormously in the last months: dCache is performing very well and the new DataBase back-end for the FTS ironed out some of the overload problems found in past exercises. For that reason I want to deeply thank all the people involved maintaining the computing structure at PIC, this results show we are in the correct way. Congrats !
Thursday, 15 May 2008
In April, a thousand rains
We had also other operative issues last month which also contributed to the overall unreliability. Every service had its grey-day last month: in the Storage service, a gridftp door (dcgftp07 its name) misteriously hang in a funny way such that the clever-dCache could not detect it and kept trying to use it. As far as I know, this is still in the Poltergeist domain. I hope it will just go with the next dCache upgrade. For the Computing service there have been also some hickups... it is not nice when an automated configuration system decides to erase all of the local users in the farm nodes at 18:00 p.m.
We have even had a nice example of collaborative-destruction among Services last month: a supposedly rutinary and harmless pool replication operation in dCache ended up saturating one network switch which happened to have, among others, the PBS master connected to it, which immediately lost connectivity to all of its Workers. Was it sending the jobs to the data, or bringing the data to the CPUs? Anyhow, a nice example of Storage - Computing love.
First LHCOPN backup use at PIC
This night we've had a pair of network scheduled downtimes. The first one (red in the image) affected the PIC-RREN connection, this means a single 10Gbps fiber connection from PIC to Barcelona therefore, as scheduled, we had no connectivity from 21:35 to 23:20.
The second scheduled downtime was due to rerouting tasks at the French part of the PIC-CERN LHCOPN fiber. This used to be a critical task for us since we were isolated from the LHCOPN while tasks were taking place but it's not like this anymore. Now our NREN routes us through GÉANT independently on where data is going so we keep connected! As you can see in the image orange coloured areas, while reaching the OPN through our NREN (Anella+RedIRIS) we're limited to 2Gbps (our RREN uplink) but still reaching it so finally we can say we have a [not dedicated] backup link.
Waiting for the LHCOPN dedicated backup link won't be so hard now.
Monday, 5 May 2008
Networking conspiration
In the end, the reality was that the intervention completely cut our network connectivity at 23:30 on the 28-Apr. The next morning, at 6:00, we recovered part of the service (the link to the OPN), but the non-OPN connectivity was not recovered until 17:30 on the 29-Apr.
When one thinks on the network one tends to assume "it's always there". On that N-day we decided to challenge this popular belief, so we did not have just one network incident but two. Just four hours after we recovered the OPN link, somwhere near Lyon a bulldozer destroyed part of the optical fiber that links us to CERN. This kept our OPN link completely down from 10:00 a.m. 29-Apr until 01:00 30-Apr.
Not bad as an aperitive, hours before a two-day scheduled intervention...
Throttling up
On the other hand, for the data transfers the delay was even shorter, after restarting the site services at CERN only took about 5 minutes to start triggering data in and out from PIC (pic.2 -files done- and pic.3-Throughput in MB/s-):
Due to the complexity of the system and the number of cross-dependencies for each of this single services that were successfully recovered one can conclude that the "re-start" was extremely successful :)... but of course everything can be improved !
Thursday, 24 April 2008
March reliability, or is the glass half empty or half full?
The March unreliabilty came mainly from three events: First, the unexpected power cut that affected the whole building in the afternoon of March 13 (don't know the last report about it, but they were talking about somebody pressing the red button without noticing it). Second, there was an outage in the OPN dark fibre between Geneva and Madrid on the 25th that lasted for almost three hours.
The last source of SAM unreliability was of a slightly different nature: the OPS VO disk pools filled up due to massive DTEAM test transfers. So, this last one was under our domain, but actually it did not affect the LHC experiments service, only the monitoring. Anyhow, we have to take and understand SAM for the good and for the bad.
Last month we also had our yearly electrical shutdown during the Easter week. The impact of that scheduled downtime appears in the availability figure, which decreases 12 percentage points down to 74%.
So, it was a tough month in terms of management metrics to be reported (we will see these low points in graphs and tables many times in the following months, that's life). Anyhow, the scheduled intervention went well, and the LHC experiments were not that much affected, so I really believe that our customers are still satisfied. Let's keep them like this.
Friday, 18 April 2008
Pic farm kicking
This is clearly visible in the following figures, impressive ramp-up in walltime and jobs finished per day:
Notice there are some reds in the figures as there were some configuration errors at the very beginnig, quickly resolved by the people maintaining the batch system (new things always bring new issues!).
The contribution of the Spanish sites to ATLAS MonteCarlo production has been throttled, altgough we are far from the gigantic Tier-1s we are firmly growing up and showing robustness (figure below: spanish sites are tagged as "ES" and shown in blue):
We keep seeing the advantage of using the pilot jobs schema as the new nodes were rapidly spotted by this "little investigators" and some hours after the deployment, all the blades were happily fizzing.
Monday, 14 April 2008
Network bottleneck for the Tier-2s
Last week we reached a new record at PIC: the export transfer rate to the Tier-2 centres. On wednesday the 9th April, around noon, we were transfering data to the Tier-2s at 2Gbps. CMS started very strong on Moday. Pepe was so happy with the resurrected FTS, that started to comission channels to the Tier-2s like hell. Around thursday, CMS lost a bit of steam, but it looks like ATLAS kicked in exporting data to the UAM at a quite serious rate, so the weekly plot ends up quite fancy (attached picture).
The not-so-good news are that actually this 2Gbps is not only a record, but a bottleneck. At CESCA, in Barcelona, the Catalan RREN routes our traffic to RedIRIS (non-OPN sites) through a couple of Gigabit cables. Last October they confirmed us this fact (and now we have measured it ourselves) and also told us that they were planning to migrate this infrastructure to 10Gbps. So far so good. Now let's see if with the coming kick-off of the Spanish Networking Group for WLCG this plan gets to reality.
Monday, 7 April 2008
FTS collapse!
The DDT transfers for CMS were so degraded, that most of the PIC channels had been decomissioned in the last days (see CMS talk). On thursday the 3rd April, we decided to solve this following a radical recipe: restart the FTS with a completely new DB. We lost the history rows, but at least the service was again up and running.
Now, let's try to recomission all those FTS channels asap... and quit the CMS blacklist!
Thursday, 20 March 2008
Easter downtime, and February reliability
The main intervention was the upgrade of 5 racks to 32A power lines. Now we can plug our HP blade centers. Will see how the PBS behaves when we scale-up the number of Workers by a factor of three.
This week we also got the results for the February reliability from the WLCG office. Our colleagues from Taiwan got the gold medal (100% reliability) breaking CERN's monopoly on this figure. PIC's reliability was very good as well. Actually, we reached our record: 99% reliability. And we got the silver medal for February.
The small 1% that we missed this time to reach the 100% was due to few hours of problems caused by a log file not rotating in the pnfs and a "not enough transparent" intervention in the Info System for dCache, which is still quite patchy for SRMv2.2.
Most probably next month's result will not be so green, due to the unscheduled power cut we had last week and the scheduled yearly shutdown this week. So, let's enjoy our silver medal until the next results come.
Friday, 14 March 2008
Power Cut!
The cooling of the machine room stopped, but the machines at PIC still were working powered by the UPS. After 10 min in this situation, we decided to start stopping services. Not much later, the (yet to be understood) glitch arrived. All of the racks lost power for less than 1 second. After this, and to avoid that servers would still restart after a dirty stop, we just switched off all racks in the electric main board.
Today, at 8:00, we started switching on PIC. The good news is that it looks as if we did not have many hardware incidences after the dirty stop. The lesson learnt (hard way) is that we are still too far away from a controlled and efficient complete shutdown. We will have to repeat this on monday, due to the yearly electrical maintenance. So overall, it will be a good week to debug all this procedures.
Tuesday, 11 March 2008
Crossing the highway
The meeting started with three presentations, one for each of the experiments, aiming to report on the state of the LCG from the point of view of each of them. It was quite interesting to see how the three talks were presenting completely different views about the same issue.
Both ATLAS and CMS mentioned the need of some sort of Tier-3s for the users to make their final analysis. There was some general concern due to the fact that such infrastructures are currently not being funded. The LHCb presentation was, from my point of view, the one that most directly presented the "view from the LCG users". The small number of physicists actually using the Grid was mentioned, and the most common problems found by them described. There was the usual "30% of the jobs sent to the Grid fail", and "sometimes the sites do not work, sometimes the experiment framework does not work". The result is always the same: the user just leaves saying "the Grid does not work". After some years of working trying to make "a Grid site to work" I really do think now that many of the problems remaining today are due to the experiment frameworks not working, or not properly managing the complexity of the Grid.
I presented the status of the Tier-1 at PIC, focusing in the last results obtained in the recent test CCRC08. Most of the results were actually quite positive, so I am quite confident that the board got the message "ther Tier-1 at PIC is working". It was also quite helpful to see direct references to the good performance of PIC in some of the Tier-2s presentations, like the LHCb one (thanks Ricardo).
There were two points I would like to highlight from the PIC presentation. The first one arose when showing two plots where the actual cost of equipment was compared to two CERN estimations: the one used in the proposal (oct-2006), and the update received three months ago. The results suggest that the hardware cost is lower than the estimation in the proposal. I think there are plenty of moving parameters in this project, one is the hardware cost estimations, but we should not forget that the event size, the cpu time or memory needed to generate a MonteCarlo event, the overall experiment requirements, etc. are also parameters with uncertainties of the order of 30 to 100%. If eventually we get to 2010 and the computing market has been such that prices have decreased faster than expected, good news. We will report this (as we already did with the past project and the delay of the LHC) and will propose to use the "saved" money for the 2011 purchases.
The second issue arose from a question from Les Robertson, who was a member of the board: "when do you expect that PIC will run out of power?". As the cpu and storage power of the equipment is (luckily for us!) exponentially growing, the power consumption of these wonderful machines is also going up to the sky. Soon the total input power at PIC will be raised from 200 KVA to 300 KVA. Though it is not an easy estimation, we believe this should be enough for the current phase of the Tier-1, up to 2010. Beyond that date, we should most probably be thinking about a major upgrade of the PIC site. Next to the UAB campus, on the other side of the highway, a peculiar machine is being built: a Synchrotron Ring. This stuff comes normally with a BIG plug... should we try and cross the highway to get closer to it?
Tuesday, 19 February 2008
January Availability
January 2008. This time, at PIC we were just on top of the target for Tier-0/Tier-1: 93%.
The positive read of it is that we are still one of the only three Tiers that reached the reliability target for every month since last July. The other two sites are CERN and TRIUMF.
The negative read is that 93% looks like a too low figure, when we were getting used to score over 95% in the last quarter of 2007.
The 7% unreliability of PIC in January 2008 is fully due to one single incidence that we had in the Storage system the weekend of the 26-27 January. The Monday before (21/01/2008) had been a black-monday in the global markets - European and Asian exchanges plummeted 4 to 7% - so, we still do not discard that our failure might be correlated to that fact.
However, Gerard's investigations point to the fact that the most probable cause of our incident was a less-glamourous problem in the system disk of the dCache core server. The funny symptom is that all the GET transfers from PIC were working fine, but the PUT transfers to PIC were failing. The problem could only be solved by manual intervention of the MoD, who came on Sunday to "press the button".
So, the "moraleja" as we call it in Spanish, could read: a) we need to implement remote reboot at least in the critical servers, b) a little sensor that checks that the system disk is alive would be very useful.
Now, back to work and let's see if next month we reach the super-cool super-green 100% monthly reliability that up to now only CERN is able to reach with apparently no much effort.
Monday, 18 February 2008
2008, the LHC and PIC
protons at CERN. At PIC we are deploying one of the so-called Tier-1 centres: large computing centres that will receive and process the data from the detectors online. There will be eleven of such Tier-1s worldwide. Together with CERN (the Tier-0) and almost 200 more sites (the Tier-2s) these will form one of the largest distributed computing infrastructures in the world for scientific purposes: The LHC Computing Grid.
So, handling the many-Petabytes of data from the LHC is the challenge, and the LCG must be the tool.