Wednesday, 16 December 2009

Last day of LHC running this year

After so much celebration of first days of LHC running, it is time today to celebrate the last day of LHC running... this year. In few hours the LHC will be switched of and accelerated protons will go on holidays until next year.
I has been a very nice and long awaited time since last 23rd November the experiments started taking collision data. Today the LHC goes on holidays, but the WLCG does not. This piece of distributed infrastructure we have been building in the last six years should stay up and running 24x7 so that the precious data taken can be processed, re-processed, re-re-processed and so on. Somebody said that "data can be equated with money that has value only if it is used and circulated". So this is what we will be doing in the next weeks: giving value to the LHC data. This will not yet be haunting the Higgs, but less sexy minimum bias soft QCD events... but still, LHC physics after all.
At PIC Tier-1 we will carefully look to the services to ensure maximum availability and efficiency.
For the moment, what can we say about PIC's performance during "the month in which the LHC started" (aka November 2009)? We just received this Christmas gift from the official WLCG availability reports:
  • PIC availability and reliability for OPS VO = 100%
  • For ATLAS VO: 98% availability and 100% reliability (only ATLAS Tier-1 with max score)
  • For CMS VO: 100% availability and reliability (FZK also got max score for CMS)
  • For LHCb: 98% availability and 99% reliability (only CERN got 100% for LHCb)

Sunday, 22 November 2009

Outreach in the school

Last week it was the "week of science" in Spain. This happens every year around mid November and consists of one week where plenty of activities oriented to explain science to the people are scheduled. In Catalonia, one of the organised activities are talks of scientists in the schools. Last wednesday there were 100 simultaneous talks carried out in different schools all around Catalonia. I visited a secondary school in Badalona where I had a great time talking about the LHC and the origin of the Universe to around 70 students. I see now that they even posted an etry in the blog of the school!
Nice to see that Catalan schools are in the blogosphere... and that they had a nice time listening to my LHC stories.

Real data flowing through PIC

ATLAS has provided a nice monitoring page where we can follow the progress of data distribution in these so exciting moments of first circulating beam in the LHC. This is not collisions yet, but real data indeed. After so many years of simulations, we are happy to see the first Megabytes of real stuff. In the picture, I have just captured the current status of the datasets distribution to Tier-1s and from there to the associated Tier-2s. The overall picture looks pretty green, which is good news. PIC received the subscribed data with no problems and promptly redistributed it to the Tier-2s. It looks the data movement went mostly smooth. Let's keep an eye on this. We will see the rates growing in the next days.

Circulating beam in the LHC (take two)

So, there we go. Last friday 20th November beams circulated again inside the LHC, after one long year of reparations. Everyone is happy and bottles of champain (or cava) are being opened in the control rooms. In the picture you can see, besides the party atmosphere at the LHC control room, the first event displays from ATLAS, CMS and LHCb . The hundreds of tracks coming from the collimators where beams are splashed can be clearly seen in all of them. We are watching the first LHC data.
Commencing countdown, engines on...

Tuesday, 10 November 2009

LHC beam approaching CMS!

Last Saturday evening, the 7th of November 2009, at around 8 p.m., after passing through the LHCb detector, for the first time since last year's incident, protons arrived at the doorstep of the CMS experiment, thus completing half the journey around the LHC's circumference.

Low energy protons from the LHC were dumped in a collimator just upstream of the CMS cavern. The calorimeters and the muon chambers of the experiment saw the tracks left by particles coming from the dumping point (a so-called 'splash event', see images). During the rest of the weekend, bunches of protons were also sent in the clockwise direction passing through the ALICE detector and were dumped at point 3.

All detectors saw 'splash' events on their monitoring pages. Castor and the Preshower detectors saw particles for the first time! Some beautiful pictures from the events seen:

Monday, 19 October 2009

Data loss

These days we are hearing more often about data loss events at the WLCG sites. Today it was the NL-T1 site that reported some data loss in the daily Operations meeting. Apparently, the tape drive loaded the tape and, instead of reading it, it just destroyed it. A similar event happened to us at PIC at the end of September, when we lost a tape containing 214 files from CMS. Nothing could be done with that piece of hardware... not even rewinding it! Luckily for us, all of those files were replicated in some other Tier-1 or at CERN, so we could fix the problem quite straightaway.
We were used to think in tapes as a safe media for data... but these episodes of tape destruction show that this is not always the case. A bit scary.
Anyhow, even ig it does not solve anything but it is nice to see that WLCG sites we are not the only ones losing data. Even Microsoft loses some data eventually!

Friday, 25 September 2009

August availability: on target, but...

Last August, as may be one could expect after a busy July with the CPU occupied at almost 100%, many of the LHC experiment production managers and computing guys went on (deserved) holidays. Accounting data is being collected these days, and the results for PIC show that only about 40% of the installed CPU was used. Curiously, the experiment share of this workload was highly non-nominal: LHCb, which represents only about 10% of PIC pledges, was the one consuming more: up to 60% of the delivered CPU cycles were for them. ATLAS got essentially all the remaining 30%, while CMS did essentially zero. So, for both ATLAS and CMS August was the month with the lowest CPU consumed in the year, while for LHCb was a record-breaking month.
PIC Tier-1 availability during August was just right on top of the target: 97%. About 1% of the unavailability was due to the usual monthly Scheduled Downtime which took place on the 25th August. Most of the remaining 2% unavailability we spent it also on that same day, which suggests that there is still room for improvement on SD coordination. One of the A-critical services, the site-bdii, was off during almost 4h after its scheduled intervention... and no one of us noticed! There was also an issue with the Computing Service (B-critical) which had its queues closed for 2h longer than planned. We should now then feed this experience back into our operation system and make sure the relevant procedures are improved.
Besides that, the 19th of August instabilities appeared in the OPN link which were affecting the SRM service, specially for outgoing transfers. The problem disappeared in about 24h, but we never knew what had really happened. The Spanish NREN did not answer to our query for information. First we thought this was an August-effect, but later we realised the problem was that our e-mail contact for operational issues in the network was wrong. We have corrected this and the e-mail we have now should even trigger a ticket opening automatically.
The good news for August were that, despite being one of the hottest in several years, the cooling system of the PIC machine room coped perfectly with it. Seems that the new maintenance team did a good job in preparing the system for the summer campaign.

Friday, 18 September 2009

One Petabyte knocking at the door

This is just a micro-post, to show you who I found two days ago knocking at PIC's door just when I was leaving the office.
Yes! the long-awaited Petabyte that will push our capacity up to the 2009 MoU pledges was there.
It will be a tricky path the one we have to walk from these nice wooden boxes to the WLCG SRM service (the first one seems to be the big disks are reluctant to get to Barcelona).
Busy weeks ahead... but the goal is clear: fill up those disks with LHC data.

Friday, 28 August 2009

Believe us, PIC is Ok

Scary, isn't it? PIC availability has been bully red in the last 24h but, sadly, there was not much we could do.
We have always said, and still strongly believe it, that SAM tests are a very good thing. Actually, I firmly believe that have been one of the key ingredients for the WLCG success. Success here meaning the evolution from "the Grid does not work" situation with 60% job success rate we had few years ago, to the rutine >97% availabilities we are used to see these days. But yes, not even SAM tests are perfect. There has always been a dark corner inside them: the so much questioned "SE test inside the CE", or lcg-rm test. And inside this controversial test there is another smaller corner which is still a bit darker: the file replication to CERN test. This was the one that started flickering on Tuesday at PIC and it is consistently failing since more than 24h. This test tries to copy a file sitting at PIC into a very concrete DPM server at CERN. This very precise connection was timing out for us while any other transfer to any other site, even to any other CERN storage server was working. This was strange enough so that we asked for help to our CERN colleagues. Today, they came with the good news: problem found, a problematic router.
Got a really puzzling error? Bet on the network...

Friday, 21 August 2009

July: good availability comes with CPU delivery record

Most of the people is these days either just back from holidays (me) or still out (lucky them) or neither of those (also lucky, since they will leave later...). Anyhow, life at the Tier-1 is 24x7 since we all know so, it doesn't matter if we are in the middle of August and it is near 40 degrees celsius out there, it is time to report about service performance in the past month.
We just got the WLCG reports for July and the results for PIC are pretty positive: 99% availability and reliability. This little 1% that gets us away of our beloved 100% happened on the 29th July, around noon. During four hours all of the Computing Elements at PIC were failing the SAM Job Submission tests, so definetely the Tier-1 service was affected during that period. The source of the problem was found to be a pretty mysterious one: the switch connecting the servers hosting Virtual Machines to the PIC LAN (VMs are always a bit misterious, aren't they?). Actually, the source of the problem was not found, but just disappeared when that switch was replaced by a new one (different brand, no names here to avoid anti-propaganda :)
Regarding the availability of PIC as seen from the experiment specific monitoring, we got also very good results for ATLAS and LHCb, close to 100%. However, the result for CMS was not that good: a mere 90%. This funny assymetry was due to a bug we introduced by mistake in the Torque ACL queues configuration which actually blocked CMS submissions to our short queue for 3 days (6-9 July). Somebody could ask why we did not notice this in 3 days... We should put priority in deploying the WLCG-Nagios in production. It will for sure help reducing these unavailable times.
So, besides this unfortunate CMS-blocking bug, we can say July was a pretty good month for the Tier-1 in terms of availability. Thanks to this, and also to the job submission hyperactivity seen from ATLAS and LHCb, we delivered a record amount of CPU cycles during July: around 70.000 ksi2k·days, which is very close to keeping 100% of our resources busy for the whole month.
Now things look quiet, being many people still away. PIC is up and running, cooling is ok (even the heat wave out there)... but watch out for the VM-networking ghosts, they could come at any time.

Thursday, 18 June 2009

Welcome Valparaiso !

The Universidad Técnica Federico Santa Maria (UTFSM) of Valparaiso (Chile) has joined ATLAS Computing and has been associated to PIC as its Tier-1 center. They are now a small center (Tier-3) but surely will grow soon. Firsts transfers using the full chain of the ATLAS Data Management System has been successfully tested. We welcome UTFSM to PIC and to the distributed computing world !

Monday, 15 June 2009

WLCG proposed a test period were all the VOs can exercise their computing models altogheter in a realistic scenario before the data taking. This exercise was named STEP09. The challenge ended on Friday 23:00 CET. A lot of activity happened during these last two weeks and the ATLAS Computing Model were fully exercised (finally!):

a) Huge load of MC production, exercising the farms and the data aggregation form T2s to the T1. The job type were basically G4 simulation jobs and AOD merging.

b) Huge load of Analysis jobs that were racing against the MC ones. I'd like to emphasize that this activity was quite well *disorganized* in overall... that was good as possibly could simulate quite well the possible impact of the uncontrolled user analysis that everyone will get very soon. There were general problems with Athena build jobs (compatibly libraries), the PanDA brokerage were bypassing the ATLAS release check at the sites (spotted last Thursday), the protocols to get data were basically lcg-cp for File stager and dcopen/file for remote connection... there's a big room for improvement in the Pilot Jobs based User Analysis, hopefully some tests will be performed in the coming months (use "natural SE" protocols for the FS, tune the Read Ahead Buffer -32Kb used during STEP-, etc.)

c) Non-Stop Reprocessing, this was the primary activity at the Tier-1s and was focused to perform multi-VO reprocessing at the same time fro those Tier-1 serving more than one experiment. That was achieved: ATLAS, CMS and LHCb managed to use the robots at the same time during several days. Six ATLAS sites manage to pass the metrics: reprocess at five times the data taking speed assuming 40% of LHC machine efficiency. And three sites manage to get gold stars, the metrics for this were: reprocess at five times data taking speed assuming 100% of LHC machine efficiency. PIC obtained gold star.

e) Data distribution: was broadly tested on top of the processing activity. Merged AOD were pre-placed at the sites: T1-T1-T2s, Functional Tests kept running all the time during the two weeks. There were no major incidences although some sites had some instabilities during the STEP: disk space emptied out, gridftp doors overloads (intensive LAN, WAN activity), but in overall data flow has shown its robustness and major improving with respect two years ago.

The STEP showed that manual work is still required but we managed to spot and learn which things would be the most demanding ones, the target now is to work in this direction and deploy an scalable operations model before the start.

Concerning our cloud, I'd say the exercise was a real success ! I want to thanks every party involved. Site has been stable (no major issues) that allowed STEP to pass through in a quite intensive way. I've barley seen an idle CPU all over the cloud, letting schedulers to battle against the different roles and jobs. Also the data transfer has been pretty high and the output form PIC to the Tier-2s has been quite stable at around 150-200MB/s.

MonteCarlo production ran between 1k and 1.5k jobs during STEP and efficiencies reached were well higher the 90%

On the other side User Analysis jobs were also filling the remaining slots, see the snapshot for the Pilot-based jobs (the ones directly submitted trough WMS are not accounted). Efficiencies ranked among the 70% and 80% depending on the sites playing the game, there were some issues (all understood!) that prevented pilot user analysis jobs to finish correctly.

PIC reprocessed the data two times, job efficiencies and data flow from the robot to the buffers and to the WN as a final step showed a very good performance. Pre-Stage is manage to keep up with 500 concurrent jobs which is beyond the required reprocessing speed for a Tier1 of small size. The interesting point is that reprocessing outputs were written to tape as well, this is what the computing model say, and we found no major problems with the simultaneous usage of drive for reading and writing.

What we learned during the STEP is that, once again, storage services are the most sensitive layer of the infrastructure. Small outage on the SE induce a general failure of every single activity. Storage services would have to be well dimensioned on those sites that had instabilities under constant load. We learned also that disk space is as volatile as it is in our laptops, a good prevision of the storage versus desired data is mandatory, it is true that ATLAS has to be more aggressive in data deletion but this is at the hands of the physics groups. I'd say that we should reinforce the Federative structure of our Tier-2s and take profit of these for the data distribution and always preserve some disk quantity in case of crisis. In the near future ATLAS will provide an interface that for modifying the shares at the sites so can be dynamic and not human-based, but also DDM will cross-check if the space token has enough free space before the data is shipped, hence preventing to replicate in case of shortages.

Once again my feeling and understanding is that our cloud is ready for data taking, let's hope data start to flow soon...

Monday, 8 June 2009

Tier-1 reliability in March and April

Now that everybody is talking about hot topics such as STEP09, let us take some break here to review some pretty outdated stuff. Sure we will be posting about the STEP09 in brief, but reviewing the scored monthly reliabilities and, most important, highlighting the causes of failures as homework to improve the service in the future has always proven to be a very useful exercise. Let's go then :-)
In March PIC scored 99% in reliability and 98% in availability. Pretty good, and above target. The missing reliability was caused by the jobs from a MAGIC user that filled up the local disk of WNs, turning them into black holes. Interesting lessons to learn here: protect ourselves from users filling up the disk (they will always do it, even if not deliberately) and minimise the impact that one user can have in all the other PIC users community.
April was not such a good month. Our reliability was above target (98%) but our availabilty was not (92%). The cause for the latter was our building yearly electrical maintenance. We scheduled two days downtime, and this brought us below the availability target for the Tier-1s. During this SD we tested a reduced downtime for the LHCb-DIRAC service. We managed to stop this service by just about 8 hours, so next time we should apply this to the other Tier-1 critical services.
The missing reliability in April was caused by a pretty bizarre problem in the SRM server. For some days we suffered huge overloads. The cause was finally found to be in the configuration of the dCache postgresql DB. In particular, the schedule of the "vacuum" procedures in the background. Using the "false" flag as recommended in the documentation was the origin of all our problems. After this incident, we have learnt quite a lot about postgres vacuum configuration and are pretty sure we are safe now. We have also learnt that trusting the documentation is not always a wise thing to do :-)

Monday, 25 May 2009

Important upgrades in last week SD: Improving the Computing Service

Last tuesday 19th May we had a Scheduled Downtime where quite a lot of important interventions were performed, aiming to improve the performance and reliability of some of the PIC services.
One of these interventions was the connection of the HP c7000 bladecenters to two stacked 10GE switches. Using a configuration already in place and originally designed for the dCache disk servers. The resulting bandwidth for the Computing LAN will be an average of 1,78 MB/s/core in the switch-router uplink and 3,9 MB/s/core in the bladecenter-switch uplink (after connecting each blacecenter with 4x1GE). One of the good things of this LAN infrastructure is its scalability, so we will keep an eye on the cacti monitoring of these links to anticipate wether we need to scale up.
Another important intervention which took place also affecting the Computing Service was the migration of the NFS shared software area to a new much more robust hardware: a FAS2020 cabin with SAS disks. This will not solve all the inherent problems that an NFS shared area brings to our lives, but at least will let us sleep a bit more relaxed while a more scalable solution for VO software access from the WNs arrives.

Friday, 8 May 2009

CPU ramp up

This week, on monday 4th May, the capacity of the Computing Service at PIC suffered a substantial increase. The number of available cores almost doubled in one go, so now we have a total of about 1400 cores. This corresponds to the deployment of 90 new (blade) servers, the MoU-2009 purchase of the Tier-1.
These new Worker Nodes have a L5420 Intel Xeon processor, which should give us better power consumption to specs ratio. This figure is important these days, when input power issues appear everywhere you go.
The first thing we wanted to check when powering on this new capacity was how stable was the temperature inside the machine room, and it looks that this has been ok. The other interesting issue is to see how well the Torque and Maui servers scale when doubling the number of nodes. We will need to keep an eye also in the scaling of the CEs...

Thursday, 26 March 2009

February reliability, VOs joining the party

We should try and report about our scored availability and reliability for the past month of February... better before March is over!
Last month we got pretty good results for the OPS SAM tests: 98% availability (the missing 2% being basically the Scheduled Downtime on the 10th Feb) and... (drums in the background)... 100% reliability!
Yes. There we are, at the top of the ranking. Well, actually CERN, FNAL, BNL and RAL did also score max reliability last month, together with us.
Somebody might argue (the experiments, actually) that this are OPS VO tests, so not reflecting reality. Actually, the opposite was true since not long ago. The experiments SAM tests were still being put together, and they were showing lots of false negatives.
Anyway, the results for the VO-specific SAM tests in February were also pretty good at PIC: 98% reliability for ATLAS, 99% for CMS and 96% for LHCb. The ATLAS and CMS detected unreliability is completely true, we must admit. This was an indicent we suffered on the 22nd February (Sunday, by the way) with our "sword of damocles", also known as the SGM-dedicated WN. Some jobs hanged there, completely blocking further execution of the SGM jobs from the VOs. Luckily, Arnau reads his e-mail during weekends, and the problem was detected and solved quite fast (thanks!).
The LHCb reliability number is not really reliable, since their SAM test framework had some hickups the first 5 days of the month.
We had recently the WLCG Workshop in Prague, there we heard big complaints from CMS saying the reliability of Tier-1s is not good at all. The numbers they showed were indeed quite bad, and this is mostly due to the fact that they are adding extra ingredients to their reliability calculation. In particular, they use the results of routine dummy job submissions (JobRobot), and it happened that, even if our CMS SAMs were strong green, the JobRobot was more reddish.
I think it is good that experiments make their sites monitoring more sophisticated. However, for this to be a useful tool in improving reliability... they first need to tell us!
Now we are

Monday, 16 March 2009

CMS cpu consumer, back to business

Looks like someone in CMS was reading this blog, since few hours after the last post, saturday evening, CMS jobs started arriving to PIC. We see a constant load of about 300 jobs from CMS since then. Not bad.
Apparently these are the so-called "backfill" jobs. All the Tier-1s but us (and Taiwan, down these days due to a serious fire incident) started running these backfill jobs early March. After a bit of asking around, we found out that PIC was not getting its workload share because the old 32bit batch queue names were hardcoded somewhere in the CMS sytem (we deprecated 32bit queues more than one month ago!) plus they had a bug in the setup script that got the available TMPDIR space wrong.
Good that we found these problems and that they were promptly solved. Now CMS is back to the cpuburning business at PIC. ATLAS is still debugging the memory-exploding problem that stopped jobs being sent to PIC about one week ago. Looks we are close to the solution (missing packages) and we will soon se both experiments competing again for the CPU cycles at PIC.

Saturday, 14 March 2009

CPU delivery efficiency

We are these days collecting the accounting data for February 2009. Looks like we reached a record figure for CPU efficiency delivery last month. The 3 LHC experiments used up to 80% of the total CPU days available at PIC: almost 37.000 ksi2k·days. This was largely thanks to ATLAS, who consumed around 80% of those CPU days. LHCb used just above 15% and CMS a mere 5%.
So, well done for ATLAS. It is true that most of that load are not "Tier-1 type jobs", but just contribution to the experiment MonteCarlo production. Anyway, it is better that Tier-1 resources are used for simualation rather than stay idle consuming electricity, heating the computing room and watching their 3-year lifetime pass by (at a rate of about 6 kEur/month).
From our point of view the Panda system which is used in ATLAS, and that implements the now so-loved pull model for computing (or pilot jobs), is definetely doing a good job in consuming all available CPU resources.
Unfortunately, not everything is so nice. This last week we have seen the CPU utilisation at PIC decreasing quite a lot. The ATLAS Panda system was not sending jobs to PIC, and we discovered this was due to a problem with Athena software running in 64bit OS. Suddenly the production jobs running at PIC's SL4/64bit WNs exploded in memory utilisation and were eventually killed by the system. The experts are working now to understand and fix this, hope they find a patch soon.
Let's see when CMS and LHCb implement efficient CPU consuming systems similar to ATLAS and we can benefit of being a multi-experiment Tier-1.
Meanwhile, at PIC, idle CPUs are transforming electricity into heat. Waiting for ATLAS to cure their 64bit indigestion.

Friday, 27 February 2009

January availability, hit by CMS and cooling

So, let's try and have a look to the Tier-1 availability and reliability during the past month of January... at least before February is over!
The result for the reliability was 98%, just above the target for the Tier-1s (which is now 97%). This looks quite ok, but it is worth to note that our colleagues at the Tier-1s are doing a very good job at their sites, so with 98% we got the 8th position in the ranking, tied with TRIUMF. Four centres got the 100%, and three more got 99%. Quite impressive, isn't it?
Our 2% unreliability was mostly caused by the incident in the SRM service we had on January the 21st. On that day we saw our dCache SRM server flying in the sky with loads up to 300. The cause of that was traced to be a bug in the CMS jobs software, that made them issue recursive srmls queries to the SRM. Once again, we saw how easy is to suffer a DoS from an innocent user and how little we can do to protect us against it.
For the availability, however, we scored pretty low in January: 92%. Well below the 97% target. This was caused by the cooling indicent we suffered on saturday 24th January. After shutting down PIC on saturday noon, we did not bring it up back again until monday morning. Here we see how fast the availability goes down on weekends :-)
With the LHC start around the corner, we should be definetely operating our full 24x7 now. As we see we are not quite there yet, so now it is time to think about the last step in the 24x7 journey (one could call it MoD-Phase3) that implements the required coverage for the critical services. First thing to do: ensure we have a proper definition of criticality.

Friday, 6 February 2009

When the user becomes the enemy

Our poor SRM service has been the victim of a couple of user attacks in the last days. The user is always an inocent scientist somewhere trying to do some HEP research, but at some point starts hammering our SRM with requests which overload the system. It happened to us on the 21st January with CMS, whose jobs suddenly started issuing recursive srmls due to a bug. This overloaded our SRM service so that it could not handle other requests properly. Another event happened at the beginning of this week, when an ATLAS user from Germany started requesting a single file at PIC thousands of times. This was also traced to be a bug in the ATLAS Grid job framework. Even if innocent victims, we still need to protect against these events. And as of today there is no clear way on how to do it. We will need to work on splitting the SRM servers among VOs as well as being able to limit requests to the server in some way.

NFS, the batch killer, and the windy datacenter

Again, a long time since we do not post in this blog. This is not because nothing is going on here at PIC. On the contrary, too many things that let little time to write them down here.
Anyway, I will try now to briefly review the major issues we had in the last weeks. We had a couple of remarkable service problems. The first one, and most severe, affected the Computing service during Christmas. The problem started on the 19th December 2008, and it could not be fixed until the 12th January next year. The origin of the problem were LHCb and CMS jobs which were accessing to SQLite through NFS. This is known to be a bad practice since it can effectively hang the processes accessing NFS. This is what happened at PIC. The batch quickly filled up with hanged-unkillable jobs which in few days completely blocked the service. The batch master saw all of the WNs with high load, so could not deliver new jobs to run. We contacted back the experiments and asked them to stop using SQLite through NFS, but we also learnt useful lessons: we are missing very important monitoring.
The second problem arrived on saturday the 24th January around noon. There was a huge wind storm affecting the whole of Spain and the south of France. Among other incidences, this caused disruptions of the electric supply at the PIC building. The UPS system properly dealt with these short power cuts, but unfortunately the cooling system didn't. It stopped, and did not start back again automatically. The consequence was a fast temperature increase in the room. Fortunately, we could stop most of our servers gracefully so the restart on monday was quite smooth. In any case, more lessons learnt: more monitoring needed (a proper high level temperature alarm) and operational procedures in place, both for stopping the service asap and also for being able to start it back as soon as conditions are restablish. We should not forget we have to meet the MoU reliability metrics.