European Grid Infrastructure

towards a sustainable infrastructure

Jump to Menu

The grid is under attack!

Sven Gabriel reports on the outcome of the 2011 Service Security Challenge

Challenging the IT-security incident response is like a fire drill. In both cases you want to make sure that the response system is flawless and to ensure that everything works as expected. In the end you have to try out your response plans and procedures to be sure that they actually work in these special circumstances.

To do this, the Computer Security Incident Response Team (EGI-CSIRT) has run a series of security drills against a huge fraction of the infrastructure. Over the years these drills evolved from exercises targeted at the traceability of a grid job back to the sender, to realistic simulations of a security incident affecting multiple sites. These runs have resulted in a number of operational improvements developed by the participating teams, such as communication templates, general incident response procedures, a simple forensics how-to and specialised tools to facilitate operations.

The 2011 drill

For the 2011 service security challenge (SSC5) we created a story where the credentials of a legitimate ATLAS user were stolen and the password was cracked. The ‘compromised’ credentials were then used to deploy the ‘malware’, in this case a trojan named WOPR, using the ATLAS job-submission-framework PANDA. The ATLAS-VO helped us by creating a parallel PANDA factory including a specifically created certificate DN under which the pilot jobs were running. This ensured that incident response operations would not affect the ATLAS work.

We invited 40 resource centres from 20 countries, including ARC sites and one ATLAS site from the Open Science Grid to participate in our drill. We then deployed the ‘malware’, by creating a bot-net affecting these resource centres that simulates a security incident on a global scale.

After the ‘malware’ was successfully deployed at all participating sites, we sent an alert to one site in the Asian-Pacific region (on 25 May, 9:00 local time) and recorded how long it took until the alarm propagated around the globe. We also wanted to record how long it took to contain the incident (time elapsed between detection of the incident and the moment all processes were stopped) and how long were the ‘compromised’ credentials still usable on the infrastructure.

Once the initial alarm was received, each site was expected to inform the EGI-CSIRT, which is responsible for coordinating the response to the incident with the help of the NGIs.

As in a real case the participating sites had very little initial information about the incident. To find out the details about the attack the teams had to use various sources such as analysis of the submitted jobs or network logs.

They had to estimate the extent of the incident (how many sites were affected), identify the attack vector (what were the methods used to propagate the malware) and which VO and user credentials were involved. The teams also had to contain the incident, i.e. stop all malicious processes and suspend the potentially compromised credentials at all participating sites.

From Russia with speed

The Russian team led by Eygene Ryabankin, a security expert based at the Kurchatov Institute, detected the activity of the drill at an early stage - actually even before receiving the official alarm. They were able to track us down, identify the EGI-CSIRT team as the source of the attack, and stop our ‘malicious’ activity by shutting down the Command & Control centre. All in less than four hours.

Eygene’s quick response is great news if we ever face a real incident. For the purposes of the drill his team was just too fast and we had to explain to them that this was a part of the SSC5 exercise and that the system was needed to continue the security drill.

In general, a key step in these situations is that all sites suspend the ‘attackers’ DN at their systems to make sure that the credentials cannot be used on the infrastructure. This was recognised as an area for improvement as some sites did not manage to successfully ban the attacker from their site and we were able to submit jobs via regular gLite job submission. Other mechanisms to control the user access were also used. These have the advantage that they can be done centrally, but need some time to take effect, for example suspending the user at the VO has a latency of up to 24 hours and certificate revocation information should be at the sites within six hours. Nevertheless, due to local settings, update of the certificate revocation information might take up to 48 hours - this is currently being investigated.

One of the nice things about the 2011 security drill was the community involvement. Almost all 40 resource centres participated actively in the drill and sites are already volunteering for the next exercise scheduled for 2012.

Given the positive feedback it seems we managed to get some fun in these drills.

SSC5 - the movie

All activities were recorded by the SSC-Monitor and summarised in a movie called "48h of incident response in 5 minutes"

 

Security drills, step-by-step

Security drills may be organised in many ways, but they all address the same key steps of incident response:

 

This issue