Building the grid: pretending it was easy

Sara Coelho and Jeff Templon look back at 15 years of milestones

The distributed computing infrastructure we now call EGI is a success. The installed High-Throughput Compute capacity has surpassed one million cores earlier this year, and as of March we have about 690 petabytes available for disk and tape storage. Our federated e-infrastructure processes more than half a billion compute jobs per year and serves research communities across all fields of science.

The scale and range of the achievements are impressive but, like Rome, none of it was built in a day. Taking the EGI federation off the ground involved groundbreaking work and dozens of teams across Europe.

So what were the major steps in our story?

The testbed

The European DataGrid project started with the new century on January 1, 2001 with the aim to develop and implement a computational grid for data-intensive scientific computing models, to be demonstrated on a worldwide testbed.

The testbed was first presented at a project review held on March 1, 2002. During the live demonstration, a grand total of 15 jobs were submitted by scientists from the LHCb experiment, Earth Observation and Computational Biology research communities. The 15 jobs were distributed across the five computer centers: CERN, CNAF (Bologna), CC-IN2P3 (Lyon), RAL (Didcot), and NIKHEF (Amsterdam).  The overall experiment was a success. The jobs consumed in total less than one CPU-hour, but the demo showed that the grid was capable to parallelise computational tasks.

The distributed e-infrastructure was born.

Scaling up from prototype

This first release of the testbed was a successful proof of concept. The e-infrastructure needed to be scaled up by several orders of magnitude by the time the LHC started operation. The DataGrid project turned its attention to this challenging task.

As the testbed increased in size (more sites and more machines), problems began to appear with various components on which it was built. An example was the Information System – the list of all services available and information on how to use them. The original Globus component (the Grid Information Index Service or GIIS) became unusable once the grid grew past 7 sites.

During a lunch discussion at NIKHEF one day, somebody asked the question “how hard could it be?” (to make a stable information system). David Groep answered that the next day with the first release of the  “fake II” (later renamed the BDII).  After a few years of use and a factor-of-ten increase in the scale of the grid, the BDII was re-engineered (at CERN) in the EGEE and LCG projects and is still in use today. The Compute Element was another example: the original could not scale past 100 running jobs per site and had to be redeveloped. It was replaced first with the LCG-CE, later with the CREAM-CE which is itself about to be replaced as of this writing.

The Service Challenges

As the e-infrastructure grew, the LHC Computing Grid (LCG) project was concerned with making sure that the grid would cope with the volumes of data expected with the start of LHC operations.

To do this the LCG designed a series of Service Challenges: scenarios to test real production capabilities in preparation for the start of LCG operations in April 2007. The Service Challenges started in 2004 focusing the development work on specific goals. For example:

  • Demonstrating sustained data transfers between grid sites, and
  • Providing a reliable base service from all the Tier1s

The Service Challenges progressed with increasing complexity (not unlike leveling up on a computer game) and showed that coordination was key for everything to work smoothly.

A production infrastructure at your service

Through the LCG Project, the EGEE series and now the EGI era, the grid continued to grow in almost every dimension: number of sites, countries, users, cores, petabytes, communities. For the entire e-infrastructure to work reliably and professionally, we needed to develop the support tools necessary for smooth operations. These included, for example, the GGUS helpdesk, configuration management or the monitoring system; here we will focus on accounting.

The original accounting “system” consisted of sysadmins manually sending the numbers recorded on their site by email. Such a procedure is not scalable, desirable or useful in the long term. A central database was developed along with a messaging service; sites could publish their numbers into the messaging service, to be collected into the database. Many sites made use of the APEL tool, developed by STFC, to generate records automatically and publish them; other sites published records directly from their own site-level repository. The central database is then used to generate the summaries available through the Accounting Portal. The jobs recorded in the central accounting database go back to January 2004: about 4,000 jobs submitted by NIKHEF.

The start of the accounting records 15 years ago is a major milestone as in establishing the grid as a mature, professional system, where sites could demonstrate precisely how much resource was used at their site by whom, which is in turn crucial to showing funding agencies the justification of their investments.

15 years later…

Now that we are in 2019 with an e-infrastructure in production, supporting wonderful scientific discoveries, it is easy to be amazed at the technical achievement that was to put this together.

But we would argue that it’s not just that. Building the grid was also a great sociological experiment: getting so many different teams to sit together to work on one common goal was perhaps our greatest accomplishment.

More information

Sara Coelho is the EGI Foundation Communications Manager and did not know what middleware was until 2010.

Jeff Templon leads the Physics Data Processing programme at NIKHEF and has worked on grid computing since July 2001.

Issue 34

Table of contents