EGI Federation Home

Enabling the big data pipeline lifecycle on the computing continuum

DataCloud is a Horizon 2020 project that develops methods for the entire lifecycle of Big Data pipelines on diverse infrastructures for efficient processing and monitoring.



The EU-funded H2020 project DataCloud introduces a groundbreaking paradigm with a complete life cycle managing Big Data pipelines through discovery, design, simulation, provisioning, deployment and adaptation across the computing continuum. It allows Big Data pipelines to interconnect the end-to-end industrial operations from the preprocessing and collecting of data to the realisation of a business target, using heterogeneous infrastructures as part of a common compute continuum.

DataCloud develops novel methods to support the complete lifecycle of Big Data pipelines processing, enabling their discovery, definition, model-based analysis and optimization, simulation, deployment, adaptive run-time and monitoring on top of decentralized heterogeneous infrastructures on the Computing Continuum.


The challenge

For DataCloud, we focus on the cloud-edge continuum, and try to provide benefits by limiting data transfer. Therefore we found it important to be able to host the deployed applications close to the source data, therefore to EGI cloud providers in Italy, Spain and Portugal. In addition, of course, the benefit of free cloud resources that can be used for our tests was an important aspect, as it saved us valuable resources in comparison to using AWS or Azure cloud resources, as we performed the evaluation testing over.

The solution

In DataCloud, we used the EGI Cloud Compute and the cloud-based EGI Online Storage to distribute the computational task to a scalable compute platform and to store intermediate results from the user jobs.

In addition, we used both the VMOps Dashboard and the Infrastructure Manager Dashboard to configure our resources and perform our tests. The EGI Check-In was used for authorized access to both portals and to the underlying distributed compute infrastructure. DataCloud used the EGI Applications Database to configure and deploy underlying services that we needed as prerequisites.

Then, we proceeded with a federated Kubernetes cluster setup using EGI resources from multiple cloud providers across Europe. For the setup of the federated cluster, we used Submariner to enable direct networking between Pods and Services in different Kubernetes clusters that facilitate a compute continuum that is used by DataCloud Toolbox for the deployment and management of the Big Data pipelines.

For our testing we used the virtual organization. We tested our setup with cloud providers mainly in France, Italy, Spain but also tested with resources in other countries across Europe, such as Slovakia, Poland and Portugal.


Services provided by EGI

Store, share and access your files and their metadata on a global scale

Login with your own credentials

Run virtual machines on-demand with complete control over computing resources

The IM Dashboard is a graphical interface for the IM Server specially developed for EOSC users to access EGI Cloud Compute resources.

Dedicated computing and storage for training and education


Through EGI-ACE, the DataCloud team got access to European Cloud resources that were used to create distributed cloud-edge continuum testbeds for the execution of realistic scenarios from five use cases across different domains.



For the testing the solution in its first release, DataCloud performed a first PoC testing the deployment of one of the scenarios in Italy, based on a 3-step pipeline and sample data, for a period of a week.


As DataCloud is reaching the final release of the integrated toolbox and the final implementation of the five business cases, DataCloud expects even more datasets to be processed in the tests by the end of 2023.

Supporting projects


EGI-ACE is a 30-month project with a mission to empower researchers from all disciplines to...