Pangeo EOSC Infrastructure for Big Data Geoscience with EGI-ACE
Many scientific domains are increasingly data-driven disciplines. Research projects crucially depend on the exchange and processing of data and have developed platforms tailored to their needs. Pangeo is a world-wide community of scientists and developers. Among other things, it provides a platform, initially developed for Geoscience, that has a huge potential to become a common open science gateway able to leverage a wide variety of infrastructures and data providers for various science fields.
Public Pangeo platform deployments that are providing fast access to large amounts of data and compute resources are all USA-based and members from of the Pangeo community in Europe do not have a shared platform where scientists or technologists can exchange know-how.
The main objective of the Pangeo use case is to demonstrate how to deploy and use Pangeo on EOSC and underline the benefits for the European community.
The Pangeo use case has two main computing objectives:
- Creating a common platform for Pangeo European users,
- On-boarding European researchers on this Pangeo EOSC infrastructure.
Pangeo configuration for the European Open Science Cloud (EOSC)
In this section, we briefly describe how the Pangeo Europe Community together with EGI deployed DaskHub, composed of Dask Gateway and Jupyterhub, with Kubernetes cluster backend on EOSC using the infrastructure of the EGI Federation.
The Pangeo EOSC Jupyterhub was deployed through the Infrastructure Manager (IM) Dashboard to enable future Pangeo deployments to be easily deployed on top of a wide range of cloud providers (AWS, Google Cloud, Microsoft Azure, EGI Cloud Computing, OpenNebula, OpenStack, and more …).
The current Pangeo EOSC infrastructure has been deployed on a Kubernetes cluster on top of OpenStack. The deployment of a Kubernetes cluster and the installation of Daskhub helm chart has been done through the IM dashboard. All the computing and storage resources are provided by CESNET in the frame of EGI-ACE project. The figure below shows the current deployment and more information can be found at https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI.md.
Pangeo 101 workshop at FOSS4G conference
The newly deployed infrastructure has been successfully used for onboarding users during a course provided at the FOSS4G conference. The training material is open source (CC-BY-4 license) and has been collaboratively developed.
The FOSS4G Pangeo 101 workshop was held on Tuesday 23rd August 2022 from 14:00 – 18:00 (Europe/Rome). 25 people from at least 4 different countries joined the Pangeo 101 FOSS4G workshop and used the Pangeo infrastructure. The course raised several interesting discussions and an increasing interest in using such infrastructure for analysing data.
During the workshop, attendees learned about the Pangeo ecosystem and how to make open, reproducible, and scalable Earth science on the EOSC infrastructure. Fully reproducible Jupyter Notebooks were prepared and used to teach how to:
- Access local and remote data;
- Load and analyse data with Xarray;
- Visualise data with Hvplot interactive visualisation package;
- Understand how to scale computation with Dask.
Sentinel-3 NDVI Analysis Ready Data (ARD) provided by the Copernicus Global Land Service was used during the workshop. All the Python packages used during this training are Open-source.
Lessons learned, next steps and get engaged!
The collaboration with EGI-ACE is a success for Pangeo. It allowed to deliver an efficient and scalable Pangeo JupyterHub that was successfully used to onboard 25 users. The feedback from users is very positive even though many struggled with EGI Check-in. In the future, we will make sure to send information on how to join EOSC Pangeo JupyterHub and connect with EGI Check-in a lot more in advance (ideally 2 weeks before the workshop) to be able to troubleshoot any potential issues.
Two more training events will take place by the end of 2022:
- Arctic processes in CMIP6 Bootcamp will take place in Søminestationen (Denmark) from October 11 to October 21, 2022 (37 users from 9 different countries);
- eScience Tools in Climate Science: Linking Observations with Modelling will be held from 31 October to 11 November 2022 at Tjärnö Marine Laboratory (Sweden) with remote presentations of final results on 28-30 November (30 users from 4 different countries).
In addition to training events, the Pangeo EOSC infrastructure will be used to showcase Open Science practices with a range of use cases.
In parallel to on-boarding events, the Pangeo community continues its collaboration with EGI to improve the Pangeo deployment and facilitate Open Science practices with for instance the deployment of a Binder instance with a Dask gateway and provide a common approach to spatial data analysis, independently of data and infrastructure providers.