The status of the data transfer pilot in the PaNOSC cluster project

Giuseppe La Rocca and Jean-François Perrin update us on the PaNOSC data transfer pilot

In the context of the PaNOSC cluster project, the EGI Foundation  was invited to contribute with cloud and storage resources, and high-level data management solutions for helping members of the project to perform remote data analyses using the EGI Notebooks service in the EGI Cloud. The overall objective of this data transfer pilot was to help scientists working in one of the PaNOSC facilities to get the data, perform analysis on the datasets obtained during an experiment, and store back the results at the same facility. Datasets used during the analysis may be under embargo, which means that the data at the facility could only be accessed by users with enough authorization. (e.g.: the user has to be part of the proposal team).

After more than one year of activities led by ILL in the “WP6 – EOSC integration” working group of the project, with the following article we present the current status of this pilot including the services and the resources allocated by the EGI Foundation and its partners. To support the data transfer pilot the EGI DataHub service, based on the Onedata software stack solution, was adopted to federate datasets and enable data replication across different PaNOSC facilities. For this particular pilot, the datasets were provided by the CERIC-ERIC facility. The implementation of the data federation requested the installation of three instances of oneproviders (two at CERIC-ERIC and one at CESNET-MCC) and one instance of onezone at CESNET-MCC. A total of 50 vCPU cores, 112GB of RAM and 800TB of block storage were allocated by CESNET-MCC cloud provider for supporting the pilot. To secure the access to services and datasets, user’s credentials released by UmbrellaID were adopted.

The adoption of the UmbrellaID as Community Proxy for the project contributed to implementing the authentication and authorization mechanism and making sure that only authorized users from the PaNOSC facilities can access datasets under embargo. For the analysis of the datasets, EGI contributed offering the EGI Notebooks service. For this pilot the service was tailored in order to support UmbrellaID authentication and allow scientists to perform agile analysis with the relevant datasets provided by the project and exposed through the EGI DataHub service. The PaNOSC Data Transfer Pilot high-level architecture is shown below.

 

 

 

 

 

 

 

 

 

 

For demonstrative purposes, a Jupyter notebook was also prepared to: read hd5f datasets made available by the PaNOSC facility through the EGI DataHub volume space,  analyse datasets using the h5glance library developed by the European XFEL, and produce as result a plot. The resulting plot, generated in the user’s space of the Jupyter notebook, was also sent back to the facility through the oneprovider service instance enabled for write access. Upon request, system administrators can sync the results of the processing to the Central Archive service of the PaNOSC facility. The Jupyter notebook used for testing this pilot use case is available in this GitHub repository.

Giuseppe La Rocca is Community Support Lead at the EGI Foundation.

Jean-François Perrin is WP6 Leader on EOSC integration.

Contributors: Enol Fernandez (EGI Foundation), Marco De Simone (CERIC-ERIC), Łukasz Opioła, Michal Orzechowski and Bartosz Kryza (CYFRONET), Miroslav Ruda and Andrei Kirushchanka (CESNET-MCC), and Christos Kanellopoulos (GEANT)