Reference architectures provided by SZTAKI at ELKH Cloud

SZTAKI has established the Hungarian academic cloud system (called MTA Cloud) together with Wigner Data Centre (WDC) in 2016. The main goal was to serve the e-infrastructure needs of all the Hungarian scientists working in the research institutes of the Hungarian Academy of Sciences. The cloud was established as an alliance of two cloud sites located in two different areas of Budapest, one in the Pest side of the city in SZTAKI and the other one in the Buda side of the city in WDC. It was decided from the very beginning that both cloud should be built on the same cloud technology (Openstack was selected) and keep their technological similarities as much as possible. Nevertheless, some differences in various fields like the applied security concept, authentication mechanism, the size of the resources have been tolerated. We developed a common web portal for the two sites and the policy of accessing the resources and supporting the user communities were managed by the two sites in the same way.

This two-site cloud federation has worked in a very reliable way with full satisfaction of the academic user communities. However, by the end of 2019 it was clear that the current cloud capacity is not enough to serve the ever growing needs of the Hungarian scientists. By that time more than 120 projects were run on the cloud and the cloud resources were fully used up. From September 2019 the research institutes were managed by a newly formed organization called Eötvös Lóránd Research Network (ELKH) and the leadership of ELKH recognizing this situation decided to further develop the cloud and substantially extend its capacity towards even new directions like big data and AI activity support. As a result, on the 1st of July 2020 a one-year project started to realize this planned enhancement of the cloud infrastructure. In October 2020, the cloud was renamed under the new name ELKH Cloud. The planned capacity of ELKH Cloud compared with the capacity of the former MTA Cloud is shown in Table 1. The table clearly shows that every resource types will be at least 3 times bigger than it was in MTA Cloud and several of them will be augmented by 1 or even 2 orders of magnitude especially those which are needed for big data, AI, data transfer and storage support.

Current ELKH Cloud      (former MTA Cloud)  resource capacity Planned ELKH Cloud resource capacity
vCPU (max) 1368 4000
GPU core 12 76
vGPU (max) 12 2060
RAM (TB) 3.25 11
SSD storage (TB) 0 153
HDD storage (TB) 527 1500
Tensor GPU performance  (PFLOPS) 0 7.16
Floating-point GPU performance (PFLOPS) 0 0.89
Network bandwidth (Gbps) 10 100

 

It should be very clear that ELKH Cloud is neither a batch-oriented cloud nor a cloud to provide multi-user software services. Rather this is an e-infrastructure framework in which every user project can build its own required e-infrastructure as they like and then the members of the project can use this e-infrastructure according to their needs. As a consequence, the cloud is not individual user-oriented rather project-oriented. When a new project registers for using the cloud, its leader should fill in a request form in which among others the type and size of the required resources should be identified as well as the foreseen duration of using them in the cloud. Based on this request and the available capacity a resource quota is granted to the project and it can use those resources according to its needs. Project members can deploy any kind of infrastructure within the quota limit and use them for the duration specified in the request form.

Of course, deploying the required e-infrastructure in the cloud is not easy. Therefore, it was also part of our cloud project to provide ready-to-use e-infrastructure reference architectures (RAs) for the scientists and a set of basic images from which they can build the required type of VMs. The RAs cover a wide range of potential e-infrastructure types starting with the relatively simply ones up to the very complex big data and AI-oriented reference architectures. The simplest one provides the very popular Jupiter notebook development environment. A bit more complex one provides a WS-PGRADE science gateway and parallel program development environment in which job submission and workflow management is also available. In order to enable the transfer of large volume of data the Data Avenue tool developed by SZTAKI is available as RA. Recognizing the importance of processing very large scientific data sets we have developed the Flowbster workflow system as RA by which a temporary pipeline e-infrastructure specialized for data transfer and data processing can be built and deployed in the cloud. Further important RAs provide various types of clusters devoted to different kind of application areas: Swarm and Kubernetes docker clusters for running docker container based parallel applications, Hadoop and Spark cluster for big data and AI applications. Furthermore, a Tensorflow environment can be deployed in ELKH Cloud as a RA. In order to run automatically scalable cloud applications at VM and container level as well we have created a MiCADO RA that was developed in the EU FP7 COLA project and is under industrial exploitation in the recently launched EU H2020 DigitBrain project. Further RAs will be developed as future user support activity according to the direct needs of the various projects running on ELKH Cloud.

Since ELKH Cloud is a project-oriented and not an individual user-oriented cloud we do not provide multi-user services that are constantly available for users either they use it or not. Rather, we trust projects that if they need a service for their members, then they will build and deploy it in the cloud using the resources available within their quota. In this way only those services will be deployed in the cloud that are really needed and the used resources are limited by the resource quota of the given project. As a result the exploitation of the resources is much more economical than in the case of the generic multi-user services.

A good example of our approach and impact with our cloud was a successful COVID-19 related project, when virologists, bioinformatics and cloud experts formed a project during the 1st wave of the pandemic in order to significantly speed up the analysis of epidemic spreading within 24 hours, and to create the first genetic epidemiological modelling for Hungary in a few days.

In the EGI-ACE project we would like to combine and share our experiences with the knowledge accumulated in the EGI fed cloud system in order to take mutual advantages of the lessons learnt in those systems. Best practices applied in the EGI Federated Cloud system should be adapted for ELKH Cloud and vice versa we are ready to make openly available for the whole EGI-ACE community the RAs we have developed for ELKH Cloud. We believe that these mutual benefits will contribute to the further improvement and optimization of these cloud systems.

More information

SZTAKI website

Peter Kacsuk is Head of Laboratory of Parallel and Distributed Systems at SZTAKI.