Skip to main content

Empowering pandemic preparedness with LUMI and the European COVID-19 data portal

The COVID-19 virus is still present among us. While population immunity has increased due to vaccinations taking effect, the virus is under evolutionary pressure to persist. Consequently, SARS-CoV-2 continues to evolve. Just a couple of weeks ago, one of the scientific authorities in the field, Eric Topol, reported on social media about a hyper-mutated virus that is diverging from the known coronavirus variants. This variant is expected to reach Europe in the coming weeks.

What makes this new mutation noteworthy is its location. The virus gene that has undergone changes under evolutionary pressure is altering the structure of the coronavirus spike protein. As we have come to know during the pandemic, the spike protein structure of the coronavirus is the culprit for its rapid spread through the air; it can attach to human respiratory organs and cells.

In order to understand the threat virus poses and be prepared, we need to turn existing viral data into knowledge. This is where the EuroHPC Joint Undertaking’s supercomputers enter the scene.

Location of the recently detected COVID-19 mutations affect the spike protein region around residue number 455.

Figure. Location of the recently detected COVID-19 mutations affect the spike protein region around residue number 455. Source: Worldwide protein data bank 

Analysing big data with supercomputing

Big viral data is a technical challenge for the present EuroHPC supercomputers. SARS-CoV-2 data analysis pipelines, utilized for updating the European COVID-19 data portal, are effectively deployed and tested on LUMI. The portal, offered globally by EMBL-EBI, an ELIXIR node, according to the FAIR data principles, offers access to data about more than 8 million viral sequences.

The computing made on LUMI helps to organise this data, and this work is carried out in collaboration between CSC and EMBL-EBI experts. The project has overcome many technical challenges associated with data management. Some of these challenges arise from the functional demands of large-scale data access in bioinformatics, stretching and developing data handling capability of the supercomputer ecosystem. Big data and its movement is important part of computational workflows.

Approximately, analysing each coronavirus sample requires 0.9 CPU hours of supercomputer processor time. A dataset consisting of 1 million samples and occupies 100 terabytes of disk space, and at present, there are a total of 8 million virus samples. The number of unanalyzed datasets or samples is millions.

Consequently, supercomputer capacity is required to analyze all the data required for online publication of knowledge derived from the data. At some point, we anticipate a stabilization in the influx of new samples, possibly after a few years, but it is likely that there will still be a substantial number to process, preparing to quickly analyse new virus strains.

Technical challenges with data transfer

The primary functional challenge in supercomputing environment revolves around the incoming raw data from EMBL-EBI (ftp server). This is integral part of the computing workflow. The data transfer, which needs to occur ‘on the fly,’ deviates from the traditional high-performance computing usage pattern. To address this challenge, we conducted pilot experiments with locally staged European Nucleotide Archive (ENA) raw data from EMBL-EBI to CSC, which would amount tens of petabytes transferred and made available on compute environment over time. Intention is to have the data located in close proximity to the LUMI computing resource. Progress made in collaboration with expert teams has shown a great promise.

Nevertheless, data management needs extend computing project timelines beyond traditional HPC computing project timelines. Requirements are associated with data transfer mostly, which is causing delays in completing the computing tasks. Addressing this data management challenge is crucial to enable large-scale computing on the data. We are jointly heading towards the dedicated solutions on EuroHPC architecture side.

One of the outstanding issues pertains also to data quality control. Data is sent to EMBL-EBI in batches (data sets), originating from various organizations and varying in size. It has become evident that some of these batches contain data of higher quality, while others have lower quality data. This discrepancy is reflected in computing pipeline failure rates, with some batches experiencing a 1% failure rate and others going up to 5% of the samples being affected. Failed samples necessitate costly reruns, which can potentially double the time required for dataset analysis, representing a significant computational burden and complicating our estimation of the resources needed to process everything. CSC and EMBL-EBI are researching the issues to address this computing challenge.

Forward together – combining supercomputing and big data

At present, we are proposing an experiment that involves roughly 1.5 million samples that have not yet been included in the COVID-19 data portal. The data that needs to be transferred over the Internet presents a technical challenge on its own.

The success of the proposed experiment combines extensive data processing on supercomputing platforms and holds significant implications for European pandemic preparedness. Therefore, there is a strong rationale for continuing this work, bridging various skills to address the remaining technical challenges.

This endeavor and computing collaboration between life science data infrastructures with the EuroHPC remains relevant task for European preparedness for infectious diseases until we have successfully established a computational process for the periodic update of the European COVID-19 data portal.

Tommi Nyrönen

Author: Tommi Nyrönen, ELIXIR Finland, CSC – IT Center for Science, written on 31 October 2023

elixir logo