Authors V.E. Velikhov, A.A. Klimentov, R.Yu. Mashinistov, A.A. Poyda, E.A. Ryabinkin
Month, Year 11, 2016 @en
Index UDC 004.75
DOI 10.18522/2311-3103-2016-11-88100
Abstract Modern experiments face unprecedented computing challenges. Heterogeneous computer resources are distributed worldwide, thousands of scientists analyse the data remotely, the volume of processed data is beyond the exabyte scale, while data processing requires more than a few billion hours of computing usage per year. The Large Hadron Collider (LHC), operating at the international CERN Laboratory in Geneva, Switzerland, is leading Big Data driven scientific ex-plorations. Experiments at the LHC explore the fundamental nature of matter and the basic forces that shape our universe, and were recently credited for the discovery of a Higgs boson. ATLAS, one of the largest collaborations ever assembled in science, is at the forefront of research at the LHC. To address an unprecedented multi-petabyte data processing challenge, the ATLAS experiment is relying on a heterogeneous distributed computational infrastructure. The ATLAS experiment uses PanDA (Production and Data Analysis) Workload Management System for managing the workflow for all data processing on over 140 data centers. Through PanDA, ATLAS physicists see a single computing facility that enables rapid scientific breakthroughs for the experiment, even though the data centers are physically scattered all over the world. Modern biology also uses complex algorithms and sophisticated software, which is impossible to run without access to significant computing resources. Recent advances of Next Generation Genome Sequencing (NGS) technology led to increasing streams of sequencing data that need to be processed, analysed and made available for bioinformaticians worldwide. Analysis of ancient genomes sequencing data using popular software pipeline PALEOMIX can take a month even running it on the powerful computer resource. PALEOMIX includes typical set of software used to process NGS data including adapter trimming, read filtering, sequence alignment, genotyping and phylogenetic or metagenomic analysis. Sophisticated computing software WMS and efficient usage of the supercomputers can greatly enhance this process. In 2014 authors have started to develop a large scale data- and task- management system for federated heterogeneous resources based on the PanDA workload management system as an underlying technology for ATLAS experiment on Large Hadron Collider and bioinformatics applications. As a part of this work, we have designed, developed and deployed a portal to submit scientific payloads to heterogeneous computing infrastructure. The portal combines Tier-1 Grid center, Supercomputer, and academic cloud at the Kurchatov Institute. The portal is used not only for ATLAS tasks, but also for genome sequencing analysis. In this paper we will describe the adaptation the PALEOMIX pipeline to run it on a distributed computing environ-ment powered by PanDA. We used PanDA to manage computational tasks on a multi-node parallel supercomputer. To run pipeline we split input files into chunks which run separately on different nodes as separate inputs for PALEOMIX and finally merge output file, it is very similar to what it is done by ATLAS to process and simulate data. We dramatically decreased the total walltime because of (re)submission automation and brokering within PanDA, what was earlier demonstrated for the ATLAS applications on the Grid. Software tools developed initially for HEP and Grid can reduce payload execution time for Mammoths DNA samples from weeks to days.

Keywords Distributed computing; supercomputers big data; workflow management systems.
