Authors V.E. Velikhov, A.A. Klimentov, R.Yu. Mashinistov, A.A. Poyda, E.A. Ryabinkin
Month, Year 11, 2016 @en
Index UDC 004.75
DOI 10.18522/2311-3103-2016-11-88100
Abstract Modern experiments face unprecedented computing challenges. Heterogeneous computer resources are distributed worldwide, thousands of scientists analyse the data remotely, the volume of processed data is beyond the exabyte scale, while data processing requires more than a few billion hours of computing usage per year. The Large Hadron Collider (LHC), operating at the international CERN Laboratory in Geneva, Switzerland, is leading Big Data driven scientific ex-plorations. Experiments at the LHC explore the fundamental nature of matter and the basic forces that shape our universe, and were recently credited for the discovery of a Higgs boson. ATLAS, one of the largest collaborations ever assembled in science, is at the forefront of research at the LHC. To address an unprecedented multi-petabyte data processing challenge, the ATLAS experiment is relying on a heterogeneous distributed computational infrastructure. The ATLAS experiment uses PanDA (Production and Data Analysis) Workload Management System for managing the workflow for all data processing on over 140 data centers. Through PanDA, ATLAS physicists see a single computing facility that enables rapid scientific breakthroughs for the experiment, even though the data centers are physically scattered all over the world. Modern biology also uses complex algorithms and sophisticated software, which is impossible to run without access to significant computing resources. Recent advances of Next Generation Genome Sequencing (NGS) technology led to increasing streams of sequencing data that need to be processed, analysed and made available for bioinformaticians worldwide. Analysis of ancient genomes sequencing data using popular software pipeline PALEOMIX can take a month even running it on the powerful computer resource. PALEOMIX includes typical set of software used to process NGS data including adapter trimming, read filtering, sequence alignment, genotyping and phylogenetic or metagenomic analysis. Sophisticated computing software WMS and efficient usage of the supercomputers can greatly enhance this process. In 2014 authors have started to develop a large scale data- and task- management system for federated heterogeneous resources based on the PanDA workload management system as an underlying technology for ATLAS experiment on Large Hadron Collider and bioinformatics applications. As a part of this work, we have designed, developed and deployed a portal to submit scientific payloads to heterogeneous computing infrastructure. The portal combines Tier-1 Grid center, Supercomputer, and academic cloud at the Kurchatov Institute. The portal is used not only for ATLAS tasks, but also for genome sequencing analysis. In this paper we will describe the adaptation the PALEOMIX pipeline to run it on a distributed computing environ-ment powered by PanDA. We used PanDA to manage computational tasks on a multi-node parallel supercomputer. To run pipeline we split input files into chunks which run separately on different nodes as separate inputs for PALEOMIX and finally merge output file, it is very similar to what it is done by ATLAS to process and simulate data. We dramatically decreased the total walltime because of (re)submission automation and brokering within PanDA, what was earlier demonstrated for the ATLAS applications on the Grid. Software tools developed initially for HEP and Grid can reduce payload execution time for Mammoths DNA samples from weeks to days.

Download PDF

Keywords Distributed computing; supercomputers big data; workflow management systems.
References 1. Aad G. et al. The ATLAS Collaboration, "The ATLAS Experiment at the CERN Large Hadron Collider, Journal of Instrumentation, 2008, Vol. 3, S08003.
2. Evans L., Bryant P. LHC machine, Journal of Instrumentation, 2008, Vol. 3, S08001.
3. Klimentov A.A, Mashinistov R.Yu., Novikov A.M., Poĭda A.A., Ryabinkin E.A., Tertychnyĭ I.S. Integratsiya superkomp'yutera NITs «Kurchatovskiy institut» s tsentrom Grid pervogo urovnya [Integration of supercomputer research center "Kurchatov Institute" with the center Grid of the first level], Superkomp'yuternye dni v Rossii: Trudy mezhdunarodnoy konferentsii (28-29 sentyabrya 2015 g., g. Moskva) [Supercomputer days in Russia: Proceedings of the international conference (28-29 September 2015, Moscow)]. Moscow: Izd-vo MGU, 2015, pp. 700-705.
4. Klimentov A.A, Mashinistov R.Yu., Novikov A.M., Poĭda A.A., Tertychnyĭ I.S. Kompleksnaya sistema upravleniya dannymi i zadachami v geterogennoy komp'yuternoy srede [Comprehensive system and data management tasks in heterogeneous computing environment], Trudy mezhdunarodnoy konferentsii «Analitika i upravlenie dannymi v oblastyakh s intensivnym ispol'zovaniem dannykh» (DAMDID/RCDL'2015) (13-16 oktyabrya 2015 g., g. Obninsk) v evropeyskom repozitorii trudov konferentsiy CEUR Workshop Proceedings (DAMDID/RCDL) [Proceedings of the international conference "Analytics and data management in areas with in-tensive use of data" (DAMDID/RCDL'2015) (13-16 October 2015, Rotterdam) in the European repository of conference proceedings CEUR Workshop Proceedings (DAMDID/RCDL)], 2015, Vol. 1536, pp. 165-172. ISSN: 1613-0073.
5. Maeno T. On behalf of PANDA team and ATLAS collaboration. PanDA: distributed produc-tion and distributed analysis system for ATLAS, Journal of Physics: Conference Series. IOP Publishing, 2008, Vol. 119, No. 6.
6. Vanyashin A.V., Klimentov A.A., Koren'kov V.V. Za bol'shimi dannymi sledit PANDA [For big data monitors a PANDA], Superkomp'yutery [Supercomputers], 2013, No. 3 (15), pp. 56-61.
7. Schubert M et al. Characterization of ancient and modern genomes by SNP detection and phylogenomic and metagenomic analysis using PALEOMIX, Nat Protoc., 2014, No. 9 (5),
pp. 1056-82. Doi: 10.1038/nprot.2014.063. Epub 2014 Apr 10. PubMed PMID: 24722405.
8. Klimentov A., Koren'kov V. Raspredelennye vychislitel'nye sistemy i ikh rol' v otkrytii novoy chastitsy [Distributed computing systems and their role in the discovery of new particles], Superkomp'yutery [Supercomputers], 2012, No. 3 (11), pp. 7-11.
9. Grid-infrastruktura WLCG [Grid infrastructure WLCG]. Available at:
10. Skryabin K.G., Prokhortchouk E.B., Mazur A.M., Boulygina E.S., Tsygankova S.V., Nedoluzhko A.V., Rastorguev S.M., Matveev V.B., Chekanov N.N., Goranskaya D.A., Teslyuk A.B., Gruzdeva N.M., Velikhov V.E., Zaridze D.G., Kovalchuk M.V. Combining two technologies for full genome sequencing of human, Acta Nat., 2009, Vol. 1, No. 3, pp. 102-107.
11. Kawalia A., Motameny S., Wonczak S., Thiele H., Nieroda L., Jabbari K., Borowski S., Sinha V., Gunia W., Lang U., Achter V., Nurnberg P. Leveraging the Power of High Performance Computing for Next Generation Sequencing Data Analysis: Tricks and Twists from a High Throughput Exome Workflow. PLoS One, 2015, No. 10 (5). Article No e0126321.
Doi: 10.1371/journal.pone.0126321.
12. Bao R., Huang L., Andrade J., Tan W., Kibbe W.A., Jiang H., Feng G. Review of current methods, applications, and data management for the bioinformatics analysis of whole exome sequencing, Cancer Inform., 2014, No. 13 (2), pp. 67-82.
13. Miller W., Drautz D.I., Ratan A., Pusey B., Qi J., Lesk A.M., Tomsho L.P., Packard M.D., Zhao F., Sher A., Tikhonov A., Raney B., Patterson N., Linblad-Toh K., Lander E.S., Knight J.R., Irzyk G.P. Fredrikson K.M., Harkins T.T., Sheridan S., Pringle T., Schuster S.C. Sequencing the nuclear genome of the extinct woolly mammoth, Nature, 2008, Vol. 456,
pp. 387-390. Doi: 10.1038/nature07446.
14. Rasmussen M., Li Y., Lindgreen S., Pedersen J.S., Albrechtsen A., Moltke I., Metspalu M., Metspalu E., Kivisild T., Gupta R., et al. Ancient human genome sequence of an extinct Palaeo-Eskimo, Nature, 2009, Vol. 463, pp. 757-762. Doi: 10.1038/nature08835.
15. Keller A., Graefen A., Ball M., Matzas M., Boisguerin V., Maixner F., Leidinger P., Backes C., Khairat R., Forster M., et al. New insights into the Tyrolean Iceman's origin and phenotype as inferred by whole-genome sequencing, Nature Communications, 2011, No. 3.
16. Allentoft M.E., Collins M., Harker D., Haile J., Oskam C.L., Hale M.L., Campos P.F., Samaniego J.A., Gilbert M.T., Willerslev E., et al. The half-life of DNA in bone: measuring decay kinetics in 158 dated fossils, Proc Biol Sci., 2012, Vol. 279, pp. 4724-4733. Doi: 10.1098/rspb.2012.1745.
17. Nedoluzhko A.V., Boulygina E.S., Sokolov A.S., Tsygankova S.V., Gruzdeva N.M., Rezepkin A.D., Prokhortchouk E.B. Analysis of the Mitochondrial Genome of a Novosvobodnaya Culture Representative using Next-Generation Sequencing and Its Relation to the Funnel Beaker Culture, Acta Naturae, 2014, No. 6, pp. 31-35.
18. Sokolov A.S., Nedoluzhko A.V., Boulygina E.S., Tsygankova S.V., Gruzdeva N.M., Shishlov A.V., Kolpakova A., Rezepkin A.D., Skryabin K.G., Prokhortchouk E.B. Six complete mitochondrial genomes from Early Bronze Age humans in the North Caucasus, Journal of Archaeological Sciences, 2016, No. 73, pp. 138-144. Doi: 10.1016/j.jas.2016.07.017.
19. Martin M.D., Cappellini E., Samaniego J.A., Zepeda M.L., Campos P.F., Seguin-Orlando A., Wales N., Orlando L., Ho S.Y., Dietrich F.S., et al. Reconstructing genome evolution in historic samples of the Irish potato famine pathogen, Nature Communications, 2013, No. 4.
Doi: 10.1038/ncomms3172.
20. Yoshida K., Schuenemann V.J., Cano L.M., Pais M., Mishra B., Sharma R., Lanz C., Martin F.N., Kamoun S., Krause J., et al. The rise and fall of the Phytophthora infestans lineage that triggered the Irish potato famine, eLife, 2013, No. 2.

Comments are closed.