|Article title||MULTI-AGENT ALGORITHM FOR CREATING A RESIDUAL PROBLEM-SOLVING SCHEME IN DISTRIBUTED APPLIED SOFTWARE PACKAGES|
|Authors||A. G. Feoktistov, R. O. Kostromin, I. A. Sidorov, S. A. Gorsky|
|Section||SECTION II. DISTRIBUTED AND CLOUD COMPUTING|
|Month, Year||08, 2018 @en|
|Abstract||Nowadays, basic software tools that implement technologies for organizing computations in high-performance computing systems provide a potential basis for the mass creation and use of parallel and distributed applications. Tools for creating applied software packages and workflow support systems are being actively developed and applied in practice. However, an analysis of their practical application allows us to conclude that it is necessary to increase the fault-tolerance of problem-solving processes in distributed applied software packages for problems that include sets of interrelated subproblems. In particular, this problem becomes urgent when we solve problems in a heterogeneous distributed computing environment. Clusters, including hybrid clusters with heterogeneous nodes, are the main components of such an environment. High-performance servers, storage systems, personal computers, and other computing elements complement the infrastructure of the environment. The paper presents an adaptive multi-agent algorithm, which is intended for the redistribution of jobs on the resources of such an environment. The algorithm is used when restarting the problem-solving process in distributed applied software packages after the failure of software and hardware. In contrast to the well-known algorithms for maintaining fault-tolerance of distributed computing that are used in workflow management systems, the work of this algorithm is based on the use of program specialization methods for creating and executing a residual problem-solving scheme. It also actively applies meta-monitoring of computational resources. Comparative analysis of the experimental results on the semi-natural modeling the support of the fault-tolerance of the scheme-executing process for solving problems of distributed applied software packages by various meta-schedulers demonstrated the advantage of the proposed approach to multi-agent management in the heterogeneous distributed computing environment.|
|Keywords||Distributed applied software package; problem-solving scheme; multi-agent management; fault-tolerance.|
|References||1. Bondarenko A.A., Yakobovski M.V. Obespechenie otkazoustojchivosti vysokoproizvoditel'nyh vychislenij s pomoshch'yu lokal'nyh kontrol'nyh tochek [Fault tolerance for HPC by using local checkpoints], Vestnik Yuzhno-Ural'skogo gosudarstvennogo universiteta. Seriya: Vychislitel'naya matematika i informatika [Bulletin of the South Ural State University. Series: Computational Mathematics and Software Engineering], 2014, Vol. 3, No. 3, pp. 20-36.
2. Feoktistov A.G., Sidorov I.A. Gorky S.A. Avtomatizatsiya razrabotki i primeneniya raspredelennykh paketov prikladnykh programm [Automation of development and application of distributed applied software packages], Problemy informatiki [Problems of Informatics], 2017, No. 4, pp. 61-78.
3. Banti A., Kacsuk P., Kozlovszky M. Classification of scientific workflows based on reproducibility analysis, Proceedings of the 39th International Convention on information and communication technology, electronics and microelectronics (MIPRO-2016), Riejka: IEEE, 2016,
4. Mhashilkar P., Miller Z., Kettimuthu R., Garzoglio G., Holzman B., Weiss C., Duan X., Lacinski L. End-To-End Solution for Integrated Workload and Data Management using GlideinWMS and Globus Online, Journal of Physics: Conference Series, 2012, Vol. 396, No. 3, pp. 2076-2085.
5. Talia D. Workflow Systems for Science Concepts and Tools, ISRN Software Engineering, 2013, Vol. 2013, pp. 1-15.
6. Deelman E., Peterka T., Altintas I., Carothers C.D., van Dam K.K., Moreland K., Parashar M., Ramakrishnan L., Taufer M., Vetter J. The future of scientific workflows, The International Journal of High Performance Computing Applications, 2017, Vol. 32, No. 1.1, pp. 159-175.
7. Ostermann S., Plankensteiner K., Prodan R., Fahringer T., Iosup A. Workflow monitoring and analysis tool for ASKALON, Proceedings of 3rd CoreGRID Workshop on Grid Middleware, Spain: Springer, 2009, pp. 73-86.
8. Zhao Y., Raicu I., Foster I. Scientific Workflow Systems for 21st Century, New Bottle or New Wine?, IEEE Congress on Services - Part I, Honolulu, HI: IEEE, 2008, pp. 467-471.
9. Rodriguez M.A., Buyya R. Deadline Based Resource Provisioning and Scheduling Algorithm for Scientific Workflows on Clouds IEEE Transactions on Cloud Computing, 2014, Vol. 2, No. 2, pp. 222-235.
10. Anwar N., Deng H. Elastic Scheduling of Scientific Workflows under Deadline Constraints in Cloud Computing Environments Future Internet, 2018, Vol. 10, No. 1, pp. 1-23.
11. Feoktistov A., Sidorov I., Sergeev V., Kostromin R., Bogdanova V. Virtualization of Heterogeneous HPC-clusters Based on OpenStack Platform, Vestnik Yuzhno-Ural'skogo gosudarstvennogo universiteta. Seriya: Vychislitel'naya matematika i informatika [Bulletin of the South Ural State University. Series: Computational Mathematics and Software Engineering], 2017, Vol. 6, No. 2, pp. 37-48.
12. Ershov A.P. Nauchnye osnovy dokazatel'nogo programmirovaniya [Scientific basis of evidence-based programming], Vestnik AN SSSR [Herald of the Russian Academy of Sciences], 1984, No. 10, pp. 9-19.
13. Ershov A.P. On Mixed Computation: Informal Account of the Strict and Polyvariant Computation Schemes, Control Flow and Data Flow: Concepts of Distributed Programming, Berlin A.O.: Springer-Verlag, 1985, pp. 107-120.
14. Sidorov I.A. Methods and Tools to Increase Fault Tolerance of High-Performance Computing Systems, Proceedings of the 39th International Convention on information and communication technology, electronics and microelectronics (MIPRO-2016), Riejka: IEEE, 2016, pp. 242-246.
15. Feoktistov A.G., Sidorov I.A. Logical-Probabilistic Analysis of Distributed Computing Reliability, Proceedings of the 39th International Convention on information and communication technology, electronics and microelectronics (MIPRO-2016), Riejka: IEEE, 2016, pp. 247-252.
16. Feoktistov A.G, Kostromin R.O., Dyadkin Y.A. Upravlenie zadaniyami v geterogennoy raspredelennoy vychislitel'noy srede na osnove znaniy [Knowledge Based Management of Jobs in Heterogeneous Distributed Computing Environment], Vestnik komp'iuternykh i informatsionnykh tekhnologii [Herald of computer and information technologies], 2018, No. 2, pp. 10-17.
17. Bychkov I., Feoktistov A., Kostromin R., Sidorov I., Edelev A., Gorsky S. Machine Learning in a Multi-Agent System for Distributed Computing Management, Data Science. Information Technology and Nanotechnology 2018, CEUR-WS Proceedings, 2018, Vol. 2212. pp. 89-97.
18. Tel G. Introduction to Distributed Algorithms: Solutions and Suggestions, Cambridge University Press, 2000, 596 p.
19. Balaji P., Buntinas D., Kimpe D. Fault Tolerance Techniques for Scalable Computing, Scalable Computing and Communications: Theory and Practice, Hoboken: Wiley-IEEE Press, 2013, pp. 212-245.
20. Irkutsk Supercomputer Centre of SB RAS. Available at: http://hpc.icc.ru/ (accessed 3 November 2018).