Article

Article title CALCULATION OF FUNCTION OF REALIZABILITY OF PROBLEM SOLUTION ON DISTRIBUTED COMPUTATION SYSTEMS IN CASE OF FAILURES AND RESTORATIONS
Authors K.V. Pavsky, V.A. Pavsky
Section SECTION III. DISTRIBUTED COMPUTING AND SYSTEMS
Month, Year 12, 2016 @en
Index UDC 004.272:[519.87:519.248]
DOI 10.18522/2311-3103-2016-12-8491
Abstract Modern distributed computer systems (CS) are of large scale and intended to solve problems of varying complexity. The number of nodes in such systems can reach hundreds of thousands of units. Experience shows that times between different types of failure in computer systems can be measured in hours. Such systems have increased requirements for reliability and robustness. De-velopment of effective tools for analyzing the functioning of such systems becomes urgent. Quality of CS functioning is evaluated using the set of indices of: reliability, robustness, realizability of solving problems, etc. Indices of realizability of solving problems characterize the process of solving problems on not absolutely reliable computer systems. Realizability function is the conditional probability that a complex problem represented by a parallel program will be solved in a given time on a CS functioning with a given number of working elementary machines (EM) and using for solution all the working EMs. The paper proposes a stochastic model of the functioning of computer systems at solving complex problems. The formulas of function for calculating the realizability of solving tasks in distributed computer systems are proposed. The derivation of equation for calculating the efficiency indices is based on the assumption that the time of problem solution on CS is a function of time of problem solution on one elementary machine, and the function has a finite number of discontinuities. The discontinuities have the probabilistic character and correspond to the CS failures which require reconfiguration of the CS (structure readjustability with regard to working machine only). Calculation of the obtained expression is executed by using approximation calculation. The example for calculation of the probability of solving problem in a given time on a computer system is presented.

Download PDF

Keywords Distributed computer systems; failures; renewal; stochastic model; realizability function of solving problems.
References 1. TOP500 Supercomputers Official Site. TOP500 Lists. Available at: http://www.top500.org.
2. Dongarra J.J., A.J. van der Steen. High-performance computing systems: Status and outlook, Acta Numerica. 2012, pp. 1-96.
3. Nikolic S. High Performance Computing Directions: The Drive to ExaScale Computing, Trudy Mezhdunarodnoy nauchnoy konferentsii “Parallel'nye vychislitel'nye tekhnologii (PaVT’2012) [Proceedings of the International scientific conference “Parallel computational technologies (PCT' ’2012)]. Novosibirsk, 2012. Available at: http://pavt.susu.ru/2012/talks/Nikolic.pdf.
4. Schroeder B. and Gibson G.A. Understanding Failures in Petascale Computers // Journal of Physics: Conference Series. – Jul. 2007. – Vol. 78, No. 1. – P. 012 022+. Available: http://dx.doi.org/10.1088/ 1742-6596/78/1/012022.
5. Christopher Weaver, Joel Emer, Shubhendu S. Mukherjee, and Steven K. Reinhardt. Techniques to reduce the soft error rate of a high-performance microprocessor // In Proceedings of the 31st Annual International Symposium on Computer Architecture, ISCA ’04. – Washington, DC, USA, 2004. I. – P. 264.
6. Vilas Sridharan, Nathan DeBardeleben, Sean Blanchard, Kurt B. Ferreira, Jon Stearley, John Shalf, and Sudhanva Gurumurthi. Memory errors in modern systems: The good, the bad, and the ugly // In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems. – 2015. – P. 297-310.
7. Jin H., Chen Y., Zhu H., and Sun X.H. Optimizing hpc fault-tolerant environment: An analytical approach // In 2010 39th International Conference on Parallel Processing. – Sept. 2010.
– P. 525-534.
8. Di S., Bouguerra M.S., Bautista-Gomez L., and Cappello F. Optimization of multilevel check-point model for large scale hpc applications // In Parallel and Distributed Processing Symposi-um, 2014 IEEE 28th International. – May 2014. – P. 1181-1190.
9. Korneev V.V., Semenov D.V., Telegin P.N., Shabanov B.M. Otkazoustoychivoe detsentralizovannoe upravlenie resursami grid [Failover decentral-centralized resource man-agement grid], Izvestiya vuzov. Elektronika [Proceedings of Higher Educational Institutions. Electronics], 2015, No. 1, pp. 83-89.
10. Kalyaev I.A., Korobkin V.V., Mel'nik E.V., Malakhov I.V. Otkazoustoychivyy upravlyayushchiy vychislitel'nyy kompleks mashiny peregruzochnoy atomnogo reaktora tipa VVER [Fault-tolerant computer control system refueling machine of nuclear reactor VVER], Mekhatronika, avtomatizatsiya, upravlenie [Mechatronics, Automation, Control], 2003, No. 3, pp. 143-146.
11. Mel'nik E.V., Gorelova G.V. Imitatsionnoe modelirovanie variantov rezervirovaniya v raspredelennykh informatsionno-upravlyayushchikh sistemakh s detsentralizovannoy organizatsiey [Simulation modelind back-up options in distributed information-control system with a decentralized organization], Izvestiya YuFU. Tekhnicheskie nauki [Izvestiya SFedU. Engineering Sciences], 2013, No. 3 (140), pp. 184-193.
12. Kalyaev I.A., Mel'nik E.V. Detsentralizovannye sistemy komp'yuternogo upravleniya: monografiya [Decentralized computer control: monograph]. Rostov-on-Don: Izd-vo YuNTs RAN, 2011, 196 p.
13. Kapustyan S.G., Mel'nik E.V. Tekhnologiya organizatsii otkazoustoychivogo funktsionirovaniya raspredelennykh informatsionno-upravlyayushchikh sistem slozhnykh tekhnicheskikh ob"ektov [The technology of fault tolerant operation of a distributed information-control systems of complex technical objects], Vestnik komp'yuternykh i informatsionnykh tekhnologiy [Herald of computer and information technologies], 2010, No. 4, pp. 33-41.
14. Balaji P., Buntinas D., Goodell D. [et al.]. MPI on a Million Processors, Proceedings of the 16th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Ma-chine and Message Passing Interface. Berlin, Heidelberg: SpringerVerlag, 2009, pp. 20-30.
15. Khoroshevskiy V.G. Inzhenernyy analiz funktsionirovaniya vychislitel'nykh mashin i system [Engineering analysis of the functioning of computing machines and systems]. Moscow: Radio i svyaz', 1987, 256 p.
16. Khoroshevskiy V.G. Arkhitektura vychislitel'nykh system [Architecture of computing systems]. Moscow: MGTU im. Baumana, 2008, 520 p.
17. Pavskiy V.A., Pavskiy K.V. Matematicheskoe modelirovanie funktsionirovaniya raspre-delennykh vychislitel'nykh sistem s otkazami i polnym vosstanovleniem [Mathematical modeling of the functioning of distributed computing systems with failure and full restoration], Vestnik komp'yuternykh i informatsionnykh tekhnologiy [Herald of computer and information technologies], 2015, No. 11, pp. 41-44.
18. Pavskii V.A., Pavskii K.V. Stochastic simulation and analysis of the operation of computing systems with structural redundancy, Optoelectronics, instrumentation and data processing, Allerton Press, Inc., 2014, Vol. 50, No 4, pp. 363-369.
19. Saati T.L. Elementy teorii massovogo obsluzhivaniya i ee prilozheniya [Elements of queueing theory and its applications]. Moscow: URSS, 2010, 520 p.
20. Kleynrok L. Teoriya massovogo obsluzhivaniya [The theory of mass service]. Moscow: Mashinostroenie, 1979, 432 p.

Comments are closed.