In-place solver with fault resilience for linear systems

The invention relates to a solver of linear systems of equations, and matrix inversion, of large dimensions with processors for parallel or distributed computation, equipped with techniques for resisting the failure of one or more processors during computation.

Patent title Solutore in loco resiliente ai guasti per sistemi lineari
Thematic area Industry, Digital and Security
Ownership ALMA MATER STUDIORUM - UNIVERSITA' DI BOLOGNA
Inventors Daniela Loreti, Marcello Artioli
Protection IItaly, with the possibility to extend internationally
Licensing status Available for development agreements, option, license and other exploitation agreements
Keywords Exact solver, In-place execution, Fault resilience, Linear systems, High Performance Computing Systems, Parallel computing
Filed on 13 October 2022

Over the last years, the progress of hardware technologies in the HPC field have brought to supercomputers with impressive computing power. Typically, these architectures are employed to perform complex computations (that would require excessive time on traditional, smaller-scale hardware) taking advantage of the parallel execution on multiple nodes. An example of these complex computations is the resolution of linear equation systems: a linear algebra building block for many scientific calculations. However, the development of HPC systems is still hindered by various problems. In particular, the presence of many computing nodes dramatically increases the occurrence of faults that threaten the execution of time-consuming parallel algorithms.

The proposed method is a ABFT solution to address fault resilience while solving linear equation systems. Thanks to its structure, the technique minimizes the memory occupation with an in-place execution and avoids the drawbacks of periodic checkpointing and rollback in case of a fault.

Thanks to the offered fault tolerance feature and the limited memory footprint, the proposed method is particularly suitable for HPC computations: it can be included in systems libraries (provided together with HPC hardware) or scientific software.

Indeed, it enriches the linear system resolution process with fault tolerance without affecting the memory footprint and without the drawbacks of traditional checkpoint/restart methods.

Furthermore, the method simplifies the way to obtain the desired degree of resilience: while checkpoint/restart requires to know the duration of the application in advance in order to properly compute the checkpoint frequency, the proposed method allows to gain resilience to up to a certain number of hard faults by simply providing the same number of checksum nodes.

Page published on: 17 October 2022