Evaluation of Checkpointing Mechanism on SCore Cluster System

Masaaki Kondo, Takuro Hayashida, Masashi Imai, Hiroshi Nakamura, Takashi Nanya, Atsushi Hori

Research output: Contribution to journalArticlepeer-review

5 Citations (Scopus)

Abstract

Cluster systems are getting widely used because of good performance / cost ratio. However, their reliability has not been well discussed in practical environment so far. As the number of commodity components in a cluster system gets increased, it is indispensable to support reliability by system software. SCore cluster system software is a parallel programming environment for High Performance Computing (HPC). SCore provides checkpointing and rollback-recovery mechanism for high availability. In this paper, we analyze and evaluate the checkpointing and rollback-recovery mechanisms of SCore quantitively. The experimental results reveal that the required time for checkpointing scales very well in respect to the number of computing nodes. However, the required time is quite long due to the low effective network bandwidth. Based on the results, we modify SCore and successfully make checkpointing and recovery 1.8 ∼ 2.8 times and 3.7 ∼ 5.0 times faster respectively. This is very helpful for cluster systems to achieve high performance and high availability.

Original languageEnglish
Pages (from-to)2553-2562
Number of pages10
JournalIEICE Transactions on Information and Systems
VolumeE86-D
Issue number12
Publication statusPublished - 2003 Dec
Externally publishedYes

Keywords

  • Checkpointing
  • Cluster system
  • High availability
  • Rollback-recovery

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Vision and Pattern Recognition
  • Electrical and Electronic Engineering
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Evaluation of Checkpointing Mechanism on SCore Cluster System'. Together they form a unique fingerprint.

Cite this