TY - GEN
T1 - Applying Pwrake Workflow System and Gfarm File System to Telescope Data Processing
AU - Tanaka, Masahiro
AU - Tatebe, Osamu
AU - Kawashima, Hideyuki
N1 - Funding Information:
ACKNOWLEDGMENT This work is supported by JST CREST “Statistical Computational Cosmology with Big Astronomical Imaging Data”, and JST CREST “Extreme Big Data (EBD) Next Generation Big Data Infrastructure TechnologiesTowards Yottabyte/Year.” REFERENCES
Publisher Copyright:
© 2018 IEEE.
Copyright:
Copyright 2019 Elsevier B.V., All rights reserved.
PY - 2018/10/29
Y1 - 2018/10/29
N2 - In this paper, we describe a use case applying a scientific workflow system and a distributed file system to improve the performance of telescope data processing. The application is pipeline processing of data generated by Hyper Suprime-Cam (HSC) which is a focal plane camera mounted on the Subaru telescope. In this paper, we focus on the scalability of parallel I/O and core utilization. The IBM Spectrum Scale (GPFS) used for actual operation has a limit on scalability due to the configuration using storage servers. Therefore, we introduce the Gfarm file system which uses the storage of the worker node for parallel I/O performance. To improve core utilization, we introduce the Pwrake workflow system instead of the parallel processing framework developed for the HSC pipeline. Descriptions of task dependencies are necessary to further improve core utilization by overlapping different types of tasks. We discuss the usefulness of the workflow description language with the function of scripting language for defining complex task dependency. In the experiment, the performance of the pipeline is evaluated using a quarter of the observation data per night (input files: 80 GB, output files: 1.2 TB). Measurements on strong scaling from 48 to 576 cores show that the processing with Gfarm file system is more scalable than that with GPFS. Measurement using 576 cores shows that our method improves the processing speed of the pipeline by 2.2 times compared with the method used in actual operation.
AB - In this paper, we describe a use case applying a scientific workflow system and a distributed file system to improve the performance of telescope data processing. The application is pipeline processing of data generated by Hyper Suprime-Cam (HSC) which is a focal plane camera mounted on the Subaru telescope. In this paper, we focus on the scalability of parallel I/O and core utilization. The IBM Spectrum Scale (GPFS) used for actual operation has a limit on scalability due to the configuration using storage servers. Therefore, we introduce the Gfarm file system which uses the storage of the worker node for parallel I/O performance. To improve core utilization, we introduce the Pwrake workflow system instead of the parallel processing framework developed for the HSC pipeline. Descriptions of task dependencies are necessary to further improve core utilization by overlapping different types of tasks. We discuss the usefulness of the workflow description language with the function of scripting language for defining complex task dependency. In the experiment, the performance of the pipeline is evaluated using a quarter of the observation data per night (input files: 80 GB, output files: 1.2 TB). Measurements on strong scaling from 48 to 576 cores show that the processing with Gfarm file system is more scalable than that with GPFS. Measurement using 576 cores shows that our method improves the processing speed of the pipeline by 2.2 times compared with the method used in actual operation.
KW - Distributed file system
KW - Scientific workflow system
UR - http://www.scopus.com/inward/record.url?scp=85057265540&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85057265540&partnerID=8YFLogxK
U2 - 10.1109/CLUSTER.2018.00024
DO - 10.1109/CLUSTER.2018.00024
M3 - Conference contribution
AN - SCOPUS:85057265540
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 124
EP - 133
BT - Proceedings - 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018
Y2 - 10 September 2018 through 13 September 2018
ER -