TY - GEN
T1 - Low-reliable low-latency networks optimized for HPC parallel applications
AU - Nguyen, Truong Thao
AU - Matsutani, Hiroki
AU - Koibuchi, Michihiro
N1 - Funding Information:
ACKNOWLEDGMENT This work was supported by JST CREST and JSPS KAK-ENHI Grant Numbers JP#16H02816. Part of this work is conducted as research activities of AIST - Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL).
Publisher Copyright:
© 2018 IEEE.
PY - 2018/11/26
Y1 - 2018/11/26
N2 - High-end network standards, such as 400GbE, have been introduced Forwarding Error Correction (FEC) for maintaining the same bit error rate (BER) as that in traditional low-bandwidth interconnection networks. However, FEC operation latency overhead surprisingly becomes higher than the sum of all the other switch operation overheads, e.g., routing computation and switch allocation. FEC operation latency overhead significantly degrades the performance of parallel applications in HPC systems. Instead, in this study, we exploit the low-latency network design using a Hamming code that does not provide rigid error-free communication. Since it is consistent with existing frame format based on standard Reed-Solomon RS(544,514) with DC(64b/66b) direct linecode and TC(256b/257b) transcode, respectively, the influences upon the other network layer design are limited. Interestingly, a large number of parallel applications can accept the BER in such a Hamming code. Since lowering such a BER improves switch operation latency, the proposed network using the Hamming code improves the execution time of NAS Parallel Benchmarks by 56% on average when compared to the counterpart RS-FEC networks.
AB - High-end network standards, such as 400GbE, have been introduced Forwarding Error Correction (FEC) for maintaining the same bit error rate (BER) as that in traditional low-bandwidth interconnection networks. However, FEC operation latency overhead surprisingly becomes higher than the sum of all the other switch operation overheads, e.g., routing computation and switch allocation. FEC operation latency overhead significantly degrades the performance of parallel applications in HPC systems. Instead, in this study, we exploit the low-latency network design using a Hamming code that does not provide rigid error-free communication. Since it is consistent with existing frame format based on standard Reed-Solomon RS(544,514) with DC(64b/66b) direct linecode and TC(256b/257b) transcode, respectively, the influences upon the other network layer design are limited. Interestingly, a large number of parallel applications can accept the BER in such a Hamming code. Since lowering such a BER improves switch operation latency, the proposed network using the Hamming code improves the execution time of NAS Parallel Benchmarks by 56% on average when compared to the counterpart RS-FEC networks.
KW - Forwarding Error Correction (FEC)
KW - High Performance Computing (HPC)
KW - Interconnection networks
UR - http://www.scopus.com/inward/record.url?scp=85059980998&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85059980998&partnerID=8YFLogxK
U2 - 10.1109/NCA.2018.8548063
DO - 10.1109/NCA.2018.8548063
M3 - Conference contribution
AN - SCOPUS:85059980998
T3 - NCA 2018 - 2018 IEEE 17th International Symposium on Network Computing and Applications
BT - NCA 2018 - 2018 IEEE 17th International Symposium on Network Computing and Applications
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 17th IEEE International Symposium on Network Computing and Applications, NCA 2018
Y2 - 1 November 2018 through 3 November 2018
ER -