TY - JOUR
T1 - An FPGA-based optimizer design for distributed deep learning with multiple GPUs
AU - Itsubo, Tomoya
AU - Koibuchi, Michihiro
AU - Amano, Hideharu
AU - Matsutani, Hiroki
N1 - Funding Information:
Manuscript received January 7, 2021. Manuscript revised April 29, 2021. Manuscript publicized July 1, 2021. †The authors are with Graduate School of Science and Technology, Keio University, Yokohama-shi, 223–8522 Japan. ††The author is with National Institute of Informatics, Tokyo, 101–8430 Japan. ∗This work was partially supported by JSPS KAKENHI Grant Number JP19H01106. a) E-mail: itsubo@arc.ics.keio.ac.jp DOI: 10.1587/transinf.2021PAP0008
Publisher Copyright:
Copyright © 2021 The Institute of Electronics, Information and Communication Engineers
PY - 2021
Y1 - 2021
N2 - Since deep learning workloads perform a large number of matrix operations on training data, GPUs (Graphics Processing Units) are efficient especially for the training phase. A cluster of computers each of which equips multiple GPUs can significantly accelerate the deep learning workloads. More specifically, a back-propagation algorithm following a gradient descent approach is used for the training. Although the gradient computation is still a major bottleneck of the training, gradient aggregation and optimization impose both communication and computation overheads, which should also be reduced for further shortening the training time. To address this issue, in this paper, multiple GPUs are interconnected with a PCI Express (PCIe) over 10Gbit Ethernet (10GbE) technology. Since these remote GPUs are interconnected with network switches, gradient aggregation and optimizers (e.g., SGD, AdaGrad, Adam, and SMORMS3) are offloaded to FPGA-based 10GbE switches between remote GPUs; thus, the gradient aggregation and parameter optimization are completed in the network. The proposed FPGA-based 10GbE switches with the four optimizers are implemented on NetFPGA-SUME board. Their resource utilizations are increased by PEs for the optimizers, and they consume up to 56% of the resources. Evaluation results using four remote GPUs connected via the proposed FPGA-based switch demonstrate that these optimizers are accelerated by up to 3.0x and 1.25x compared to CPU and GPU implementations, respectively. Also, the gradient aggregation throughput by the FPGA-based switch achieves up to 98.3% of the 10GbE line rate.
AB - Since deep learning workloads perform a large number of matrix operations on training data, GPUs (Graphics Processing Units) are efficient especially for the training phase. A cluster of computers each of which equips multiple GPUs can significantly accelerate the deep learning workloads. More specifically, a back-propagation algorithm following a gradient descent approach is used for the training. Although the gradient computation is still a major bottleneck of the training, gradient aggregation and optimization impose both communication and computation overheads, which should also be reduced for further shortening the training time. To address this issue, in this paper, multiple GPUs are interconnected with a PCI Express (PCIe) over 10Gbit Ethernet (10GbE) technology. Since these remote GPUs are interconnected with network switches, gradient aggregation and optimizers (e.g., SGD, AdaGrad, Adam, and SMORMS3) are offloaded to FPGA-based 10GbE switches between remote GPUs; thus, the gradient aggregation and parameter optimization are completed in the network. The proposed FPGA-based 10GbE switches with the four optimizers are implemented on NetFPGA-SUME board. Their resource utilizations are increased by PEs for the optimizers, and they consume up to 56% of the resources. Evaluation results using four remote GPUs connected via the proposed FPGA-based switch demonstrate that these optimizers are accelerated by up to 3.0x and 1.25x compared to CPU and GPU implementations, respectively. Also, the gradient aggregation throughput by the FPGA-based switch achieves up to 98.3% of the 10GbE line rate.
KW - Deep learning
KW - FPGA switch
KW - Remote GPU
UR - http://www.scopus.com/inward/record.url?scp=85121002597&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85121002597&partnerID=8YFLogxK
U2 - 10.1587/transinf.2021PAP0008
DO - 10.1587/transinf.2021PAP0008
M3 - Article
AN - SCOPUS:85121002597
SN - 0916-8532
VL - E104D
SP - 2057
EP - 2067
JO - IEICE Transactions on Information and Systems
JF - IEICE Transactions on Information and Systems
IS - 12
ER -