TY - GEN
T1 - An In-Network Parameter Aggregation using DPDK for Multi-GPU Deep Learning
AU - Furukawa, Masaki
AU - Itsubo, Tomoya
AU - Matsutani, Hiroki
N1 - Publisher Copyright:
© 2020 IEEE
PY - 2020/11
Y1 - 2020/11
N2 - In distributed deep neural network using remote GPU nodes, communication occurs iteratively between remote nodes for gradient aggregation. This communication latency limits the benefit of distributed training with faster GPUs. In distributed deep learning using the remote GPUs, workload of gradient aggregation is imposed on a host machine. In this paper, we therefore propose to offload the gradient aggregation to a DPDK (Data Plane Development Kit) based network switch between the host machine and remote GPUs. In this approach, the aggregation process is completed in the network using extra computation resources in the network switch. We evaluate the proposed switch when GPUs and the host communicate with a standard IP communication and a PCI Express (PCIe) over 40Gbit Ethernet (40GbE) product, respectively. The evaluation results using a standard IP communication show that the aggregation is accelerated by 2.2-2.5x compared to the aggregation executed by a host machine. The results using the PCIe over 40GbE product show that the proposed switch outperforms the aggregation done by the host machine by 1.16x. This approach is thus useful for distributed training with multiple GPUs.
AB - In distributed deep neural network using remote GPU nodes, communication occurs iteratively between remote nodes for gradient aggregation. This communication latency limits the benefit of distributed training with faster GPUs. In distributed deep learning using the remote GPUs, workload of gradient aggregation is imposed on a host machine. In this paper, we therefore propose to offload the gradient aggregation to a DPDK (Data Plane Development Kit) based network switch between the host machine and remote GPUs. In this approach, the aggregation process is completed in the network using extra computation resources in the network switch. We evaluate the proposed switch when GPUs and the host communicate with a standard IP communication and a PCI Express (PCIe) over 40Gbit Ethernet (40GbE) product, respectively. The evaluation results using a standard IP communication show that the aggregation is accelerated by 2.2-2.5x compared to the aggregation executed by a host machine. The results using the PCIe over 40GbE product show that the proposed switch outperforms the aggregation done by the host machine by 1.16x. This approach is thus useful for distributed training with multiple GPUs.
UR - http://www.scopus.com/inward/record.url?scp=85104633421&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85104633421&partnerID=8YFLogxK
U2 - 10.1109/CANDAR51075.2020.00021
DO - 10.1109/CANDAR51075.2020.00021
M3 - Conference contribution
AN - SCOPUS:85104633421
T3 - Proceedings - 2020 8th International Symposium on Computing and Networking, CANDAR 2020
SP - 108
EP - 114
BT - Proceedings - 2020 8th International Symposium on Computing and Networking, CANDAR 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 8th International Symposium on Computing and Networking, CANDAR 2020
Y2 - 24 November 2020 through 27 November 2020
ER -