TY - JOUR
T1 - 4K Real Time Image to Image Translation Network With Transformers
AU - Shibasaki, Kei
AU - Fukuzaki, Shota
AU - Ikehara, Masaaki
N1 - Funding Information:
This work was supported by Keio University.
Publisher Copyright:
© 2013 IEEE.
PY - 2022
Y1 - 2022
N2 - CNNs have traditionally been applied in computer vision. Recently, applying Transformer networks, originally a technique in natural language processing, to computer vision has received much attention and produced superior results. However, Transformers and their derivation have drawbacks that the computational cost and memory usage increase rapidly with the image resolution. In this paper, we propose the Laplacian Pyramid Translation Transformer (LPTT) for image to image translation. The Laplacian Pyramid Translation Network, a previous study of this work, creates Laplacian pyramid of the input images and processes each component with CNNs. However, LPTT transforms the high-frequency components with CNNs and the low-frequency components with Axial Transformer blocks. LPTT can have Transformer's expressive power while reducing the computational cost and memory usage. LPTT significantly improves the quality of generated images and inference speed for high-resolution images over conventional methods. LPTT is the first network with a Transformer that can perform practical inference in real time on 4K resolution images. LPTT can also process 8K images in real time depending on the model conditions and the performance of the GPU. The ablation study in this paper suggests that even when processing high-resolution images, the performance is improved while maintaining the inference speed by computing the low-resolution component with a Transformer. LPTT improves PSNR value by 0.41 dB in MIT-Adobe FiveK dataset. The greater the number of layers in the Laplacian pyramid, the greater the improvement of LPTT over the Laplacian Pyramid Translation Network.
AB - CNNs have traditionally been applied in computer vision. Recently, applying Transformer networks, originally a technique in natural language processing, to computer vision has received much attention and produced superior results. However, Transformers and their derivation have drawbacks that the computational cost and memory usage increase rapidly with the image resolution. In this paper, we propose the Laplacian Pyramid Translation Transformer (LPTT) for image to image translation. The Laplacian Pyramid Translation Network, a previous study of this work, creates Laplacian pyramid of the input images and processes each component with CNNs. However, LPTT transforms the high-frequency components with CNNs and the low-frequency components with Axial Transformer blocks. LPTT can have Transformer's expressive power while reducing the computational cost and memory usage. LPTT significantly improves the quality of generated images and inference speed for high-resolution images over conventional methods. LPTT is the first network with a Transformer that can perform practical inference in real time on 4K resolution images. LPTT can also process 8K images in real time depending on the model conditions and the performance of the GPU. The ablation study in this paper suggests that even when processing high-resolution images, the performance is improved while maintaining the inference speed by computing the low-resolution component with a Transformer. LPTT improves PSNR value by 0.41 dB in MIT-Adobe FiveK dataset. The greater the number of layers in the Laplacian pyramid, the greater the improvement of LPTT over the Laplacian Pyramid Translation Network.
KW - Deep learning
KW - Laplacian pyramid
KW - image to image translation
KW - photo retouching
KW - transformer
UR - http://www.scopus.com/inward/record.url?scp=85134235369&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85134235369&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2022.3189649
DO - 10.1109/ACCESS.2022.3189649
M3 - Article
AN - SCOPUS:85134235369
SN - 2169-3536
VL - 10
SP - 73057
EP - 73067
JO - IEEE Access
JF - IEEE Access
ER -