Video super-resolution aims at generating high-resolution video frames using multiple adjacent low-resolution frames. An important aspect of video super-resolution is the alignment of neighboring frames to the reference frame. Previous methods directly align the frames either using optical flow or deformable convolution. However, directly estimating the motion from low-resolution inputs is hard since they often contain blur and noise that hinder the image quality. To address this problem, we propose to conduct feature alignment across multiple stages to more accurately align the frames. Furthermore, to fuse the aligned features, we introduce a novel Attentional Feature Fusion Block that applies a spatial attention mechanism to avoid areas with occlusion or misalignment. Experimental results show that the proposed method achieves competitive performance to other state-of-the-art super-resolution methods while reducing the network parameters.