Without training, this new method achieves freedom in generating image sizes and resolutions.

2024.04.08

Recently, diffusion models have surpassed GAN and autoregressive models and become the mainstream choice for generative models due to their excellent performance. Text-to-image generation models based on diffusion models (such as SD, SDXL, Midjourney, and Imagen) have demonstrated the amazing ability to generate high-quality images. Typically, these models are trained at a specific resolution to ensure efficient processing and stable model training on existing hardware.

Figure 1: Comparison of using different methods to generate 2048×2048 images under SDXL 1.0. [1]

However, when these pre-trained diffusion models generate images beyond the training resolution, they often suffer from pattern duplication and severe artifacts, as shown on the far left of Figure 1.

In order to solve this problem, researchers from the Chinese University of Hong Kong-SenseTime Technology Joint Laboratory and other institutions conducted an in-depth study of the convolutional layer of the UNet structure commonly used in the diffusion model in a paper, and analyzed it from the perspective of frequency domain analysis. FouriScale is proposed, as shown in Figure 2.

Figure 2 Schematic diagram of FouriScale's process (orange line), which aims to ensure consistency across resolutions.

FouriScale replaces the original convolutional layers in the pre-trained diffusion model by introducing atrous convolution operations and low-pass filtering operations, aiming to achieve structure and scale consistency at different resolutions. Combined with the "fill then crop" strategy, this method can flexibly generate images of different sizes and aspect ratios. Furthermore, with FouriScale as a guide, the method is able to guarantee complete image structure and excellent image quality when generating high-resolution images of any size. FouriScale does not require any offline pre-computation and has good compatibility and scalability.

定量和定性实验结果表明,FouriScale 在利用预训练扩散模型生成高分辨率图像方面取得了显著提升。


  • Paper address: https://arxiv.org/abs/2403.12963
  • Open source code: https://github.com/LeonHLJ/FouriScale
  • 论文标题:FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis

Method introduction

1. Atrous convolution ensures structural consistency across resolutions

The denoising network of the diffusion model is usually trained on images or latent spaces of a specific resolution, and this network usually adopts a U-Net structure. The authors aim to use the parameters of the denoising network during the inference stage to generate higher resolution images without the need for retraining. To avoid structural distortion at inference resolution, the authors try to establish structural consistency between default and high resolutions. For the convolutional layer in U-Net, the structural consistency can be expressed as:

picture

Among them, k is the original convolution kernel, and k' is a new convolution kernel customized for larger resolution. According to the frequency domain representation of spatial downsampling, it is as follows:

picture

Formula (3) can be written as:

picture

This formula shows that the Fourier spectrum of the ideal convolution kernel k' should be spliced ​​by the Fourier spectrum of s×s convolution kernels k. In other words, the Fourier spectrum of k' should have periodic repetition, and this repeating pattern is the Fourier spectrum of k.

The widely used atrous convolution just meets this requirement. The frequency domain periodicity of atrous convolution can be expressed by the following formula:

picture

When using the pre-trained diffusion model (the training resolution is (h, w)) to generate a high-resolution image of (H, W), the parameters of the atrous convolution use the original convolution kernel, and the expansion factor is (H/h, W /w), is the ideal convolution kernel k'.

2. Low-pass filtering ensures scale consistency across resolutions

However, only using atrous convolution cannot solve the problem perfectly. As shown in the upper left corner of Figure 3, only using atrous convolution still has the phenomenon of pattern repetition in details. The author believes that this is because the frequency aliasing phenomenon of spatial downsampling changes the frequency domain components, resulting in differences in frequency domain distribution at different resolutions. In order to ensure scale consistency across resolutions, they introduced low-pass filtering to filter out high-frequency components to remove the frequency aliasing problem after spatial downsampling. As can be seen from the comparison curve on the right side of Figure 3, after using low-pass filtering, the frequency distribution at high and low resolutions is closer, thus ensuring consistent scale. As can be seen from the lower left corner of Figure 3, after using low-pass filtering, the pattern repetition phenomenon of details has been significantly improved.

Figure 3 (a) Visual comparison of whether low-pass filtering is used. (b) Fourier relative logarithmic amplitude curve without low-pass filtering. (c) Fourier relative logarithmic amplitude curve with low-pass filtering.

3. Suitable for image generation of any size

The above method can only be adapted when the aspect ratio of the generated resolution is consistent with the default inference resolution. In order to adapt FouriScale to image generation of any size, the author adopts a "fill and then crop" method, as shown in Method 1 Pseudocode for FouriScale incorporating this strategy is given.

picture

4. FouriScale guidance

Due to the frequency domain operation in FouriScale, the generated image inevitably suffers from loss of detail and undesirable artifacts. In order to solve this problem, as shown in Figure 4, the author proposed FouriScale as a guidance method. Specifically, based on the original conditional generation estimation and unconditional generation estimation, they introduced an additional conditional generation estimation. The generation process of this additional conditional generation estimate also uses atrous convolution, but uses a gentler low-pass filtering to ensure that details are not lost. At the same time, they will use the attention score in the conditional generation estimate output by FouriScale to replace the attention score in this additional conditional generation estimate. Since the attention score contains the structural information in the generated image, this operation will correctly The image structure information is introduced while ensuring the image quality.

Figure 4 (a) FouriScale guidance diagram. (b) The generated image without using FouriScale as a guide has obvious artifacts and detail errors. (c) Generated image using FouriScale as a guide.

experiment

1. Quantitative test results

The author followed the method of [1] and tested three Vincentian graph models (including SD 1.5, SD 2.1 and SDXL 1.0), generating four higher resolution images. The tested resolutions were 4x, 6.25x, 8x, and 16x the number of pixels of their respective training resolutions. The test results of randomly sampling 30000/10000 image and text pairs on Laion-5B are shown in Table 1:

Table 1 Comparison of quantitative results of different training-free methods

Their method achieved optimal results in various pre-trained models and different resolutions.

2. Qualitative test results

As shown in Figure 5, their method can ensure image generation quality and consistent structure in each pre-trained model and at different resolutions.

Figure 5 Comparison of generated images by different training-free methods

in conclusion

This paper proposes FouriScale to enhance the ability of pre-trained diffusion models to generate high-resolution images. FouriScale is analyzed from the frequency domain and improves the structure and scale consistency at different resolutions through atrous convolution and low-pass filtering operations, solving key challenges such as repeated patterns and structural distortion. Adopting a "fill then crop" strategy and using FouriScale as a guide enhances the flexibility and quality of text-to-image generation, while adapting to different aspect ratio generation. Quantitative and qualitative experimental comparisons show that FouriScale can ensure higher image generation quality under different pre-trained models and different resolutions.