[Paper Review] SinGAN: Learning a Generative Model from a Single Natural Image 논문 리뷰

업데이트: March 09, 2022

Paper: SinGAN: Learning a Generative Model from a Single Natural Image (ICCV 2019): arxiv, code, video
GAN-Zoos! (GAN 포스팅 모음집)

SinGAN: 단 하나의 이미지로 학습할 수 있는 unconditional generative moel

Single natural image 내 여러 patch들의 분포를 학습함으로써 다양한 형태의 새로운 이미지들을 만들어 낼 수 있게 됨

fine detail / texture generation 뿐만 아니라 복잡한 image의 structure들도 학습을 해야하기 때문에 challenging

이미지의 구도라던지(하늘은 위에, 땅은 바닥에), 큰 물체의 shape 등등

multi-scale patch 구조여서 global structure 뿐만 아니라 fine texture까지 학습이 가능

single image로 한번 학습을 했을 뿐인데, 다양한 고퀄리티의 이미지들이 생성된다.

SinGAN

hierachy of patch-GANs (Markovian discriminator)
- 이미지 x 에서 다양한 scale의 patch distribution을 학습할 수 있도록 patch-GAN의 구조를 차용
- receptive field 영역를 작게 잡아서 local 정보에 대한 Inductive bias를 주고, training image를 기억하지 못하게 함
- 비슷한 접근: Swapping AutoEncoder

Multi-scale architecture

Training Progression: coarsest scale에서 시작하여 finest scale까지 학습하며, 매 scale마다 Gaussian noise를 추가하여 이미지를 생성
- 처음에만 gaussian noise으로부터 이미지를 생성하고
  \[\tilde{x}_{N}=G_{N}(z_{N})\]
- 그 다음부터는 이전에 생성한 low-resolution 이미지와 gaussian noise를 활용하여 이미지 생성
  \[\tilde{x}_{n}=G_{n}(z_{n},(\tilde{x}_{n+1}) \uparrow^{r}), \quad n<N\]
모든 Generator와 Discriminator의 receptive field는 같음
- coarse한 이미지가 생성되는 초기에는 이 receptive field가 이미지 크기에 비해 상대적으로 크므로 global structure를 생성하고,
- fine 이미지가 생성되는 후반기에는 이 receptive field가 이미지 크기에 비해 상대적으로 작으므로 texture나 detail을 생성하게 됨
- Fig 4의 Effective Patch Size 참고
Generator
- Generator는 Fig 5처럼 비교적 간단한 구조를 가진다. (Generator가 복잡하여 모델의 capacity가 커지면 training image를 외워버리는 문제가 있음)
- 3x3 Conv-BatchNorm-LeakyReLU의 conv-block을 5번 반복한 구조

Training

Loss function은 adversarial loss와 reconstruction loss으로 구성된다.

\[\min _{G_{n}} \max _{D_{n}} \mathcal{L}_{\text {adv }}(G_{n}, D_{n})+\alpha \mathcal{L}_{\text {rec }}(G_{n})\]

The adversarial loss
- real image $x_n$의 patch distribution과 생성된 이미지 $\tilde{x}_{n}$의 patch distribution이 비슷하도록 학습
- WGAN-GP loss 사용
Reconstruction Loss
- 이미지가 original image를 잘 따라가도록 학습하는 loss
  \[\mathcal{L}_{\mathrm{rec}}=\|G_{n}(0,(\tilde{x}_{n+1}^{\mathrm{rec}}) \uparrow^{r})-x_{n}\|^{2}\]
  - 첫 noise $z^*$는 이미지 생성에 필요하니까 고정해서 주고, 그 다음부터는 noise를 안줬다고 한다.
  - ${z_{N}^{\mathrm{rec}}, z_{N-1}^{\mathrm{rec}} \ldots, z_{0}^{\mathrm{rec}}}={z^{*}, 0, \ldots, 0}$

Results

Global structure 뿐만 아니라 fine detail까지 잘 학습
제한된 작은 receptive field를 사용했어서 다양한 조합의 patch들이 새롭게 생성되었다.

Effect of scales at test time

한번 훈련시켜놓으면 test는 간단하나 Fig 8처럼 약간의 문제가 생길 수 있음
보통 global structure는 가장 coarest한 level에서 결정되는데,

\[\tilde{x}_{N}=G_{N}(z_{N})\]

이때 이미지가 비정상적으로 생성이 된다면 Fig 8처럼 다리가 6개인 얼룩말이나 기둥이 2개인 나무가 생성될 수 있음

⇒ 첫번째 이미지 $\tilde{x}_{N}$ 만 우리가 주고, 나머지는 자유롭게 생성되도록 한다면 괜찮게 생성이 됨
- 추가적으로 semantic-to-image 와 같은 추가적인 task도 가능해짐

Effect of scales during training

scale이 작으면 texture와 같은 정보들만 capture
scale이 클수록 global structure들이 같이 학습됨

Application

Super-Resolution

Paint-to-Image

Coarse scale에 (N-1이나 N-2) downsampling한 paint image를 넣어 global structure에 대한 정보를 주면 역시나 이미지가 잘 생성됨

Harmonization

Editing

ICCV 2019 Best Paper 답게 흥미로웠던 논문! 학습 시간이 장당 30분 정도인게 약간 아쉽지만, 모델 구조도 신선하고 이를 활용한 application도 다양해서 재밌게 읽었다.

SinGAN과 StyleGAN2가 왜 잘 동작했는지를 분석한 다음 논문도 읽어보시는 걸 추천합니다 :)

Positional Encoding as Spatial Inductive Bias in GANs (CVPR2021)

Twitter Facebook LinkedIn

Jihye Back

[Paper Review] SinGAN: Learning a Generative Model from a Single Natural Image 논문 리뷰

SinGAN

Multi-scale architecture

Training

Results

Effect of scales at test time

Effect of scales during training

Application

Super-Resolution

Paint-to-Image

Harmonization

Editing

공유하기

댓글남기기

참고

[Paper Review] RAFT: Adapting Language Model to Domain Specific RAG 논문 리뷰

[4] RAG (Retriever Augumented Generation)

[3-2] A Survey of Large Language Models - Adaptation of LLMs

[3-1] A Survey of Large Language Models - 다양한 LLMs부터, Data, Architecture, Training 까지