[Paper Review] Fewshot-SMIS: Few-shot Semantic Image Synthesis Using StyleGAN Prior 논문 리뷰

업데이트: February 20, 2022

Paper: Fewshot-SMIS: Few-shot Semantic Image Synthesis Using StyleGAN Prior (CoRR 2021): arxiv, code
GAN-Zoos! (GAN 포스팅 모음집)

Fewshot-SMIS

few-shot semantic image synthesis: few shot의 이미지 만으로 semantic to real image 모델을 훈련할 수 있다.

training StyleGAN encoder: **본 논문은 semantic mask를 latent space로 embedding 시켜주는 encoder를 학습하며, pre-trained StyleGAN generator로 realistic image를 합성한다.

dense한 semantic mask 뿐만 아니라 landmark나 scribble과 같은 spare inputs 도 encoding이 가능하다.

Semantic layouts to Photographs = Challenging ⭐️:

Domain gap이 존재
few-shot 밖에 없는 semantic images들을 StyleGAN의 latent space로 inversion 시키는 것은 어려움

Few-shot Semantic Image Synthesis

unlabeled images ( $N_u = 1 or 5$ ): $\mathcal{D}_{u}=\left\{\mathbf{y}_{i}\right\}_{i=1}^{N_{u}}$
labeled images $N_l$: $\mathcal{D}_{l}=\left\{\mathbf{x}_{i}, \mathbf{y}_{i}\right\}_{i=1}^{N_{l}}$
- one-hot semantic mask $\mathbf{x} \in{0,1}^{C \times W \times H}$
  - Fig3 참고: y 이미지에 pixel-align이 되어있는 dense map pixel image 이거나 scribble이나 landmark와 같은 sparse map
- GT RGB image $\mathbf{y} \in \mathbb{R}^{3 \times W \times H}$

Pseudo labeling

⭐️ Goal: few labeled pair $\mathcal{D}_l$ 로 되어있는 semantics과 unlabeled dataset인 $\mathcal{D}_u$ 로 구성되어있는 StyleGAN의 latent space를 mapping하는 것

How?
1. semantic class들을 대표하는 feature representative vector를 추출
2. k-nearest neighbor search로 위 vector를 StyleGAN의 feature map과 matching
  - 2의 방식으로 semantic label와 StyleGAN의 feature map을 matching하기 때문에 pseudo labeling이 가능하다
  - StyleGAN의 latent space의 random noise로부터 pseudo semtantic mask를 얻을 수 있다.
  - 이 pseudo labeling은 Fig3 처럼 noisy하기는 하지만 이는 spatial global information을 잘 encoding한 정보이기 때문에 StyleGAN의 Generator에 넣었을 때 high-quality의 image를 합성할 수 있다.

Dense psudo labeling

optimization 방식으로 GT RGB images $y$ 를 StyleGAN의 latent space로 inversion
forward propagation으로 feature map을 추출
(1, 2를 통해 뽑힌 feature map과 semantic masks $x_i$) pair를 이용하여 각 semantic class $c$ 에 대한 representative vector $v_c$ 를 추출
- 이때 feature map $\mathbf{F}_{i} \in \mathbb{R}^{Z \times W^{\prime} \times H^{\prime}}$ 은 masked average pooling을 적용하며
- semantic mask $\mathbf{x}_{i}^{\prime} \in \mathbb{R}^{C} \times W^{\prime} \times H^{\prime}$ 의 size를 resize 한다.
- 과정 3은 PANet 논문의 Prototype learning 의 방식을 따름
\[\mathbf{v}_{c}=\frac{1}{N_{l}} \sum_{i=1}^{N_{l}} \frac{\sum_{x, y} \mathbf{F}_{i}^{x, y} \mathbb{1}\left[\mathbf{x}_{i}^{(c, x, y)}=1\right]}{\sum_{x, y} \mathbb{1}\left[\mathbf{x}_{i}^{(c, x, y)}=1\right]}\]
pSp 기반의 Encoder 학습
- Fig 6 참고
- encoder를 학습시키려면 pseudo semantic masks를 생성하는 과정이 필요하다
  - 우리가 가지고 있는 데이터는 few-shot의 paired dataset
  - 하지만 encoder를 학습시키려면 많은 데이터셋이 필요하기 때문에 random noise $z$ 와 쌍을 이루는 semantic mask 가 있으면 좋음
  - 본 논문은 이 paired dataset을 Fig 4 의 과정을 통해 구하고자 하며, semantic mask를 구하기 위해서는 representative vector가 필요하다. (이 vector는 1,2,3의 과정을 통해 구함)
- How to train?
  1. random noise 를 pretrained StyleGAN에 넣어 image를 합성
  2. feature map $F’$을 추출
    - 이때 feature map은 StyleGAN Generator의 최상단 layer말고 64 x 64 resolution을 갖도록 하는 layer에서 뽑는다
  3. representative vectors와 $F’$의 pixel-wise vector 사이의 nearest-neighbor matching을 통해 semantic masks를 계산
    - cosine similarity를 통해 계산함으로써 semantic mask를 추출
    \[c^{(x, y)}=\operatorname{argmax}_{c \in C} \cos \left(\mathbf{v}_{c}, \mathbf{F}^{\prime(x, y)}\right)\]
    - 마지막으로는 dense semantic mask를 본래 사이즈(a의 synthesis image’s size)로 resize
  4. 4-1,2,3의 방식으로 synthesised image $G(z)$ 에 대응되는 pseudo semantic masks를 구할 수 있다.
  5. 이 semantic mask를 pSp Encoder에 넣은 후 latent code $\left{\hat{\mathbf{w}}{i}\right}{i=1}^{L}$를 추출
  6. e에서 추출한 latent code와 본래의 latent code가 비슷해지도록 encoder를 optimize
  \[\mathcal{C}=\mathbb{E}_{\mathbf{w} \sim f(\mathbf{z})}\|\hat{\mathbf{w}}-\mathbf{w}\|_{2}^{2}\]

Sparse pseudo labeling

이미지를 sparse하게 labeling하는 방식은 dense labeling과 약간 다르다.

기존에 dense pseudo labeling에서는 feature map을 masked average pooling한 후 real semantic mask와의 prototype learning을 통해 representative vector를 구했다면, sparse pseudo labeling의 경우에는 pixel-wise하게 feature를 뽑아서 representative vector를 구한다.

이후 좀 더 범용적인 pseudo-labeled image를 뽑기 위해 annotated pixel들과 pixel-wise vector간의 one-to-one matching을 계산하지 않고, 다양한 annotated pixels들이 identitcal pixel-wise vector와 matching되도록 many-to-one으로 mapping을 했다.

→ one-nearest-neighbor 대신 top-k의 nearest neightbor를 계산