[Paper Review] Swapping Autoencoder for Deep Image Manipulation 논문 리뷰

업데이트: March 09, 2022

Paper: Swapping Autoencoder for Deep Image Manipulation (NeurIPS 2020): arxiv, code1, code2
GAN-Zoos! (GAN 포스팅 모음집)

Swapping Autoencoder

Image를 structure와 texture로 나눠서 encoding (두 latent code는 disentangle)

서로 다른 이미지에서 나온 structure code와 texture code를 바탕으로 새로운 이미지를 생성할 수 있도록 학습

patch discriminator를 도입하여 이미지의 texture를 보다 잘 학습하도록 함

unsupervised training

비슷한 논문 : [MUNIT] Multi-Modal Unsupervised Image-to-Image Translation (ECCV 2018 / Xun Huang, Ming-Yu Liu, Serge Belongie, Jan Kautz)

Model

Reconstruction
- 두 개의 이미지를 따로 swapping 한다거나 하지 않고, recon이 잘되도록 하는 loss
\[\mathcal{L}_{\mathrm{rec}}(E, G)=\mathbb{E}_{\mathbf{x} \sim \mathbf{x}}\left[\|\mathbf{x}-G(E(\mathbf{x}))\|_{1}\right]\]
- non-saturating adversarial loss를 추가
  \[\mathcal{L}_{\mathrm{GAN}, \mathrm{rec}}(E, G, D)=\mathbb{E}_{\mathbf{x} \sim \mathbf{x}}[-\log (D(G(E(\mathbf{x}))))]\]
Decompose latent codes: $z =(z_s, z_t)$
- structure code: $z_s$
  - location specific information
  - spatial한 정보를 잘 담고 있도록 2-d로 구성
  - local information에 대한 inductive bias를 제공할 수 있도록 receptive field를 제한하여 structure code를 뽑음 [링크]
- texture code: $z_t$
  - texture distribution, 1-d
  - average pooling을 때려서 이미지의 spatial한 정보들을 날려버리고, style에 대한 정보를 담고 있도록 만들었다.
- structure code와 texture code를 joint하게 학습하여 나중에 random texture code에 대해서도 이미지가 잘 생성되도록 학습
- GAN loss
  \[\mathcal{L}_{\mathrm{GAN}, \mathrm{swap}}(E, G, D)=\mathbb{E}_{\mathbf{x}^{1}, \mathbf{x}^{2} \sim \mathbf{X}, \mathbf{x}^{1} \neq \mathbf{x}^{2}}\left[-\log \left(D\left(G\left(\mathbf{z}_{s}^{1}, \mathbf{z}_{t}^{2}\right)\right)\right)\right]\]
Co-occurrent patch statistics
- texture와 structure가 disentangle하게 encoding되도록 학습하는 term
- 이미지의 texture는 texture code에 의해 학습되도록 hybrid image의 patch를 discriminator가 판별하도록 함 (Patch Discriminator)
\[\mathcal{L}_{\text {CooccurGAN }}\left(E, G, D_{\text {patch }}\right)=\mathbb{E}_{\mathbf{x}^{1}, \mathbf{x}^{2} \sim \mathbf{X}}\left[-\log \left(D_{\text {patch }}\left(\operatorname{crop}\left(G\left(\mathbf{z}_{s}^{1}, \mathbf{z}_{t}^{2}\right)\right), \operatorname{crops}\left(\mathbf{x}^{2}\right)\right)\right)\right]\]

Total Loss

\[\mathcal{L}_{\text {total }}=\mathcal{L}_{\text {rec }}+0.5 \mathcal{L}_{\mathrm{GAN}, \mathrm{rec}}+0.5 \mathcal{L}_{\mathrm{GAN}, \mathrm{swap}}+\mathcal{L}_{\text {CooccurGAN }}\]

Model Architecture

Generator
- StyleGAN2의 G가 base
Discriminator
- StyleGAN2의 D와 구조가 동일
- co-occurrence path discriminator: 처음에 각 patch에 대한 feature를 뽑은 후 이들을 concate해서 final classification layer로 보냄
Encoder
- 4 downsampling ResNet blocks → conv → $z_s$ 생성
- conv + average pooling layer → $z_t$ 생성

Result

Twitter Facebook LinkedIn

Jihye Back

[Paper Review] Swapping Autoencoder for Deep Image Manipulation 논문 리뷰

Model

Total Loss

Model Architecture

Result

공유하기

댓글남기기

참고

[Paper Review] RAFT: Adapting Language Model to Domain Specific RAG 논문 리뷰

[4] RAG (Retriever Augumented Generation)

[3-2] A Survey of Large Language Models - Adaptation of LLMs

[3-1] A Survey of Large Language Models - 다양한 LLMs부터, Data, Architecture, Training 까지