[Paper Review] ScreenStyle: Manga Filling Style Conversion with Screentone Variational Autoencoder 논문 리뷰

업데이트: March 20, 2022

Paper: ScreenStyle: Manga Filling Style Conversion with Screentone Variational Autoencoder (SIGGRAPH Asia 2020): paper, project, code
GAN-Zoos! (GAN 포스팅 모음집)

We propose a novel variational model, ScreenVAE, to characterize the local texture property at a single point, without the interference of boundary and overlapping fine details, as an interpolative ScreenVAE feature map.

With the ScreenVAE unifying the property of screentone and color, we propose to learn and convert between screening and color-filling styles.

The proposed ScreenVAE effectively simplifies the complex patterns in manga, and assists manga inpainting effectively.

ScreenStyle

ScreenVAE
- ScreenEncoder(SE): screened manga image를 Input $I_m$ 으로 받아 intermediate ScreenVAE map $I_s$ 을 생성
  - intermediate Screen VAE map $I_s$ : 이미지의 texture 정보를 담고 있는 feature map, 특정 픽셀의 주변부에 있는 local neighborhood의 texture 특징까지를 encoding
  - 이미지의 content나 region semantics을 추출해내는 part
- ScreenDecoder(DE): ScreenVAE map $I_s$ 을 다시 screened manga $I’_m$ 로 decoding
  - 이미지에 screentone을 입히는 부분
  - dataset에 있는 다양한 screentone을 표현할 수 있음
- multi-scale design
  - ScreenEncoder, ScreenDecoder 모두 multi-scale로 디자인하여 이미지의 content region을 잘 뽑아낼 뿐만 아니라 특정 region의 screentone이 일정할 수 있도록 함
Bidirectional Translation Model
- screen manga domain과 color comic domain간의 자유로운 translation이 가능하도록 하는 모델
- Screen2Color G: ScreenVAE map $I_s$ 을 color comic $I_{s arrow c }$ 으로 변환
- Color2Screen G: color comic $I_{s arrow c }$ 을 ScreenVAE map $I_s$ 으로 변환
- cycle-consistency하게 두 도메인간의 변환이 잘되도록 저자들은 adv-loss 기반의 tailored bidirectional style translation model를 제안하였음
ScreenEncoder와 ScreenDecoder를 함께 사용하면 흑백의 manga image가 생성되고
ScreenEncoder와 Screen2Color를 함께 사용하면 color comics image가 생성됨

ScreenVAE

Network Architecture

The ScreenVAE map has the same resolution as the input manga.
Each pixel in the ScreenVAE map summarizes the texture characteristics of a local neighborhood in the input manga within the receptive field
- 실험을 해보니 ScreenVAE maps은 4 channel이면 충분하다고 함
ScreenEncoder
- downscaling-upscaling network with 6 residual blocks
ScreenDecoder
- 5-level U-net structure with strided de-convolutional operations
- structure를 잘 유지하면서도 다양한 scale의 screentone 생성이 가능해짐

Objectives

\[\mathcal{L}_{\text {scr }}=\lambda_{\text {rec }} \mathcal{L}_{\text {rec }}+\lambda_{\text {spp }} \mathcal{L}_{\text {spp }}+\lambda_{z} \mathcal{L}_{z}+\lambda_{\mathrm{adv}} \mathcal{L}_{\mathrm{adv}}\]

Reconstruction Loss
- Input Manga $I_m$과 recon manga $R_m$ 이 비슷하도록 강제하는 loss
- decoder SD 가 잘 학습되도록 도와주는 term
- pixel-wise mean square error (MSE) loss
  \[\mathcal{L}_{\text {rec }}=\mathbb{E}_{I_{m} \sim \mathcal{I}_{m}}\{\|R_{m}-I_{m}\|_{2}\}\]
Superpixel Loss
- ScreenVAE map $I_s$ 이 input manga의 texture를 잘 encoding하도록 도와주는 term, 이때 region 별로 tone이 일정해야함
- To extract constant-tone regions
  1. total-variation based smoothing → input manga의 tonal intensity map $I_t$ 를 얻음
  2. tonal intensity map $I_t$ 에 simple linear iterative clustering (SLIC)를 적용하여 super-pixel map $I_{spp}$ 를 얻음
    - super-pixel map $I_{spp}$ 를 얻기 위해 superpixel pooling network(SPN)을 사용했으며, 이를 통해 region representation이 uniform하게 만들었다.
  3. regional texture feature variances를 추정해서 super-pixel map $I_{spp}$ 에서 다양한 tone의 region이 나오는 것을 제거
    - 만약 두 영역간의 tone은 같지만 texture가 다르다면, super-pixel map $I_{spp}$ 에서 두 superpixel을 분리
- Loss
  \[\mathcal{L}_{\mathrm{spp}}=\mathbb{E}_{I_{m} \sim I_{m}}\{w_{l}\|I_{s}-\operatorname{Superpixel}(I_{s}, I_{\mathrm{spp}})\|^{2}\}\]

KL Regularization Loss
- ScreenVAE map이 normally distribute하도록 정규화해주는 term
\[\begin{gathered}\mathcal{L}_{z}=\mathbb{E}_{I_{m} \sim \mathcal{I}_{m}}\{K L(\mathcal{N}(\mu, \sigma) \mid \mathcal{N}(\mathbf{0}, \mathrm{I}))\} \\K L(\mathcal{N}(\mu, \sigma), \mathcal{N}(\mathbf{0}, \mathrm{I}))=\frac{1}{2} \sum(\sigma^{2}+\mu^{2}-\log (\sigma^{2})-1)\end{gathered}\]
Adversarial Loss
- 위의 세 loss만 사용하면 이미지가 blurry하게 생성됨
- clear하고 screentone이 잘 뽑히는 이미지를 생성하기 위해 adv-loss를 사용
- Discrimator $D_{sr}(R_m)$ with 4 strided downscaling blocks
- WGAN-gp을 채택하여 훈련의 안정성을 높임
\[\begin{aligned}\mathcal{L}_{\mathrm{adv}}=& \mathbb{E}_{I_{m} \sim \mathcal{I}_{m}}\{D_{s r}(I_{m})-D_{s r}(R_{m})\} \\&+\mathbb{E}_{\hat{I}_{m} \sim \hat{I}_{m}}\{(\|\nabla_{\hat{I}_{m}} D_{s r}(\hat{I}_{m})\|_{2}-1)^{2}\}\end{aligned}\]
- $\hat{I}_{m}$ : image $I_m$와 $R_m$ 을 linearly interpolate

ScreenVAE는 manga image를 dense한 pixel-wise ScreenVAE map으로 잘 translate

이 ScreenVAE map는 local neighborhood를 고려하여 이미지의 texture를 encoding하며,

interpolation이 가능하기 때문에 dataset에서 보지 못했던 high-quality의 screentone 생성이 가능

Bidirectional Style Translation

Network Architecture

paired dataset 없이 두 도메인간의 translation을 자유롭게 하기 위해 unsupervised learning
CycleGAN과 비슷하게 bidirectional translation model을 도입
- 2개의 generator: Screen2Color G, Color2Screen G
- 7-level U-net structure
  - 각 level은 2개의 conv block(conv + normalization + ReLU)로 구성되어있음
- style extractor $E_{st}$
  - 로 style vector $v_r$ 를 추출한 후 이를 hint로 사용하기 위해 Screen2Color G에 주입 (by AdaIN)
  - reference image $I_r$ 와 output color comic image $G_{m2c}(I_s,v_r)$ 가 비슷한 color composition을 가지도록 강제
  - style vector를 따로 사용하지 않고 random vector를 Screen2Color G에 넣어줘도 colorful comics image가 생성됨
  - 5 strided downscaling blocks and FC layer

Objectives

\[\begin{aligned}\mathcal{L}_{\mathrm{bi}}=& \alpha_{\mathrm{cyc}}(\mathcal{L}_{\mathrm{cyc}}^{c}+\mathcal{L}_{\mathrm{cyc}}^{s})+\alpha_{\mathrm{GAN}}(\mathcal{L}_{\mathrm{GAN}}^{c}+\mathcal{L}_{\mathrm{GAN}}^{s})+\alpha_{\mathrm{sty}} \mathcal{L}_{\mathrm{sty}}+\alpha_{\mathrm{kl}} \mathcal{L}_{\mathrm{kl}}\end{aligned}\]

Bidirectional Cycle-Consistency Loss
- color → screen → color
- screen → color → screen
\[\begin{aligned}&\mathcal{L}_{\mathrm{cyc}}^{c}=\mathbb{E}_{I_{c} \sim I_{c}}\|G_{s arrow c}(G_{c arrow s}(I_{c}))-I_{c}\|_{1} \\&\mathcal{L}_{\mathrm{cyc}}^{s}=\mathbb{E}_{I_{s} \sim I_{s}}\|G_{c arrow s}(G_{s arrow c}(I_{S}))-I_{s}\|_{1}\end{aligned}\]
Adversarial Loss
- high-quality의 color comics & screen manga image가 생성되도록 도와주는 term
\[\begin{aligned}\mathcal{L}_{\mathrm{GAN}}^{c}=& \mathbb{E}_{I_{c} \sim \mathcal{I}_{c}}\{D_{c}(I_{c})\}-\mathbb{E}_{I_{s} \sim I_{s}}\{D_{c}(G_{s arrow c}(I_{s}))\} \\&+\mathbb{E}_{\hat{I}_{c} \sim \hat{I}_{c}}\{(|\nabla \hat{I}_{c} D_{c}(\hat{I}_{c})|_{2}-1)^{2}\} \\\mathcal{L}_{\mathrm{GAN}}^{s}=& \mathbb{E}_{I_{s} \sim \mathcal{I}_{s}}\{D_{s}(I_{s})\}-\mathbb{E}_{I_{c} \sim I_{c}}\{D_{s}(G_{c arrow s}(I_{c}))\} \\&+\mathbb{E}_{\hat{I}_{s} \sim \hat{I}_{s}}\{(|\nabla \hat{I}_{s} D_{s}(\hat{I}_{s})|_{2}-1)^{2}\}\end{aligned}\]
Style Loss
- 생성된 color comic 이미지가 reference image의 style과 비슷하도록 style loss 사용
- style feature는 illustration2vec network $\phi$ 로 추출했다고 함
\[\begin{aligned}\mathcal{L}_{\mathrm{sty}}=& \mathbb{E}_{(I_{s}, I_{c}) \sim(I_{f}, I_{\rfloor})}(\sum_{l} \| \operatorname{mean}(\phi^{l}(I_{c})).\\&-\operatorname{mean}(\phi^{l}(G_{s arrow c}(I_{s}, E_{\mathrm{st}}(z \mid I_{c})))) \|_{2} \\&+\sum_{l} \| \operatorname{std}(\phi^{l}(I_{c})-\operatorname{std}(\phi^{l}(G_{s arrow c}(I_{s}, E_{\mathrm{st}}(z \mid I_{c})))) \|_{2})\end{aligned}\]
Style Regularization Loss
- style vector를 normalize
\[\mathcal{L}_{\mathrm{kl}}=\mathbb{E}_{I_{c} \sim I_{c}} K L(E_{s t}(z, I_{c}) \mid \mathcal{N}(\mathbf{0}, \mathrm{I}))\]

Results

Twitter Facebook LinkedIn

Jihye Back

[Paper Review] ScreenStyle: Manga Filling Style Conversion with Screentone Variational Autoencoder 논문 리뷰

ScreenStyle

ScreenVAE

Network Architecture

Objectives

Bidirectional Style Translation

Network Architecture

Objectives

Results

공유하기

댓글남기기

참고

[Paper Review] LLaMA: Open and Efficient Foundation Language Models 논문 리뷰

[Paper Review] I-DDPM: Improved Denoising Diffusion Probabilistic Models 논문 리뷰

[Paper Review] DDIM: Denoising Diffusion Implicit Models 논문 리뷰

[Paper Review] DDPM: Denoising Diffusion Probabilistic Models 논문 리뷰