[Paper Review] Learning to Cartoonize Using White-box Cartoon Representations 논문 리뷰

업데이트: March 15, 2022

Paper: Learning to Cartoonize Using White-box Cartoon Representations (CVPR 2020): paper, project, code
GAN-Zoos! (GAN 포스팅 모음집)

기존의 연구들(ex. CartoonGAN)은 특정 training data에 fit되게 학습되는 black-box model

→ general하고 style이 잘 담겨져있는 이미지를 생성하는 능력이 떨어짐

White Box Cartoonization

white box cartoonization 논문은 작가들이 어떻게 그림을 그리는지에 착안하여 이 방법론을 모델에 적용하였다. 작가들은 보통 이미지의 윤곽을 먼저 그려놓은 후(surface) 이미지의 detail들을 추가하곤 하는데(texture), 본 논문에서는 이런 그림 프로세스 자체가 모델에도 녹아들도록 하였다.

Contribution (1) 그림을 그릴 때는 크게 세가지가 중요 ⭐️

based on our observation of cartoon painting behavior

surface representations
- image의 smooth surface를 뽑아내는 파트
- 이미지의 weighted low-frequency component $\boldsymbol{I}_{s f} \in \mathbb{R}^{W \times H \times 3}$를 추출
  - color composition이나 edge와 같은 surface texture는 보존되지만 detail한 특징이나 texture는 무시
- 보통 작가들이 composition draft를 먼저 그리고 차후에 detail을 추가하곤 하는데, 이 smoothed surface를 찾는 파트가 이 프로세스라고 생각하면 됨
structure representation
- global structural information이랑 만화 스타일의 sparse color block을 찾는 부분
- 먼저 input image에서 segmentation map을 뽑은 다음에, 각 seg region에 adaptive coloring algorithm을 적용시켜서 structure representation $\boldsymbol{I}_{s t} \in \mathbb{R}^{W \times H \times 3}$ 을 생성
texture representation
- 이미지의 painted detail이나 edge 정보를 포함하는 부분
- input image $\boldsymbol{I} \in \mathbb{R}^{W \times H \times 3}$ 를 single-channel intensity map $\boldsymbol{I}_{t} \in \mathbb{R}^{W \times H \times 1}$ 으로 변환
  - 이때 이 intensity map은 color나 luminance에 대한 정보는 가지고 있지 않으며 relative pixel intensity 만 보존된다
- 이 파트는 network가 high-frequency인 textural detail을 학습하도록 돕는다

Contribution (2) GAN Framework

(1) 의 3 representations을 바탕으로 GAN 모델을 end-to-end로 optimize

GAN framework
- Generator G
- 두 개의 Discriminator, D
  - $D_S$ : model의 output에서 뽑아낸 surface rep와 실제 cartoon에서 뽑아낸 surface rep를 판별
  - $D_t$ : model의 output과 cartoon에서 뽑아낸 texture rep를 판별
- pre-trained VGG
  - output과 추출된 structure rep 사이의 perceptual similarity를 구해 global content에 대한 constraint를 검
  - 또한, input과 output사이에 대해서도 같은 loss를 계산

Contribution (3) high-quality의 이미지 생성 & SoTA !

Proposed Approach

(1) Learning From the Surface Representation

The surface representation: 이미지를 smooth하게 만드는 파트

global semantic structure는 보존되지만, texture나 detail들을 사라진다.

edge가 보존된채 이미지를 filtering하기 위해 differentiable guided filter $\mathcal{F}_{d g f}(\boldsymbol{I}, \boldsymbol{I})$ 를 도입

$\mathcal{F}_{d g f}(\boldsymbol{I}, \boldsymbol{I})$ 는 image $\boldsymbol{I}$ 를 input과 guide map으로써 사용
texture나 detail을 사라진채 surface representation만 추출된다.

\[\begin{aligned}\mathcal{L}_{\text {surface }}\left(G, D_{s}\right)=\log D_{s}\left(\mathcal{F}_{d g f}\left(\boldsymbol{I}_{c}, \boldsymbol{I}_{c}\right)\right) \\+\log \left(1-D_{s}\left(\mathcal{F}_{d g f}\left(G\left(\boldsymbol{I}_{p}\right), G\left(\boldsymbol{I}_{p}\right)\right)\right)\right)\end{aligned}\]

Discriminator: model의 output과 reference cartoon image $I_c$ 가 비슷한 surface를 가지고 있는지 판별

(2) Learning From the Structure representation

The structure representation: global content, sparse color blocks, celluloid style cartoon workflow내의 깔끔한 boundaries를 학습

이미지의 자그마한 chunk/block 들이 비슷한 color, content를 가지도록 강제하는 term

felzenszwalb algorithm를 통해 image를 segmentation
추가로, selective search를 도입하여 segmented region을 병합하고 sparse segmentation map을 추출
- Felzenszwalb 알고리즘과 같은 superpixel algorithm은 pixel들간의 similarity만을 고려하고 의미있는 정보들은 무시하기 때문에 이 과정이 필요하다고 함
색 추출 과정에서 segmentation 내의 average color를 구하지 않고, adaptive coloring algorithm을 도입
- Standard superpixel algorithms은 각각의 segmented region에서 pixel들 간의 average color를 구하곤 함
  - 이렇게 했더니 global contrast가 낮아지고, 이미지가 어두워져 최종 이미지의 quality가 떨어짐 (Fig 5 (a, c))
- Fig 5 (a, c)에서의 문제를 해결하기 위해 저자들은 adaptive coloring algorithm을 도입
  \[\begin{aligned}\boldsymbol{S}_{i, j} &=\left(\theta_{1} * \overline{\boldsymbol{S}}+\theta_{2} * \tilde{\boldsymbol{S}}\right)^{\mu} \\\left(\theta_{1}, \theta_{2}\right)=& \begin{cases}(0,1) & \sigma(\boldsymbol{S})<\gamma_{1} \\(0.5,0.5) & \gamma_{1}<\sigma(\boldsymbol{S})<\gamma_{2} \\(1,0) & \gamma_{2}<\sigma(\boldsymbol{S})\end{cases}\end{aligned}\]
  - parameter는 다음의 값이 가장 좋은 결과를 냈다고 보고
    \[\gamma_{1}=20, \gamma 2=40 \text { and } \mu=1.2\]
pre-trained VGG19 network로 이런 structure representation이 잘 추출되도록 학습

\[\mathcal{L}_{s t r u c t u r e}=\left\|V G G_{n}\left(G\left(\boldsymbol{I}_{p}\right)\right)-V G G_{n}\left(\mathcal{F}_{s t}\left(G\left(\boldsymbol{I}_{p}\right)\right)\right)\right\|\]

(3) Learning From the Textural Representation

the texture representation: cartoon image의 high-frequency feature인 texture에 대한 정보를 담고 있는 파트

이미지의 밝기와 색상 정보는 실제 이미지와 cartoon image를 쉽게 판별할 수 있게 한다.
따라서 저자들은 random color shift algorithm $\mathcal{F}_{r c s}$ 를 적용한 color image 에서 single-channel texture representation 을 뽑아냈다고 한다.
\[\mathcal{F}_{r c s}\left(\boldsymbol{I}_{r g b}\right)=(1-\alpha)\left(\beta_{1} * \boldsymbol{I}_{r}+\beta 2 * \boldsymbol{I}_{g}+\beta_{3} * \boldsymbol{I}_{b}\right)+\alpha * \boldsymbol{Y}\]
- $\boldsymbol{I}_{r g b}$ : 이미지의 RGB, 세 color channels
- $\boldsymbol{Y}$ : RGB image를 standard grayscale로 변환한 이미지
- $\text { We set } \alpha=0.8, \beta_{1}, \beta_{2} \text { and } \beta_{3} \sim U(-1,1)$
random color shift를 하기 때문에 random intensity map이 생성되고, 이미지의 밝기나 색상 정보가 제거된다.

\[\begin{aligned}\mathcal{L}_{\text {texture }}\left(G, D_{t}\right) &=\log D_{t}\left(\mathcal{F}_{r c s}\left(\boldsymbol{I}_{c}\right)\right) \\&+\log \left(1-D_{t}\left(\mathcal{F}_{r c s}\left(G\left(\boldsymbol{I}_{p}\right)\right)\right)\right)\end{aligned}\]

Discriminator: model의 output과 reference cartoon image $I_c$ 가 비슷한 texture를 가지고 있는지 판별

(4) Full model

하나의 Generator와 2개의 Discriminator를 사용하여 세 representation에서 학습된 feature들이 joint하게 optimize되도록 학습

\[\begin{aligned}\mathcal{L}_{\text {total }} &=\lambda_{1} * \mathcal{L}_{\text {surface }}+\lambda_{2} * \mathcal{L}_{\text {texture }} \\&+\lambda_{3} * \mathcal{L}_{\text {structure }}+\lambda_{4} * \mathcal{L}_{\text {content }}+\lambda_{5} * \mathcal{L}_{t v}\end{aligned}\]

total-variation loss $\mathcal{L}_{t v}$
- 생성된 이미지가 spatial smoothness하도록 학습 + high frequency noise를 줄여줌
\[\mathcal{L}_{t v}=\frac{1}{H * W * C}\left\|\nabla_{x}\left(G\left(\boldsymbol{I}_{p}\right)\right)+\nabla_{y}\left(G\left(\boldsymbol{I}_{p}\right)\right)\right\|\]
content loss $\mathcal{L}_{\text {content }}$
- input image와 결과 이미지가 비슷한 semantic 정보를 가지도록 강제해주는 term
- structure loss와 비슷하게 pre-trained VGG 16 loss를 사용
\[\mathcal{L}_{\text {content }}=\left\|V G G_{n}\left(G\left(\boldsymbol{I}_{p}\right)\right)-V G G_{n}\left(\boldsymbol{I}_{p}\right)\right\|\]
interpolation loss
- output image를 sharp하게 만들기 위해 loss를 추가
\[\boldsymbol{I}_{\text {interp }}=\delta * \mathcal{F}_{\text {dgf }}\left(\boldsymbol{I}_{i n}, G\left(\boldsymbol{I}_{i n}\right)\right)+(1-\delta) * G\left(\boldsymbol{I}_{i n}\right)\]

Experiments

Implementation

Training
- 우선 content loss만을 사용하여 generator를 50k iter 동안 pre-train을 한 후
- GAN 으로 jointly optimize
- 총 100k iter동안 학습을 했다고 함 (이후부터는 거의 수렴)

Generator
- fully convolutional U-Net 구조
- stride2 conv-layer로 down-sampling
- bilinear interpolation layer로 upsampling → checker-board artifact를 없앰
Discriminator
- Patch discriminator → detail들이 더 잘 학습되며, 훈련 속도가 빨라짐

Datasets

real world photos
- FFHQ: 10000 images
- landscape: 5000 images
Cartoon images
- human face: 10000 images
- landscape: 10000 images
- Style: Kyoto animation, P.A.Works, Shinkai Makoto, Hosoda Mamoru, and Miyazaki Hayao

Result

Twitter Facebook LinkedIn

Jihye Back

[Paper Review] Learning to Cartoonize Using White-box Cartoon Representations 논문 리뷰

White Box Cartoonization

Proposed Approach

(1) Learning From the Surface Representation

(2) Learning From the Structure representation

(3) Learning From the Textural Representation

(4) Full model

Experiments

Implementation

Datasets

Result

공유하기

댓글남기기

참고

[Paper Review] RAFT: Adapting Language Model to Domain Specific RAG 논문 리뷰

[4] RAG (Retriever Augumented Generation)

[3-2] A Survey of Large Language Models - Adaptation of LLMs

[3-1] A Survey of Large Language Models - 다양한 LLMs부터, Data, Architecture, Training 까지