[Paper Review] Few-Shot Adversarial Learning of Realistic Neural Talking Head Models 논문 읽기

업데이트: July 13, 2021

✍🏻 이번 포스팅에서는 적은 이미지로 학습가능한 Talking head model에 대해 살펴본다.

Paper : Few-Shot Adversarial Learning of Realistic Neural Talking Head Models (arxiv 2019 /Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, Victor Lempitsky)
GAN-Zoos! (GAN 포스팅 모음집)

1. Introduction

⭐ Goal : Few-shot의 이미지로 학습하여 진짜같은 Talking head model 만들기

Realistic Neural Talking Head를 합성하는 건 어려움

사람의 얼굴은 너무 복잡 : 얼굴뿐만 아니라 머리, 옷 등을 모델링하기 어려움
Uncanny valley effect : 사람같이는 생겼는데 애매하게 닮으면 거부감이 심함 ➡ 정말 진짜 사람같은 이미지를 만들어야함

In this work,

Few-shot learning 으로 talking head model 생성이 가능. (one-shot learning으로도 학습이 가능하지만, few-shot일 때가 더 성능이 좋음)
- meta-learning : 아주 큰 데이터로 talking head video로 학습시킨 pre-trained model을 fine-tunning 하는 방식으로 학습하기 때문에 Few-shot learning 가능
Warping이 아니라 Direct synthesis 의 방식을 차용.
- Warping-based system : 적은 수의 이미지로도 talking head sequences를 생성할 수 있지만, 다양한 동작이나 움직임 등을 합성하긴 어려움
- Direct synthesis (warping-free) : Deep ConvNet을 훈련시키는 방식. 다만, large corpus와 많은 resource 필요

2. Methods

2.1 Architecture and Meta-Learning stage

Generator

Generator는 reference video(source)의 몇몇 frame들을 Embedding한 결과 $\hat{e_i}$와 Target Image의 Landmark $y_i(t)$를 입력으로 받아 새로운 이미지 $\hat{x_i}(t)$를 합성
- Embedder : source video에서 random으로 추출한 image와 그 이미지의 landmark를 활용하여 N-dim의 vector로 임베딩. 각 이미지의 임베딩 값들간의 평균이 $\hat{e_i}$
Generator Loss : content loss + adversarial loss + match loss

\[\mathcal{L}\left(\phi, \psi, \mathbf{P}, \theta, \mathbf{W}, \mathbf{w}_{0}, b\right)=\mathcal{L}_{\mathrm{CNT}}(\phi, \psi, \mathbf{P})+\mathcal{L}_{\mathrm{ADV}}\left(\phi, \psi, \mathbf{P}, \theta, \mathbf{W}, \mathbf{w}_{0}, b\right)+\mathcal{L}_{\mathrm{MCH}}(\phi, \mathbf{W})\]

Discriminator

Discriminator : 합성된 이미지 $\hat{x_i}(t)$가 landmark $y_i(t)$를 잘 반영하고 있는지 판단 (realism score)
Discriminator Loss

\[\mathcal{L}_{\mathrm{DSC}}\left(\phi, \psi, \mathbf{P}, \theta, \mathbf{W}, \mathbf{w}_{0}, b\right)=\max \left(0,1+D\left(\hat{\mathbf{x}}_{i}(t), \mathbf{y}_{i}(t), i ; \phi, \psi, \theta, \mathbf{W}, \mathbf{w}_{0}, b\right)\right)+\max \left(0,1-D\left(\mathbf{x}_{i}(t), \mathbf{y}_{i}(t), i ; \theta, \mathbf{W}, \mathbf{w}_{0}, b\right)\right)\]

2.2 Few-shot learning by fine-tuning

(1) Target image ➡ Landmark image : source video의 $T$ 개의 frame에 대하여 training image와 landmark image를 모두 구해야함

\[x(1), x(2), ..., x(T) / y(1), y(2), ..., y(T)\]

(2) meta-learned Embedder를 통해 embedding $\hat{\mathbf{e}}_{\mathrm{NEW}}$ 계산

\[\hat{\mathbf{e}}_{\mathrm{NEW}}=\frac{1}{T} \sum_{t=1}^{T} E(\mathbf{x}(t), \mathbf{y}(t) ; \phi)\]

위와 같은 방식으로 pretrained model을 활용하여 새로운 이미지를 합성. 다만, Fine-tuning 과정이 필요

Generator

$G\left(\mathbf{y}(t), \hat{\mathbf{e}}_{\mathrm{NEW}} ; \psi, \mathbf{P}\right)$ ➡ $G^{\prime}\left(\mathbf{y}(t) ; \psi, \psi^{\prime}\right)$
$\psi^{\prime}=\mathbf{P} \hat{\mathbf{e}}_{\mathrm{NEW}}$

새로 계산된 embedding $\hat{\mathbf{e}}_{\mathrm{NEW}}$을 활용하여 $\psi^{\prime}$ 로 초기화

Discriminator

Fine-Tuning 과정에서도 meta-learning stage와 비슷한 방식으로 D는 realism score를 계산
\[D^{\prime}\left(\hat{\mathbf{x}}(t), \mathbf{y}(t) ; \theta, \mathbf{w}^{\prime}, b\right)= V(\hat{\mathbf{x}}(t), \mathbf{y}(t) ; \theta)^{T} \mathbf{w}^{\prime}+b\]

Loss Function

Generator Loss

\[\mathcal{L}^{\prime}\left(\psi, \psi^{\prime}, \theta, \mathbf{w}^{\prime}, b\right)=\mathcal{L}_{\mathrm{CNT}}^{\prime}\left(\psi, \psi^{\prime}\right)+\mathcal{L}_{\mathrm{ADV}}^{\prime}\left(\psi, \psi^{\prime}, \theta, \mathbf{w}^{\prime}, b\right)\]

Discriminator Loss

\[\mathcal{L}_{\mathrm{DSC}}^{\prime}\left(\psi, \psi^{\prime}, \theta, \mathbf{w}^{\prime}, b\right)= \max \left(0,1+D\left(\hat{\mathbf{x}}(t), \mathbf{y}(t) ; \psi, \psi^{\prime}, \theta, \mathbf{w}^{\prime}, b\right)\right)+ \max \left(0,1-D\left(\mathbf{x}(t), \mathbf{y}(t) ; \theta, \mathbf{w}^{\prime}, b\right)\right)\]

3. Experiments

Dataset : VoxCeleb1, VoxCeleb2
Metrics : FID, SSIM. CSIM, USER
Method : FF(no fine-tune), FT

다른 모델들보다 score가 낮기도 하지만, target 이미지를 가장 잘 반영 + user평가도 높음

4. Conclusions

✍🏻 본 논문은 Deep Generator network로 진짜같은 talking head 이미지를 생성할 수 있는 모델을 제안했다. 한장의 이미지만으로도 괜찮은 이미지가 합성되며, few-shot image로 학습하면 진짜같은 이미지가 생성된다.

다만, 아직 시선처리와 같이 mimic representation은 잘 안되며, landmark adaptation이 필요하다는 한계가 있다.

Twitter Facebook LinkedIn

Jihye Back

[Paper Review] Few-Shot Adversarial Learning of Realistic Neural Talking Head Models 논문 읽기

1. Introduction

2. Methods

2.1 Architecture and Meta-Learning stage

2.2 Few-shot learning by fine-tuning

3. Experiments

4. Conclusions

공유하기

댓글남기기

참고

[Paper Review] RAFT: Adapting Language Model to Domain Specific RAG 논문 리뷰

[4] RAG (Retriever Augumented Generation)

[3-2] A Survey of Large Language Models - Adaptation of LLMs

[3-1] A Survey of Large Language Models - 다양한 LLMs부터, Data, Architecture, Training 까지