[Paper Review] Phi-1: Textbooks Are All You Need 리뷰

업데이트:

Phi-1 (1.3B): Python coding

Summary

  • Transformer-based model, trained for 4 days on 8 A100s
  • [1] pretrain on textbook quality data (7B tokens, 8 epochs = total 50B tokens)
  • [2] finetune on textbook-exercise-like data (200M tokens)

  • model도, 학습 데이터도 매우 적음에도 HumanEval과 MBPP score가 매우 높음

2. Training details and the importance of high-quality data

  • 논문의 제목 (Textbooks are all you need) 처럼 저자들은 textbook-quality training data를 사용
    • 이전의 많은 연구들은 code generation을 위해 단순히 다량의 code corpora을 사용. BUT 저자들은 이게 모델에게 reasoning, meaningful computation, algorithmic logic 등을 학습시키기엔 부적절하다고 주장. (의미없는 코드들도 많고.. 맥락 없는 snippet 코드도 많으니까)
  • 저자들은 textbook-quality의 data (clear, self-contained, instructive, and balanced)을 사용하니, 작은 모델과 데이터셋으로도 SoTA의 결과를 냈다고 함
  • dataset
    • [1] CodeTextbook (약 7B)
      • pretraining phase에 사용. → phi-1-base 모델 (HumanEval 결과 29%)
      • A filtered code-language dataset, which is a subset of The Stack and StackOverflow, obtained by using a language model-based classifier (consisting of about 6B tokens).
      • A synthetic textbook dataset consisting of <1B tokens of GPT-3.5 generated Python textbooks.
    • [2] CodeExercises
      • finetuning phase에 사용 → phi-1
      • A small synthetic exercises dataset consisting of ∼180M tokens of Python exercises and solutions.
  • dataset 생성 방법
    • Filtering of existing code datasets using a transformer-based classifier
      • [1-1] A filtered code-language dataset 을 필터링하기 위해 이 방법을 사용
      • 100k sample 정도만 뽑아서 GPT-4로 annotating (원래는 35M samples, 35B tokens) 한 후, random forest classier 학습 → 이후 이 classifier를 사용해서 데이터를 filtering
      • Our filtering methodology boosts our model performance **significantly even without the synthetic datasets discussed below: for 350M parameter models trained on unfiltered Stack (deduplicated python) and StackOverflow, the HumanEval performance saturates at 12.19% even after training for 96k steps (∼ 200B tokens), while training on the filtered subset achieves 17.68% on HumanEval after 36k steps. We further improve this to 20.12% (reported in Figure 2.1) by training on a combination of the filtered dataset and the synthetic textbooks dataset discussed below.
    • Creation of synthetic textbook-quality datasets
      • language model을 creative & diverse 하게 만드려면, dataset의 diversity를 유지하는 것이 중요
      • Tinystories 처럼 prompt에 랜덤성을 주입시키려고 노력했음
        • Tinystories: a diverse set of short stories were created by including a random subset of words chosen from some fixed vocabulary in the prompt and requiring that they would be somehow combined in the generated text
      • [1-2] A synthetic textbook dataset
        • GPT-3.5를 활용해서 데이터 생성 (1B tokens 정도) high-quality의 natural language를 gpt에 줘서 고품질의 데이터를 생성할 수 있게 함
        • reasoning & basic algorithmic skill을 위해 topic을 다양하게 만듦

      • [2] CodeExercises: A small synthetic exercises dataset
        • GPT-3.5로 생성 (350M tokens 정도).
        • The goal of this dataset is to align the model to perform function completion tasks based on natural language instructions.
        • 이 데이터는 finetuning phase에서 사용되는데, HumanEval benchmark와 겹치는게 없도록 저자들이 어느정도 제거를 했다고 함

  • Model architecture and training
    • We use a decoder only transformer [VSP+ 17] model using the FlashAttention implementation of multihead attention (MHA) [DFE+ 22]. We also use MHA and MLP layers in parallel configuration following some recent models like CodeGen [NPH+ 22], PaLM [CND+ 22], and GPT-NeoX [BBH+ 22].
    • The architecture for our 1.3B parameter phi-1 model consists of 24 layers, hidden dimension of 2048, MLP-inner dimension of 8192, and 32 attention heads of dimension 64 each. The smaller 350M parameter phi1-small model consists of 20 layers, hidden dimension of 1024, MLP-inner dimension of 4096, and 16 attention heads of dimension 64 each. We also use a rotary position embedding [SLP+ 21] with rotary dimension 32. These architectural choices were adopted from [NPH+ 22]. We also use the same tokenizer as codegen-350M-mono [NPH+ 22]. Aside from FlashAttention, our models do not use other techniques like Fill-In-the-Middle (FIM) [BJT+ 22], or Multi-Query-Attention (MQA) [RSR+ 20] that could further boost performance and efficiency [LAZ+ 23].
    • For both pretraining and finetuning, we concatenate our respective datasets into a single dimensional array with “⟨∣endoftext∣⟩” token used for separating the files. We train our models on sequence length of 2048 sliced from our dataset array with next-token prediction loss. We use fp16 training with AdamW optimizer, linear-warmup-linear-decay learning rate schedule, and attention and residual dropout of 0.1.
    • We train on 8 Nvidia-A100 GPUs using deepspeed. Our pretrained base model phi-1-base was obtained in under 4 days of training. Finetuning to obtain phi-1 used an additional 7 hours on the same hardware.
    • Pretraining.
      • phi-1-base was trained on the CodeTextbook dataset (filtered code-language corpus and synthetic textbooks). We use effective batch size 1024 (including data parallelism and gradient accumulation), maximum learning rate 1e-3 with warmup over 750 steps, and weight decay 0.1, for a total of 36,000 steps. We use the checkpoint at 24,000 steps as our phi-1-base – this is equivalent to ∼ 8 epochs on our CodeTextbook dataset for a total of little over 50B total training tokens. Despite the small size and computation, this model already achieves a 29% accuracy on HumanEval.
    • Finetuning.
      • phi-1 is obtained by finetuning phi-1-base on the CodeExercises dataset. For finetuning, we use the same setup as pretraining, but different hyperparameters: we use effective batchsize of 256, maximum learning rate 1e-4 with 50 steps of warmup, and weight decay 0.01. We train for total of 6,000 steps and pick the best checkpoint (saved every 1000 steps).

3. Spikes of model capability after finetuning on CodeExercises

  • Finetuning 을 하고 나니, finetuning 데이터셋에 없었던 작업에 대해서도 모델의 성능이 좋아짐을 발견
  • 즉, finetuning process가 pretraining 의 knowledge를 재구성하고 통합하는데 도움이 되었다!

  • our finetuning not only improves the tasks we targeted, but also makes unrelated tasks easier to distill from pretraining

댓글남기기