[Paper Review] Phi Series 비교 (Tinystories ~ Phi-3) 및 리뷰

업데이트:

model Phi-1 Phi-1.5 Phi-2 Phi-3 Mini Phi-3 Medium
released 2023.06 2023.09 2023.12 2024.05 2024.05
pretraining data 54B tokens (7B unique tokens) 150B token (30B unique tokens) 1.4T tokens (250B unique tokens) 3.3T 3.3T
context length   2k (2048) 2k (2048) 128K 128K
model size 1.3B 1.3B 2.7B 3.8B 14B

[1] Tinystories: speaking fluent English

  • 기존의 GPT-Neo 나 GPT-2 와 같은 small language model (SLMs, 약 125M)은 대량의 dataset으로 학습 됐지만, 성능이 떨어졌었음 (말도 잘 못하고, reasoning 성능이 떨어짐)
  • Tinystories: GPT-3.5/4를 가지고 3~4 살도 이해할만한 고품질의 short stories 데이터들을 생성 → SLMs 학습
    • 데이터양도, 모델의 size도 작은데도 coherent text, reasoning and instruction following 성능이 뛰어남
  • the effect of high quality data extends well past this: improving data quality can dramatically change the shape of the scaling laws, potentially allowing to match the performance of large-scale models with much leaner training/models.

[2] Phi-1 (1.3B): Python coding

자세한 리뷰는 다음 글 참고

Summary

  • Transformer-based model, trained for 4 days on 8 A100s
  • [1] pretrain on textbook quality data (7B tokens, 8 epochs = total 50B tokens)
  • [2] finetune on textbook-exercise-like data (200M tokens)
    • our finetuning not only improves the tasks we targeted, but also makes unrelated tasks easier to distill from pretraining

  • model도, 학습 데이터도 매우 적음에도 HumanEval과 MBPP score가 매우 높음

[3] Phi-1.5 (1.3B): common sense reasoning

  • Both phi-1.5 and phi-1.5-web are base models pre-trained on large natural language corpora. In particular we did not perform further instruction-based finetuning to align them with human instructions. Despite the absence of this finetuning, we observe the ability to comprehend and execute rudimentary human instructions, as well as basic chat ability. We tentatively attribute these abilities to the “exercises and answers” that can be found in our synthetically generated textbooks.

3.1 Technical specifications

  • Architecture
    • The architecture for phi-1.5 (and its variants) is exactly the same as our previous model phi-1 in [GZA+ 23]. It is a Transformer [VSP+ 17] with 24 layers, 32 heads, and each head has dimension 64. We use rotary embedding with rotary dimension 32, and context length 2048. We also use flash-attention [DFE+ 22, Dao23] for training speed up, and we use the tokenizer of codegen-mono [NPH+ 22].
  • Training data
    • phi 저자들이 항상 주장하는건 데이터의 양보다 질 !
    • (1) phi-1의 training data (7B tokens)는 그대로 쓰고 (2) 새롭게 생성한 “textbook-like” data (약 20B tokens)도 함께 사용했다고 함
      • for the purpose of teaching common sense reasoning and general knowledge of the world (science, daily activities, theory of mind, etc.)
      • (1) phi-1의 training data (7B tokens)
        • A filtered code-language dataset (약 6B): 생성된 데이터 X, 코드 데이터 중 고품질만 필터링한 데이터
        • A synthetic textbook dataset (약 1B): GPT-3.5로 생성한 데이터
      • (2) 새롭게 생성한 “textbook-like” data (약 20B tokens)
        • diversity를 위해 web data에서 20k의 주제를 sampling한 후 데이터를 생성했다고 함
  • Training details
    • We train phi-1.5 starting from random initialization with constant learning rate 2e − 4 (no warm up), weight decay 0.1. We use Adam optimizer with momentum 0.9, 0.98, and epsilon 1e − 7. We use fp16 with DeepSpeed ZeRO Stage 2 [RRRH20]. We use batch size 2048, and train for 150B tokens, with 80% from the newly created synthetic data and 20% from phi-1 ’s training data.
  • Filtered web data
    • 생성된 데이터가 아니라, 웹을 필터링한 데이터를 비교하기 위해 두가지의 모델을 사용
    • filtered web data (95B tokens = 88+7)
      • Falcon refined web dataset에서 필터링한 88B tokens (NLP data)
      • Stack과 StackOverflow에서 필터링한 7B tokens의 code data (no synthetic data)
    • [a] phi-1.5-web-only model
      • filtered web data (95B tokens = 88+7) 로 학습
    • [b] phi-1.5-web model
      • filtered web data (40%), phi-1의 code data (20%), 생성한 NLP data (40%)로 학습

3.2 Benchmark results

(1) Common Sense Reasoning Benchmarks

(2) Language understanding and Knowledge Benchmarks

  • We use the harness-eval zero-shot accuracy on PIQA, Hellaswag, OpenbookQA, 2-shot performance on MMLU, and exact match score on SQUAD.

(3) Multi-Step Reasoning Benchmarks

  • we evaluate reasoning abilities, through mathematics and coding. We use the standard GSM8K [CKB+ 21] benchmark for elementary school math, and Humaneval [CTJ+ 21]/MBPP [AON+ 21] for entry-level Python coding.
  • phi-1.5-web가 phi-1.5보다 좋음: web 데이터가 학습에 도움이 된다
  • phi-1.5의 코딩능력이랑 phi-1 (코딩만을 위해 학습)이랑 비슷함
    • → This highlights another potential advantage of using high-quality, textbook-like data for training: the model seems to store and access the knowledge more efficiently compared to training with web data

[4] Phi-2 (2.7B)

  • Phi-2 is a Transformer with 2.7B parameters.
  • data: it was trained using the same data sources as Phi-1.5, augmented with a new data source that consists of various NLP synthetic texts and filtered websites (for safety and educational value).
  • 13B 이하의 모델 중 SoTA
  • phi-1.5와 마찬가지로 finetuning (X), RLHF (X)
  • textbook과 유사한 데이터로 학습을 했기 때문에 fine-tuning이나 instruction으로 align을 맞추는 과정이 없어도 기본적으로 exercises and answer 이 가능함 (Phi-1.5와 동일) 따라서 the Phi-2 model is best suited for prompts using the QA format, the chat format, and the code format.
  • Model
    • Architecture: a Transformer-based model with next-word prediction objective
    • Context length: 2048 tokens
  • Data
    • Dataset size: 250B tokens, combination of NLP synthetic data created by AOAI GPT-3.5 and filtered web data from Falcon RefinedWeb and SlimPajama, which was assessed by AOAI GPT-4.
    • Training tokens: 1.4T tokens
      • phi-1.5-web이 1.3B param, 100B data size, 300B train tokens 였는데..!
    • GPUs: 96xA100-80G
    • Training time: 14 days

[5] Phi-3

  • Models
    • Architecture
      • Phi-3 Mini-128K-Instruct has 3.8B parameters and is a dense decoder-only Transformer model.
      • The model is fine-tuned with Supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) to ensure alignment with human preferences and safety guidlines.
    • Inputs: Text. It is best suited for prompts using chat format.
    • Context length: 128K tokens
      • default context length: 4K
      • LongRope로 context length를 128K로 확장한 버전이 phi-3-mini-128K
    • GPUs: 512 H100-80G
    • Training time: 7 days
    • Training data: 3.3T tokens
    • Outputs: Generated text in response to the input
    • Dates: Our models were trained between February and April 2024
    • Status: This is a static model trained on an offline dataset with cutoff date October 2023. Future versions of the tuned models may be released as we improve models.

Datasets (3.3T)

  1. Publicly available documents filtered rigorously for quality, selected high-quality educational data, and code;
  2. Newly created synthetic, “textbook-like” data for the purpose of teaching math, coding, common sense reasoning, general knowledge of the world (science, daily activities, theory of mind, etc.);
  3. High quality chat format supervised data covering various topics to reflect human preferences on different aspects such as instruct-following, truthfulness, honesty and helpfulness.

Bentchmarks

  • phi-3-mini: Mixtral 8x7B나 GPT-3.5와 맞먹는 성능(MMLU 69%, MT-bench 8.38점).

Technical Specifications

  • Tokenizers
    • 오픈소스 커뮤니티 기여를 위해 Llama-2와 유사한 블록 구조와 동일한 걸 사용했다고 함.
    • BPE SentencePiece tokenizer (total vocab: 32K)
    • Phi-2 모델은 tokenizer를 byte BPT 쓰고 vocab size 도 50k 정도 됐던 걸로 기억하는데, phi-3가 되면서 토크나이저 바껴서 실험하기 좋은 것 같다. 근데 최근에 llama3 128k vocab tokenizer로 바꾸..
  • phi-3-small model (7B)
    • tiktoken tokenizer 사용
      • multilingual 에 좋음, vocab size (100k, 100352)
      • the standard decoder architecture of a 7B model class, having 32 layers and a hidden size of 4096.
      • Llama2,3 와 마찬가지로 GQA 사용 (with 4 queries sharing 1 key)
  • phi3-mini (3.8B)
    • can be quantized to 4-bits so that it only occupies ≈ 1.8GB of memory. We tested the quantized model by deploying phi-3-mini on iPhone 14 with A16 Bionic chip running natively on-device and fully offline achieving more than 12 tokens per second.

댓글남기기