MS Phi-2 small language model – outperforms many LLMs but fits on your laptop

We are now releasing Phi-2 (opens in new tab), a 2.7 billion-parameter language model that demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters. On complex benchmarks Phi-2 matches or outperforms models up to 25x larger, thanks to new innovations in model scaling and training data curation.

With its compact size, Phi-2 is an ideal playground for researchers, including for exploration around mechanistic interpretability, safety improvements, or fine-tuning experimentation on a variety of tasks. We have made Phi-2 (opens in new tab) available in the Azure AI Studio model catalog to foster research and development on language models.

[..]

Phi-2 is a Transformer-based model with a next-word prediction objective, trained on 1.4T tokens from multiple passes on a mixture of Synthetic and Web datasets for NLP and coding. The training for Phi-2 took 14 days on 96 A100 GPUs. Phi-2 is a base model that has not undergone alignment through reinforcement learning from human feedback (RLHF), nor has it been instruct fine-tuned. Despite this, we observed better behavior with respect to toxicity and bias compared to existing open-source models that went through alignment (see Figure 3). This is in line with what we saw in Phi-1.5 due to our tailored data curation technique, see our previous tech report (opens in new tab) for more details on this. For more information about the Phi-2 model, please visit Azure AI | Machine Learning Studio (opens in new tab).

Figure 3. Safety scores computed on 13 demographics from ToxiGen. A subset of 6541 sentences are selected and scored between 0 to 1 based on scaled perplexity and sentence toxicity. A higher score indicates the model is less likely to produce toxic sentences compared to benign ones.
[…]

With only 2.7 billion parameters, Phi-2 surpasses the performance of Mistral and Llama-2 models at 7B and 13B parameters on various aggregated benchmarks. Notably, it achieves better performance compared to 25x larger Llama-2-70B model on muti-step reasoning tasks, i.e., coding and math. Furthermore, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, despite being smaller in size.
[…]

Model Size BBH Commonsense
Reasoning Language
Understanding Math Coding

Llama-2 7B 40.0 62.2 56.7 16.5 21.0

13B 47.8 65.0 61.9 34.2 25.4

70B 66.5 69.2 67.6 64.1 38.3

Mistral 7B 57.2 66.4 63.7 46.4 39.4

Phi-2 2.7B 59.2 68.8 62.0 61.1 53.7

Table 1. Averaged performance on grouped benchmarks compared to popular open-source SLMs.

Model Size BBH BoolQ MBPP MMLU

Gemini Nano 2 3.2B 42.4 79.3 27.2 55.8

Phi-2 2.7B 59.3 83.3 59.1 56.7

Table 2. Comparison between Phi-2 and Gemini Nano 2 Model on Gemini’s reported benchmarks.

Model	Size	BBH	Commonsense Reasoning	Language Understanding	Math	Coding
Llama-2	7B	40.0	62.2	56.7	16.5	21.0
13B	47.8	65.0	61.9	34.2	25.4
70B	66.5	69.2	67.6	64.1	38.3
Mistral	7B	57.2	66.4	63.7	46.4	39.4
Phi-2	2.7B	59.2	68.8	62.0	61.1	53.7

Model	Size	BBH	BoolQ	MBPP	MMLU
Gemini Nano 2	3.2B	42.4	79.3	27.2	55.8
Phi-2	2.7B	59.3	83.3	59.1	56.7

Source: Phi-2: The surprising power of small language models – Microsoft Research

Robin Edgar

Organisational Structures | Technology and Science | Military, IT and Lifestyle consultancy | Social, Broadcast & Cross Media | Flying aircraft

robin@edgarbv.com https://www.edgarbv.com

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31