MS Phi-2 small language model – outperforms many LLMs but fits on your laptop

We are now releasing Phi-2 (opens in new tab), a 2.7 billion-parameter language model that demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters. On complex benchmarks Phi-2 matches or outperforms models up to 25x larger, thanks to new innovations in model scaling and training data curation.

With its compact size, Phi-2 is an ideal playground for researchers, including for exploration around mechanistic interpretability, safety improvements, or fine-tuning experimentation on a variety of tasks. We have made Phi-2 (opens in new tab) available in the Azure AI Studio model catalog to foster research and development on language models.


Phi-2 is a Transformer-based model with a next-word prediction objective, trained on 1.4T tokens from multiple passes on a mixture of Synthetic and Web datasets for NLP and coding. The training for Phi-2 took 14 days on 96 A100 GPUs. Phi-2 is a base model that has not undergone alignment through reinforcement learning from human feedback (RLHF), nor has it been instruct fine-tuned. Despite this, we observed better behavior with respect to toxicity and bias compared to existing open-source models that went through alignment (see Figure 3). This is in line with what we saw in Phi-1.5 due to our tailored data curation technique, see our previous tech report (opens in new tab) for more details on this. For more information about the Phi-2 model, please visit Azure AI | Machine Learning Studio (opens in new tab).

A barplot comparing the safety score of Phi-1.5, Phi-2, and Llama-7B models on 13 categories of the ToxiGen benchmark. Phi-1.5 achieves the highest score on all categories, Phi-2 achieves the second-highest scores and Llama-7B achieves the lowest scores across all categories.
Figure 3. Safety scores computed on 13 demographics from ToxiGen. A subset of 6541 sentences are selected and scored between 0 to 1 based on scaled perplexity and sentence toxicity. A higher score indicates the model is less likely to produce toxic sentences compared to benign ones.

With only 2.7 billion parameters, Phi-2 surpasses the performance of Mistral and Llama-2 models at 7B and 13B parameters on various aggregated benchmarks. Notably, it achieves better performance compared to 25x larger Llama-2-70B model on muti-step reasoning tasks, i.e., coding and math. Furthermore, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, despite being smaller in size.

Model Size BBH Commonsense
Math Coding
Llama-2 7B 40.0 62.2 56.7 16.5 21.0
13B 47.8 65.0 61.9 34.2 25.4
70B 66.5 69.2 67.6 64.1 38.3
Mistral 7B 57.2 66.4 63.7 46.4 39.4
Phi-2 2.7B 59.2 68.8 62.0 61.1 53.7
Table 1. Averaged performance on grouped benchmarks compared to popular open-source SLMs.
Model Size BBH BoolQ MBPP MMLU
Gemini Nano 2 3.2B 42.4 79.3 27.2 55.8
Phi-2 2.7B 59.3 83.3 59.1 56.7
Table 2. Comparison between Phi-2 and Gemini Nano 2 Model on Gemini’s reported benchmarks.

Source: Phi-2: The surprising power of small language models – Microsoft Research

Robin Edgar

Organisational Structures | Technology and Science | Military, IT and Lifestyle consultancy | Social, Broadcast & Cross Media | Flying aircraft