Inside Phi 2: Microsoft’s small language model

2023 was very much the year of the large language model. OpenAI’s GPT models, Meta’s Llama, Google’s PaLM, and Anthropic’s Claude 2 are all large language models, or LLMs, with many billions of parameters, trained on content from the internet, and used to generate text and code.

But they’re not the only technologies being developed. Case in point: Microsoft Research has been exploring an alternative way of building generative AI models that delivers consistent results with a much smaller number of parameters. Enter the small language model, or SLM.

Why small language models?

A small language model is far easier to make portable. We can’t always be connected to the cloud. At the same time, we might not want to train a model on public data. It takes months to train a GPT-class LLM, using a supercomputer. By building a language model on a smaller set of private or domain-specific data (for example, a bank’s internal codebase), we could deliver a model that is both smaller and more specialized (such as a code generator that benefits from years of internal knowledge and coding standards of the bank’s development teams).

There’s a lot of work being put into SLMs at the moment, with surprisingly good results. One of the more interesting families of models is Microsoft Research’s Phi series, which recently switched from a research-only license to a more permissive MIT license.

Microsoft Research has used an approach it calls “textbooks are all you need” to train its Phi series of SLMs. The idea is to strategically train the model using authoritative sources, in order to deliver responses in a clear and concise fashion. For the latest release, Phi 2, Microsoft’s training data mixed synthetic content and web-crawled information.

Synthetic data is used to give the model foundational knowledge to support basic reasoning as well as a grounding in general knowledge, so outputs aren’t limited to textbook-grade data and can respond to a user’s context more effectively. The results speak for themselves. Phi 2 has benchmarked as well as, and sometimes better than, models that are larger and considerably more complex.

Training a SLM with curated data

Microsoft Research notes that the quality of the training data used is key to delivering good results and exhibiting the type of behavior seen in much larger models. Instead of training the model on a large corpus of web data, which is inherently random, the team building the Phi models curates its training data, focusing on content quality. The team has also used existing knowledge from earlier Phi models to kickstart Phi 2, speeding up training.

Unlike larger-scale transformers, the Phi models receive no human feedback-driven reinforcement learning. The curation of the training data makes this reinforcement learning unnecessary. It also makes the model less likely to deliver toxic or biased outputs. However, garbage in, garbage out applies: It would be possible to train a version of Phi that was deliberately biased by choosing a biased set of training data. As a result, you should test any SLM in advance of use, to ensure that it will behave as expected.

The synthetic data used as part of Phi’s training set was itself generated by AI, so needed to be vetted carefully, to ensure that it doesn’t include inaccuracies. The first version of Phi was designed to work as a code generator, and was trained on existing codebases with permissive licenses; these were then selected further to filter out code that wasn’t suitable for teaching purposes. Phi may not have all the power of OpenAI’s Codex, but it can deliver useful tips and ideas for working with code—especially when paired up with a code-focused search index.

Textbooks Are All You Need

It’s worth reading the original Textbooks Are All You Need paper and its follow-up, as they go into detail regarding how the model team developed their synthetic training data sets, using GPT 3.5 to build both sample code and textbooks. One interesting takeaway was how they were able to keep generated documents from being too similar, by adding randomness into the prompts used to create content. Once a base model had been generated, the team fine-tuned it with more detailed data, for example producing different tunings for different tasks.

Even though Phi 2 has significantly fewer parameters than, say, GPT 3.5, it still needs a dedicated training environment. The SLM used a 1.4 trillion token data set, with 2.7 billion parameters, and took 14 days to train. While it needed 96 Nvidia A100 GPUs, training took a lot less time and a lot fewer resources than go into training a LLM like GPT. Training a SLM is conceivably within the reach of most organizations, especially if you’re using pay-as-you-go capacity in a public cloud.

It’s possible to imagine alternative formulations of Phi built on different synthetic data sets, for example a library of synthetic contracts or other common document types. Once trained, fine-tuning with actual documents in the target domain reduces the risk of error and helps deliver a grounded model,

Building or tuning your own variant isn’t necessary, of course. For basic chat functionality you can use Phi 2 as is, or more likely, use it as part of a RAG (retrieval-augmented generation)-based application, working with LangChain or a similar approach. As Phi is part of Azure AI Studio (and soon Windows AI Studio), it can be used both in the cloud and on premises.

Using SLMs in your applications

A quantized build of Phi 2 weighs in at under 1.9GB, small enough to be delivered as part of a web application. (You’ll find a Rust/WebAssembly demo application in the Hugging Face repo.) It’s slow to make an initial response while loading, but once the SLM is cached, it’s reasonably responsive. That’s without using a GPU or NPU, too. Accelerators should allow a SLM to work well alongside traditional code.

It’s important to note that SLMs like Phi 2 do have their limitations, especially around the token length of prompts. You shouldn’t expect to use complex prompts. However, if you carefully sanitize inputs and apply hard limits to string length, you should find that an SLM will handle most queries, for example in a Q&A application.

Having a lightweight local SLM fine-tuned on custom data or used as part of a local RAG application, where the SLM provides the natural language interface to a search, is an intriguing prospect. One key point is that the size and resource requirements of SLMs make them economically attractive for tasks that would be too costly to perform with LLMs.

Putting a SLM like Phi into common workflows, such as to quickly deliver readable and comprehensible summaries of key data, could prove quite useful. The result would be an intriguing alternative to aging UI paradigms, especially when working with unstructured data.

One interesting option takes us back to the early 1990s and research into the idea of “intelligent agents.” A team of SLMs like Phi, each powering an intelligent agent and providing an interface between us and a sea of unstructured data, could be one way of delivering the context-based, adaptive computing environment envisioned by early ubiquitous computing researchers.

Copyright © 2024 IDG Communications, Inc.


This website uses cookies. By continuing to use this site, you accept our use of cookies.