Preloader Image 1

TinyLlama breaks the Chinchilla scaling law

Meta’s LLaMA and Llama 2 have changed the game in LLM. People think that models cannot get any smaller and perform with the same capabilities as larger models. Now, with a certain small Llama, people can tweak it and make it even smaller, so small that people are asking how that is possible. The latest version, TinyLlama, is truly the most impressive, breaking the rules of scalability.

Truong Boi Vien, a research assistant at the University of Singapore, began training a 1.1 billion parameter model called TinyLlama. Based on Llama 2, the ambitious part of this project is that Peiyuan aims to pre-educate it on 3 trillion tokens! The idea is to achieve this within 90 days using just 16 A100-40G GPUs at 24k tokens per second per GPU. For comparison, the estimated cost to train this feature on an AWS server would be around $40,000.

If it works, the model will set a new standard and serve applications that require limited computational resources as the 1.1 billion weight only accounts for 550MB RAM. But people are a bit skeptical about the project.

Chinchilla entered

The dataset of 3 trillion tokens is a combination of 70% Slimpajama and 30% Starcodedata. “What does pre-training a 1.1 billion model for such a long time achieve?” said one user on HackerNews. “Doesn’t it contradict the Law of Chinchilla Proportions?”

Chinchilla Scaling Law essentially says that to train a transformer-based language model, to achieve optimal computational power, the number of parameters and the number of tokens to train the model must be divided. proportionately approximately equal.

When it comes to models like the GPT or PaLM that are larger in size, the saturation point can come much later because they are more likely to train themselves over longer periods of time and thus surpass the others. According to OpenAI, “We expect that larger models will always perform better than smaller models.” The company believes that a model with a fixed size will have limited capacity.

In other words, because smaller models have fewer multiplications, they run and train faster. But according to this theory, these models will eventually reach the limit of their ability to learn knowledge, reducing the rate of learning. For example, training 2 trillion tokens on a 7 billion model may still be better than training 3 trillion tokens on a 1 billion model.

This is the question with the TinyLlama model. Does it make sense to pre-train a model on 3 trillion tokens if there is a saturation point? According to everyone, 3 trillion tokens is too high for the 1.1 billion model. But that’s the point of the experiment.

But Lam did not agree

The debate about whether bigger models are always better has been ongoing, and Meta and Llama have continually tried to prove it wrong. According to the Llama 2 paper, “We observed that after pre-training 2 trillion tokens, the models still did not show any signs of saturation.” This may suggest to Peiyuan that training the model on 3 trillion tokens may still be a reasonable idea.

This begs the question – if Meta believes that the Chinchilla expansion law is actually becoming a bit redundant, why doesn’t the company continue to train Llama 2 in an amount exceeding 2 trillion tokens and possibly issue more Updates to the model in a few weeks? The only reason could be that the expected advantage from it would be too small for the company to actually gain something from it.

Or maybe the next Llama will be even smaller and trained this way with a higher number of cards. Meta is letting its open source community test its capabilities, while it may be doing the same thing behind closed doors.

The amount of information we put inside smaller models must reach a limit. This project aims to prove the opposite. While we wait and check progress on the training stage, it will be interesting to note how TinyLlama actually eliminates the Chinchilla scaling law. According to the first test scores, TinyLlama competed with StableLM-Alpha-3B and Pythia-1B.

If achieved, this will be a major feat in creating AI models that run on single devices. If not, Chinchilla could emerge as a winner. According to Peiyuan, “I don’t know. This is an open trial that makes no promises or goals. The only target is ‘1.1B over 3T'”.


#TinyLlama #breaks #Chinchilla #scaling #law

Written By

Leave a Reply

Leave a Reply

Your email address will not be published. Required fields are marked *