Size Isn't Everything - How LLaMA democratizes access to Large-Language-Models

Recently, Meta announced the release of a new AI language generator called LLaMA. While tech enthusiasts have been primarily focused on language models developed by Microsoft, Google, and OpenAI, LLaMA is a research tool designed to help researchers advance their work in the subfield of AI. In this blog post, we will explain how LLaMA is helping to democratize large language models.

March 7, 2023

min read

Jaime Ferrando Huertas

LLaMA is a large language model introduced by Meta to push the boundaries of what smaller language models can do. It is based on traditional transformer architecture and includes some recent training advances such as Pre-normalization (as seen in GPT-3), SwiGLU activation function (used in PaLM), and Rotary Embeddings (applied in GPTNeo). The model comes in four different sizes: 7B, 13B, 33B, and 65B parameters.

All sizes perform extremely well compared to the current state of the art while having fewer parameters. For example, LLaMA-13B performed better than GPT-3 (175B) in most tests or evaluations despite being more than 10× smaller. On the other hand, LLaMA-65B, is comparable to some of the best-performing models such as Chinchilla70B and PaLM-540B.

	Model size	BoolQ	PIQA	SIQA	HellaSwag	WinoGrande	ARC-e	ARC-c	OBQA
GPT-3	175B	60.5	81.0	-	78.9	70.2	68.8	51.4	57.6
Gopher	280B	79.3	81.8	50.6	79.2	70.1	-	-	-
Chinchilla	70B	83.7	81.8	51.3	80.8	74.9	-	-	-
PaLM	62B	84.8	80.5	-	79.7	77.0	75.2	52.5	50.4
PaLM-cont	62B	83.9	81.4	-	80.6	77.0	-	-	-
PaLM	540B	88.0	82.3	-	83.4	81.1	76.6	53.0	53.4
LLaMA	7B	76.5	79.8	48.9	76.1	70.1	72.8	47.6	57.2
LLaMA	13B	78.1	80.1	50.4	79.2	73.0	74.8	52.7	56.4
LLaMA	33B	83.1	82.3	50.4	82.8	76.0	80.0	57.8	58.6
LLaMA	65B	85.3	82.8	52.3	84.2	77.0	78.9	56.0	60.2

Zero-shot performance on Common Sense Reasoning tasks. Higher scores are better.

Why bigger is not better?

Achieving state-of-the-art results with magnitude less of parameter sizes is a huge accomplishment and is beneficial for both research and industry use cases. By reducing the computational resources required for training and inference, smaller models are more accessible to researchers and practitioners with limited resources. This means that language models can become a part of our daily workflows with ease. Do you want ChatGPT integrated into your home assistant? This is what we need to make that happen.

Moreover, smaller models are less prone to overfitting and more capable of generalizing to new data, making them dependable and robust in real-world settings. These models are not only energy efficient but also reduce the environmental impact of training and deploying them.

Larger models still outperform smaller ones, as shown by the better results achieved by the bigger LLaMA size (65B) in the first table. However, practicality is a key consideration, and smaller models are often more useful for retraining with recent data or fine-tuning for specific tasks. These adjustments can yield greater improvements than simply increasing model size, and smaller models are easier to work with than larger ones. In fact, it would only cost a tenth of the resources to train a 7B LLaMA compared to a 65B one, as shown in the table.

	GPU-hours	Total power Consumption	Carbon emitted (tCO2eq)
OPT-175B	809,472	809,472	356 MWh
BLOOM-175B	1,082,880	1,082,880	475 MWh
LLAMA-7B	82,432	82,432	36 MWh
LLaMA-13B	135,168	135,168	59 MWh
LLaMA-33B	530,432	530,432	233 MWh
LLAMA-65B	1,022,362	1,022,362	449 MWh

Bias evaluation

The potential for biases in AI language models is a serious concern. It's why companies are cautious when adopting them for production systems. Take Google, for instance. Despite impressive academic work around large language models, they were slow productionize AI models until OpenAI’s ChatGPT came along. Google clearly had the AI innovations, infrastructure, talent, and distribution to release Bard AI earlier, but potentially because of the bias issues the risk wasn’t worth it until there was a competitor like ChatGPT.

The Meta AI team put LLaMA to the test to see if it exhibited any biases towards gender, religion, race, sexual orientation, age, nationality, disability, physical appearance, or socio-economic status. They also measured how toxic the model's responses using PerspectiveAPI, an open API to measure toxicity.

To measure biases, the researchers used stereotypical sentences related to a topic and measured the model's preference using perplexity in a zero-shot setting. Higher scores indicate greater biases. The LLaMA model had the lowest average bias score of 66.6 across all categories, but the score varied in each category. The model had the lowest bias score of 57 for race/color, which is excellent. However, it had the highest bias score of 81 for sexual orientation, which is not so good.

	LLaMA	GPT3	OPT
Gender	70.6	62.6	65.7
Religion	79.0	73.3	68.6
Race/Color	57.0	64.7	68.6
Sexual orientation	81.0	76.2	78.6
Age	70.1	64.4	67.8
Nationality	64.2	61.6	62.9
Disability	66.7	76.7	76.7
Physical appearance	77.8	74.6	76.2
Socioeconomic status	71.5	73.8	76.2
Average	66.6	67.2	69.5

Dataset

In contrast to other big Language Models that use private data to expand their datasets LLaMA is only trained on publicly available data, compatible with open-source. Used datasets

Dataset	Sampling prop.	Epochs	Disk size
CommonCrawl	67.00%	1.1	3.3 TB
C4	15.00%	1.06	783 GB
Github	4.50%	0.64	328 GB
Wikipedia	4.50%	2.45	83 GB
Books	4.50%	2.23	85 GB
ArXiv	2.50%	1.06	92 GB
StackExchange	2.00%	1.03	78 GB
Physical appearance	77.8	74.6	76.2
Socioeconomic status	71.5	73.8	76.2
Average	66.6	67.2	69.5

Datasets include data in 20 different languages, but due to the majority of the training data being English, it is expected to perform better in English than in other languages. The FAIR team also found that the model's performance may vary for different dialects.

Open-source access

The model is open by request at the following form. If you get access you would need to clone the provided repository facebookresearch/llama and follow their instructions.

Conclusion

The LLaMA model represents a significant breakthrough for natural language processing, with exciting implications for both research and industry. By reducing model complexity and footprint, we can make this technology more accessible to a wider range of industries. Recent research, however, has been focused on increasing model performance at the cost of model size - like the 540B PaLM model.

Thankfully, the 13B LLaMA model has shown that smaller models can outperform their larger counterparts like GPT-3, effectively flipping the script on the size-to-performance ratio. Not only that, but LLaMA also has lower biases compared to other language models. This breakthrough demonstrates that it's possible to achieve impressive results with smaller models while also reducing the risk of perpetuating harmful biases. This paves the way for more accessible and practical applications of natural language processing that are more inclusive and trustworthy.

‍

Size Isn't Everything - How LLaMA democratizes access to Large-Language-Models

Why bigger is not better?

Bias evaluation

Dataset

Open-source access

Conclusion

Get up and running with one engineer in one sprint

Related Posts

How Threads Built a World-Class Recommendation System in Record Time

The Secret Sauce of Tik-Tok’s Recommendations

GPT-4: A New Milestone in Scaling Up Deep Learning