Size Isn't Everything - How LLaMA democratizes access to Large-Language-Models

Recently, Meta announced the release of a new AI language generator called LLaMA. While tech enthusiasts have been primarily focused on language models developed by Microsoft, Google, and OpenAI, LLaMA is a research tool designed to help researchers advance their work in the subfield of AI. In this blog post, we will explain how LLaMA is helping to democratize large language models.

LLaMA is a large language model introduced by Meta to push the boundaries of what smaller language models can do. It is based on traditional transformer architecture and includes some recent training advances such as Pre-normalization (as seen in GPT-3), SwiGLU activation function (used in PaLM), and Rotary Embeddings (applied in GPTNeo). The model comes in four different sizes: 7B, 13B, 33B, and 65B parameters.

All sizes perform extremely well compared to the current state of the art while having fewer parameters. For example, LLaMA-13B performed better than GPT-3 (175B) in most tests or evaluations despite being more than 10× smaller. On the other hand, LLaMA-65B, is comparable to some of the best-performing models such as Chinchilla70B and PaLM-540B.

Model size BoolQ PIQA SIQA HellaSwag WinoGrande ARC-e ARC-c OBQA
GPT-3 175B 60.5 81.0 - 78.9 70.2 68.8 51.4 57.6
Gopher 280B 79.3 81.8 50.6 79.2 70.1 - - -
Chinchilla 70B 83.7 81.8 51.3 80.8 74.9 - - -
PaLM 62B 84.8 80.5 - 79.7 77.0 75.2 52.5 50.4
PaLM-cont 62B 83.9 81.4 - 80.6 77.0 - - -
PaLM 540B 88.0 82.3 - 83.4 81.1 76.6 53.0 53.4
LLaMA 7B 76.5 79.8 48.9 76.1 70.1 72.8 47.6 57.2
LLaMA 13B 78.1 80.1 50.4 79.2 73.0 74.8 52.7 56.4
LLaMA 33B 83.1 82.3 50.4 82.8 76.0 80.0 57.8 58.6
LLaMA 65B 85.3 82.8 52.3 84.2 77.0 78.9 56.0 60.2

Zero-shot performance on Common Sense Reasoning tasks. Higher scores are better.

Why bigger is not better?

Achieving state-of-the-art results with magnitude less of parameter sizes is a huge accomplishment and is beneficial for both research and industry use cases. By reducing the computational resources required for training and inference, smaller models are more accessible to researchers and practitioners with limited resources. This means that language models can become a part of our daily workflows with ease. Do you want ChatGPT integrated into your home assistant? This is what we need to make that happen.

Moreover, smaller models are less prone to overfitting and more capable of generalizing to new data, making them dependable and robust in real-world settings. These models are not only energy efficient but also reduce the environmental impact of training and deploying them.

Larger models still outperform smaller ones, as shown by the better results achieved by the bigger LLaMA size (65B) in the first table. However, practicality is a key consideration, and smaller models are often more useful for retraining with recent data or fine-tuning for specific tasks. These adjustments can yield greater improvements than simply increasing model size, and smaller models are easier to work with than larger ones. In fact, it would only cost a tenth of the resources to train a 7B LLaMA compared to a 65B one, as shown in the table.

GPU-hours Total power Consumption Carbon emitted (tCO2eq)
OPT-175B 809,472 809,472 356 MWh
BLOOM-175B 1,082,880 1,082,880 475 MWh
LLAMA-7B 82,432 82,432 36 MWh
LLaMA-13B 135,168 135,168 59 MWh
LLaMA-33B 530,432 530,432 233 MWh
LLAMA-65B 1,022,362 1,022,362 449 MWh

Bias evaluation

The potential for biases in AI language models is a serious concern. It's why companies are cautious when adopting them for production systems. Take Google, for instance. Despite impressive academic work around large language models, they were slow productionize AI models until OpenAI’s ChatGPT came along. Google clearly had the AI innovations, infrastructure, talent, and distribution to release Bard AI earlier, but potentially because of the bias issues the risk wasn’t worth it until there was a competitor like ChatGPT.

The Meta AI team put LLaMA to the test to see if it exhibited any biases towards gender, religion, race, sexual orientation, age, nationality, disability, physical appearance, or socio-economic status. They also measured how toxic the model's responses using PerspectiveAPI, an open API to measure toxicity.

To measure biases, the researchers used stereotypical sentences related to a topic and measured the model's preference using perplexity in a zero-shot setting. Higher scores indicate greater biases. The LLaMA model had the lowest average bias score of 66.6 across all categories, but the score varied in each category. The model had the lowest bias score of 57 for race/color, which is excellent. However, it had the highest bias score of 81 for sexual orientation, which is not so good.

LLaMA GPT3 OPT
Gender 70.6 62.6 65.7
Religion 79.0 73.3 68.6
Race/Color 57.0 64.7 68.6
Sexual orientation 81.0 76.2 78.6
Age 70.1 64.4 67.8
Nationality 64.2 61.6 62.9
Disability 66.7 76.7 76.7
Physical appearance 77.8 74.6 76.2
Socioeconomic status 71.5 73.8 76.2
Average 66.6 67.2 69.5

Dataset

In contrast to other big Language Models that use private data to expand their datasets LLaMA is only trained on publicly available data, compatible with open-source. Used datasets

Dataset Sampling prop. Epochs Disk size
CommonCrawl 67.00% 1.1 3.3 TB
C4 15.00% 1.06 783 GB
Github 4.50% 0.64 328 GB
Wikipedia 4.50% 2.45 83 GB
Books 4.50% 2.23 85 GB
ArXiv 2.50% 1.06 92 GB
StackExchange 2.00% 1.03 78 GB
Physical appearance 77.8 74.6 76.2
Socioeconomic status 71.5 73.8 76.2
Average 66.6 67.2 69.5

Datasets include data in 20 different languages, but due to the majority of the training data being English, it is expected to perform better in English than in other languages. The FAIR team also found that the model's performance may vary for different dialects.

Open-source access

The model is open by request at the following form. If you get access you would need to clone the provided repository facebookresearch/llama and follow their instructions.

Conclusion

The LLaMA model represents a significant breakthrough for natural language processing, with exciting implications for both research and industry. By reducing model complexity and footprint, we can make this technology more accessible to a wider range of industries. Recent research, however, has been focused on increasing model performance at the cost of model size - like the 540B PaLM model.

Thankfully, the 13B LLaMA model has shown that smaller models can outperform their larger counterparts like GPT-3, effectively flipping the script on the size-to-performance ratio. Not only that, but LLaMA also has lower biases compared to other language models. This breakthrough demonstrates that it's possible to achieve impressive results with smaller models while also reducing the risk of perpetuating harmful biases. This paves the way for more accessible and practical applications of natural language processing that are more inclusive and trustworthy.

Get up and running with one engineer in one sprint

Guaranteed lift within your first 30 days or your money back

100M+
Users and items
1000+
Queries per second
1B+
Requests

Related Posts

Omair Khan
 | 
August 10, 2023

How Threads Built a World-Class Recommendation System in Record Time

Jaime Ferrando Huertas
 | 
October 24, 2022

Why your feeds are getting worse over time

Javier Jorge Cano
 | 
January 24, 2023

Whisper 🤫 : A multilingual and multitask robust ASR model