LLaMA is a large language model introduced by Meta to push the boundaries of what smaller language models can do. It is based on traditional transformer architecture and includes some recent training advances such as Pre-normalization (as seen in GPT-3), SwiGLU activation function (used in PaLM), and Rotary Embeddings (applied in GPTNeo). The model comes in four different sizes: 7B, 13B, 33B, and 65B parameters.
All sizes perform extremely well compared to the current state of the art while having fewer parameters. For example, LLaMA-13B performed better than GPT-3 (175B) in most tests or evaluations despite being more than 10× smaller. On the other hand, LLaMA-65B, is comparable to some of the best-performing models such as Chinchilla70B and PaLM-540B.
Zero-shot performance on Common Sense Reasoning tasks. Higher scores are better.
Why bigger is not better?
Achieving state-of-the-art results with magnitude less of parameter sizes is a huge accomplishment and is beneficial for both research and industry use cases. By reducing the computational resources required for training and inference, smaller models are more accessible to researchers and practitioners with limited resources. This means that language models can become a part of our daily workflows with ease. Do you want ChatGPT integrated into your home assistant? This is what we need to make that happen.
Moreover, smaller models are less prone to overfitting and more capable of generalizing to new data, making them dependable and robust in real-world settings. These models are not only energy efficient but also reduce the environmental impact of training and deploying them.
Larger models still outperform smaller ones, as shown by the better results achieved by the bigger LLaMA size (65B) in the first table. However, practicality is a key consideration, and smaller models are often more useful for retraining with recent data or fine-tuning for specific tasks. These adjustments can yield greater improvements than simply increasing model size, and smaller models are easier to work with than larger ones. In fact, it would only cost a tenth of the resources to train a 7B LLaMA compared to a 65B one, as shown in the table.
Bias evaluation
The potential for biases in AI language models is a serious concern. It's why companies are cautious when adopting them for production systems. Take Google, for instance. Despite impressive academic work around large language models, they were slow productionize AI models until OpenAI’s ChatGPT came along. Google clearly had the AI innovations, infrastructure, talent, and distribution to release Bard AI earlier, but potentially because of the bias issues the risk wasn’t worth it until there was a competitor like ChatGPT.
The Meta AI team put LLaMA to the test to see if it exhibited any biases towards gender, religion, race, sexual orientation, age, nationality, disability, physical appearance, or socio-economic status. They also measured how toxic the model's responses using PerspectiveAPI, an open API to measure toxicity.
To measure biases, the researchers used stereotypical sentences related to a topic and measured the model's preference using perplexity in a zero-shot setting. Higher scores indicate greater biases. The LLaMA model had the lowest average bias score of 66.6 across all categories, but the score varied in each category. The model had the lowest bias score of 57 for race/color, which is excellent. However, it had the highest bias score of 81 for sexual orientation, which is not so good.
Dataset
In contrast to other big Language Models that use private data to expand their datasets LLaMA is only trained on publicly available data, compatible with open-source. Used datasets
Datasets include data in 20 different languages, but due to the majority of the training data being English, it is expected to perform better in English than in other languages. The FAIR team also found that the model's performance may vary for different dialects.
Open-source access
The model is open by request at the following form. If you get access you would need to clone the provided repository facebookresearch/llama and follow their instructions.
Conclusion
The LLaMA model represents a significant breakthrough for natural language processing, with exciting implications for both research and industry. By reducing model complexity and footprint, we can make this technology more accessible to a wider range of industries. Recent research, however, has been focused on increasing model performance at the cost of model size - like the 540B PaLM model.
Thankfully, the 13B LLaMA model has shown that smaller models can outperform their larger counterparts like GPT-3, effectively flipping the script on the size-to-performance ratio. Not only that, but LLaMA also has lower biases compared to other language models. This breakthrough demonstrates that it's possible to achieve impressive results with smaller models while also reducing the risk of perpetuating harmful biases. This paves the way for more accessible and practical applications of natural language processing that are more inclusive and trustworthy.