Jump to section

What is an open source LLM?

Copy URL

An open source large language model (LLM) has publicly accessible code and architecture, allowing for free use, modification, and distribution. Because of the complexities in building and serving LLMs, deciding whether an LLM is truly open source can be challenging.

Generally, open source refers to complete access to a product’s design. In the case of open source software, it refers to the release of a computer program through a specific kind of license in which the source code is available for general public use or modification. Typically, this means software can be considered open source if:

  • It’s available in source code form without additional cost.
  • The source code can be repurposed into other new software.

When it comes to LLMs, open source values play a critical role in lowering barriers to understanding and contributing to technological innovation. 

Experts in the field disagree about what it takes for an LLM to be considered legitimately open source. This is because the traditional definition of open source code can’t be easily applied to artificial intelligence (AI) technologies.

Unlike conventional open source code, which consists mainly of programming instructions, LLMs are created using:

  • Lots (and lots) of training data. This training data may contain copyrighted works or private data, which creates a legal issue when it comes to sharing.
  • Numerical parameters known as weights. These parameters determine how the input data is processed into a meaningful output and are key in shaping the model’s understanding of language. Think of weights as the building blocks that create a model's “brain” and determine how it prioritizes topics as it processes information.

In other words, it’s not just about code anymore. LLMs are much more complex as they require mathematical models and data sets to create. While “open” LLMs may disclose model weights and starting code, they may not necessarily share each data source used to create the LLM in the first place. An open source LLM, on the other hand, would share each step and data source along with a permissive license to allow others to use, build upon, and further distribute that model. 

When the recipes for LLMs are distributed for use without charge, individuals and organizations get the opportunity to build upon the work of others. This leads to many benefits, such as:

Collaborative improvement: Fostering collaboration from diverse sources is arguably the biggest benefit of open source LLMs. Creating more access to generative AI (gen AI) technologies allows for more experimentation and learning while reducing biases, increasing accuracy, and improving performance.

Transparency: If we don’t know how a model was trained, how can we trust the output? An open source LLM provides full transparency to how it was trained. This helps users understand how features work and gives them the information they need to decide how (or if) they’ll use the technology.

Less environmental impact: When models are transparent, we can see what work has already been done. This eliminates redundancies in training and evaluation systems, which would otherwise create additional computation and emissions.

Financial accessibility: LLMs typically cost a lot of money to train from scratch and are overall resource intensive. If you access a proprietary LLM, you’re potentially responsible for licensing fees. The ability to build upon someone else’s finished work for free lowers the barrier to entry for organizations that otherwise couldn’t afford to develop an LLM.

Webinar: Get the most out of AI with open source

Open source principles are responsible for many foundational aspects of the internet as we know it. The open source development model has led to some of the most important applications and cloud platforms in use today.

This spirit of freedom continues on a spectrum when it comes to large language models and how “open” or “closed” they are to the public. Let’s take a look at some of the most popular LLMs:

Closed models
ChatGPT from OpenAI and Claude from Anthropic are closed models. They’re tightly controlled and made available to users with restrictions, through paid API services.

Open models
The term “open source” has been used colloquially to refer to any LLM that’s downloadable on platforms like Hugging Face free of charge. This is the case with Meta’s Llama 2 model. However, the terms for Llama 2 don’t fit the common definition of open source software. This is because there are conditions and restrictions the user must agree to within the license agreement. That is, Meta has put in place certain legal and moral restrictions, like what constitutes “acceptable use.” Secondly, the license agreement requires any organization with a specific number of monthly users to file for an additional license from Meta.

Open source-licensed models
The Granite family of models from IBM Research and the Mistral AI models are examples of LLMs available under an Apache 2.0 license. This means the models are free for commercial use without restrictions. However, even these models don’t make all their training data available for inspection, in some cases due to licensing restrictions.

Red Hat envisions a future where anyone can contribute, review, and build upon code from an open, trustworthy foundation. We believe using an open development model helps create more stable, secure, and innovative technologies. As AI continues to grow, our open source platforms can help you build, deploy, and monitor AI models and applications for your own needs with your own data.

Red Hat® Enterprise Linux® AI is a foundation model platform for harmoniously developing, testing, and running Granite family LLMs for enterprise applications. With the technological foundation of Linux, containers, and automation, Red Hat’s open hybrid cloud strategy gives you the flexibility to run your AI applications anywhere you need them.

Created by IBM and Red Hat, InstructLab is an open source project and community for enhancing LLMs following open source principles. The InstructLab project gathers a set of training data curated by humans, generates synthetic data based on the seed training data, then uses the synthetic data to retrain the base model. Community contributions can lead to regular iterative builds of enhanced LLMs. InstructLab is a cost-effective solution for improving the alignment of LLMs and opens the doors for those with minimal machine learning experience to contribute.

Built using open source technologies, Red Hat OpenShift® AI is an enterprise-ready AI application platform that helps teams build, operate, and scale with confidence. OpenShift AI allows data acquisition and preparation, model training and fine-tuning, model serving and monitoring, and hardware acceleration.