While many AI model developers publicly release research papers and their data training approaches, we’ll focus on one model in particular– IBM’s Granite model, where IBM has gone one step further and released their specific training data. So, if you would like specificity on what Granite family large language models (LLM) is trained on, this article provides a detailed breakdown of the datasets used in the initial training phase of IBM’s popular granite.13b.v1 model, the original Granite model from which other model variants were fine-tuned to target downstream tasks.
What are the IBM Granite models?
As we begin to see the impact of AI in our lives and organizations, principles such as trust are as important to our software as they are to AI/ML models. Thus, IBM Research built and trained the Granite family of models with transparency under an Apache 2.0 license for broad, unencumbered commercial use. “The Granite family of models provides enterprise users with some of the most robust and transparent insights into the underlying training data, important for efficiently refining model behavior for specific use cases and domains, and for protecting enterprises from risk from any unlicensed content in the training data”, as reported by The Forrester Wave™: AI Foundation Models For Language Q2 2024.
What data was used to train the Granite models?
Granite.13b.v1 was trained on a massive dataset consisting of 1 trillion tokens derived from 14 distinct datasets across various domains. Due to the transparency in training data, we’re able to detail the data sources used to teach the model to handle sentiment classification, named entity recognition, question answering and summarization. These are considered to be enterprise safe data sources, and Granite models are among the most transparent according to Stanford University’s Foundation Model Transparency Index 2024. Let’s break these down into several categories.
Academia and science
- arXiv: This dataset includes over 1.8 million scientific pre-prints
- DeepMind Mathematics: This dataset contains pairs of mathematical questions and their corresponding answers
- Pubmed Central: This dataset comprises biomedical and life sciences research papers
Legal and financial
- Free Law: This dataset encompasses public-domain legal opinions from both US federal and state courts
- SEC Filings: This dataset contains 10-K/Q filings from the US Securities and Exchange Commission (SEC) spanning from 1934 to 2022
- United States Patent and Trademark Office: This dataset includes US patents granted between 1975 and May 2023, excluding design patents
Code and technology
- GitHub Clean: This dataset features code from CodeParrot in various programming languages
- Hacker News: This dataset comprises news articles focused on computer science and entrepreneurship, collected between 2007 and 2018
General web and literature
- Common Crawl: This dataset is an open repository of web crawl data
- OpenWeb Text: This is an open source version of OpenAI's Web Text corpus containing web pages up to 2019
- Project Gutenberg (PG-19): This dataset includes free e-books, primarily older works with expired US copyrights
Other
- Stack Exchange: This dataset features anonymized user-contributed content from the Stack Exchange network, a collection of websites focused on questions and answers
- Webhose: This dataset includes unstructured web content transformed into machine-readable data feeds, acquired by IBM
- Wikimedia: This dataset contains extracted plain text from pages and articles across eight English Wikimedia projects (enwiki, enwikibooks, enwikinews, enwikiquote, enwikisource, enwikiversity, enwikivoyage, enwiktionary)
The Granite 13b model is the base model from which all other variants of Granite were fine-tuned for specific tasks. However, version 2 of the 13b model, granite.13b.v2, had an additional pretraining on 1.5T new tokens that were deemed usable after going through the data processing pipeline seen below. Upon adding these tokens to version one of the model we now have 2.5T Tokens used in the training of version 2. Version 2 still contains the same 14 datasets as version 1, plus 6 new data sets.
V2 additional pre-training data:
- Earnings Call Transcripts: This dataset includes transcripts from quarterly earnings companies hold with investors
- EDGAR Filings: Annual reports from all the publicly traded companies in the US spanning a period of more than 25 years
- FDIC: The data is from the annual submissions of the Federal Deposit Insurance Corporation (FDIC)
- Finance Text Books: A corpus from University of Minnesota's Open Textbook Library, including all textbooks tagged as finance
- Financial Research Papers: Publicly available financial research paper corpus
- IBM Documentation: IBM redbooks and product documents
As with any form of software, having trust and confidence in our workloads is critical to enterprise readiness. As AI is another tool being used to enhance our applications and streamline business processes, we should treat it as such and work to apply the same open source principles and transparency that have been tested over the years to AI itself.
Red Hat’s history as a leader in the open source community has led to RHEL AI, a supported platform for training and deploying Granite models for enterprise applications. However, as this industry continues to advance, we should strive for openness as a whole, from research papers detailing architecture advancements, to permissive licensing for encouraging widespread adoption, and finally the transparency behind training data itself. What history has demonstrated is that when work and collaboration is done in the open, everybody benefits.
About the authors
Legare Kerrison is an intern on the developer advocacy team, focusing on providing developers with resources for Red Hat products, with an emphasis on Podman and Instructlab.
Cedric Clyburn (@cedricclyburn), Senior Developer Advocate at Red Hat, is an enthusiastic software technologist with a background in Kubernetes, DevOps, and container tools. He has experience speaking and organizing conferences including DevNexus, WeAreDevelopers, The Linux Foundation, KCD NYC, and more. Cedric loves all things open-source, and works to make developer's lives easier! Based out of New York.
More like this
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech
Products
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud services
- See all products
Tools
- Training and certification
- My account
- Customer support
- Developer resources
- Find a partner
- Red Hat Ecosystem Catalog
- Red Hat value calculator
- Documentation
Try, buy, & sell
Communicate
About Red Hat
We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.
Select a language
Red Hat legal and privacy links
- About Red Hat
- Jobs
- Events
- Locations
- Contact Red Hat
- Red Hat Blog
- Diversity, equity, and inclusion
- Cool Stuff Store
- Red Hat Summit