While many AI model developers publicly release research papers and their data training approaches, we’ll focus on one model in particular– IBM’s Granite model, where IBM has gone one step further and released their specific training data. So, if you would like specificity on what Granite family large language models (LLM) is trained on, this article provides a detailed breakdown of the datasets used in the initial training phase of IBM’s popular granite.13b.v1 model, the original Granite model from which other model variants were fine-tuned to target downstream tasks.
What are the IBM Granite models?
As we begin to see the impact of AI in our lives and organizations, principles such as trust are as important to our software as they are to AI/ML models. Thus, IBM Research built and trained the Granite family of models with transparency under an Apache 2.0 license for broad, unencumbered commercial use. “The Granite family of models provides enterprise users with some of the most robust and transparent insights into the underlying training data, important for efficiently refining model behavior for specific use cases and domains, and for protecting enterprises from risk from any unlicensed content in the training data”, as reported by The Forrester Wave™: AI Foundation Models For Language Q2 2024.
What data was used to train the Granite models?
Granite.13b.v1 was trained on a massive dataset consisting of 1 trillion tokens derived from 14 distinct datasets across various domains. Due to the transparency in training data, we’re able to detail the data sources used to teach the model to handle sentiment classification, named entity recognition, question answering and summarization. These are considered to be enterprise safe data sources, and Granite models are among the most transparent according to Stanford University’s Foundation Model Transparency Index 2024. Let’s break these down into several categories.
Academia and science
- arXiv: This dataset includes over 1.8 million scientific pre-prints
- DeepMind Mathematics: This dataset contains pairs of mathematical questions and their corresponding answers
- Pubmed Central: This dataset comprises biomedical and life sciences research papers
Legal and financial
- Free Law: This dataset encompasses public-domain legal opinions from both US federal and state courts
- SEC Filings: This dataset contains 10-K/Q filings from the US Securities and Exchange Commission (SEC) spanning from 1934 to 2022
- United States Patent and Trademark Office: This dataset includes US patents granted between 1975 and May 2023, excluding design patents
Code and technology
- GitHub Clean: This dataset features code from CodeParrot in various programming languages
- Hacker News: This dataset comprises news articles focused on computer science and entrepreneurship, collected between 2007 and 2018
General web and literature
- Common Crawl: This dataset is an open repository of web crawl data
- OpenWeb Text: This is an open source version of OpenAI's Web Text corpus containing web pages up to 2019
- Project Gutenberg (PG-19): This dataset includes free e-books, primarily older works with expired US copyrights
Other
- Stack Exchange: This dataset features anonymized user-contributed content from the Stack Exchange network, a collection of websites focused on questions and answers
- Webhose: This dataset includes unstructured web content transformed into machine-readable data feeds, acquired by IBM
- Wikimedia: This dataset contains extracted plain text from pages and articles across eight English Wikimedia projects (enwiki, enwikibooks, enwikinews, enwikiquote, enwikisource, enwikiversity, enwikivoyage, enwiktionary)
The Granite 13b model is the base model from which all other variants of Granite were fine-tuned for specific tasks. However, version 2 of the 13b model, granite.13b.v2, had an additional pretraining on 1.5T new tokens that were deemed usable after going through the data processing pipeline seen below. Upon adding these tokens to version one of the model we now have 2.5T Tokens used in the training of version 2. Version 2 still contains the same 14 datasets as version 1, plus 6 new data sets.
V2 additional pre-training data:
- Earnings Call Transcripts: This dataset includes transcripts from quarterly earnings companies hold with investors
- EDGAR Filings: Annual reports from all the publicly traded companies in the US spanning a period of more than 25 years
- FDIC: The data is from the annual submissions of the Federal Deposit Insurance Corporation (FDIC)
- Finance Text Books: A corpus from University of Minnesota's Open Textbook Library, including all textbooks tagged as finance
- Financial Research Papers: Publicly available financial research paper corpus
- IBM Documentation: IBM redbooks and product documents
As with any form of software, having trust and confidence in our workloads is critical to enterprise readiness. As AI is another tool being used to enhance our applications and streamline business processes, we should treat it as such and work to apply the same open source principles and transparency that have been tested over the years to AI itself.
Red Hat’s history as a leader in the open source community has led to RHEL AI, a supported platform for training and deploying Granite models for enterprise applications. However, as this industry continues to advance, we should strive for openness as a whole, from research papers detailing architecture advancements, to permissive licensing for encouraging widespread adoption, and finally the transparency behind training data itself. What history has demonstrated is that when work and collaboration is done in the open, everybody benefits.
关于作者
Legare Kerrison is an intern on the developer advocacy team, focusing on providing developers with resources for Red Hat products, with an emphasis on Podman and Instructlab.
Cedric Clyburn (@cedricclyburn), Senior Developer Advocate at Red Hat, is an enthusiastic software technologist with a background in Kubernetes, DevOps, and container tools. He has experience speaking and organizing conferences including DevNexus, WeAreDevelopers, The Linux Foundation, KCD NYC, and more. Cedric loves all things open-source, and works to make developer's lives easier! Based out of New York.
更多此类内容
产品
工具
试用购买与出售
沟通
关于红帽
我们是世界领先的企业开源解决方案供应商,提供包括 Linux、云、容器和 Kubernetes。我们致力于提供经过安全强化的解决方案,从核心数据中心到网络边缘,让企业能够更轻松地跨平台和环境运营。