订阅内容
AI/ML 

While many AI model developers publicly release research papers and their data training approaches, we’ll focus on one model in particular– IBM’s Granite model, where IBM has gone one step further and released their specific training data. So, if you would like specificity on what Granite family large language models (LLM) is trained on, this article provides a detailed breakdown of the datasets used in the initial training phase of IBM’s popular granite.13b.v1 model, the original Granite model from which other model variants were fine-tuned to target downstream tasks.

What are the IBM Granite models?

As we begin to see the impact of AI in our lives and organizations, principles such as trust are as important to our software as they are to AI/ML models. Thus, IBM Research built and trained the Granite family of models with transparency under an Apache 2.0 license for broad, unencumbered commercial use. “The Granite family of models provides enterprise users with some of the most robust and transparent insights into the underlying training data, important for efficiently refining model behavior for specific use cases and domains, and for protecting enterprises from risk from any unlicensed content in the training data”, as reported by The Forrester Wave™: AI Foundation Models For Language Q2 2024.

What data was used to train the Granite models?

Granite.13b.v1 was trained on a massive dataset consisting of 1 trillion tokens derived from 14 distinct datasets across various domains. Due to the transparency in training data, we’re able to detail the data sources used to teach the model to handle sentiment classification, named entity recognition, question answering and summarization. These are considered to be enterprise safe data sources, and Granite models are among the most transparent according to Stanford University’s Foundation Model Transparency Index 2024. Let’s break these down into several categories.

Academia and science

  • arXiv: This dataset includes over 1.8 million scientific pre-prints
  • DeepMind Mathematics: This dataset contains pairs of mathematical questions and their corresponding answers
  • Pubmed Central: This dataset comprises biomedical and life sciences research papers

Legal and financial

  • Free Law: This dataset encompasses public-domain legal opinions from both US federal and state courts
  • SEC Filings: This dataset contains 10-K/Q filings from the US Securities and Exchange Commission (SEC) spanning from 1934 to 2022
  • United States Patent and Trademark Office: This dataset includes US patents granted between 1975 and May 2023, excluding design patents

Code and technology

  • GitHub Clean: This dataset features code from CodeParrot in various programming languages
  • Hacker News: This dataset comprises news articles focused on computer science and entrepreneurship, collected between 2007 and 2018

General web and literature

  • Common Crawl: This dataset is an open repository of web crawl data
  • OpenWeb Text: This is an open source version of OpenAI's Web Text corpus containing web pages up to 2019
  • Project Gutenberg (PG-19): This dataset includes free e-books, primarily older works with expired US copyrights

Other

  • Stack Exchange: This dataset features anonymized user-contributed content from the Stack Exchange network, a collection of websites focused on questions and answers
  • Webhose: This dataset includes unstructured web content transformed into machine-readable data feeds, acquired by IBM
  • Wikimedia: This dataset contains extracted plain text from pages and articles across eight English Wikimedia projects (enwiki, enwikibooks, enwikinews, enwikiquote, enwikisource, enwikiversity, enwikivoyage, enwiktionary)

The Granite 13b model is the base model from which all other variants of Granite were fine-tuned for  specific tasks. However, version 2 of the 13b model, granite.13b.v2, had an additional pretraining on 1.5T new tokens that were deemed usable after going through the data processing pipeline seen below. Upon adding these tokens to version one of the model we now have 2.5T Tokens used in the training of version 2. Version 2 still contains the same 14 datasets as version 1, plus 6 new data sets.

Funnel demonstrating the filtering of extracted data, beginning with 28.7 Terabytes of data and finishing with 2.5 Trillion Tokens usable for training.

V2 additional pre-training data:

  • Earnings Call Transcripts: This dataset includes transcripts from quarterly earnings companies hold with investors
  • EDGAR Filings: Annual reports from all the publicly traded companies in the US spanning a period of more than 25 years
  • FDIC: The data is from the annual submissions of the Federal Deposit Insurance Corporation (FDIC)
  • Finance Text Books: A corpus from University of Minnesota's Open Textbook Library, including all textbooks tagged as finance
  • Financial Research Papers: Publicly available financial research paper corpus
  • IBM Documentation: IBM redbooks and product documents

As with any form of software, having trust and confidence in our workloads is critical to enterprise readiness. As AI is another tool being used to enhance our applications and streamline business processes, we should treat it as such and work to apply the same open source principles and transparency that have been tested over the years to AI itself.

Red Hat’s history as a leader in the open source community has led to RHEL AI, a supported platform for training and deploying Granite models for enterprise applications. However, as this industry continues to advance, we should strive for openness as a whole, from research papers detailing architecture advancements, to permissive licensing for encouraging widespread adoption, and finally the transparency behind training data itself. What history has demonstrated is that when work and collaboration is done in the open, everybody benefits.

Learn more about RHEL AI


关于作者

Legare Kerrison is an intern on the developer advocacy team, focusing on providing developers with resources for Red Hat products, with an emphasis on Podman and Instructlab.

Read full bio

Cedric Clyburn (@cedricclyburn), Senior Developer Advocate at Red Hat, is an enthusiastic software technologist with a background in Kubernetes, DevOps, and container tools. He has experience speaking and organizing conferences including DevNexus, WeAreDevelopers, The Linux Foundation, KCD NYC, and more. Cedric loves all things open-source, and works to make developer's lives easier! Based out of New York.

Read full bio
UI_Icon-Red_Hat-Close-A-Black-RGB

按频道浏览

automation icon

自动化

有关技术、团队和环境 IT 自动化的最新信息

AI icon

人工智能

平台更新使客户可以在任何地方运行人工智能工作负载

open hybrid cloud icon

开放混合云

了解我们如何利用混合云构建更灵活的未来

security icon

安全防护

有关我们如何跨环境和技术减少风险的最新信息

edge icon

边缘计算

简化边缘运维的平台更新

Infrastructure icon

基础架构

全球领先企业 Linux 平台的最新动态

application development icon

应用领域

我们针对最严峻的应用挑战的解决方案

Original series icon

原创节目

关于企业技术领域的创客和领导者们有趣的故事