Training a large language model (LLM) probably sounds like a specialized and highly technical task. Until recently, that was true, but the InstructLab project has been putting in the work required to make it easier for anyone to train an LLM. That means you can contribute to the development of artificial intelligence (AI), either because your organization needs domain-specific knowledge for its AI solution, or just because you want to help improve open source AI. All you need to know is how to type text in a simple format called YAML, and this article is going to teach you exactly how to do that.
Open a text editor
An LLM establishes probable responses to questions by analyzing existing content about a specific topic. The easiest way to contribute to an LLM is to contribute knowledge about a topic, in the form of questions and answers. All you need for that is a text editor.
A text editor is like a simplified word processor. There are lots of them out there, including Notepad++ and Notepadqq, Pulsar and VSCodium , so just choose the one that works for you.
Download the template
The InstructLab provides a template file for knowledge content so you don’t have to start from a blank file. To download the template, open a web browser to github.com/instructlab/taxonomy and then click the green Code button in the top right corner and select Download ZIP.
Once the files have downloaded, find the ZIP archive in your Downloads folder and unzip it. This produces a new folder called taxonomy
.
Write simplified YAML
In the docs
folder within the taxonomy
directory, open the file named template_qna.yaml
in your text editor. This file contains a blank question-and-answer session you can use as a template for the knowledge you want to provide training for.
YAML is designed to be simple, but the amount of YAML you need for this is even simpler. Mostly, YAML is a collection of labels (also called a “key” or “mapping”) and descriptions (also called a “value”), which is how a lot of data on the internet is structured. When you go to your favorite online store, you probably shop by clicking on a label (the name of an item) and then you read its description. When you write a report for work or school, you probably write a subheading, and then you write a paragraph explaining more about that subheading. InstructLab’s use of YAML is based on the exact same concept.
Here’s an abbreviated sample of the blank YAML template:
version: 3
domain: <The knowledge domain>
created_by: <Your name>
seed_examples:
- context: |
<Context from the document associated with this set of sample q&a pairs.>
questions_and_answers:
- question: |
<A relevant question used for synthetic data generation.>
answer: |
<The desired response for the question.>
The data at the top of the document establishes the knowledge domain you’re writing about, and who you are. If you’re contributing to the InstructLab project, then you must use your GitHub user name as the description of the created_by
label. If you’re contributing to a private LLM, then you can use your name or whatever description the project manager has requested.
The seed_examples
is the main label for the knowledge section you’re about to create. It doesn’t require a description, because it contains yet more labels.
The context
label is essentially a subheading, and it requires a statement from you that describes the kind of conversation that might lead to the questions and answers you’re about to enter. For example, to add a question and answer session about some aspect of the ancient Ptolemaic empire, you might describe its context as “The kings and queens of the Ptolemaic empire.” To enter questions and answers about the works of Edgar Rice Burroughs, you might write “The literature of Edgar Rice Burroughs.” Just imagine you’re writing a report for school. It’s the same logic.
Indentation is important
YAML is a sequence of label after label, so it relies on indentation to represent the flow of logic. In a word processor, a heading is often displayed as large and bold text compared to the text in a paragraph. Instead of using font size and style, YAML uses indentation.
When you write a description in InstructLab’s YAML file, you write it on the line under the label, and you add two spaces to the level of indentation. This is how the template is structured, so it’s a pretty easy pattern to fall into.
Questions and answers
Next is the actual question and answer section. Under each question
heading, you write exactly one question that you might anticipate in a conversation about your chosen topic. Under each answer
heading, you write a simple answer to that question.
It’s best to keep both the questions and answers short and concise, because that ensures that they’re modular and distinct. Don’t try to sneak two questions into one, especially when the answer to one question has no bearing on the answer to the second. It’s misleading to ask “Did Edgar Rice Burroughs write the Tarzan book and movie?” as one question, because Edgar Rice Burroughs wrote the Tarzan books whether or not he wrote a Tarzan screenplay.
Write a distinct question and a focused answer so that the LLM can use your knowledge to extrapolate correct data. Here’s an example:
version: 3
domain: Ptolemaic empire
created_by: Tux
seed_examples:
- context: |
Discussion of Cleopatra.
questions_and_answers:
- question: |
How many Ptolemaic queens were named Cleopatra?
- answer: |
There were 7 Ptolemaic queens named Cleopatra.
YAML for InstructLab
YAML is a way of writing text so that it has predictable structure, which makes it easy for computers to process. Follow the InstructLab template, add your knowledge to the LLM of your choice, and help improve AI. If you need reinforcement for what you’ve learned from this article, check out this video introduction on how to get started!
Über den Autor
Seth Kenlon is a Linux geek, open source enthusiast, free culture advocate, and tabletop gamer. Between gigs in the film industry and the tech industry (not necessarily exclusive of one another), he likes to design games and hack on code (also not necessarily exclusive of one another).
Mehr davon
Nach Thema durchsuchen
Automatisierung
Das Neueste zum Thema IT-Automatisierung für Technologien, Teams und Umgebungen
Künstliche Intelligenz
Erfahren Sie das Neueste von den Plattformen, die es Kunden ermöglichen, KI-Workloads beliebig auszuführen
Open Hybrid Cloud
Erfahren Sie, wie wir eine flexiblere Zukunft mit Hybrid Clouds schaffen.
Sicherheit
Erfahren Sie, wie wir Risiken in verschiedenen Umgebungen und Technologien reduzieren
Edge Computing
Erfahren Sie das Neueste von den Plattformen, die die Operations am Edge vereinfachen
Infrastruktur
Erfahren Sie das Neueste von der weltweit führenden Linux-Plattform für Unternehmen
Anwendungen
Entdecken Sie unsere Lösungen für komplexe Herausforderungen bei Anwendungen
Original Shows
Interessantes von den Experten, die die Technologien in Unternehmen mitgestalten
Produkte
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud-Services
- Alle Produkte anzeigen
Tools
- Training & Zertifizierung
- Eigenes Konto
- Kundensupport
- Für Entwickler
- Partner finden
- Red Hat Ecosystem Catalog
- Mehrwert von Red Hat berechnen
- Dokumentation
Testen, kaufen und verkaufen
Kommunizieren
Über Red Hat
Als weltweit größter Anbieter von Open-Source-Software-Lösungen für Unternehmen stellen wir Linux-, Cloud-, Container- und Kubernetes-Technologien bereit. Wir bieten robuste Lösungen, die es Unternehmen erleichtern, plattform- und umgebungsübergreifend zu arbeiten – vom Rechenzentrum bis zum Netzwerkrand.
Wählen Sie eine Sprache
Red Hat legal and privacy links
- Über Red Hat
- Jobs bei Red Hat
- Veranstaltungen
- Standorte
- Red Hat kontaktieren
- Red Hat Blog
- Diversität, Gleichberechtigung und Inklusion
- Cool Stuff Store
- Red Hat Summit