How to get started with InstructLab today

17 juin 2024Seth Kenlon

When people talk about artificial intelligence (AI), they’re usually talking about the combination of a chat bot, providing input and output, and a large language model (LLM), providing data that the chat bot can use to form sentences. AI without LLM isn’t very useful, and that’s why much of the conversation around the legalities and ethics of AI are concerned with what’s being used to build the “knowledge” used by generative AI (gen AI). How can you be sure that the data a gen AI uses to formulate its answers is reliable, trustworthy, and unencumbered by copyright? The best way to either audit or specialize the knowledge base of AI is to use open source, and that’s what the InstructLab project makes possible.

What is InstructLab?

InstructLab is an open source AI project that promotes universal modeling with open contribution. Its stated goal is to enable anyone to shape gen AI, whether you need an open source LLM due to concerns over intellectual property and copyright, privacy, reliability, subject matter expertise, accessibility or anything else. Designing a complete LLM is a big task, so the best way to build an open LLM is to build it in the open. Because InstructLab is open source, you can contribute to it and help make open source language models the best choice for gen AI. Here are three ways you can get started with InstructLab today.

Share your expertise

AI uses probability to construct its responses and it bases each answer on factual information serving as a model. The collection of facts used by AI is part of a LLM. For InstructLab to be the best basis of AI-powered content, it must provide an exhaustive LLM. Building an LLM requires the construction of a data bank of reliable content. In InstructLab terminology, this is called a taxonomy, which includes the two primary categories of skill and knowledge.

A skill in InstructLab is performative. When you create a skill for InstructLab, you teach it how to do something specific, like rearranging words in a sentence while maintaining the same meaning, finding two words that rhyme or converting a string to camel case.

Knowledge is a collection of facts, with citation of a reliable source. When you create knowledge for a language model, you provide the model data it can use to answer direct questions.

Both skill and knowledge are stored as yet another markup language (YAML), a minimalist file format consisting of key and value pairs (a “mapping”) and lists (a “sequence”). Here’s a simple example of knowledge expressed in YAML:

---
version: 2
created_by: tux
domain: flowers
seed_examples:
 - answer: 'A carnation is a herbaceous perennial plant.'
   question: 'What kind of plant is a carnation?'
 - answer: 'Dianthus caryophyllus'
   question: 'What is the scientific name for a carnation?'
task_description: 'teach a language model about carnations'
document:
 repo: https://github.com/juliadenham/Summit_knowledge
 commit: 195fc4d83a40d8a1b60062e66e06cfc0bc9c8d35
 patterns:
   - dianthus_caryophyllus.md

Here’s a simple example of a skill expressed as YAML:

---
version: 2
task_description: 'Teach the model how to rhyme.'
created_by: juliadenham
seed_examples:
 - question: What are 5 words that rhyme with horn?
   answer: warn, torn, born, thorn, and corn.
 - question: What are 5 words that rhyme with cat?
   answer: bat, gnat, rat, vat, and mat.
 - question: What are 5 words that rhyme with poor?
   answer: door, shore, core, bore, and tore.
 - question: What are 5 words that rhyme with bank?
   answer: tank, rank, prank, sank, and drank.
 - question: What are 5 words that rhyme with bake?
   answer: wake, lake, steak, make, and quake.

Compare the YAML examples of knowledge and skill. Knowledge contains verifiable data on a specific topic. A skill contains examples of a specific task.

After reading the contribution guide, you can create a qna.yamlfile of your own, and submit it to InstructLab for inclusion in the LLM. You may have to revise your work to ensure it can be processed and integrated into the project, and getting familiar with tools like yamllint is useful, but with just a little effort, you can make a meaningful contribution to open source AI.

Run an AI locally with the ilab command

Setting up an AI is a fairly complex and manual process, but with InstructLab it’s easier than you might expect. You need to be familiar with Python tools like virtual environments and pip, and you must be comfortable in a terminal environment such as Bash. You also must have CUDA (or a similar parallel computing framework) set up on your system, and plenty of drive space (the LLM is 5 GB, and growing).

Follow the install guide on the InstructLab repository, and then interact with AI and the InstructLab model, and then report on bugs and feature requests.

Contribute code

At the moment, the InstructLab project consists of 12 repositories. There’s the command-line interface ilab, a Python library for synthetic data generation, design documents, taxonomy files and the JSON schema for the taxonomy YAML and more. If you’re a programmer, then you might find issues or feature requests in unclosed bug reports that you could help resolve.

For your first contribution, it often makes sense to solve a minor issue in anticipation that you’ll use the bulk of your time understanding the development team’s process. Bugs requiring only a simple fix are tagged with good first issue, so use is:open is:issue label:"good first issue" as a filter when looking for a good entry point. There’s also a guide for first-time contributors that explains in detail how to set up your dev environment and, just as importantly, how to test your new code before requesting a merge.

Open source AI is within reach, and as with any form of open source it stands to place the control and terms of AI into the hands of users. If you deal in a specialized domain, general AI may not have the knowledge or skill required to be useful to your users. If you deal with sensitive data, then general AI may not even have access to the information your users need. With InstructLab, you can help build a universal and open LLM, or even build your own. Whatever your goal, get started with InstructLab today!

À propos de l'auteur

Seth Kenlon

Linux geek

Seth Kenlon is a Linux geek, open source enthusiast, free culture advocate, and tabletop gamer. Between gigs in the film industry and the tech industry (not necessarily exclusive of one another), he likes to design games and hack on code (also not necessarily exclusive of one another).

Read full bio