피드 구독
AI/ML 

Artificial intelligence (AI), especially generative AI (gen AI) offers immense opportunities for open research and innovation. Nevertheless, since its inception, AI's commercialization has raised concerns about several important factors like transparency, reproducibility, and most importantly, its security, privacy and safety. 

There have been many debates about the risks and benefits of open sourcing AI models, familiar territory for the open source community, where initial doubt and skepticism often evolve into acceptance. However, there are significant differences between open source code and open source AI models.

What is open source AI?

The definition of an "open source AI model" is still evolving as researchers and industry experts continue to delineate its framework. We don’t intend to dive into the debate or define its parameters in this article. Instead, our focus is presenting and demonstrating how the IBM Granite model is open source and why an open source model is inherently more trustworthy.

Open source licenses

A fundamental part of the open source movement involves publishing software code under licenses that grant users independence and control, giving them the right to inspect, modify and redistribute the code without restrictions. OSI-approved licenses like Apache 2.0 and MIT have been key to enabling worldwide collaborative development, freedom of choice and accelerated progress.

Several models, such as the IBM Granite model and its variants, are released under the permissive Apache 2.0 license. While there are several AI models being released with permissive licenses, these are all faced with a number of challenges, which we will discuss below.

How does an open license help security and safety?

This relates to the core principles of open source. A permissive license allows more users to use and experiment with the model. This permissive license means more security and safety issues can be discovered, reported and, in most cases, fixed.

Open data

The term "large" in "large language model" (LLM) refers to the large amount of data required to train the model, apart from the many parameters that constitute the model. Model efficacies are often measured by the number of input tokens—often trillions for a good model—used to train the data.

For most closed models, the data sources used to pre-train and fine-tune the model are secret and form the very basis of differentiation from similar products created by other companies. We believe that for an AI model to be truly open source, it is important to reveal the data used to pre-train and fine-tune that model.

The corpus of data used to train Granite foundation models is documented in detail, along with the governance and safety workflows applied to the data before it is sent to the training pipeline.

How does open data help security and safety?

The kind of data generated by a large language model during inference depends on the data with which the model has been trained. Open data provides a way for community members to examine the data used to train the model and verify that no hazardous data is used in the pipeline. Also, open governance practices help reduce model bias, in that biases can be identified and removed from the pre-training phase itself.

Freedom to modify and share

This brings us to the challenges of models released with permissive licenses:

  • Because of the way these models are created and distributed, it’s not possible to directly contribute to the models themselves. Since this is the case, these community contributions show up as forks of the original model. This forces consumers to choose a "best-fit" model that isn’t easily extensible, and these forks are expensive for model creators to maintain.
  • Most people find it difficult to fork, train and refine models, because of their lack of knowledge of AI and machine learning (ML) technologies.
  • There is a lack of community governance or best practices around review, curation and distribution of forked models.

Red Hat and IBM introduced InstructLab, a model-agnostic open source AI project that simplifies the process of contributing to LLMs. The technology gives model upstreams with sufficient infrastructure resources the ability to create regular builds of their open source licensed models.

These resources are not used for rebuilding and retraining the entire model, but rather to refine that model through the addition of new skills and knowledge. These projects would be able to accept pull requests for these refinements and include those in the next build.

In short, InstructLab allows the community to contribute to AI models without forking them. These contributions can be sent "upstream," which allows the developers to rebuild the original model with the new taxonomy, which can be further shared with other users and contributors.

How does the freedom to modify and share help security and safety?

This allows community members to add their own data to the base model in a trustworthy way. They can also fine-tune the safety parameters of the model by using taxonomy, which adds additional safety guardrails. The community can also improve the security and safety posture of the model without repeating the pre-training, which is expensive and time consuming.

IBM and Red Hat are part of the AI Alliance, which is seeking to define what open source AI means at an industry scale in terms of governance, process and practice.

Open, transparent and responsible AI will help advance AI safety, giving the open community of developers and researchers the ability to address the significant risks of AI and mitigate them with the most appropriate solutions.

Learn more about InstructLab


저자 소개

Huzaifa Sidhpurwala is a Senior Principal Product Security Engineer - AI security, safety and trustworthiness, working for Red Hat Product Security Team.

 
Read full bio
UI_Icon-Red_Hat-Close-A-Black-RGB

채널별 검색

automation icon

오토메이션

기술, 팀, 인프라를 위한 IT 자동화 최신 동향

AI icon

인공지능

고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트

open hybrid cloud icon

오픈 하이브리드 클라우드

하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요

security icon

보안

환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보

edge icon

엣지 컴퓨팅

엣지에서의 운영을 단순화하는 플랫폼 업데이트

Infrastructure icon

인프라

세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보

application development icon

애플리케이션

복잡한 애플리케이션에 대한 솔루션 더 보기

Original series icon

오리지널 쇼

엔터프라이즈 기술 분야의 제작자와 리더가 전하는 흥미로운 스토리