Artificial intelligence (AI), especially generative AI (gen AI) offers immense opportunities for open research and innovation. Nevertheless, since its inception, AI's commercialization has raised concerns about several important factors like transparency, reproducibility, and most importantly, its security, privacy and safety.
There have been many debates about the risks and benefits of open sourcing AI models, familiar territory for the open source community, where initial doubt and skepticism often evolve into acceptance. However, there are significant differences between open source code and open source AI models.
What is open source AI?
The definition of an "open source AI model" is still evolving as researchers and industry experts continue to delineate its framework. We don’t intend to dive into the debate or define its parameters in this article. Instead, our focus is presenting and demonstrating how the IBM Granite model is open source and why an open source model is inherently more trustworthy.
Open source licenses
A fundamental part of the open source movement involves publishing software code under licenses that grant users independence and control, giving them the right to inspect, modify and redistribute the code without restrictions. OSI-approved licenses like Apache 2.0 and MIT have been key to enabling worldwide collaborative development, freedom of choice and accelerated progress.
Several models, such as the IBM Granite model and its variants, are released under the permissive Apache 2.0 license. While there are several AI models being released with permissive licenses, these are all faced with a number of challenges, which we will discuss below.
How does an open license help security and safety?
This relates to the core principles of open source. A permissive license allows more users to use and experiment with the model. This permissive license means more security and safety issues can be discovered, reported and, in most cases, fixed.
Open data
The term "large" in "large language model" (LLM) refers to the large amount of data required to train the model, apart from the many parameters that constitute the model. Model efficacies are often measured by the number of input tokens—often trillions for a good model—used to train the data.
For most closed models, the data sources used to pre-train and fine-tune the model are secret and form the very basis of differentiation from similar products created by other companies. We believe that for an AI model to be truly open source, it is important to reveal the data used to pre-train and fine-tune that model.
The corpus of data used to train Granite foundation models is documented in detail, along with the governance and safety workflows applied to the data before it is sent to the training pipeline.
How does open data help security and safety?
The kind of data generated by a large language model during inference depends on the data with which the model has been trained. Open data provides a way for community members to examine the data used to train the model and verify that no hazardous data is used in the pipeline. Also, open governance practices help reduce model bias, in that biases can be identified and removed from the pre-training phase itself.
Freedom to modify and share
This brings us to the challenges of models released with permissive licenses:
- Because of the way these models are created and distributed, it’s not possible to directly contribute to the models themselves. Since this is the case, these community contributions show up as forks of the original model. This forces consumers to choose a "best-fit" model that isn’t easily extensible, and these forks are expensive for model creators to maintain.
- Most people find it difficult to fork, train and refine models, because of their lack of knowledge of AI and machine learning (ML) technologies.
- There is a lack of community governance or best practices around review, curation and distribution of forked models.
Red Hat and IBM introduced InstructLab, a model-agnostic open source AI project that simplifies the process of contributing to LLMs. The technology gives model upstreams with sufficient infrastructure resources the ability to create regular builds of their open source licensed models.
These resources are not used for rebuilding and retraining the entire model, but rather to refine that model through the addition of new skills and knowledge. These projects would be able to accept pull requests for these refinements and include those in the next build.
In short, InstructLab allows the community to contribute to AI models without forking them. These contributions can be sent "upstream," which allows the developers to rebuild the original model with the new taxonomy, which can be further shared with other users and contributors.
How does the freedom to modify and share help security and safety?
This allows community members to add their own data to the base model in a trustworthy way. They can also fine-tune the safety parameters of the model by using taxonomy, which adds additional safety guardrails. The community can also improve the security and safety posture of the model without repeating the pre-training, which is expensive and time consuming.
IBM and Red Hat are part of the AI Alliance, which is seeking to define what open source AI means at an industry scale in terms of governance, process and practice.
Open, transparent and responsible AI will help advance AI safety, giving the open community of developers and researchers the ability to address the significant risks of AI and mitigate them with the most appropriate solutions.
저자 소개
Huzaifa Sidhpurwala is a Senior Principal Product Security Engineer - AI security, safety and trustworthiness, working for Red Hat Product Security Team.
유사한 검색 결과
채널별 검색
오토메이션
기술, 팀, 인프라를 위한 IT 자동화 최신 동향
인공지능
고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트
오픈 하이브리드 클라우드
하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요
보안
환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보
엣지 컴퓨팅
엣지에서의 운영을 단순화하는 플랫폼 업데이트
인프라
세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보
애플리케이션
복잡한 애플리케이션에 대한 솔루션 더 보기
오리지널 쇼
엔터프라이즈 기술 분야의 제작자와 리더가 전하는 흥미로운 스토리
제품
- Red Hat Enterprise Linux
- Red Hat OpenShift Enterprise
- Red Hat Ansible Automation Platform
- 클라우드 서비스
- 모든 제품 보기
툴
체험, 구매 & 영업
커뮤니케이션
Red Hat 소개
Red Hat은 Linux, 클라우드, 컨테이너, 쿠버네티스 등을 포함한 글로벌 엔터프라이즈 오픈소스 솔루션 공급업체입니다. Red Hat은 코어 데이터센터에서 네트워크 엣지에 이르기까지 다양한 플랫폼과 환경에서 기업의 업무 편의성을 높여 주는 강화된 기능의 솔루션을 제공합니다.