Subscribe
& more

Subscribe to the podcast to receive new episodes as soon as we release them

Get the latest episodes in your feed. Find us on your podcast app of choice, and hit subscribe.

Download

Get Compiler on the go. Search for our show in your podcast app, and download straight to your device.

Stream

Play episodes in your web browser with our embedded web player.

Missed an episode? Catch up on Compiler in one place.

View all episodes

Episode 70

Navigating Data Rights In AI

Episode 47

Legacies | Hardy Hardware

clock 27:40 minutes

Show Notes

Copyright infringement is a huge issue for AI training and use. Can LLMs give you copyrighted content? What data can you use to train and tune your own model?

In this episode of Compiler, we explore who owns what when AI models learn from protected content—and why it matters.

Transcript

Who owns the stuff that generative AI models create? Who owns the data that went into it, and what are the rights associated with those inputs and outputs?

Whether you're building something out of these community building blocks yourself, you have a locally hosted model, or you're using one of these third-party services, you have to think about the issue of ownership of the output generated by this technology. That is one way it is different from traditional software technology.

If you've been using AI and haven't considered these questions of ownership and rights, it might be a good idea to pay attention. Because even though AI might seem like the Wild West, copyright laws are pretty well-established. This episode's all about understanding when copyright could be an issue when working with AI, and why.

This is Compiler, an original podcast from Red Hat. I'm Johan Philippine.

I'm Kim Huang.

And I'm Angela Andrews.

On this show, we go beyond the buzzwords and jargon and simplify tech topics.

We're figuring out how people are working artificial intelligence into their lives.

This episode scratches the surface of copyright in the world of AI.

Alright, let's preface this by clearly stating that even though our guest for this episode is a lawyer, we are not. Please do not take what we say as legal advice. Alright, that's out of the way. Let's dig in here. There's been a lot of talk about copyright issues with AI models. Companies are getting sued left and right over alleged infringements.

Angela, Kim, is this anything that either of you have been keeping track of, or at least seeing in the news?

I'll let Angela go first because I'm thinking about it.

Yes, I've been hearing about it, and it has been all the rage at this point. People are filing lawsuits about finding their copyrighted information, their IP, inside of these models. It is basically the Wild West right now. How are we going to build the precedents? I don't know if there have been any court cases that have gone through trial and have had developments, and now we have precedent, but it just feels like this is all so new right now. This is only the beginning.

Yeah.

I want to get some clarity on this, Johan, because for this episode, we're talking about copyright materials being used to train AI models, right? LLMs.

That's right.

So there's a distinction between a model, a generative AI tool, for example, generating something that is copyrighted or it having some data set that it's pulling from copyrighted material that it's essentially just copying, pasting, and representing as an original creation of that model, right?

Right. Okay. Yeah. There is a big difference in that. And we're going to get into that very specific issue. But because I'm not an expert, and I also wanted a lot of clarity on this, I spoke to a lawyer.

The right person to speak to.

Right.

That's right. I spoke to Richard Fontana, who we heard from in the intro. He is a principal commercial counsel on the product privacy innovation team here at Red Hat, and he loves to talk about copyright issues. So we spoke to him in the past for another episode. He wasn't on the episode, but he helped kind of give us a little bit of background on it.

Right.

And he really wanted to start us out on this topic by explaining the first instance in which copyright law can come into play with AI models.

This is technology that solves problems through a process of what's called training. So you have these things called neural networks that are trained on what is typically a vast amount of data, training data. For large language models, which is one of the types of AI models that has attracted the most attention in the past few years, that is a huge amount of data collected from all over the internet. These neural networks are designed to learn from this data, to sort of extract features in a sense, and detect patterns in this data. In the process, a kind of laborious engineering process of training a model, you end up with a model that is tuned to solving a particular type of problem.

Right. So data goes in, and the AI learns from all of it. Once the AI is trained on that data, you ask a question or pose a problem, and the AI will provide a solution based on the data it's trained on, right? We went over this a little bit in the last couple of episodes. Some of these models are trained on texts, and they provide written answers. Others are trained on images, and they generate pictures. Right? You get the idea.

Exactly.

Now, the first thorny question is, did the model builder have the appropriate rights over the potentially copyrighted material used in the training? Are they allowed to use whatever they can find on the internet to train their model?

This seems really gargantuan. How do you know what information out on the internet is available for fair use? If you're using the internet to train your models, it's vast! It's copyrighted, uncopyrighted, free use, open source... It's everything. But how do we navigate to find only the stuff that we are actually legally entitled to use?

And if we're not going to do it as humans, can the AI do it? And the answer right now, I'm willing to bet, is a no.

Yeah. So we've been talking about how AI is kind of a Wild West right now. But copyright law itself is very well-established. Right? So you're not allowed to just take whatever you want without checking to see if it's copyrighted, even if it's just for training your model. Because even the act of training your model, you're exercising that copyright.

When you train a model, you are exercising the rights that are exclusive to the copyright owner necessarily because you have to make copies when you're training a model. You have to make copies of stuff. And so that's why there is possibly copyright infringement at all.

It's in the word.

Yeah.

It's right there.

Mhm.

Yeah. Now, there's the saying that it's easier to ask for forgiveness than for permission. But that doesn't make it right. Right? We have a copyright system for a reason. Even open source, which is all about providing free information and free use of software, works within that system to facilitate that free sharing of intellectual property. In order to train a model on those unfathomable amounts of data, you need to have the right to use the data in the first place. That means getting permission from the copyright holder if there is one.

In some cases, maybe no one answered. In some cases, maybe someone has some sort of property claim, you know, intellectual property claim of some sort on it. Let's just limit it to copyright. Someone might have copyright on an item of training data, and if you have a huge amount of training data, much of it copyrighted, that is used to create this thing.

So at the start of this series, we heard about the debate about whether quality of data or quantity of data is more important when training a model. Right? Christopher Nuland told us about that paper that came out in 2017 from the University of Toronto and Google working together. They definitely showed that quantity is absolutely necessary for large language models. And that quantity means a lot of data, right? Now, hopefully, these models can show what data they used and prove that they have the right to use it in their models. But given the number of lawsuits out there that have made the news, it could be that some of these models were trained on copyrighted information that they didn't have the right to use.

So, if you're going to train a model, no matter how big, you've got to make sure that you have the right to use the data that you want to use.

So this is probably an easier issue when we're talking about building models from your own data. If I am insurance company A and I have all of this customer data, PII, I have all of these tests, I have all of this data in my insurance fiefdom, that is my data. But is it? You know, like, these are human beings. But we always sign away. They redact and make sure that PII and PHI isn't accessible. But again, maybe another company example would be better where you're saying, "Oh, I have all this data, I can use all of my data, customers," whatever. These are the easy cases. It's for the maybe the layperson or this new application or this new startup that wants to do something amazing with AI, but they just don't have the access.

So it's... you said it. Copy is in the word. I'm curious to find out how we find out if the data we're using has a copyright on it or not. And if not, will that become more prevalent as AI becomes more of this everyday word? And you hear about every new model that's dropping every week? How do we decide?

I don't know, Angela. I don't know if I agree with that because even with PII, even with a kind of a domain-specific set of data, the people that are providing that data still have to kind of opt in. They have to kind of give their permission to share, even if it's personal data like their name or their address. They still have to kind of give the consent to share that data with the company.

So if that... I feel like that agreement is kind of up in the air again. We're still in that Wild West territory with AI. Does that extend then to use an AI tool? Does that extend into an LLM? Right now, I'm thinking it might, but maybe not. In some cases, we might have to start putting in consent forms and permission forms to share for the explicit use in AI or explicit use in LLMs beyond what we've already kind of gotten into our culture where it's like you give us your name, you give us your blood type, you give us your address so we can serve you as a customer, say, for a healthcare company.

Well, there's a couple of things there. The first one, which we are going to get into in a couple of minutes, is that there are limits on copyright. There are limits you can use the data to train your AI. But that doesn't also mean that you can share it with other people or things like that.

And the other thing I wanted to mention is that that opt-in, a lot of companies are starting to do that already where they're saying in their end-user license agreements, right? A lot of people don't read, and they'll be updated and they'll say, "We will use this data to train our AI," or some of them are being a little bit more upfront about it and making it an opt-in service, saying like, "Hey, you know, if it's okay with you, you can opt in, you can check a little box, and we'll use this to make our own AI models better."

And there are some other companies who are just kind of trying to sneak it in...

...and kind of doing it.

There's a little bit of that back and forth that's going between some companies doing it a certain way and other companies kind of just sliding it under the radar.

Right. You know how you get those emails saying we've changed our terms and conditions.

I want everyone to check their email and search for that. We've updated our privacy policy.

Exactly.

Go back and read it and see.

The ones you got to watch out for.

Yeah. That's your homework. You're gonna look at it. You'll be surprised at how many you've already agreed to.

Yes.

Alright. That brings me to thorny question number two. When an AI model produces an output, who owns that output? Or more simply, who owns what comes out of an AI model?

Yikes.

For this part of the discussion, we focused on an example model for writing code.

And so what's unique about that, I would say, is that you have this artifact, this technical artifact, we can call it software. So this piece of software that is produced in part, in large part, through a process that involves using data that is not owned by the person creating the model. That is different from... in a certain sense, that is different from the way a kind of typical software application is created.

Alright. Let's dissect that a little bit because it is kind of dense. But there's a lot of really useful information in there. If you're using a large language model to write code, it is very likely that the person who created the model trained that model on large amounts of data and large amounts of code that they do not own.

They might have the right to use it in a model for training purposes, but that might not extend to actually sharing the code or granting the model's users access to that code.

Oh boy.

So I'll give you a little example. Let's say someone's building a large language model to write novels. Let's imagine this person, in order to train their model, got the rights to use all the books ever written. The extent of what they can do is often limited to training the model. This person doesn't now own these books or have absolute rights over their use and distribution.

Right? This is what I mentioned just a minute ago, where copyright rights have limitations on how you can use them.

Yes. Licenses.

Exactly. You know, you can use it to train. You don't necessarily own that data. You don't have the right to then publish that information on the internet.

Yeah. There's so many different variations of use that like, it's like use, you can use it, but you can't alter it or you can use it, you can alter it, but you can't share it, or you can use it and alter it and share it. Or sometimes you can just share, but only with attribution. It's so many different levels of it.

And so that's different from how most software is made these days, where a coder either writes it from scratch, which is becoming more and more rare, or borrows it from open source software, which is usually acquired directly from the rights holder on, say, GitHub or another repository where the person who put that code up or the group put that code up owns the code and has applied an open-source license so that other people can use it within certain restrictions.

Right. So what are we saying here?

Well, I'm sure there are a lot of our listeners out there right now who are writing software, and instead of banging their heads up against something that is not working, they're going to use some generative AI to get some code. They're going to copy and paste it, see if it works, if it does what they need to do, and, you know, ship it.

I just found an answer to a problem. And is that what we're talking about here? Because if so, there are a lot of us that are very, very guilty at this point. We are so guilty. We're just gonna...

Right.

But I guess the difference here is that, like before the proliferation of generative AI tools specifically for people who code, people were encouraged and, to your point, Angela, to go out on the internet and find, you know, code that they could use, code snippets they could use, but typically where those code snippets are housed and where that code is housed, they have some kind of user agreement or some kind of license, like derivatives or, you know, I always think of Creative Commons licenses because that's what I know, because I'm a creative person, but they have something there that's on like a web page or on a repository that says, here is how you use this thing.

What this is doing is removing that part. Right? So if you're putting in a prompt into a generative AI tool to get a certain type of code, and that code has a license that is associated with it, that tool may or may not provide that license agreement for you to be able to understand how you can use it.

It's just giving it to you without the context. I think that's the problem that Richard's talking about.

It's slightly different from this. But let's listen to it real quick and then we'll elaborate and go into it.

Now it is true that a typical software application nowadays, much of it, most of it is going to be using, reusing building blocks. You know, these days mostly open source, building blocks that do useful things or you can just kind of like plug into whatever you're developing. But this is a little bit different.

This is the sort of the fundamental behavior of the AI model is sort of shaped and determined by this training process that involves vast quantities of data that is not owned by the person creating the model. And so, who does own it? That is the question.

I think he's saying the same thing that I was saying. It's just, it's removing that extra kind of context of like, this is who owns the thing and this is how you can use it. It's removing that. It's taking that out of the context.

Well, here's the thing. Some of these models will, and some of these very large models, they will explicitly have that at the bottom or in the license agreement, they will say any output generated by this model is owned by the person who asked the question.

Ooooh.

Right. So if you are generating, if you're putting in prompts in an AI model, anything that the model produces they're allowing you to use and they're saying, we don't have any ownership rights over that.

Phew!

Right. So it can get pretty messy with all these layers of ownership. But in terms of who owns what, the data, what data, or what output is coming out from that model, some of these models very clearly state that the outputs that are created are owned by the person who asked the question and not the company that trained the model.

So they're saying if you use our model to write code, we do not have any copyright claim over that code.

We have absolved ourselves.

Exactly. Now, whether or not that code looks like some of the code that was in the training data to begin with, that's a whole different story that we're about to get into. But before we get to that, I just want to reiterate that if you are using these models to generate code, generate texts, whatever it is, and if you want to use it for yourself or for your business, make sure that you check that specific model's license agreement and see if they retain the rights over the outputs or if they give those away.

It should be clearly stated. And if it is not, then I would err on the side of not using it.

Not using it, exactly.

Because typically with copyright law, it'll revert to the originator of that content, not to whatever else you might...

The middle person. They just provided it. You decided to use it and didn't do your due diligence. And here comes the copyright lawyers knocking at your door.

Exactly.

Right. And I'm going to add another wrinkle to this, a little asterisk because Rich says mostly open source. And I'm gonna put a little asterisk mostly is very important because some of that code...

...it might be proprietary, it might be copyrighted. So it's like mostly it might be proprietary code that's in there. Now that's a whole other kind of hot button issue. Right. But, again, AI models remove the context. What we're talking about here with copyright and ownership is context that we have as humans; machines don't understand the context.

And we have a machine that's essentially removing that context from the conversation. So that's what's really important here to understand.

Yeah. It gets very messy with the different layers of copyright ownership and permissions and right holding. You know we can't know for a lot of these models who have just so much data that it would just be impossible to go through all of it. We can't know if they have all the rights to use the data or the code that they use to train their model.

Hopefully, if you're using a model to do that, they can kind of clearly document, even if they're not necessarily able to show the code, they can clearly document that they have the copyright to use the data that they use to train the model. But that's not going to be the case for a lot of these large language models.

Now, after the break, we're going to get into that thorny thing about what happens when the output of a model looks too much like the input, right? If it's just spitting out what it's been fed...

Oh boy.

Not good.

Oh boy.

Another aspect of this is that it is a feature of this technology that, in some circumstances, the output may resemble some of the training data that was the input during the training process. That is not how these models are designed. They're not designed deliberately to have that behavior.

We've been talking about this a little bit. Right. We don't want the models to be just spitting out the same thing that's put into them, because that's just going to go into a whole host of copyright infringement issues. And so they are designed to avoid doing that.

But are they perfect at that behavior?

Despite the precautions, there's still so much we don't understand about how these AI models actually work. And sometimes that black box does the thing it's not supposed to.

That's an unintended consequence.

Yes, absolutely.

I think an unavoidable technical feature of this technology is that under some circumstances, you are going to see some similarities between the output. It may be rare. It does seem to be rare for the commonly used models that I'm familiar with, but there are examples where it does seem to happen. And so that tendency creates another issue, because even if you have the right to use third-party copyrighted material to train a model—and that's itself a question—but you may have that right, that doesn't mean you have the right to have a model that emits that training data under some circumstances.

Where we've been talking about this point several times now, we have limited rights over what you can use the data and how you can use it to train your models. You can have permission to use copyrighted material to train, but that doesn't give you the right to redistribute that material. Right? So someone, I say for a recipe, chocolate chip cookies. It's not supposed to give you a word-for-word reproduction of one of the recipes that was used to train the model.

Do not trust the recipe.

Don't trust it. Uh oh.

Wait, what?

Andrews, I kid! I mean, you know. Okay. Yeah. What if it's like the perfect recipe that was used to train the model, and the model's like, "I'm just going to take a little license here on this yummy chocolate chip recipe." And then you go and make it, and you're like, "Eww, what in the world?"

I'm gonna go ahead and show you.

Exactly. Yeah. Prompter beware. That's all we can say.

Yeah. For recipes, I would be a little...

A lot skeptical.

...a lot skeptical. Because if you're combining different recipes, how do you know that the amounts are going to be correct? Yeah. You just got to do a trial and error on that one. Might as well get that straight from the source. I don't know if generative AI's going to be the best for recipes at this point. Maybe it will be in the future. Who knows? But for now, you know, just find something on the internet rather than asking a large language model. But that's the idea, right? On the one hand, you don't want it to spit out someone else's recipe word for word because that's not okay. And on the other hand, you know, you don't know if it's going to be the right thing.

So, my next question for Richard was, how does all of this discussion affect me as a user? Right. I don't necessarily know if the code or the recipe I'm getting from an AI model is the original or if it's a copy. And if I'm trying to use it for something other than personal reasons, you know, is there a surefire way to know if it's okay or not, if it's a copy, or if it's alright for me to use it?

How do you know? You don't... you don't know for sure now. Now one, because there is awareness of this problem, at least in the code-generating tool area, some providers of services have provided tools that are designed to help you figure this out. So that might be... I'm not sure if that's the same as the corpus of training data or if it's something else. But it tries to identify certain matches to existing code, third-party code in public repositories, you know, basically like code that might be public on GitHub.

So you can't know for sure, right? Unless you do a little bit of research by yourself. You know, if you're given some pieces of code, you can kind of see, do a search, see if it can be found somewhere else. But you need to at least do a little bit of due diligence yourself to find out if the tools, the AI tools that you're using, are producing code that is copyrighted or not.

There's sleuthing to be done and not just changing variables and removing comments and changing function names. Yeah, there's going to be more to it than that.

Yep, yep, yeah. There's a certain degree of intuition a coder should have when it comes to using code from the internet, and that's especially true now. Richard had a really great example to share.

It's sort of like someone just walked over to your workstation and dropped some code into your editor, and you don't know where they got it from. And so you have to think about, you know, is this like a pretty trivial piece of code or is it extensive? And if it's extensive, does it feel like something that looks like it might have been copied? There might be something in the suggestion that makes you a little bit suspicious. And you can then maybe search around.

But like we were just saying, look around to see if you can find that code anywhere. It pays to be thorough because even if you find it in another place, that doesn't necessarily mean that's what the model was trained on. They might both have been derived from another piece of code entirely. There's another fundamental aspect of copyright law that could be applied to generated code.

Copyright protects expression, but it doesn't protect ideas. And it doesn't protect mere functionality. So in software, sometimes it is hard to tell whether something is expressive or is merely functional. So that is... that's an issue that comes up quite a bit in traditional sort of software copyright analysis. And it could come up in AI.

So, like, Johan, can you tell me what that means? I don't really know what that means.

So expressive versus functional. Right. Let's do a quick example. Say you want to do a Hello World kind of program in a language that you're learning about. There are only so many ways in which you can write that Hello World program.

Right, exactly.

So the function of that program is, you know, Hello World. You're printing out Hello World to your screen, right? That's the function of it. The expression of it is exactly how you put that, how you wrote that program. Okay. So the exact terms you use, I mean, you don't really use a lot of variables for a world, but for this example, right? It’s the exact way in which you code that program, that's the expression of it.

Okay.

So let's take another example.

Yeah. Let's please do another example because I feel like my difficulty with Python has been well documented on the show. So let's do something else that's a little bit more in my wheelhouse.

Little higher level. Let's say you want to write a camera application for a phone. Right. You write a program that implements a camera feature like taking a picture, zooming in and out, applying filters, those things. The code you use to write the application is copyrightable. That's your expression, that actual code. But the idea of a camera application is not. That's the functional side. So someone else could write their own camera application and not infringe on your copyright if they don't use your code.

Okay. Much better.

So that's the difference between expressive and functional, right?

Okay.

Great examples.

Thank you. So copyright covers the expressive part, not the functional part.

I got it now.

Alright. We've covered a lot today. We haven't covered everything, let alone the consequences of copyright infringement. And I asked Richard what to do if you're planning to use AI-generated outputs for something other than personal use.

First of all, if you have a lawyer you can talk to, you talk to your lawyer. But if you don't, if you don't have a lawyer to talk to, you'll have to kind of try to figure this out on your own. So think about how extensive is the thing that you want to use? So maybe that's a piece of code that a generative AI tool has produced in response to a prompt. Or maybe it's, you know, I keep using the example of code because I think it's so interesting, but it could be an ordinary natural language response to a prompt. You know, the more extensive that output is, the more on your guard you should be, I think. I would not exaggerate or overemphasize this type of risk. It is a real risk.

If you have a lawyer to talk to, talk to your lawyer. They'll be much better at kind of parsing the situation and figuring out, "Hey, can I use this? Is it okay? Is it not okay?"

Sound advice.

Yeah, exactly. You know, don't take our advice. Right.

Just don't do that. And don't use generative AI as your de facto lawyer either. Don't ask it those questions. Don't do it.

Yeah. In the absence of those two options, if you are using generative AI and you don't have a lawyer to talk to, you got to be really careful about how you're using it. If the judge says if it's a rather large piece of code that the generative model has produced, or if it's a smaller one, right? You have to consider whether the model is doing something that there are a few ways of doing it or not. Again, this is the expressive versus functional. There are a lot of things that you have to go through and really kind of dot your i's and cross your t's, right.

However, as much as we've been talking about copyright issues today, the intention is not to send everyone into a panic.

Was it though?

But everything we've seen so far suggests that this is rare, and I think maybe it's rare in part because of the fact that what makes these models powerful is that they're trained on such large quantities of data that you're not going to... it's unlikely that any particular output of a generative model is going to actually resemble any particular input in a non-trivial sense that would rise to the level of copyright infringement.

It's a numbers game. It's so much data that it would be hard to extrapolate from all this data a carbon copy of something that was used to train because of the size of the dataset.

Hard but not impossible.

Yeah.

Yeah. The bigger the model, the less likely you are going to be to run into copyright infringement issues or getting an output from the model that really closely resembles an input.

Okay.

Right. I mean, this is something that we've been talking about. It goes back to that paper that we keep going back to about the quantity versus the quality of the data. The larger the quantity of data, the more things got kind of mixed up and reshuffled and used in ways that are somewhat new.

Derivative.

Derivative, but not carbon copies of what's gone in there. But it is a real risk, right? I mean, these things do happen. You know, again, the larger the model, the less it's going to happen. But the corollary here is that the smaller models might be more likely to run afoul of copyright protection infringements. Right. And with what we've been talking about in our previous episodes, smaller models look like they're getting more and more popular, right? They're cheaper to run, they're more specialized, they're more domain-specific. Right. So if you are building a smaller model and you are using material that might be copyrighted for training, you're going to want to be extra careful of any potential outputs looking like the inputs, right?

Right. At the very least, get legal representation and make sure that you're covered.

100%. Yep.

Sounds like sound advice.

Alright, so that does it for today. Yeah. If you use AI for more than personal use, be aware of the copyright issues we've been talking about. Do even more research because this is, again, just scratching the surface. And make sure you've got all your legal ducks lined up in a row.

Research, research, research!

Don't just take things at face value. Definitely do your due diligence as you would with everything else.

Exactly. And to all of our listeners, we know this topic is hot on your minds. We want to hear about it. How are you using it? Are you bumping up against these types of issues in your work, and how have you remedied them? Or if not, are you curious about, well, what does this look like? Where are the lawsuits? Like, where do I go to find out more? You have to share this all with us. So hit us up on our socials at Red Hat. Always use the #compilerpodcast, and we would love to hear what you have to say about this.

And that does it for this episode of Compiler.

This episode was written by Johan Philippine.

Victoria Lawton is keenly aware of copyright law.

Thank you to our guest, Richard Fontana.

Thank you so much for listening. Compiler is produced by the team at Red Hat with technical support from Dialect.

Our theme song was composed by Mary Ancheta.

If you liked today's episode, please follow the show, rate the show, and leave a review. Share it with someone you know. It really helps us out.

Thank you so much for listening. Can't wait to hear from you again. Until next time.

See you.

Featured guests

Rich Fontana

Re:Role

This limited series features technologists sharing what they do and how their roles fit into a growing organization.

Explore Re:Role

Episode 70

Navigating Data Rights In AI

Episode 47

Legacies | Hardy Hardware

Show Notes

Transcript

Featured guests

Re:Role

产品

工具

试用购买与出售

沟通

关于红帽

选择语言

Red Hat legal and privacy links

Red Hat legal and privacy links