Subscribe
& more

Episode 71

Chasing Its Own Tail

00:00
00:00
AI 101

Episode 47

Legacies | Hardy Hardware

00:00
00:00

 

legacies hero art

Show Notes

With the massive flow of AI-generated content onto the internet, it was only a matter of time until all of those bits of data found their way back into AI models. But what do you get when generative AI models start getting their answers from that content?

The Compiler team digs into AI feedback loops, and the unique challenges they present for technologists...and everyone else.

Transcript

You've probably seen it somewhere before in fiction, in school, or just floating around on the internet: an image of a snake eating its own tail, uroboros. It's usually a symbol of a never-ending cycle. But when it comes to artificial intelligence and data, a never-ending cycle can be problematic. With human guidance, large language models can be fine-tuned or corrected when issues arise. But with the massive flow of AI-generated content onto the internet, it was only a matter of time until all those bits of data found their way back into algorithms. AI content informing new AI content. An AI uroboros. How does this happen, and what are the potential consequences of generative AI getting its answers not from authoritative data sets, but from itself? This is Compiler, an original podcast from Red Hat. I'm Kim Huang. I'm Johan Philippine. And I'm Angela Andrews. On this show, we go beyond the buzzwords and jargon and simplify tech topics. We're figuring out how people are working artificial intelligence into their lives. Today, we're tackling AI feedback loops. Let's dive in. Alright, everyone, I'll be honest. We have a lot of ground to cover, so I'm going to get right into it. Let's get into it. Let's go. For the purpose of this topic, I spoke with someone who I feel has a lot of... well, let's just say interest and also expertise on this topic. She has that cred? Yes, definitely. My name is Emily Fox. I am the new portfolio security architect and product security at Red Hat. Ooh, that was a mouthful. She is that girl. Okay? Yes, yes. The proliferation of AI tools means the space between security and AI is colliding fast. That's where Emily sits. There's so many moving parts and pieces associated with the artificial intelligence space when it comes to security. It's an interesting kind of nexus in that traditional software engineering and technical conversations around software vulnerability management have to shift slightly to cover more data privacy, bias, influence, and impact kind of concerns that we typically don't get in traditional software security conversations. I want to call attention to the part where she says "traditional software security," because we're going to get back to that, but it has a lot to do with the subject matter here. But first, I asked Emily to break down how generative AI feedback loops became a thing in the first place. And she says it goes back to the birth of the internet itself. Originally, internet access and network connection systems were only available to academia, researchers, military, governments, things of that nature. So there was an expectation in the rigor and the quality of the content that was being produced and shared being from an authoritative source. So if you're reaching out to a university online to know what their latest article is, they're connected. They're on the internet. They're automatically validated and authorized to be there because they have the connectivity and the funds to be able to do it; therefore, the content that they're producing is considered trustworthy. So that's pretty much how the internet started. For those of you who may be too young to remember or just kind of lost the context of it, I want to level set here. We're talking mostly about large language models that are pulling data points from the internet, right? These large models that people are using, these tools that people are using to make content or AI-generated content that are pulling data points from the internet itself. There's a lot of models out there that do this, but I just want to say that to start with. But the context here is important. Emily's talking about how the internet kind of came about for those of you who are maybe too young to remember. As soon as we made that more commercially available and more open to the public, now you have new personas coming online and sharing their knowledge, their information, their perspectives, and their biases around what information exists and how something could be done. This was great because it allowed us to start getting historical information brought back into the mainstream for discussion, but it also allowed people to start spreading opinions and beliefs that may not be founded in fact. Ye olde internet. Yes. It's what we've become. You know, for me, the internet was huge as a kid, and it created, just like Emily's saying, a kind of democratization of information. You didn't have to rely on other people telling you things. There were websites for everything and everyone. Every interest that I had, whether it was comic books or video games. But not everyone is an academic, right? So you got content that was circulating in chat rooms, online boards, websites, and a lot of it wasn't true. And as more and more of these sites and contents and posts and chat servers kind of stood up and were available online, people needed to understand how to navigate what is still considered a source of valid and authentic information versus what is an opinion or belief. And in the face of more and more information being generated online, it gets harder and harder to understand whether or not the content that you're reading is actually from an authorized source. Okay, so we have all the parts here. We have the early version of the internet that's built on authority and authoritative sources. We have the evolution of what essentially has become the internet today. And then somewhere in between there and now, we have the birth and proliferation of social media, which kind of changed and evolved the way people interacted with the internet and with each other. And then we got to today where we have these large language models, these AI models that people can use to generate content, and they're using it a lot. There's a lot of people using AI tools to generate everything from images to text to even video. And where are they sharing this AI-generated content? Where else do you share anything? The internet. Of course. We're not mailing letters. No, we're putting it on the internet. That's right. Mail's too slow for the memes. Well yes. By the time it gets there, they're not going to get the joke. We've produced all of this material content online that models can go and self-train on or be tuned against. And as long as there's still those beliefs and that data set that's not validated, it hasn't been verified against a fact, it can continue to consume and build and train itself. There it is. That's where this comes from, the feedback loop, a churn of AI consuming and then redistributing content made via an algorithm via AI. The AI uroboros. This is even worse than when you'd look at your phone and someone would say, "Text this to 15 people." And just put it all out. This is so much worse than that. Oh. Way worse. Yes. So I mean, for me, it's... like I said, the internet was a huge thing for me when I was a kid. It changed the way I interacted with the world. I grew up in a very small town. It changed the way I looked at everything. So I was just like, really like this. Kind of, like, good and bad. Good angel, bad angel situation on my shoulders where I love the internet. I don't really love what it's become over time for many reasons I will go into here. But then with the advent of generative AI, where you can type something into a line of text, of anything, into a tool, and it can give you whatever you want, it can produce whatever you want on the other side. Based on this huge, huge kind of database of information that it has to interpret what you're asking it, it's so tempting and fascinating. But at the same time, that content that's creating goes back into its... it's that eating its own tail. And when you have that, it introduces a whole host of problems. Well, it takes out all the work out of making that content in the first place. Right? It does. It does. Like I remember trying to figure out how to do everything from DIY projects around the house to learning how to draw. And now you can just type a, you know, draw a classical painting still life of fruit and a vase and just type it in, and it just does it. Me trying to figure out how to do it on my own in analog, you know, those days are gone. Well, yeah. But then again... and the picture will look pretty good. Pretty close to what you want. But a lot of the images that I see coming out of AI, you know, I have maybe a little trouble with drawing hands and fingers and the appropriate number of appendages. And there's always something that's a little off. Right? And that can be a little bit of a problem too when it keeps training itself on the stuff that it's putting out. Right? Yes. It's funny you should mention that because Emily is going to give us a really good example of what this could look like in practice. Johan, if you remember our conversation with her? Is this the corn on the cob thing? It is the corn on the cob thing. Oh, man. Yeah. So Emily gave us an example of something that happened when someone was asking an AI tool or kind of an AI model to produce an image for what it looks like for the human digestive tract when it eats corn. You start to see these images pop up online of, like, corn going through the digestive tract. And then eventually, corn is the digestive tract, and it keeps going down this path. And if you've seen images of fingers and hands online from generated AI models, you'll see you kind of get what I'm going at is like it just keeps producing more garbage content out there. And then we have like this false set of information around is the human digestive tract actually made out of corn? Maybe. Maybe not. But if you don't know and you don't have that medical background or experience, or just genuinely don't understand how humans consume food, you're going to actually potentially believe that that's true. We went from academics and the military and, you know, authorities putting information out on our brand new internet. And then we came along and we start putting our two cents in it and, you know, maybe regurgitating what we think something is, stating it as fact. And let's fast forward 30 years and just look and see what we've done. I don't know what to call it. Like, the corn on the cob visual in and of itself was enough for me. It's the tough one, right? And this, I think, stems from this little factoid that I've heard repeated a lot on the internet even before AI became prominent. Was that you can't digest corn or you, you know, it doesn't get digested, which just doesn't seem true. Yeah. It's not true. Like you eat corn, you get some of the nutrients and maybe don't they just as much as other foods, I don't know. We'd have to do some research. But people ask the question and, you know, the AI goes out there and finds all this false information and then kind of just self-reinforces and goes in this really weird direction. And again, we don't keep this content to ourselves. Right? We share it. Maybe we do it for the lols' sake, you know? I mean, when Emily said that corn becomes like that digestive tract, I was like, what? That image is very funny to me. I don't know why. But in less extreme examples, maybe we share it because we do think it's true. Not this way. No, not this way. Okay, let's set that record straight. No, but this is something I've said before on Compiler: computers don't understand why we like something or why we share it. It just knows that we are. That could come across as positive confirmation from an algorithm's perspective. And that's when things get worse. Because then this erroneous content starts to show up when people are asking genuine questions around the topic. When you have your first five search results that are just crap content, you're going to end up spending more time looking for something that is valid. And it gets even harder the more material content that's being produced. So instead of those five being invalid, it's going to be ten the next time, and then it will be twenty the time after that. This is already super frustrating for me personally because I'll look for something like, how do you do this? Or what does this mean? And then you'll see the links, and it's like, I don't know what these sources are. I'll check them out just to see if they're legit. And you get this bunch of gobbledygook on the page that is just barely intelligible. And then you're like, oh, I got time to go back and keep looking. And it's just a huge waste of time. It really is. So the old adage of trust and verify is going to get harder and harder. That's what we're saying. Because if what you're searching for on the internet, you're trying to confirm something or learn something, and then the results are just coming back gibberish, it gets worse and worse. Are we going to lose our quote-unquote trust in the internet? Is that where we're going? Because we can't believe anything that we read or see? Right. And you have to think about the different kinds of parties at play when you're thinking about the internet as a data set. We're looking at it like this: the LLMs are pulling information, scraping data from the internet, and then presenting answers based on whatever the person asks, which is an uncontrollable factor because people can ask anything. But you have to think about the different parties at play. There are websites made for companies like Red Hat, websites made for private businesses, websites made for actual scholarly pursuits and nonprofits. There are websites for everything, every interest under the sun. So Johan's point about looking something up and it being a bunch of gobbledygook on the page—it's black hat SEO, it's SEO dumping, it's keyword dumping and keyword stuffing. Something that happens a lot in web content where you're trying to get as many views on your page as possible. And so you're writing a bunch of gobbledygook. Like if you ask a question like, is random celebrity married? And then you'll get a page of like the question being asked over and over, and then you'll get a page of keywords that are just gibberish. And then you'll get maybe an answer or maybe just a paragraph that's just a bunch again of keywords like marriage, celebrity name, other person's name. And it's just all just put there. That's called keyword stuffing. It's something that has existed since kind of like the internet started becoming a place where people could sell things. And because it attracts attention, it's kind of gaming the system of search engines to make it so that your page is at the top of the results. So the people that are doing that, all they care about is getting as many people to their page as possible. They're not thinking that one day an AI model will scrape this page and use it to present answers to a random person about a random topic. But that's exactly what's happening. So when you have all these different parties—private citizens, regular people, technologists, companies, people trying to make money, anyone that's using the internet—they're using it for their own purpose and their own reason. But that reason may not be related to what a person is asking to find out about or what a person is searching for in your case, Johan. But it's interesting, Angela, because you say trust and trust in the internet. I want to take this in a different direction. We're talking about trust. It's interesting that the word trust has come up, and the root of the problem here might be too much trust in AI itself. We're going to talk about that. And the one fact that scares me the most about artificial intelligence when we come back. Where we left off was with the question of trust. Angela asked if the issue is trust in the internet, but I don't know if that's the issue here, insomuch as it's trust maybe in AI itself. Too much trust, maybe. And then there's that one thing that sets generative AI in particular apart from traditional software. It's programmed to answer no matter what. It's an excellent analysis on the difference between artificial intelligence models generating material versus actual traditional software engineering. In software engineering, we program for an expected and intended use, and then all of their cases are considered errors. It's explicit in that if you perform this operation, this is the corresponding configuration options in response that you're going to get. Artificial intelligence systems, we don't have that. It's kind of anything because humans are creative, and they're going to ask all sorts of interesting and fascinating questions to interact with the model. You should see my history. I ask some of the dumbest questions, and I don't regret it. Some of the answers are always like, hmm, okay. But if I'm doing something where I feel that the content and its truth matters, you really have to spread that out. When I say trust but verify, because you really can't trust everything that you get back out of AI. I don't think that's the intention. Not for generative AI. Like, think about the fingers, right? The fact that that happens kind of makes sense. The fact that sometimes when you're especially asking for programming questions, or you know, there's one in particular, it's the Ansible one, and all of you Ansible heads out here probably know which one I'm talking about. It'll return a result and say, sure, you can do that. And people who are learning it are like, oh, I can do this. And then you realize, oh no, this doesn't work. You really have to trust but verify. It's going to become more difficult if the models are going to be trained, like you said, on this information that we're putting in. You put garbage in, garbage out. That's where it's kind of heading, it seems. Yes. I really want to zoom in on this because this is what caught my attention the most when I was doing research for this story. So the idea of an error is very important. So, Angela, I'll ask you: what is the purpose of an error in just basic software development? An error tells you that you did something incorrectly. Exactly. It's not going to work the way that you intended it because there's something in your logic that isn't... the computer can't follow. Right. So I'll use an example, and you're going to be shocked. It's video games. I'm not shocked. So imagine you're playing a platformer, and you jump, and your character's jump is too short, and it falls into a pit, and you get a game over. Usually, the game over screen comes up, and then you go back to wherever you started, like wherever that one pit or that one obstacle was right before the point of failure. Right? And then from there, as a player, you have to think, well, maybe there's a double jump, or maybe there's some kind of hidden platform I need to jump on to in order to get over this pit. Or maybe I'm doing something else that's wrong. Maybe I'm not holding the button long enough. But those errors, like game over, are meant to teach you as a player what you did wrong. Debugging. So you can go back and correct it. Exactly. So... but when you're in a situation—and this is kind of what's happening here—where there is no clear error, the information that you're presenting, your Ansible example, I think Angela was genius because for a person like me who doesn't know Ansible, I would take that and be like, okay, yeah, I can do this thing that I'm just learning how to do, and it's fine. I don't have to verify it because why would the AI lie to me? You know, that's the situation that you get when you're trusting too much in AI and you're not verifying the information. And then from there, the actual error itself is not presenting itself. The AI is not saying, well, I don't know how to do this. It's saying, well, here's a way that I found based on information that I have no idea where it came from, and you have no way of verifying it because AI-generated content is really hard to find a source of origin. You've lost the thread, so to speak. So you have a situation where you have software for the first time in our human history that gives you an answer no matter what. That answer might be right, or it might be wrong. And even if it is wrong, you don't know what's wrong, and you have no way of checking it because you don't know where the information came from. Well, you can't even go to the source. It doesn't know the source. It can't cite the source, or it won't cite the source. It can... maybe there is a source you can look at, but it would take your entire human life. And then say, you take this wrong information and you don't try it out yourself and you publish it. That's, again, it's starting to self-reinforce. So it then gets published, it gets shared. And then the AI later scrapes again and says, like, oh, here's the thing where I see this answer. I don't know if it's right or wrong, but it's something that's posted on the internet. So I'm going to go ahead and, you know, note it as something as an option. We've lost a level of scholarship where we would assume if you're going to write something as some sort of source or a resource for someone, you'd want whatever you put out there to be based in fact and truth. The fact that we're using generative AI as the source of information, not having any idea where the information came from, we can't cite it, we don't know anything, but we're taking it at its face value and putting it back out on the internet as some sort of truth? It's getting harder and harder to believe what you read. Is that where we're going, where you can't trust anything you see on the internet? Because it's all fake and it's all generated. And maybe we'll go back to reading non-AI-generated books. I don't know. I don't, I don't want to see where this story goes. It's a big problem. It is. Kim, did you talk to Emily a little bit about what we can do as technologists to handle this problem? I did. You're not going to like the answer. Uh oh. I don't think the industry has a solution because there's a lot of "It's not my problem to solve." I just want to get stuff out there and get it out the door. And then there is kind of this should it be on the model manufacturers and the models themselves that are producing content to self-identify if this is generative, or is it on humans to do that? Why is the burden of care on a human? Yeah. Good question. Who's holding the bag? And also, where do users of generative AI fit into all of this? A lot of people in this day and age, we want to move fast. Not because we're lazy, but it's because as a species, we are designed to be efficient. If something doesn't need to occur, or if something is seen as a hindrance to us getting to the outcome that we want, we will bypass it unless it is an actual required step in the process, and even then people will try to skip it. So with models, though, this becomes a little bit more challenging because when you're presenting generated content to a consumer or to another model, it's just going to take it in and process it the way that we do as humans in decision-making and then produce more decisions and more material and more content that somebody else can go through and repeat the process for. Until you're forced to stop and evaluate and actually kind of fact-check that content as being valid from an original source, we're still going to kind of keep spiraling in this derivative content that eventually turns into garbage on the internet. Sorry, I wish I had something better to give you. She just basically drug us all. We should be ashamed. I mean, you can hear her frustrations as a security technical expert coming through that answer there about, you know, we want to move fast. And if there's anything that's quote unquote in the way, people are going to try and skip it. Definitely. So what to do in lieu of an answer? Emily says it's simple. And I think that we've all been saying it over and over: verify, verify, verify! It's not as simple as saying AI is evil. It's also not proactive when you're trying to grapple with these potential negative effects. Generative AI certainly has its own applications and usefulness, and it's not to say that the technology is bad; it's more that we should be slightly more responsible about its creation and development than we currently are. If you are trying to be responsible in interacting with these models and systems, you have more time out of your day to go through and validate and spot-check. And, good news here. That's becoming easier with newer models in the market. The teams behind them are building with human-centered design in mind. Some models are really cool; they are providing those references and links. So if you're interacting with it, it will give you a summary, it'll give you content, and then you can double-click on it and say, "Oh yeah, I got it from this section of this... I don't know, government policy is where it's derived from." And here's the material; go look at it yourself if you don't believe me. And those are really great. It gives you the means as a consumer of that material to independently validate and audit that content side by side and match it up, instead of just taking it at face value. That's the way to do it. That is the way to do it. I mean, you think about it, if you're telling people that this is a great tool, we want you to use it. We want you to consume it. There has to be this really big caveat: you have to make sure that what it returns to you is actually valid and accurate. Again, trust but verify. Yes, you've just returned... it seems like a bunch of great information, but you have to verify said information, and that does take work. But depending on what we're doing with the information that the AI is returning to us, we should want to make sure that it's valid and accurate, depending, of course, on its use case. Because who wants to put more crap out there? Nobody. Well, I mean, not nobody, but... you know what I mean. Mhm. Yeah. Yeah. It really sounds like the way we as a society should be looking at these AI tools is not just providing answers for us, but as a starting point to get the information that we want, producing the content that we want. We start, maybe start with what the AI produces, and then we again verify, expand on, and keep doing the work to make sure that the stuff that we're going to use is accurate and believable. Something that you can actually put out there. And instead of reinforcing this mechanism of garbage on the internet, maybe we can kind of crack it a little bit. Yeah. Really important for people who are adopting AI, especially generative AI tools. The people behind those tools, I feel like it's really important for them to understand the tendencies of human behavior, to share information and to bypass things. People are not necessarily bad or lazy; they're just trying to work in the ways that are natural to their brains. And people who use AI tools are no different. So these tools need to be designed with the human experience in mind. The tendency to latch on to different ideas and different information that may not be founded in fact, that may be controversial more so than it is authoritative—these are things that are pretty much, you know, natural to us as humans. We want harmony, and we also want things to be more efficient. So sometimes those two things work against us. And technologists would be well-served to understand that when they're making these things. Where do we go from here? I wish I knew, but Emily's thoughts about, you know, in particular the loss of deep knowledge of things over time is something that I feel like we really need to visit because as usage of generative AI becomes more commonplace, that's going to become more and more of an issue. It is! It's hard to convert that short-term memory into long-term memory, especially when you didn't do anything to get there. So when you're deep learning, that process that you go through to actually learn something converts it into long-term memory. When we're just doing this really quick, I need to do this... now I don't know everything behind it, but it gives me what I want. It gives me that little rush. I have what I need. I'm just going to go on about my day. It's really... we're going to sacrifice a part of ourselves and will not know things deeply the way we should. And I know I'm guilty of it as well. I'm speaking about myself, where you just want to get something done, and you will take this shortcut, and you'll know enough to again make sure you're not doing anything dangerous or regurgitating anything that's nonsense. But you've skipped the process. There's a process to understand and know, and we've just short-circuited it with generative AI. As long as we're continuing to abstract for our own benefit, we're going to go down that path. But what happens when we abstract that information, those processes away from the humans that were originally responsible for it? We lose the skills to be able to do it ourselves. There's some interesting and fascinating historical studies around power plants and their operation and through automation where that skill set's been lost. So if those systems stop working, who actually knows how to go drive on the site and reset the entire power grid? Those kinds of questions, we lose all that. So if you are really good at one thing, get really good at it real fast because you may be the only person left that actually knows that material. Yeah, it is. It's the second scariest thing I've heard in this episode. I'm going to play the devil's advocate on this one. I mean, I hear where y'all are coming from, and I agree to a certain extent. However, this is an argument that I've heard before, and it's one that actually goes back to ancient times when people started writing stuff down. There was a big uproar of like, "Oh, people aren't learning the things the way that they're used to." Now they can just reference the book to be able to find that information instead of memorizing it. The same thing happened with the advent of the internet, and there was a big outcry about not being able to do research and go to the library and, you know, find a book and read it that way. Everything's going to be just instantly available at your fingertips through the internet. And those arguments aren't completely wrong, right? I mean, we have lost a little bit of things along the way. But I think we might have also gained a lot in the meantime. However, the scale of which this is happening seems to be much more dangerous with the advent of AI and actually replacing these skills that we're supposed to be learning how to do. Agreed. So what you're saying is... humans always find a way. Yeah. Life finds a way. It always. Is that what you're saying? Get out of here. Sounds that way. Okay. Please. Please go. In the end, right, there is no easy meat wrapped up in a little bow answer to the problem of AI feedback loops. The AI uroboros persists. Humans are human, and it would be best for technologists from all points of view on AI, whether they love it or hate it, and everyone from people who are adopting it to people who are trying their best to stay away from it, to build whatever they're building with the human experience and the future in mind. Well, this was a story. Emily really gave us so much context and information that some of us, myself included, really haven't given a lot of thought to. And I'm feeling much more empowered now. And I really want to thank you for taking the time and bringing this episode together. I want to hear what our listeners think. Did Emily scare you too a little bit? I mean, you have to let us know what you thought of this episode. Hit us up on our socials at Red Hat. Always use the #compilerpodcast. We want to hear what you think about this. I, for one, am very curious. I'm not the only one afraid, am I? Oh, no. No. And that does it for this episode of Compiler. This episode was written by Kim Huang. Victoria Lawton knows how corn is digested, and we trust in her... always. Thank you to our guest, Emily Fox. Take care, everybody. Until next time. Compiler is produced by the team at Red Hat with technical support from Dialect. Our theme song was composed by Mary Ancheta. If you like today's episode, please follow the show, write the show, and leave a review. Share it with someone you know. It really helps us out. All right. See you later, everybody. Take care. Bye.
Compiler

Featured guests

Emily Fox
 

re-role graphic

Re:Role

This limited series features technologists sharing what they do and how their roles fit into a growing organization.

Explore Re:Role