Exploring the Copyright Clearance Problem in Generative AI
Courts are exploring and deciding whether generative AI infringes copyright. In this article, let’s talk about what this means.
Copyright law in the United States is a complicated matter. It's understandable that those of us who aren't lawyers have a hard time figuring out what it really means and what it does and doesn't protect. Data scientists don’t spend a lot of time thinking about copyright issues unless we choose a license for an open source project. Even so, sometimes we just skip over this and don't really deal with it, even though we know we should.
But the legal community is starting to look closely at the copyright implications of the field of generative AI, which could have real consequences for our work. Before we discuss specifically how copyright impacts the field of generative AI, let’s first review the factual issues surrounding copyright.
copyright
- U.S. copyright law relates to so-called "original works." These works include such related content as: literature; dramatic pantomime and dance works to music; paintings, graphics and sculptural works; audiovisual works; sound recordings; derivative works; compilations; architectural projects.
- Content must be written or documented to be copyrightable. "Ideas are not copyrightable, only tangible forms of expression (such as books, plays, paintings, films or photographs) are copyrightable. Once you express your ideas in a fixed form - such as digital paintings, recordings A song, even a doodle on a napkin – if it’s an original work, it’s automatically protected by copyright.” – Electronic Frontier Foundation (https://www.eff.org/teachingcopyright/handouts#copyrightFAQ).
- Being protected means that only the copyright holder (the author or creator, a descendant who inherits the rights, or the purchaser of the rights) can do these things, such as make and sell copies of the work, create derivative works from the original work, and publicly perform or Display works, etc.
- Copyright is not eternal, it ends after a certain period of time. Typically, this is 70 years after the author's death or 95 years after the content was published. (Anything before 1929 is generally "public domain" in the United States, meaning it is no longer protected by copyright.)
Why does copyright exist? The latest interpretation of the law holds that the point is not just to make creators rich, but to encourage creation so that we have a society that embraces artistic and cultural creativity. Basically, we exchange money with creators so they have an incentive to create great work for us. This means that many courts when hearing copyright cases ask, “Does this reproduction contribute to a creative, artistic and innovative society?” This is also taken into account when making decisions.
fair use
Furthermore, "fair use" is not a free pass to ignore copyright. There are four tests that determine whether a use of content is "fair use":
- Purpose and characteristics of the second use: Are you doing something innovative and different, or are you just copying the original? Is your new work itself innovative? If so, it's more likely fair use. Also, if your use is to make money, it's unlikely to be fair use.
- Nature of originality: If originality is creative, it is difficult to infringe copyright under fair use. If it's just facts, then you're more likely to make fair use (examples of this include: citing a research article or encyclopedia, etc.).
- Usage: Are you copying the entire content? Or just copy a paragraph or a short paragraph? It's important to use as little as possible for fair use, although sometimes you may need to use a lot in derivative works.
- Effect: Are you stealing customer information from the original? Will people buy or use your copy instead of buying the original? Will the creator lose money or market share because of your copying? If so, this is probably unfair use. (Even if you don’t make any money, it’s not a reasonable use.)
You must meet all of these tests above to receive fair use, not just one or two. Of course, all of this is subject to legal interpretation. (Obviously, this article is not about legal advice!) But now, with these facts in mind, let’s think about the role of generative AI, and why the above concepts impact generative AI.
A review of generative artificial intelligence
Readers familiar with my columns will have a very clear understanding of how generative AI is trained. That being said, let's do a quick refresher first.
- Large amounts of data are collected and the model learns by analyzing the patterns present in the data. (As I’ve written before: “Some reports suggest that there are approximately 1 trillion words in GPT-4’s training data. Each of these words was written by a person in their own creative capacity. For context For example, the first book in the Game of Thrones series is about 292,727 words, so the training data for GPT-4 is about 3,416,152 copies of the book."
- When the model learns patterns in the data (in the case of LLM, it learns all about language semantics, syntax, vocabulary, and idioms), it will be fine-tuned by humans to exhibit the desired behavior when people interact with it . These patterns in the data can be so specific that some scholars believe that the model can "remember" the training data.
- The model is then able to answer the user's prompts, reflecting the patterns it has learned (in the case of LLM, answering questions in convincing-sounding human language).
Both the input (training data) and the output of these models have important implications for copyright law; so let’s take a closer look.
Training data and model output
Training data is critical to creating generative AI models. The purpose is to teach a model to replicate human creativity, so the model needs to see a lot of work of human creativity to understand what it looks/sounds like. But, as we learned earlier, works produced by humans belong to those who created them (even if they were jotted down on a napkin). With the amount of data we need to train even a small generative AI model, paying every creator the copyright to their work is not financially feasible. So, is it reasonable for us to feed other people’s work into training datasets and create generative AI models? Now, let’s review the fair use test again to see where we find our footing.
1. Purpose and characteristics of second use
We could argue that using data to train a model doesn't really count as creating a derivative work. Is this different from teaching a child books or music, for example? The counterargument is that, first, teaching a child is not the same as using millions of books to generate a product for profit; second, generative AI is capable of replicating so keenly what it is trained on that it is essentially an almost word-for-word Great tool for copying your work. Are the results of generative AI sometimes innovative and completely different from the input? If so, that's probably because of very creative prompt engineering, but does that mean the underlying tool is legitimate?
However, philosophically speaking, machine learning is trying to reproduce the patterns it learned from the training data as accurately as possible. Are the patterns it learned from the original work the same as the "core" of the original work?
2. Nature of the original work
This aspect varies widely among the different types of generative AI that exist, but since training any model requires large amounts of data, it seems that at least some of it meets the legal criteria for creativity. In many cases, the whole reason for using artificial content as training data is to try to get innovative (highly diverse) inputs into the model. Unless someone is going to go through all 1 trillion words of GPT-4 and decide which words are creative or not; otherwise, I don't think this standard falls under fair use.
3.Amount used
This is a similar question to #2. Because, almost by definition, generative AI training data sets use everything they can get their hands on, and the numbers need to be large and comprehensive; there really is no "minimum necessary" amount of content.
4.Effect
Finally, the issue of effectiveness is a sticking point for generative AI. I think we all know someone who from time to time uses ChatGPT or a similar tool instead of searching for the answer to a question in an encyclopedia or newspaper. There is strong evidence that people use services like Dall-E to request visual works "in the style of [name of artist here]" despite some apparent efforts by these services to prevent this. If the question is whether people will use generative AI rather than paying the original creators, it seems certain that this will happen in some areas. We can see that companies like Microsoft, Google, Meta, and OpenAI are making billions in valuation and revenue from generative AI, so they are certainly not going to pass this easily.
Replication as a concept in computing
I want to pause for a moment to talk about a somewhat relevant but important issue. Copyright law does not deal well with computing in general, and software and digital artifacts in particular. Copyright laws were largely developed in the early world, when copying a vinyl record or republishing a book was a specialized and expensive undertaking. But today, when basically anything on any computer can be copied in seconds with the click of a mouse, the whole idea of copying things is different than it used to be.
Also, remember that installing any software counts as copying. Digital copies mean something different in our culture than copies did before computers. There are a lot of questions about how copyright should work in the digital age, because a lot of it doesn't seem to be that important anymore. Have you ever copied some code from GitHub or StackOverflow? Of course I do! Have you carefully reviewed the content license to ensure it can be used in your scenario? You should, but do you?
The New York Times’ case against OpenAI
Through the above introduction, we already have a general understanding of the form of artificial intelligence copyright dilemma; so, how do creators and the law deal with these problems? I think the most interesting such case (and there are many) is the one brought by the New York Times, because part of it deals exactly with what copying means, and other cases may not have done that.
As I mentioned above, the act of copying digital files is so common and normal that it's hard to imagine that forcing the copying of digital files (at least, without violating the other fair use tests of intent to distribute the exact file to the global public) would be Piracy. I think this is where we need to focus on the issue of generative AI—not just replication, but the impact on culture and markets.
Is generative AI really copying content? For example, training data input, training data output? The New York Times shows in its documentation that you can get verbatim text of New York Times articles from ChatGPT, with very specific prompts. Because the New York Times has a paywall (Translator's Note: a blocking system that prevents non-paying users from viewing web content), so if this situation is true, then this seems to be a clear violation of the fair use effect test. So far, OpenAI's response has been "yes, because you use a lot of sophisticated hints with ChatGPT, you can get such verbatim results." This surprised me: their argument was that generative AI sometimes produces verbatim copies of what it was trained on. But isn't this illegal? (Universal Music Group has filed a similar music-related case, arguing that the generative AI model Claude could copy the lyrics of copyrighted songs almost word for word.)
We ask the court to decide exactly how much of the copyrighted material is used and how it is used, which in this case will be a challenge! I tend to think that using data for training should not be an inherent problem, but the important question is how to use the model and what impact it will have.
We tend to think of fair use as a step, like quoting a passage from your article. Our system has a set of legal ideas that are well prepared for this situation. But in generative AI, it's more like two steps. To say that copyright is being violated, in my opinion, if the content is used for training, then it must also be retrieved from the final model in a way that usurps the market for the original material. I don't think AI systems are yet able to distinguish the amount of input content used versus the amount that can be literally extracted as output. However, is this really the case with ChatGPT? We'll be interested to see what the courts think about these issues.
DMCA
There is another interesting angle to these questions, and that is whether the DMCA (Digital Millennium Copyright Act) is relevant. You may be familiar with this law because it has been used for decades to force social media platforms to remove music and movie files that were posted without the copyright holder's authorization. The law is based on the idea that you can "crack down" on copyright infringers, removing content one piece at a time. However, when it comes to the training data set, this obviously doesn't work - you need to retrain the entire model, which in the case of most generative AI comes at a high cost, either by removing one from the training data or Multiple problematic files. In theory, you could still use the DMCA to force the removal of the model in question's output from the site, but proving which model produced the item would be a challenge. But, on the other hand this doesn't make input + output the key to infringement as I've described.
power issues
If any of these actions actually infringe copyright, the courts still have to decide what to do. In a sense, a lot of people think generative AI is "too big to fail" - they can't do away with the practices that got us here because everyone loves ChatGPT, right? Generative AI (we are told) will revolutionize almost every industry!
While the question of whether copyright was infringed remains to be decided, I do feel that if it was, there should be consequences. Assuming it’s easier to ask for forgiveness than permission, at what point do we stop forgiving powerful people and institutions who skirt the law or blatantly violate it? This isn't entirely obvious. Without some people acting this way, we wouldn't have a lot of innovation today, but that doesn't necessarily mean it was worthwhile. On the other hand, will letting these situations go lead to a devaluation of the rule of law?
Like many current listeners of 99percentinvisible.org, I am currently reading The Power Broker by Robert Caro. It's fascinating to hear how Robert Moses handled New York's legal issues at the turn of the 20th century, as his style of handling zoning laws seemed reminiscent of how Uber in San Francisco handled laws surrounding delivery drivers in the early 2010s. way, and the way big companies building generative AI now handle copyright. Instead of obeying the law, they adopt the attitude that legal restrictions do not apply to them because the rules they are building upon are so important and valuable.
However, I just don't believe this is true. Of course, every situation is different in some ways, but the concept that a powerful person can decide what he thinks is a good idea is inevitably more important than what other people think is something that baffles me. Generative AI can be useful, but it seems disingenuous to think that it is more important than having a culturally vibrant and creative society. Courts still have to decide whether generative AI has a chilling effect on artists and creators. However, the court cases filed by these creators argue that this is the case.
future
The U.S. Copyright Office has not ignored these challenging issues, although they may have been somewhat slow to respond to them. Recently, they published a blog post talking about their plans for generative AI-related content. However, the article is sorely lacking in specifics, other than telling us there will be reports in the future. The department's work focuses on three aspects:
- “Digital replicas”: roughly deepfakes and digital twins of people (think stunt doubles and actors who have to be scanned on the job in order to be digitally imitated)
- "Copyright in works containing material generated by artificial intelligence"
- “Training AI models on copyrighted works”
These are important topics and I hope the results will be well thought out. (I'll write about these reports once they're out.) I hope the policymakers working on this are informed and skilled, because it's easy for bureaucrats to make the entire situation worse with ill-advised new rules.
Another possibility for the future is that ethical datasets will be developed and trained. This is something that some folks at HuggingFace have already done in the form of a code dataset called the Stack (https://www.bigcode-project.org/docs/about/the-stack/). Can we do something like this for other forms of content?
in conclusion
Regardless of what the government or industry proposes, the courts are addressing the above issues. What happens if the generative AI side loses a case in court?
This could mean at least some of the money generated by generative AI will be returned to creators. I'm not convinced that the whole idea of generative AI is going away, although we did see the end of many companies in the Napster audio-sharing era. A court might put companies producing generative AI out of business, or ban the production of generative AI models – it’s not impossible! However, I don't think that's the most likely outcome - instead, I think we'll see some penalties and legal fragmentation around this (this model is okay, that model isn't okay, etc.), and it's possible May or may not make the situation legally clearer.
I really hope the courts will address the question of when and how generative AI models can be considered infringing, not separate the input and output issues but examine them as a whole, because I think that's what makes sense of the situation The essential.
If they did, we might be able to come up with a meaningful legal framework for the new technologies we're dealing with. If we don’t, I fear we’ll end up getting further mired in a law that is unprepared to guide our digital innovation. We need copyright laws that make more sense in our digital world. But we also need to intelligently protect human art, science, and creativity in all its forms, and I don’t think AI-generated content is worth trading for.