Generative artificial intelligence (AI) offers Europe a unique chance to lead in tech innovation and boost its digital competitiveness, building on the European Union’s strong talent base, world-class research and education, and access to computing power.
Generative AI systems are powered by foundation models trained on large, diverse datasets using self-supervision. This enables them to learn patterns and generate new content – such as text, images, audio, video, or code – from scratch.
Yet, the debate in the EU around generative AI – and especially its intersection with copyright – is often clouded by misunderstandings. To help set the record straight, this series of six bite-sized explainers tackles some of the most common misconceptions.
Explore these explainers to get clarity on what’s fact, what’s fiction, and why it matters.
Myths versus facts
- Do AI models contain copies of their training data?
- Can knowledge, facts and ideas be copyrighted?
- Does copyright law protect every type of data?
- Can owners of creative works prevent use for AI training?
- Does every government allow rightsholders to opt-out of AI training?
- Do all rightsholders insist on mandatory licensing of content for AI training?
Do AI models contain copies of their training data?
No, artificial intelligence (AI) doesn’t store exact copies – it learns patterns and concepts.
Main takeaways:
- Generative AI models don’t store copies of the data they have been trained on.
- Instead, they learn patterns and concepts as numerical weights.
- Like someone who reads many books on a particular subject and then writes their own, generative AI doesn’t ‘copy’ but analyses information, enabling it to create original content.
- For example, when trained on text data, AI models reflect probabilities for certain word combinations, enabling coherent output and responses.
- However, exposure to certain content during training can affect the output generated later on, as the result produced is always a statistical probability.
- For example, if an AI model is trained on tens of thousands of images of cats, it learns the defining features of a ‘cat’ and is therefore more likely to generate an accurate image of a cat when prompted.
Can knowledge, facts and ideas be copyrighted?
No, copyright applies to specific creative expressions – not to raw facts or ideas.
Main takeaways:
- Facts, ideas, information, and knowledge simply cannot be copyrighted.
- They are considered public and free for everyone to use, something that EU copyright law also recognises.
- The right for anyone to access and disseminate information is upheld by fundamental rights and legal frameworks.
- Including the Universal Declaration on Human Rights, the Convention for the Protection of Human Rights and Fundamental Freedoms, and the European Charter of Fundamental Rights.
- However, specific expressions of these ideas – like a book, painting, or news article – can be copyrighted because they are specific creative works.
Does copyright law protect every type of data?
No, copyright only protects specific creative works, not the underlying facts or ideas.
Main takeaways:
- Copyright law protects creative works fixed in a tangible medium – like books or music – but not any of the underlying facts or ideas.
- For example, you can’t copy someone else’s copyright-protected book, but you can try to learn as much from it as possible, and then use that knowledge to write your own.
- Copyright is about protecting original expression, not raw data or facts.
- Making this distinction is crucial to preventing overreach of copyright protection, as well as to upholding freedom of expression and information – something also recognised by the EU’s Copyright Directive and AI Act.
Can owners of creative works prevent use for AI training?
Sure, in the European Union rightsholders can opt-out from their data being used in artificial intelligence (AI) training if they choose to do so.
Main takeaways:
- In the EU, rightsholders can prevent the creative works they own from being used in AI training with tools such as the universally accessible robots.txt protocol.
- Tech companies provide a whole range of tools that prevent crawling, giving rightsholders control over how their data is used.
- However, it’s important to note that creators’ right to opt-out may not be used to stop legal text and data mining (TDM) activities.
- Contrary to what some believe, creators can’t block TDM in every case, for example if it’s for research purposes or to improve accessibility.
Does every government allow rightsholders to opt-out of AI training?
No, the European Union stands apart from other major AI players regarding opt-out rights.
Main takeaways:
- Among the major jurisdictions in the AI race, the EU stands out by allowing rightsholders the legal entitlement to opt-out of text and data mining (TDM) for AI training purposes.
- The United States and Japan, for example, have exceptions in place that help to promote innovation and data accessibility without any TDM opt-out entitlement.
- And so do Singapore, South Korea, Malaysia, Israel, and Taiwan.
- This divergence between the EU and the rest of the world risks limiting access to the latest AI innovations for businesses and users in Europe.
Do all rightsholders insist on mandatory licensing of content for AI training?
No, rightsholders – the owners of certain rights to a creative work – have very diverse views about AI training. And licensing is just one approach among many.
Main takeaways:
- Some vocal rightsholders want you to believe that mandated licensing of content for use in artificial intelligence (AI) training is the only viable way forward.
- In reality, however, many creators do not think compulsory licensing is the right or sole solution.
- Most websites and rightsholders do not block access to their data for AI training, which indicates they see no harm or need to opt-out.
- Surveys* show that the views of creators are far more nuanced than some make it seem, as many also recognise the benefits.
- Cumbersome licensing of training data could slow generative AI innovation, ultimately also harming the media and creative sectors, which are among those who stand to benefit most from this type of innovation.
* Exploring Preference Signals for AI Training, Creative Commons