The case, Kneschke v LAION e.V., must be seen in the larger context of the surrounding debate on whether or not AI training/machine learning (ML) is privileged under European copyright law as a form of (sophisticated) text-and-data mining (TDM).
The Starting Point
TDM, or more precisely the reproductions of protected works in the process of automated analysis, is widely privileged under European copyright law, not only for the purpose of scientific research (Art. 3 DSM-Directive) but also for other purposes, including commercial and business uses (Art. 4 DSM-Directive). In case of the latter, however, authors and right holders can opt out by explicitly reserving the use of their works.
The notion that AI training constitutes a (particularly sophisticated) form of text and data mining is largely rejected by authors and right holders on the grounds that the objective of training a generative AI model is said to extend beyond the mere extraction of information.
In Germany, this debate is even further intensified by the fact that the German act of transposition (Sec. 44b UrhG). does not require remuneration even for "commercial" TDM.
Consequently, authors and right holders are attempting to explicitly reserve the use of their works against "commercial" TDM by including disclaimers on their websites.
This presents a number of additional challenges. The German act of transposition expressly goes beyond the requirements of European law in that it requires a machine-readable form, which has previously been understood restrictively under German case-law and typically implemented through a standard web protocol, robots.txt.
At the same time, there is currently no dedicated web protocol that has been specifically designed for expressing TDM reservations. Consequently, right holders and authors are in a dilemma: Obviously, robots.txt could be used to exclude AI-developers from crawling data. But this would also exclude search engine crawling - making websites de facto disappear.
The Case
The Hamburg case was brought by Mr Kneschke, a professional photographer, against LAION e.V., a private association providing a large dataset of image-text pairs free of charge for the purpose of training generative AI. The dataset in question does not provide any reproductions of the paired images, but only the hyperlinks to images that are publicly available online. Among the almost 6 billion images, the dataset also contains one image of the plaintiff, which is originally hosted on a website that explicitly prohibits the use of automated means for downloading or scraping, as set out in its (English) terms of service.
The point of contention is the methodology employed in the creation of the dataset: The defendant created the dataset based on a pre-existing one from 2008. To ascertain whether the provided descriptions align with the images in question, the defendant did download and temporarily store the images, and then used automation to analyze them.
The plaintiff sought an injunction based on copyright infringement to prevent future reproductions of his image. The defendant maintains that the reproduction during the creation of the dataset is privileged under German and European copyright law, particularly in the context of text and data mining.
The Decision
The Hamburg court dismissed the case, ruling that
- reproduction for the purpose of verifying the content and its description must be distinguished from use for AI training,
- reproduction for the purpose of verifying constitutes a form of text and data mining, and
- creating a dataset for AI training free of charge can be considered to fall under the privileged purpose of scientific research.
However, regarding a possible reservation of use from commercial TDM the court suggested obiter that a disclaimer in natural language in the website's terms might be effective.
Takeaway: Guidance and possible Challenges
The decision does not address the (controversial) issue of using web-scraped images for AI training. By differentiating between reproductions for data aggregation and those subsequently used during AI training, the court has managed to avoid this divided territory. At the same time, the decision addresses a significant first step of AI training, the aggregation of training data, and in doing so presents rightsholders, AI developers and data aggregators alike with new challenges:
On the one hand, the court applied a wide interpretation of the term "scientific purposes" as provided in Article 3 DSM Directive, thereby extending the privilege of "scientific" TDM to the data aggregation, even when exercised by AI businesses of any form and irrespective of the fact that the dataset could later be used by others to train commercial AI applications.
On the other hand, the court's expansive interpretation of machine-readability introduces new challenges and risks for AI businesses. If reservations in natural language were deemed to suffice, data aggregators (regardless of their size and capacities) would be compelled to use AI systems with natural language processing capabilities to identify and interpret these reservations (irrespective of their extent, wording and language). This appears to be an entirely impractical proposition, not only prompting the question why the crawling for aggregation of training data should be subject to a different standard than that applied to the large search engines crawling the internet, but also provoking significant error rates. It seems that in the court’s opinion the absence of a TDM reservation web standard protocol should be borne by AI developers and data aggregators.
Outlook
For now, AI training and the TDM privileges clearly continue to remain controversial: With its extensive interpretation of machine-readability, the court goes against the prevailing legal opinion, while at the same time presenting data aggregators and AI developers with a broad privilege for curating datasets even against the explicit will of authors and right holders. It is no surprise therefore that the plaintiff has already announced that he will closely consider his option to appeal.
For AI developers and data aggregators (profit-oriented or not), the decision creates legal uncertainty. An appeal would therefore be welcome for all concerned. At the same time, the development of an TDM specific standard web protocol should be supported – and in fact, the W3 Consortium community has already initiated work on this.
Authored by Marvin Jaeschke