As many of the AI stories on Walled Culture attest, one of the most contentious areas in the latest stage of AI development concerns the sourcing of training data. To create high-quality large language models (LLMs) massive quantities of training data are required. In the current genAI stampede, many companies are simply scraping everything they can off the Internet. Quite how that will work out in legal terms is not yet clear. Although a few court cases involving the use of copyright material for training have been decided, many have not, and the detailed contours of the legal landscape remain uncertain.
However, there is an alternative to this “grab it all” approach. It involves using materials that are either in the public domain or released under a “permissive” licence that allows LLMs to be trained on them without any problems. There’s plenty of such material online, but its scattered nature puts it at a serious disadvantage compared to downloading everything without worrying about licensing issues. To address that, the Common Corpus was created and released just over a year ago by the French startup Pleias. A press release from the AI Alliance explains the key characteristics of the Common Corpus:
Truly Open: contains only data that is permissively licensed and provenance is documented
Multilingual: mostly representing English and French data, but contains at least 1[billion] tokens for over 30 languages
Diverse: consisting of scientific articles, government and legal documents, code, and cultural heritage data, including books and newspapers
Extensively Curated: spelling and formatting has been corrected from digitized texts, harmful and toxic content has been removed, and content with low educational content has also been removed.
There are five main categories of material: OpenGovernment, OpenCulture, OpenScience, OpenWeb, and OpenSource:
OpenGovernment contains Finance Commons, a dataset of financial documents from a range of governmental and regulatory bodies. Finance Commons is a multimodal dataset, including both text and PDF corpora. OpenGovernment also contains Legal Commons, a dataset of legal and administrative texts. OpenCulture contains cultural heritage data like books and newspapers. Many of these texts come from the 18th and 19th centuries, or even earlier.
OpenScience data primarily comes from publicly available academic and scientific publications, which are most often released as PDFs. OpenWeb contains datasets from YouTube Commons, a dataset of transcripts from public domain YouTube videos, and websites like Stack Exchange. Finally, OpenSource comprises code collected from GitHub repositories which were permissibly licensed.
The initial release contained over 2 trillion tokens – the usual way of measuring the volume of training material, where tokens can be whole words and parts of words. A significant recent update of the corpus has taken that to over 2.267 trillion tokens. Just as important as the greater size, is the wider reach: there are major additions of material from China, Japan, Korea, Brazil, India, Africa and South-East Asia. Specifically, the latest release contains data for eight languages with more than 10 billion tokens (English, French, German, Spanish, Italian, Polish, Greek, Latin) and 33 languages with more than 1 billion tokens. Because of the way the dataset has been selected and curated, it is possible to train LLMs on fully open data, which leads to auditable models. Moreover, as the original press release explains:
By providing clear provenance and using permissibly licensed data, Common Corpus exceeds the requirements of even the strictest regulations on AI training data, such as the EU AI Act. Pleias has also taken extensive steps to ensure GDPR compliance, by developing custom procedures to enable personally identifiable information (PII) removal for multilingual data. This makes Common Corpus an ideal foundation for secure, enterprise-grade models. Models trained on Common Corpus will be resilient to an increasingly regulated industry.
Another advantage for many users is that material with high “toxicity scores” has already been removed, thus ensuring that any LLMs trained on the Common Corpus will have fewer problems in this regard.
The Common Corpus is a great demonstration of the power of openness and permissive copyright licensing, and how they bring benefits that other approaches can’t match. For example: “Common Corpus makes it possible to train models compatible with the Open Source Initiative’s definition of open-source AI, which includes openness of use, meaning use is permitted for ‘any purpose and without having to ask for permission’. ” That fact, along with the multilingual nature of the Common Corpus, would make the latest version a great fit for any EU move to create “public AI” systems, something advocated on this blog a few months back. The French government is already backing the project, as are other organisations supporting openness:
The Corpus was built up with the support and concerted efforts of the AI Alliance, the French Ministry of Culture as part of the prefiguration of the service offering of the Alliance for Language technologies EDIC (ALT-EDIC).
This dataset was also made in partnership with Wikimedia Enterprise and Wikidata/Wikimedia Germany. We’re also thankful to our partner Libraries Without Borders for continuous assistance on extending low resource language support.
The corpus was stored and processed with the generous support of the AI Alliance, Jean Zay (Eviden, Idris), Tracto AI, Mozilla.
The unique advantages of the Common Corpus mean that more governments should be supporting it as an alternative to proprietary systems, which generally remain black boxes in terms of where their training data comes from. Publishers too would also be wise to fund it, since it offers a powerful resource explicitly designed to avoid some of the thorniest copyright issues plaguing the generative AI field today.
Robin Edgar
Organisational Structures | Technology and Science | Military, IT and Lifestyle consultancy | Social, Broadcast & Cross Media | Flying aircraft