In a project that could unlock the world’s research papers for easier computerized analysis, an American technologist has released online a gigantic index of the words and short phrases contained in more than 100 million journal articles — including many paywalled papers.
The catalogue, which was released on 7 October and is free to use, holds tables of more than 355 billion words and sentence fragments listed next to the articles in which they appear. It is an effort to help scientists use software to glean insights from published work even if they have no legal access to the underlying papers, says its creator, Carl Malamud. He released the files under the auspices of Public Resource, a non-profit corporation in Sebastopol, California, that he founded.
Computer scientists already text mine papers to build databases of genes, drugs and chemicals found in the literature, and to explore papers’ content faster than a human could read. But they often note that publishers ultimately control the speed and scope of their work, and that scientists are restricted to mining only open-access papers, or those articles they (or their institutions) have subscriptions to. Some publishers have said that researchers looking to mine the text of paywalled papers need their authorization.
And although free search engines such as Google Scholar have — with publishers’ agreement — indexed the text of paywalled literature, they only allow users to search with certain types of text queries, and restrict automated searching. That doesn’t allow large-scale computerized analysis using more specialized searches, Malamud says.
Michael Carroll, a legal researcher at the American University Washington College of Law in Washington DC, says that distributing the index should be legal worldwide because the files do not copy enough of an underlying article to infringe the publisher’s copyright — although laws vary by country. “Copyright does not protect facts and ideas, and these results would be treated as communication of facts derived from the analysis of the copyrighted articles,” he says.
The only legal question, Carroll adds, is whether Malamud’s obtaining and copying of the underlying papers was done without breaching publishers’ terms. Malamud says that he did have to get copies of the 107 million articles referenced in the index to create it; he declined to say how,
It is sad indeed that much research – lots of it probably paid for by tax payers and all of it eventually subsidised by customers of the companies who paid for it – is impossible or very hard for scientists to look up: because of copyright. This is a clear impediment to growth of wealth and knowledge and it’s not very strange to understand why countries like China who don’t allow people to sit on their copyrighted arses but make them innovate for a living are doing much better at growth than the legally quagmired west.