fbpx
skip to Main Content
Harvard And Google Release AI Training Dataset With Public Domain Books, Raising Copyright Questions: Self-Publishing News With Dan Holloway

Harvard and Google Release AI Training Dataset with Public Domain Books, Raising Copyright Questions: Self-Publishing News with Dan Holloway

Any process that improves by being trained on a set of materials will only ever get as good as the materials it’s trained on will allow. That’s as true of machine learning algorithms as it is of human beings. This week’s news that Harvard will release a dataset of 1 million volumes for AI training highlights efforts to address inequities caused by the need for high-quality training data.

ALLi News Editor, Dan Holloway

Ostensibly, the purpose behind the announcement that Harvard will be making a dataset of 1 million volumes available for AI training is to address the inequity that can result from the need for high-quality training fuel. The argument goes that at present, large tech companies with deep pockets have access to better sets of books to train their large language models. To make AI more equitable, therefore, more people with shallower pockets need access to better datasets. It’s very similar to the open data argument that has become commonplace across science.

Harvard’s Institutional Data Initiative

What Harvard has said it will do is release a high-quality dataset from the Institutional Data Initiative’s (IDI) database. The IDI, funded by OpenAI and Microsoft, works with academic and library institutions in the U.S. to bring together public domain books into a usable, single-location dataset. It will include the likes of Shakespeare as well as many public domain periodicals.

The move raises all kinds of questions—not least among them what Microsoft and OpenAI have to gain from making a valuable tool publicly available when they have their own commercial AI projects that have been trained on works still in copyright (something Sam Altman of OpenAI has said is essential for the very best AI). No doubt someone somewhere is already working on suffixing some term with “-washing” to coin an apt descriptor.

Google’s Controversial Contribution

But what raised my eyebrow was the involvement of Google. The company, which has its own Gemini AI that generates poor-quality headers clogging search results, has said it is proud to be involved and will help distribute the database. That involvement appears to include contributing works from its own books project.

AI TrainingThose of you with moderate to adequate memories will no doubt recall that Google Books wasn’t without controversy. While Google sought to make every title in the world searchable, copyright holders noted this might first be run past them (sound familiar?). As the contents of the new database are not yet fully known, the extent of Google Books’ contribution remains unclear. I am sure the IDI has done due diligence. On the other hand, talking of “-washing,” as I read these stories, I can’t seem to stop my mind drifting to some of Netflix’s fascinating documentaries on the process of money laundering.

Thoughts or further questions on this post or any self-publishing issue?

Question mark in light bulbsIf you’re an ALLi member, head over to the SelfPubConnect forum for support from our experienced community of indie authors, advisors, and our own ALLi team. Simply create an account (if you haven’t already) to request to join the forum and get going.

Non-members looking for more information can search our extensive archive of blog posts and podcast episodes packed with tips and advice at ALLi's Self-Publishing Advice Center.

Author: Dan Holloway

Dan Holloway is a novelist, poet and spoken word artist. He is the MC of the performance arts show The New Libertines, which has appeared at festivals and fringes from Manchester to Stoke Newington. In 2010 he was the winner of the 100th episode of the international spoken prose event Literary Death Match, and earlier this year he competed at the National Poetry Slam final at the Royal Albert Hall. His latest collection, The Transparency of Sutures, is available for Kindle at http://www.amazon.co.uk/Transparency-Sutures-Dan-Holloway-ebook/dp/B01A6YAA40

Share

Leave a Reply

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Latest advice, news, ratings, tools and trends.

Back To Top
×Close search
Search
Loading...