Harvard and Google Release AI Training Dataset with Public Domain Books, Raising Copyright Questions: Self-Publishing News with Dan Holloway

December 19, 2024

Any process that improves by being trained on a set of materials will only ever get as good as the materials it’s trained on will allow. That’s as true of machine learning algorithms as it is of human beings. This week’s news that Harvard will release a dataset of 1 million volumes for AI training highlights efforts to address inequities caused by the need for high-quality training data.

ALLi News Editor, Dan Holloway

Ostensibly, the purpose behind the announcement that Harvard will be making a dataset of 1 million volumes available for AI training is to address the inequity that can result from the need for high-quality training fuel. The argument goes that at present, large tech companies with deep pockets have access to better sets of books to train their large language models. To make AI more equitable, therefore, more people with shallower pockets need access to better datasets. It’s very similar to the open data argument that has become commonplace across science.

Harvard’s Institutional Data Initiative

What Harvard has said it will do is release a high-quality dataset from the Institutional Data Initiative’s (IDI) database. The IDI, funded by OpenAI and Microsoft, works with academic and library institutions in the U.S. to bring together public domain books into a usable, single-location dataset. It will include the likes of Shakespeare as well as many public domain periodicals.

The move raises all kinds of questions—not least among them what Microsoft and OpenAI have to gain from making a valuable tool publicly available when they have their own commercial AI projects that have been trained on works still in copyright (something Sam Altman of OpenAI has said is essential for the very best AI). No doubt someone somewhere is already working on suffixing some term with “-washing” to coin an apt descriptor.

Google’s Controversial Contribution

But what raised my eyebrow was the involvement of Google. The company, which has its own Gemini AI that generates poor-quality headers clogging search results, has said it is proud to be involved and will help distribute the database. That involvement appears to include contributing works from its own books project.

AI Training Those of you with moderate to adequate memories will no doubt recall that Google Books wasn’t without controversy. While Google sought to make every title in the world searchable, copyright holders noted this might first be run past them (sound familiar?). As the contents of the new database are not yet fully known, the extent of Google Books’ contribution remains unclear. I am sure the IDI has done due diligence. On the other hand, talking of “-washing,” as I read these stories, I can’t seem to stop my mind drifting to some of Netflix’s fascinating documentaries on the process of money laundering.

Thoughts or further questions on this post or any self-publishing issue?

If you’re an ALLi member, head over to the SelfPubConnect forum for support from our experienced community of indie authors, advisors, and our own ALLi team. Simply create an account (if you haven’t already) to request to join the forum and get going.

Non-members looking for more information can search our extensive archive of blog posts and podcast episodes packed with tips and advice at ALLi's Self-Publishing Advice Center.

Author: Dan Holloway

Dan Holloway is a novelist, poet and spoken word artist. He is the MC of the performance arts show The New Libertines, which has appeared at festivals and fringes from Manchester to Stoke Newington. In 2010 he was the winner of the 100th episode of the international spoken prose event Literary Death Match, and earlier this year he competed at the National Poetry Slam final at the Royal Albert Hall. His latest collection, The Transparency of Sutures, is available for Kindle at http://www.amazon.co.uk/Transparency-Sutures-Dan-Holloway-ebook/dp/B01A6YAA40

Latest advice, news, ratings, tools and trends.

Harvard and Google Release AI Training Dataset with Public Domain Books, Raising Copyright Questions: Self-Publishing News with Dan Holloway

The Self-Publishing with ALLi Podcast Changes to Better Serve Indie Authors

The Self-Publishing with ALLi podcast is changing, and this episode gives you an inside look at the exciting new features planned for 2025. Join ALLi Director Orna Ross, Campaigns Manager Matty Dalrymple, and Content and Communications Manager Howard Lovy as they discuss the new name, revamped structure, and fresh approach to delivering expert insights across all seven processes of publishing. With contributions from advisors, expanded guest appearances, and a focus on fostering both virtual and in-person connections among indie authors, the podcast is set to provide even greater value for ALLi members and listeners.

Tumblr Launches Communities to Connect Fans and Bluesky Doesn’t Rule Out Ads: Self-Publishing News with Dan Holloway

I want to start the week with a brief roundup of some news from across the social media platforms we have used, have thought about using, and often feel we ought to use. This week, "Tumblr launches Communities" stood out as a key development, alongside updates from Bluesky and TikTok. These stories, though small on their own, cluster together thematically to highlight shifts across platforms that impact creatives and their audiences.

View all articles

Harvard and Google Release AI Training Dataset with Public Domain Books, Raising Copyright Questions: Self-Publishing News with Dan Holloway

Harvard’s Institutional Data Initiative

Google’s Controversial Contribution

Thoughts or further questions on this post or any self-publishing issue?

Author: Dan Holloway

Share

Leave a Reply Cancel reply

Latest advice, news, ratings, tools and trends.

Harvard and Google Release AI Training Dataset with Public Domain Books, Raising Copyright Questions: Self-Publishing News with Dan Holloway

The Self-Publishing with ALLi Podcast Changes to Better Serve Indie Authors

Tumblr Launches Communities to Connect Fans and Bluesky Doesn’t Rule Out Ads: Self-Publishing News with Dan Holloway