Any process that improves by being trained on a set of materials will only ever get as good as the materials it’s trained on will allow. That’s as true of machine learning algorithms as it is of human beings. This week’s news that Harvard will release a dataset of 1 million volumes for AI training highlights efforts to address inequities caused by the need for high-quality training data.
Ostensibly, the purpose behind the announcement that Harvard will be making a dataset of 1 million volumes available for AI training is to address the inequity that can result from the need for high-quality training fuel. The argument goes that at present, large tech companies with deep pockets have access to better sets of books to train their large language models. To make AI more equitable, therefore, more people with shallower pockets need access to better datasets. It’s very similar to the open data argument that has become commonplace across science.
Harvard’s Institutional Data Initiative
What Harvard has said it will do is release a high-quality dataset from the Institutional Data Initiative’s (IDI) database. The IDI, funded by OpenAI and Microsoft, works with academic and library institutions in the U.S. to bring together public domain books into a usable, single-location dataset. It will include the likes of Shakespeare as well as many public domain periodicals.
The move raises all kinds of questions—not least among them what Microsoft and OpenAI have to gain from making a valuable tool publicly available when they have their own commercial AI projects that have been trained on works still in copyright (something Sam Altman of OpenAI has said is essential for the very best AI). No doubt someone somewhere is already working on suffixing some term with “-washing” to coin an apt descriptor.
Google’s Controversial Contribution
But what raised my eyebrow was the involvement of Google. The company, which has its own Gemini AI that generates poor-quality headers clogging search results, has said it is proud to be involved and will help distribute the database. That involvement appears to include contributing works from its own books project.
Those of you with moderate to adequate memories will no doubt recall that Google Books wasn’t without controversy. While Google sought to make every title in the world searchable, copyright holders noted this might first be run past them (sound familiar?). As the contents of the new database are not yet fully known, the extent of Google Books’ contribution remains unclear. I am sure the IDI has done due diligence. On the other hand, talking of “-washing,” as I read these stories, I can’t seem to stop my mind drifting to some of Netflix’s fascinating documentaries on the process of money laundering.
Thoughts or further questions on this post or any self-publishing issue?
If you’re an ALLi member, head over to the SelfPubConnect forum for support from our experienced community of indie authors, advisors, and our own ALLi team. Simply create an account (if you haven’t already) to request to join the forum and get going.
Non-members looking for more information can search our extensive archive of blog posts and podcast episodes packed with tips and advice at ALLi's Self-Publishing Advice Center.