Maintaining Truth in a Generative AI Era
Earlier this month, I participated in an event organized by Authors Alliance at UC Berkeley focused on DMCA 1202 and Generative AI Attribution Standards. Gathered in the Seaborg Room (named after Glenn T. Seaborg, 1951 Nobel Prize Winner in Chemistry) of the Faculty Club, a select group of academic legal experts, global technology company legal counsel and engineers, university publishers, and librarians discussed how the latest generative AI systems might provide attribution to sources in generative AI outputs - e.g., you query a service like Claude or ChatGPT and it would then provide meaningful (a term much debated) attribution to sources substantiating the response to your prompt - TLDR, citations!
Throughout the day my eyes drifted not infrequently to the windows and the faculty glade just outside. The dendritic form of the trees was a comfort - a sort of subliminal map for the forks and branches of discussion throughout the day.
My primary interest in the meeting was to advocate for systems and services that can effectively support research and learning in the public interest. This positionality is anchored by my role as a senior research library administrator working at a public research university. I could dig into this positionality more, but I think another participant summed up the position well when they said their primary interest in the meeting was the maintenance of truth in a generative AI era.
In what follows I share some thoughts from the meeting in the hopes that it helps colleagues working on related challenges.
On Attribution and Research in The Public Interest
By research in the public interest, I refer to the production of research that is intended to widely benefit the public. University research benefits the public in myriad ways including but not limited to - e.g., increasing crop yields, informing civic policy, expanding historical understanding, and forming the basis of next generation medical treatments. I choose a public interest framing in order to disabuse the notion that a requirement for generative AI systems to provide attribution is an edge case preference rather than a requirement fundamental to the ongoing progress of society.
Simply put, we cannot make meaningful, research-informed advances without the ability to thoroughly evaluate arguments. Evaluating research-based arguments depends fundamentally on attribution. Without attribution, research is not research, it is opinion and opinion is not viable ground for advancing society.
Research in the public interest requires attribution.
On Biting the Hand That Feeds You or a New Form of Symbiosis
Generative AI systems increase their performance by training on published content. The routes that generative AI companies take to acquire published content are many - e.g., scraping the web en masse, making use of so-called pirate libraries, striking licensing deals with publishers, digitizing books through partnerships with libraries, and rapidly acquiring and destructively scanning used books - see “Project Panama”.
Despite the importance of published materials to generative AI performance, they do not provide attribution to training data in prompt responses. At first glance, it seems like a case of biting the hand that feeds you insofar as the AI system breaks the provenance chain between training data and a prompt response. Logically, this break suggests limiting generative AI system user ability to discover published materials and consequently for source material providers (e.g., publishers, research libraries) to lose revenue and/or miss an opportunity to connect a user to high quality sources for deeper investigation.
However, this bite could be the prelude to an emerging form of symbiosis between generative AI system developers and source material providers. The open source standard Model Context Protocol (MCP), recently donated by Anthropic to the Linux Foundation is interesting signal for us to examine. With MCP, a source material provider can expose their content to a generative AI system in real-time, allowing users to query a trusted set of sources through the AI system while maintaining clear attribution - similar to how generative AI system web search works (but less prone to hallucination) - accessing collections beyond what can be readily accessed via web search (e.g., subscription content, specialized databases, content with other access restrictions).
What kind of symbiosis will this turn out to be? Will it be mutualism (both benefit), commensalism (one benefits, one unaffected), or parasitism (one benefits, one harmed)? I believe that reaching the kind of relationship we want depends on us being active in the space.
Moving Forward
As a senior research library administrator, I left the event thinking that we must resource solutions that meet our need for attribution while also maintaining the health of the knowledge production ecosystem - small lift I know! At the moment, technical approaches exist that help us get there such as retrieval augmented generation (RAG) and the aforementioned, emergent standard - MCP.
Research libraries are beginning to experiment with MCP in order to develop their own services that provide attribution-focused access to collections that they steward. Dan Cohen’s team at Northeastern University Library is doing fantastic work experimenting with connecting their collections via MCP to Claude.
If my sense-making is up to snuff I gather that further research library investment in MCP experimentation is likely over multiyear timelines. It would be interesting to see research libraries experiment at the consortial level (e.g., BTAA, and so on) to scale potential adoption to research libraries with varying resources.
As research libraries we should be excited about the bridge MCP represents while also planning effectively for the toll (e.g., staff time for implementation, ongoing maintenance, integration with existing systems) . Venues like the Coalition for Networked Information, AI4LAM, and the Digital Library Federation provide key spaces for us to do that planning and experimentation together.
Maintaining truth is an evergreen rationale for research library collaboration.


