UPDATE: This article was updated on 2/17/23 to include comments from a Microsoft spokesperson.
As new artificial intelligence chatbots and search functions launch, a pressing question is emerging for subscription publishers: Is AI poaching paywalled information?
In recent weeks, tech giants Google and Microsoft both announced their intent to scrape information from the web and directly answer users’ questions using generative AI, rather than directing them to third-party websites.
Staffers at multiple publishers told Toolkits their technology and search engine optimization teams are now scrambling to understand if and how paywalled content is being used by AI chatbots, whether publishers can prevent chatbots from weaving their original reporting and information into responses, and – ultimately – if it’s even worth trying to stop them.
If adopted widely by consumers, AI chat functions will significantly reframe the relationship between publishers and search providers and could have meaningful implications for publishers’ business models. Although the prospect of audiences tapping AI chatbots for information instead of visiting websites is concerning for publishers across the board, it could pose a more immediate and pointed threat for publishers attempting to charge for access to digital content.
Many publishers’ subscription products are predicated on the idea that they provide paying customers with exclusive reporting, analysis, data and other information that isn’t freely available elsewhere. If audiences can easily gain access to the same information for free via AI chatbots, however, convincing them to pay publishers directly could become more challenging.
The potential for content “leakage” via chatbots could concern some publishers more than others, depending on the nature of their subscription products and content. Publishers offering highly specialized or entirely unique information may view it as a more significant issue than general news publishers, for example. Professional audiences often pay hefty price tags to access proprietary data and research but might be less likely to do so if key data points and pieces of information are readily accessible via AI chatbots.
A lack of clarity
Publishers are well aware that search engines will surface portions of their content in their results pages and other products, even if they’re not always happy about it. But chatbots promise to take this a step further by providing users with complete answers to queries.
Publishers say they’re unclear if and how their content is currently being used to power AI chatbots, and that companies including Google, Microsoft and ChatGPT developer OpenAI have not been forthcoming with specifics. In short: they don’t know whether their content is being used to train AI services and, if it is, they’re not sure how to opt out.
ChatGPT says it “does not have the ability to bypass paywalls or access restricted content” when asked directly about the source material it draws from. Nevertheless, some publishers believe that paywalled information is showing up directly in AI responses provided by the latest version of Microsoft’s Bing search engine, which Microsoft says is powered by ChatGPT.
Belgian news site De Standaard published a recent article implying that Bing’s technology had effectively dipped into subscriber-only material on its site in order to inform responses to queries about Belgian politician Paul Magnette, for example. Wired conducted similar experiments and found that Bing’s chatbot returned information from the websites of The New York Times and The Wall Street Journal that is typically locked behind paywalls in some capacity.
A Microsoft spokesperson told Toolkits that Bing’s chat function only draws from content that publishers have opted to make visible to it, but that Microsoft is now in conversations with publishers and plans to update its technical guidelines.
The spokesperson declined to comment on whether all content that’s made accessible to Bing’s crawlers is eligible to appear in chatbot responses, however, or whether Microsoft will enable publishers to control what information from their sites might be surfaced in chatbot responses specifically. “As the new Bing experience is currently in preview these conversations are beginning and we’ll have more to share over time,” the spokesperson said.
Google has acknowledged that its Bard AI, which has yet to be released, has been trained using content from websites. It remains unclear exactly what content was used and how and when it was accessed, prompting some to suggest the company has been “purposely vague.” It also remains unclear what web content could be used to inform Bard AI’s responses after a full launch.
Crawling for chatbots
Some SEO experts said that they assume chatbots will initially be informed by the same data collection methods currently used to power search engines. Publishers can ostensibly control which portions of their content are made visible to search engines for potential inclusion in search results while flagging other content as off-limits.
If Google and/or Microsoft begin surfacing information from publishers’ sites in AI-driven chat responses, publishers might be able to opt out by restricting crawler access to their sites. Presumably, search engines might send less traffic to their sites if they did, however.
The crawling situation becomes more nuanced when it comes to paywalled content, however. In order to help search engines evaluate and index subscriber-only content more effectively, many publishers currently allow search engines to effectively hop their paywalls and “read” subscriber-only content, while simultaneously blocking access to human readers and other web services.
Those permissions may also allow search providers’ AI chatbots to pull information from paywalled content because publishers have “opted to make it visible” to those companies.
Publishers seek answers
Publishing executives interviewed for this piece expressed concern about how their content is being used by AI chatbots, but asked not to be named publicly. That’s largely because they have more questions than answers at this point, and say they’re largely in the dark as to how their business could be affected.
Meanwhile, publishing trade groups are raising concerns on behalf of their members. “Unless there’s a specific agreement in place, there’s just really no revenue coming back to news publications. And it is highly problematic for our industry,” Danielle Coffey, executive vice president and general counsel at News Media Alliance told Wired.
Publishers said they now intend to seek answers from Google, Microsoft, and operators of other AI chatbot services before evaluating how to proceed. Depending on how those companies intend to use their content, some publishers said they would consider controlling access to it much more closely – particularly to information that’s intended for paying subscribers only.
Others are taking a more defiant stance. Australian satire website The Chaser quipped last week that it was putting up a paywall specifically to prevent its content from becoming training material for AI chatbots. “One thing we don’t have to do is feed these learning machines our content for free,” the site’s editor wrote in a post explaining the decision.
Most publishers are unlikely to take as drastic action. Chatbot operators will still want access to publishers’ content, and will most likely appease publishers by continuing to drive traffic to their sites in some capacity.
In an interview with The Verge, Microsoft CEO Satya Nadella acknowledged that publishers would likely attempt to limit Bing’s access to their content if Microsoft ceases sending traffic to their sites. “Our bots are not going to be allowed to crawl if we are not driving traffic,” he said.
For now, publishers are left seeking answers and say they expect greater transparency from Microsoft, Google and other chatbot operators in the weeks ahead.