Fiosracht - 404 Media

Automattic May Start Selling Users' Data to Train AI Tools

404 Media published a concerning report that they have obtained internal documents from Automattic that they are preparing to sell user data to Midjourney and OpenAI. Automattic is the parent company of WordPress and Tumblr.

This blog is published using WordPress.com for hosting. I'm going to have to see if there is an opt-out option and to read the terms and conditions attached to that option. If that option is available, I would hope opt-out would be the default option. People should be able to opt-in if they want to. Subterfuge shouldn't need to be used.

A concern raised in the report is that when compiling a data dump from Tumblr for Midjourney/OpenAI, Cyle Gage (a product manager at Tumblr) stated that some data was included that shouldn't have been such as:

private posts on public blogs

posts on deleted or suspended blogs

unanswered asks (normally these are not public until they’re answered)

private answers (these only show up to the receiver and are not public)

posts that are marked ‘explicit’ / NSFW / ‘mature’ by our more modern standards (this may not be a big deal, I don’t know)

content from premium partner blogs (special brand blogs like Apple’s former music blog, for example, who spent money with us on an ad campaign) that may have creative that doesn’t belong to us, and we don’t have the rights to share with this-parties; this one is kinda unknown to me, what deals are in place historically and what they should prevent us from doing.

Tumblr and Wordpress to Sell Users’ Data to Train AI Tools (Sam Cole/404 Media)

The benefit of having my own site is that I can move if I feel like I need to. I'll have to consider other options whether it's moving to a new platform like Ghost or by finding another hosting service.

It is disappointing to see Automattic moving in this direction. They have described themselves as the guardians of the open web but this decision will have people considering whether to remove their Tumblrs or blogs to avoid it being included in a training set for a large language model.

The promise of the open web was that it allowed people to connect with each other in a new way. As Gita Jackson wrote:

The internet has been broken in a fundamental way. It is no longer a repository of people communicating with people; increasingly, it is just a series of machines communicating with machines.
The Internet Is Full of AI Dogshit (Gita Jackson/Aftermath)

This decision by Automattic, if it is true, will make this problem worse in the short term. There's no guarantee that it will improve in the medium to long term either. Companies like OpenAI have made great promises of progress in the past only to renege on them when it suited. Unfortunately, I have little faith that this will be any different.

I could be wrong. I hope that I am.

Yes, Google Results Have Gotten Worse

404 Media reported on a study published by German researchers from Leipzig University, Bauhaus-University Weimar, and the Center for Scalable Data Analytics and Artificial Intelligence titled "Is Google Getting Worse? A Longitudinal Investigation of SEO Spam in Search Engines".

Google isn't the only search engine dealing with this issue. Jason Keobler writes:

Notably, Google, Bing, and DuckDuckGo all have the same problems, and in many cases, Google performed better than Bing and DuckDuckGo by the researchers' measures.
Google Search Really Has Gotten Worse, Researchers Find (Jason Koebler/404 Media)

The research does highlight how much damage search engine optimization (SEO) has done to the ecosystem of the internet. The release of generative AI is only going to make the problem worse. Amazon is dealing with product titles and reviews being generated using ChatGPT.

David Roth had a good piece on Defector about the promises made by the developers and boosters of AI and its actual use in the present day.

One reason it is not very interesting is that everything they have touted as the future of some essential human thing or other—the future of art, or money—has mostly crashed out in ways that left behind very little useful residue. Another is that the ways in which AI is used in the present, by your lower-effort plagiarists and scammers, are so manifestly not the future of anything that works, but rather both the present and the future of shitting-up web search results, which is roughly analogous to saying that robocalls about homeowners insurance are the future of human communication.
The Future Of E-Commerce Is A Product Whose Name Is A Boilerplate AI-Generated Apology (David Roth/Defector)