Could Reddit's data be "poisoned" to prevent its use in training AI?

nodsocket@lemmy.world · edit-2 9 months ago

Could Reddit's data be "poisoned" to prevent its use in training AI?

FaceDeer@kbin.social · 9 months ago

Reddit’s surely got a copy of the PushShift archives, it’ll have all the pre-sabotage versions of those comments.

Lvxferre@mander.xyz · 9 months ago

The PS archives are publicly available. If either OpenAI or Google were to use it, they wouldn’t pay Reddit Inc. a single penny; and yet Google is paying it 60 million dollars do to do. This means that there’s content that they cannot retrieve through the PS archives that would still be valuable as LLM data.

FaceDeer@kbin.social · 9 months ago

They’re paying Reddit to not sue them.

Regardless, the content that’s available through PS is the content that people are talking about overwriting or deleting. They can’t edit or delete stuff that PushShift couldn’t see in the first place.

Lvxferre@mander.xyz · 9 months ago

They’re paying Reddit to not sue them.

Given how many defences Google would have against that ant called Reddit suing it, ranging from actual fair points to “ackshyually”, I find it unlikely.

Regardless, the content that’s available through PS is the content that people are talking about overwriting or deleting. They can’t edit or delete stuff that PushShift couldn’t see in the first place.

Emphasis mine. Can you back up this claim?

I’m asking this because the content from PS is up to March/2023, it’s literally a year old. There was a lot of activity in Reddit in the meantime, and it’s from my impression that people talking about this are the ones who already erased their content in the APIcalypse, but kept using Reddit because there’s some subject “stuck” there that they’d like to use.