Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

KaliYoni

macrumors 68000
Original poster
Feb 19, 2016
1,734
3,829
According to this Washington Post analysis, the forums on MacRumors are part of a Google-created dataset that is used to train AI products:

”…we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA”
(forums.macrumors.com is listed under the sources for Technology, as the #4 site)

So, I’d say anybody here who is highly concerned about privacy or does not want their future posts used to train AI’s should review how they use MacRumors’ forums.
 
Last edited:

Apple_Robert

Contributor
Sep 21, 2012
34,599
50,286
In the middle of several books.
I see that Reddit wants to be paid for "excessive" use of its content (via API).

My personal opinion is that there should be pushback against this sort of thing; for example my own posts were made with the expectation that they would be readable by the public, but not with the expectation that they would be copied wholesale into another system.
Seeing how we are on a public forum on the internet, there can be no expectation of privacy.
 

Scepticalscribe

macrumors Haswell
Jul 29, 2008
64,229
46,662
In a coffee shop.
Seeing how we are on a public forum on the internet, there can be no expectation of privacy.

Perhaps.

Although I would dispute the "no expectation" - as this depends on circumstances - and I would imagine that this may well change in the future.

And I would argue that there can be an expectation that while one's own posts can be readable on this forum - which is what we did indeed sign up for - there is no expectation (let alone permission) - as @Nermal has already pointed out - that they would - or could - be copied wholesale into another system.
 

Nermal

Moderator
Staff member
Dec 7, 2002
20,682
4,117
New Zealand
And I would argue that there can be an expectation that while one's own posts can be readable on this forum - which is what we did indeed sign up for - there is no expectation (let alone permission) - as @Nermal has already pointed out - that they would - or could - be copied wholesale into another system.
Indeed, and my comment was more from a copyright perspective than a privacy one.
 

KaliYoni

macrumors 68000
Original poster
Feb 19, 2016
1,734
3,829
Now that I've thought about this more, my main concerns are:
  • A third party (parties?) has archived our posts and is using our thoughts and words to generate derivative content without any ability to opt-in or opt-out.
  • Did MacRumors know about this? Did Google ask permission?
  • How else are our posts being used by organizations not connected to MacRumors without our knowledge? Not all AI's are used to simulate human writing like Google Bard or Microsoft ChatGPT. For example, what if a hacker group is training an identity theft AI? Or a state-sponsored intelligence agency is building an AI whose purpose is to find and track citizens who are living abroad?
I've always assumed that anything we post here would be crawled by search engines and made available in searches. But for me, having the entire MacRumors forums corpus hoovered up, archived, and repeatedly used to train black-box AI processes goes way beyond that.
 

Scepticalscribe

macrumors Haswell
Jul 29, 2008
64,229
46,662
In a coffee shop.
Now that I've thought about this more, my main concerns are:
  • A third party (parties?) has archived our posts and is using our thoughts and words to generate derivative content without any ability to opt-in or opt-out.
  • Did MacRumors know about this? Did Google ask permission?
  • How else are our posts being used by organizations not connected to MacRumors without our knowledge? Not all AI's are used to simulate human writing like Google Bard or Microsoft ChatGPT. For example, what if a hacker group is training an identity theft AI? Or a state-sponsored intelligence agency is building an AI whose purpose is to find and track citizens who are living abroad?
I've always assumed that anything we post here would be crawled by search engines and made available in searches. But for me, having the entire MacRumors forums corpus hoovered up, archived, and repeatedly used to train black-box AI processes goes way beyond that.
Thank you for starting this thread and for raising this topic.

This subject is something I hadn't at all been aware of, let alone given any thought to - and, following your thread, I realised that the Guardian are also covering this story.

Your questions are very good ones, timely and apt and necessary.

As with so much else in the tech revolution, the extraordinary advances mean that our attempts to deal with their consequences and effects mean that we are always lagging several steps behind.

But that is no reason not to try to address this.
 
Last edited:
  • Like
Reactions: heretiq

KaliYoni

macrumors 68000
Original poster
Feb 19, 2016
1,734
3,829
Thank you for starting this thread and for raising this topic.

This subject is something I hadn't at all been aware of, let alone given any thought to - and, following your thread, I realised that the Guardian are also covering this story.

Your questions are very good ones, timely and apt and necessary.

As with so much else in the tech revolution, the extraordinary advances mean that our attempts to deal with their consequences and effects mean that we are always lagging several steps behind.

But that is no reason no to try to address this.

I‘m glad others here–particularly somebody who takes the time to write clearly and thoughtfully–are thinking about this issue too. In many ways, what’s happening with AI technology right now seems like the period when social media companies first were able to integrate all the necessary components that enabled them to become financial, social, and political juggernauts. And of course, as has been the case with humans throughout history, it looks like a lot of the lessons learned earlier are being ignored now.

Is this the Guardian article you saw?
 

laptech

macrumors 68040
Apr 26, 2013
3,637
4,025
Earth
There is a misconception that the internet is a 'public place'. It is not regardless what any one or any expert says. The internet is made up of a lot of private companies, private businesses and private individuals who 'allow' others to see their work for free. What they provide still belongs to them and thus permission must be granted if others want to use it. You cannot just come in, hoover up what you want and then claim 'it's public therefore I can do as I wish'.

These private companies, private business and private individuals make their stuff freely available to the public but it does not mean the internet is public domain. Google is wrong in doing what it is doing and thus website owners should complain to Google that they do not have the right to take stuff from their websites.
 
  • Like
Reactions: decafjava

maflynn

macrumors Haswell
May 3, 2009
73,633
43,637
So, I’d say anybody here who is highly concerned about privacy or does not want their future posts used to train AI’s should review how they use MacRumors’ forums.
how is this any different then what google does/track and monetize now?

There is a misconception that the internet is a 'public place'. It is not regardless what any one or any expert says. The internet is made up of a lot of private companies, private businesses and private individuals who 'allow' others to see their work for free.
So what your saying is the information from private companies is made public and free, i.e., the internet is a public place

Google is wrong in doing what it is doing
So then, Google and every other search engine should shut down, and sites like the Internet Archive should also be shut down (https://archive.org/web/).
 
  • Like
Reactions: ixxx69

laptech

macrumors 68040
Apr 26, 2013
3,637
4,025
Earth
Because MR is heavily biased towards everything Apple does this mean every time someone asks the AI an Apple related question it is always going to respond with a favorable response towards Apple?

Be such a good way for Apple to get the AI loaded in it's favor.
 

laptech

macrumors 68040
Apr 26, 2013
3,637
4,025
Earth
I wonder if this will turn out to be another Cambridge Analytica type scandal where freely available information is harvested for the financial gains of others but without the consent of those who provide the freely available information. Just because information is freely available does not make it publicly free and thus for others to do as they wish.
 
  • Like
Reactions: PhoenixDown

Abazigal

Contributor
Jul 18, 2011
19,754
22,346
Singapore
Is it possible for a website to opt out of being scanned in this manner? But then again, it’s difficult to request to not be a part of something you were never aware of in the first place.
 

maflynn

macrumors Haswell
May 3, 2009
73,633
43,637
Is it possible for a website to opt out of being scanned in this manner? But then again, it’s difficult to request to not be a part of something you were never aware of in the first place.
When you sign up to use google, you sign away the rights to your data and as such they are free to use it as they wish. I don't know if that's the same when you use google's adsense, and interact with google search but I suspect that Google has a lot of really good lawyers to ensure that if you do business with them, then they gain access to your website.
 

Abazigal

Contributor
Jul 18, 2011
19,754
22,346
Singapore
When you sign up to use google, you sign away the rights to your data and as such they are free to use it as they wish. I don't know if that's the same when you use google's adsense, and interact with google search but I suspect that Google has a lot of really good lawyers to ensure that if you do business with them, then they gain access to your website.

This makes me wonder if it is possible for a web service provider to make it so that content hosted on their servers can’t be scanned for the purpose or improving LLMs. Something like ATT, but for online content. Either nothing of value is obtained or the data extracted is rubbish.

This thought was in part inspired by another thread which mentioned how far behind Siri was in comparison to chatGPT, and it made me think if instead of trying to catch up with the competition, Apple might instead find a way to hobble their progress in the name of privacy.
 

maflynn

macrumors Haswell
May 3, 2009
73,633
43,637
This makes me wonder if it is possible for a web service provider to make it so that content hosted on their servers can’t be scanned for the purpose or improving LLMs. Something like ATT, but for online content. Either nothing of value is obtained or the data extracted is rubbish.
I don't believe there is a way to distinguish a crawl/data scrape that will be used for LLMs vs. search engine indexing. You can block Google from crawling your site but that stops them completely, no index, no nothing.
 

Abazigal

Contributor
Jul 18, 2011
19,754
22,346
Singapore
I don't believe there is a way to distinguish a crawl/data scrape that will be used for LLMs vs. search engine indexing. You can block Google from crawling your site but that stops them completely, no index, no nothing.
Shucks, so much for that idea then. Thank you for taking the time to respond. :)
 

laptech

macrumors 68040
Apr 26, 2013
3,637
4,025
Earth
The problem with the internet is that EVERY company or business or individual that has a presence on the internet has the view 'What is mine is mine and what is on the internet is mine'. Why is this so? As soon as you open your web browser, your email, your IP, the name of the web browser you use, which web site you joined into and which website you went to when you left a website, the time you joined and the time you left is all collected by the websites you visit. ALL these entities feel they have an automatic right to our information ad thus they collect it. Even MR collects this type of information and they ALL do it automatically. Some countries have introduced laws to prevent this data collection going on and thus have to provide 'consent' windows when a person goes to a website but not every country does this and with those countries that do not, people need to ask themselves why is my information being collected without my consent.

Google most probably created crawler software that has the purpose of trawling the internet for 'free' content, harvest it and then report back to Google so Google can input it to their AI creation.
 
  • Like
Reactions: heretiq
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.