• Home
  • Latest
  • Fortune 500
  • Finance
  • Tech
  • Leadership
  • Lifestyle
  • Rankings
  • Multimedia

Trendingnow

1

Gen Zers are arriving at college unable to even read a sentence—professors warn it could lead to a generation of anxious and lonely graduates

2

'The golden years are not golden': Boomers are hoarding most of America's wealth and power because they're terrified of outliving their money

3

'We didn’t see this coming': Wall Street eats its forecasts as stocks sell off globally on fear of AI bubble ahead of SpaceX IPO

1

Gen Zers are arriving at college unable to even read a sentence—professors warn it could lead to a generation of anxious and lonely graduates

2

'The golden years are not golden': Boomers are hoarding most of America's wealth and power because they're terrified of outliving their money

3

'We didn’t see this coming': Wall Street eats its forecasts as stocks sell off globally on fear of AI bubble ahead of SpaceX IPO
TechMeta

A new web crawler launched by Meta last month is quietly scraping the internet for AI training data

By
Kali Hays
Kali Hays
Down Arrow Button Icon
By
Kali Hays
Kali Hays
Down Arrow Button Icon
August 20, 2024, 6:59 PM ET
Meta CEO Mark Zuckerberg is betting big on AI.
Meta CEO Mark Zuckerberg is betting big on AI.Jason Henry—Bloomberg/Getty Images

Meta has quietly unleashed a new web crawler to scour the internet and collect data en masse to feed its AI model.

The crawler, named the Meta External Agent, was launched last month, according to three firms that track web scrapers and bots across the web. The automated bot essentially copies, or “scrapes,” all the data that is publicly displayed on websites, for example the text in news articles or the conversations in online discussion groups.

A representative of Dark Visitors, which offers a tool for website owners to automatically block all known scraper bots, said Meta External Agent is analogous to OpenAI’s GPTBot, which scrapes the web for AI training data. Two other entities involved in tracking web scrapers confirmed the bot’s existence and its use for gathering AI training data.

Meta, the parent company of Facebook, Instagram, and WhatsApp, updated a corporate website for developers with a tab disclosing the existence of the new scraper in late July, according to a version history found using the Internet Archive. Besides updating the page, Meta has not publicly announced the new crawler.

A Meta spokesperson said the company has had a crawler under a different name “for years,” although this crawler—dubbed Facebook External Hit—”has been used for different purposes over time, like sharing link previews.”

“Like other companies, we train our generative AI models on content that is publicly available online,” the spokesman said. “We recently updated our guidance regarding the best way for publishers to exclude their domains from being crawled by Meta’s AI-related crawlers.”    

Scraping web data to train AI models is a controversial practice that has led to numerous lawsuits by artists, writers, and others, who say AI companies used their content and intellectual property without their consent. Some AI companies like OpenAI and Perplexity have struck deals in recent months that pay content providers for access to their data (Fortune was among several news providers that announced a revenue-sharing deal with Perplexity in July).

Flying under the radar

While close to 25% of the world’s most popular websites now block GPTBot, only 2% are blocking Meta’s new bot, data from Dark Visitors shows.

In order for a website to attempt to block a web scraper, it must deploy robots.txt, a line of code added to a codebase, in order to signal to a scraper bot that it should ignore that site’s information. However, typically the specific name of a scraper bot needs to be added as well in order for robots.txt to be respected. That’s difficult to accomplish if the name has not been openly disclosed. An operator of a scraper bot can also simply choose to ignore robots.txt – it is not enforceable or legally binding in any way. 

Such scrapers are used to pull mass amounts of data and written text from the web, to be used as training data for generative AI models, also referred to as large language models or LLMs, and related tools. Meta’s Llama is one of the largest LLMs available, and it powers things like Meta AI, an AI chatbot that now appears on various Meta platforms. While the company did not disclose the training data used for the latest version of the model, Llama 3, its initial version of the model, used large datasets put together by other sources, like Common Crawl.

Earlier this year, Mark Zuckerberg, Meta’s cofounder and longtime CEO, boasted on an earnings call that his company’s social platforms had amassed a data set for AI training that was even “greater than the Common Crawl,” an entity that has scraped roughly 3 billion web pages each month since 2011.

The existence of the new crawler suggests Meta’s vast trove of data may no longer be enough, however, as the company continues to work on updating Llama and expanding Meta AI. LLMs typically need new and quality training data to keep improving in functionality. Meta is on track to spend up to $40 billion this year, mostly on AI infrastructure and related costs.

Are you a Meta employee or someone with insight or a tip to share? Contact Kali Hays securely through Signal at +1-949-280-0267 or at kali.hays@fortune.com.

About the Author
By Kali Hays
See full bioRight Arrow Button Icon

Latest in Tech

Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025

Most Popular

Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Fortune Secondary Logo
Rankings
  • 100 Best Companies
  • Fortune 500
  • Global 500
  • Fortune 500 Europe
  • Most Powerful Women
  • World's Most Admired Companies
  • See All Rankings
  • Lists Calendar
Sections
  • Finance
  • Fortune Crypto
  • Features
  • Leadership
  • Health
  • Commentary
  • Success
  • Retail
  • Mpw
  • Tech
  • Lifestyle
  • CEO Initiative
  • Asia
  • Politics
  • Conferences
  • Europe
  • Newsletters
  • Personal Finance
  • Environment
  • Magazine
  • Education
Customer Support
  • Frequently Asked Questions
  • Customer Service Portal
  • Privacy Policy
  • Terms Of Use
  • Single Issues For Purchase
  • International Print
Commercial Services
  • Advertising
  • Fortune Brand Studio
  • Fortune Analytics
  • Fortune Conferences
  • Business Development
  • Group Subscriptions
About Us
  • About Us
  • Press Center
  • Work At Fortune
  • Terms And Conditions
  • Site Map
  • About Us
  • Press Center
  • Work At Fortune
  • Terms And Conditions
  • Site Map
  • Facebook icon
  • Twitter icon
  • LinkedIn icon
  • Instagram icon
  • Pinterest icon

Latest in Tech

Brian Schimpf gestures with both hands as he speaks on stage.
Startups & VentureBrainstorm Tech
Anduril CEO Brian Schimpf says economic warfare is the ‘new normal’ for military conflicts—and the U.S. needs to get serious
By Lily Mae LazarusJune 8, 2026
9 minutes ago
Pentagon accuses Alibaba, Baidu and BYD, three of China’s biggest companies, of supporting the Chinese military
AsiaAlibaba Group
Pentagon accuses Alibaba, Baidu and BYD, three of China’s biggest companies, of supporting the Chinese military
By Kate O'Keeffe and BloombergJune 8, 2026
60 minutes ago
Twitch CEO: Social media has become ‘anti-social’ and can’t match the shared, human connection of live streaming
Big TechBrainstorm Tech
Twitch CEO: Social media has become ‘anti-social’ and can’t match the shared, human connection of live streaming
By Sebastian HerreraJune 8, 2026
1 hour ago
Two men sitting on chairs on a stage
Future of WorkBrainstorm Tech
Your career needs a ‘gym membership’ to keep up with continuous AI advancements, says Campus founder Tade Oyerinde
By Amanda GerutJune 8, 2026
1 hour ago
ChatGPT maker OpenAI confidentially files for IPO, a week after Anthropic
Startups & VentureOpenAI
ChatGPT maker OpenAI confidentially files for IPO, a week after Anthropic
By Bloomberg, Shirin Ghaffary and Bailey LipschultzJune 8, 2026
2 hours ago
Anthropic’s Boris Cherny, creator of Claude Code, says there are days he manages tens of thousands of AI agents at once
AIBrainstorm Tech
Anthropic’s Boris Cherny, creator of Claude Code, says there are days he manages tens of thousands of AI agents at once
By Sharon GoldmanJune 8, 2026
3 hours ago

Most Popular

Gen Zers are arriving at college unable to even read a sentence—professors warn it could lead to a generation of anxious and lonely graduates
Success
Gen Zers are arriving at college unable to even read a sentence—professors warn it could lead to a generation of anxious and lonely graduates
By Preston ForeJune 7, 2026
1 day ago
'The golden years are not golden': Boomers are hoarding most of America's wealth and power because they're terrified of outliving their money
Economy
'The golden years are not golden': Boomers are hoarding most of America's wealth and power because they're terrified of outliving their money
By Nick LichtenbergJune 7, 2026
2 days ago
'We didn’t see this coming': Wall Street eats its forecasts as stocks sell off globally on fear of AI bubble ahead of SpaceX IPO
Economy
'We didn’t see this coming': Wall Street eats its forecasts as stocks sell off globally on fear of AI bubble ahead of SpaceX IPO
By Jim EdwardsJune 8, 2026
14 hours ago
Trump stunned as stocks fall on great jobs report. Barclays explains why ‘we are entering the warning zone'
Big Tech
Trump stunned as stocks fall on great jobs report. Barclays explains why ‘we are entering the warning zone'
By Eva RoytburgJune 7, 2026
1 day ago
SpaceX's IPO will also be a massive selling event triggering big price dislocations across the stock market as investors dump shares to buy SPCX
Investing
SpaceX's IPO will also be a massive selling event triggering big price dislocations across the stock market as investors dump shares to buy SPCX
By Jason MaJune 7, 2026
1 day ago
Current price of oil as of June 8, 2026
Personal Finance
Current price of oil as of June 8, 2026
By Joseph HostetlerJune 8, 2026
11 hours ago

© 2026 Fortune Media IP Limited. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | CA Notice at Collection and Privacy Notice | Do Not Sell/Share My Personal Information
FORTUNE is a trademark of Fortune Media IP Limited, registered in the U.S. and other countries. FORTUNE may receive compensation for some links to products and services on this website. Offers may be subject to change without notice.