• Home
  • Latest
  • Fortune 500
  • Finance
  • Tech
  • Leadership
  • Lifestyle
  • Rankings
  • Multimedia
TechMeta

A new web crawler launched by Meta last month is quietly scraping the internet for AI training data

By
Kali Hays
Kali Hays
Down Arrow Button Icon
By
Kali Hays
Kali Hays
Down Arrow Button Icon
August 20, 2024, 6:59 PM ET
Meta CEO Mark Zuckerberg is betting big on AI.
Meta CEO Mark Zuckerberg is betting big on AI.Jason Henry—Bloomberg/Getty Images

Meta has quietly unleashed a new web crawler to scour the internet and collect data en masse to feed its AI model.

The crawler, named the Meta External Agent, was launched last month, according to three firms that track web scrapers and bots across the web. The automated bot essentially copies, or “scrapes,” all the data that is publicly displayed on websites, for example the text in news articles or the conversations in online discussion groups.

A representative of Dark Visitors, which offers a tool for website owners to automatically block all known scraper bots, said Meta External Agent is analogous to OpenAI’s GPTBot, which scrapes the web for AI training data. Two other entities involved in tracking web scrapers confirmed the bot’s existence and its use for gathering AI training data.

Meta, the parent company of Facebook, Instagram, and WhatsApp, updated a corporate website for developers with a tab disclosing the existence of the new scraper in late July, according to a version history found using the Internet Archive. Besides updating the page, Meta has not publicly announced the new crawler.

A Meta spokesperson said the company has had a crawler under a different name “for years,” although this crawler—dubbed Facebook External Hit—”has been used for different purposes over time, like sharing link previews.”

“Like other companies, we train our generative AI models on content that is publicly available online,” the spokesman said. “We recently updated our guidance regarding the best way for publishers to exclude their domains from being crawled by Meta’s AI-related crawlers.”    

Scraping web data to train AI models is a controversial practice that has led to numerous lawsuits by artists, writers, and others, who say AI companies used their content and intellectual property without their consent. Some AI companies like OpenAI and Perplexity have struck deals in recent months that pay content providers for access to their data (Fortune was among several news providers that announced a revenue-sharing deal with Perplexity in July).

Flying under the radar

While close to 25% of the world’s most popular websites now block GPTBot, only 2% are blocking Meta’s new bot, data from Dark Visitors shows.

In order for a website to attempt to block a web scraper, it must deploy robots.txt, a line of code added to a codebase, in order to signal to a scraper bot that it should ignore that site’s information. However, typically the specific name of a scraper bot needs to be added as well in order for robots.txt to be respected. That’s difficult to accomplish if the name has not been openly disclosed. An operator of a scraper bot can also simply choose to ignore robots.txt – it is not enforceable or legally binding in any way. 

Such scrapers are used to pull mass amounts of data and written text from the web, to be used as training data for generative AI models, also referred to as large language models or LLMs, and related tools. Meta’s Llama is one of the largest LLMs available, and it powers things like Meta AI, an AI chatbot that now appears on various Meta platforms. While the company did not disclose the training data used for the latest version of the model, Llama 3, its initial version of the model, used large datasets put together by other sources, like Common Crawl.

Earlier this year, Mark Zuckerberg, Meta’s cofounder and longtime CEO, boasted on an earnings call that his company’s social platforms had amassed a data set for AI training that was even “greater than the Common Crawl,” an entity that has scraped roughly 3 billion web pages each month since 2011.

The existence of the new crawler suggests Meta’s vast trove of data may no longer be enough, however, as the company continues to work on updating Llama and expanding Meta AI. LLMs typically need new and quality training data to keep improving in functionality. Meta is on track to spend up to $40 billion this year, mostly on AI infrastructure and related costs.

Are you a Meta employee or someone with insight or a tip to share? Contact Kali Hays securely through Signal at +1-949-280-0267 or at kali.hays@fortune.com.

Join us at the Fortune Workplace Innovation Summit May 19–20, 2026, in Atlanta. The next era of workplace innovation is here—and the old playbook is being rewritten. At this exclusive, high-energy event, the world’s most innovative leaders will convene to explore how AI, humanity, and strategy converge to redefine, again, the future of work. Register now.
About the Author
By Kali Hays
See full bioRight Arrow Button Icon

Latest in Tech

Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025

Most Popular

Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Rankings
  • 100 Best Companies
  • Fortune 500
  • Global 500
  • Fortune 500 Europe
  • Most Powerful Women
  • Future 50
  • World’s Most Admired Companies
  • See All Rankings
Sections
  • Finance
  • Leadership
  • Success
  • Tech
  • Asia
  • Europe
  • Environment
  • Fortune Crypto
  • Health
  • Retail
  • Lifestyle
  • Politics
  • Newsletters
  • Magazine
  • Features
  • Commentary
  • Mpw
  • CEO Initiative
  • Conferences
  • Personal Finance
  • Education
Customer Support
  • Frequently Asked Questions
  • Customer Service Portal
  • Privacy Policy
  • Terms Of Use
  • Single Issues For Purchase
  • International Print
Commercial Services
  • Advertising
  • Fortune Brand Studio
  • Fortune Analytics
  • Fortune Conferences
  • Business Development
About Us
  • About Us
  • Editorial Calendar
  • Press Center
  • Work At Fortune
  • Diversity And Inclusion
  • Terms And Conditions
  • Site Map

© 2025 Fortune Media IP Limited. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | CA Notice at Collection and Privacy Notice | Do Not Sell/Share My Personal Information
FORTUNE is a trademark of Fortune Media IP Limited, registered in the U.S. and other countries. FORTUNE may receive compensation for some links to products and services on this website. Offers may be subject to change without notice.


Most Popular

placeholder alt text
Uncategorized
Transforming customer support through intelligent AI operations
By Lauren ChomiukNovember 26, 2025
19 days ago
placeholder alt text
Success
Sorry, six-figure earners: Elon Musk says that money will 'disappear' in the future as AI makes work (and salaries) irrelevant
By Orianna Rosa RoyleDecember 15, 2025
7 hours ago
placeholder alt text
Success
Apple cofounder Ronald Wayne sold his 10% stake for $800 in 1976—today it’d be worth up to $400 billion
By Preston ForeDecember 12, 2025
3 days ago
placeholder alt text
Success
40% of Stanford undergrads receive disability accommodations—but it’s become a college-wide phenomenon as Gen Z try to succeed in the current climate
By Preston ForeDecember 12, 2025
3 days ago
placeholder alt text
AI
Deloitte's CTO on a stunning AI transformation stat: Companies are spending 93% on tech and only 7% on people
By Nick LichtenbergDecember 15, 2025
10 hours ago
placeholder alt text
Politics
Trump admits he can't tell if the GOP will control the House after next year's elections. 'I don't know when all of this money is going to kick in'
By Jason MaDecember 14, 2025
24 hours ago

Latest in Tech

Mark Zuckerberg and the Meta logo
Big TechMeta
Former Meta integrity chief says new report reveals ‘disappointing’ ad fraud epidemic at the social-media giant
By Lily Mae LazarusDecember 15, 2025
25 minutes ago
AIregulation
Actor Joseph Gordon-Levitt wonders why AI companies don’t have to ‘follow any laws’
By Nick LichtenbergDecember 15, 2025
1 hour ago
A close-up of Jeff Bezos' face
SuccessJeff Bezos
‘I had to take 60 meetings’: Jeff Bezos says ‘the hardest thing I’ve ever done’ was raising the first million dollars of seed capital for Amazon
By Dave SmithDecember 15, 2025
3 hours ago
AIChips
What happens to old AI chips? They’re still put to good use and don’t depreciate that fast, analyst says
By Jason MaDecember 15, 2025
3 hours ago
Photo of Sergey Brin
AIchief executive officer (CEO)
Google cofounder Sergey Brin said he was ‘spiraling’ before returning to work on Gemini—and staying retired ‘would’ve been a big mistake’
By Marco Quiroz-GutierrezDecember 15, 2025
4 hours ago
CryptoCryptocurrency
Bittensor, the AI-linked cryptocurrency founded by a former Google engineer, just halved its supply. Here’s what that means
By Ben WeissDecember 15, 2025
5 hours ago