• Home
  • Latest
  • Fortune 500
  • Finance
  • Tech
  • Leadership
  • Lifestyle
  • Rankings
  • Multimedia
TechMeta

A new web crawler launched by Meta last month is quietly scraping the internet for AI training data

By
Kali Hays
Kali Hays
Down Arrow Button Icon
By
Kali Hays
Kali Hays
Down Arrow Button Icon
August 20, 2024, 6:59 PM ET
Meta CEO Mark Zuckerberg is betting big on AI.
Meta CEO Mark Zuckerberg is betting big on AI.Jason Henry—Bloomberg/Getty Images

Meta has quietly unleashed a new web crawler to scour the internet and collect data en masse to feed its AI model.

The crawler, named the Meta External Agent, was launched last month, according to three firms that track web scrapers and bots across the web. The automated bot essentially copies, or “scrapes,” all the data that is publicly displayed on websites, for example the text in news articles or the conversations in online discussion groups.

A representative of Dark Visitors, which offers a tool for website owners to automatically block all known scraper bots, said Meta External Agent is analogous to OpenAI’s GPTBot, which scrapes the web for AI training data. Two other entities involved in tracking web scrapers confirmed the bot’s existence and its use for gathering AI training data.

Meta, the parent company of Facebook, Instagram, and WhatsApp, updated a corporate website for developers with a tab disclosing the existence of the new scraper in late July, according to a version history found using the Internet Archive. Besides updating the page, Meta has not publicly announced the new crawler.

A Meta spokesperson said the company has had a crawler under a different name “for years,” although this crawler—dubbed Facebook External Hit—”has been used for different purposes over time, like sharing link previews.”

“Like other companies, we train our generative AI models on content that is publicly available online,” the spokesman said. “We recently updated our guidance regarding the best way for publishers to exclude their domains from being crawled by Meta’s AI-related crawlers.”    

Scraping web data to train AI models is a controversial practice that has led to numerous lawsuits by artists, writers, and others, who say AI companies used their content and intellectual property without their consent. Some AI companies like OpenAI and Perplexity have struck deals in recent months that pay content providers for access to their data (Fortune was among several news providers that announced a revenue-sharing deal with Perplexity in July).

Flying under the radar

While close to 25% of the world’s most popular websites now block GPTBot, only 2% are blocking Meta’s new bot, data from Dark Visitors shows.

In order for a website to attempt to block a web scraper, it must deploy robots.txt, a line of code added to a codebase, in order to signal to a scraper bot that it should ignore that site’s information. However, typically the specific name of a scraper bot needs to be added as well in order for robots.txt to be respected. That’s difficult to accomplish if the name has not been openly disclosed. An operator of a scraper bot can also simply choose to ignore robots.txt – it is not enforceable or legally binding in any way. 

Such scrapers are used to pull mass amounts of data and written text from the web, to be used as training data for generative AI models, also referred to as large language models or LLMs, and related tools. Meta’s Llama is one of the largest LLMs available, and it powers things like Meta AI, an AI chatbot that now appears on various Meta platforms. While the company did not disclose the training data used for the latest version of the model, Llama 3, its initial version of the model, used large datasets put together by other sources, like Common Crawl.

Earlier this year, Mark Zuckerberg, Meta’s cofounder and longtime CEO, boasted on an earnings call that his company’s social platforms had amassed a data set for AI training that was even “greater than the Common Crawl,” an entity that has scraped roughly 3 billion web pages each month since 2011.

The existence of the new crawler suggests Meta’s vast trove of data may no longer be enough, however, as the company continues to work on updating Llama and expanding Meta AI. LLMs typically need new and quality training data to keep improving in functionality. Meta is on track to spend up to $40 billion this year, mostly on AI infrastructure and related costs.

Are you a Meta employee or someone with insight or a tip to share? Contact Kali Hays securely through Signal at +1-949-280-0267 or at kali.hays@fortune.com.

Join us at the Fortune Workplace Innovation Summit May 19–20, 2026, in Atlanta. The next era of workplace innovation is here—and the old playbook is being rewritten. At this exclusive, high-energy event, the world’s most innovative leaders will convene to explore how AI, humanity, and strategy converge to redefine, again, the future of work. Register now.
About the Author
By Kali Hays
See full bioRight Arrow Button Icon

Latest in Tech

Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025

Most Popular

Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Fortune Secondary Logo
Rankings
  • 100 Best Companies
  • Fortune 500
  • Global 500
  • Fortune 500 Europe
  • Most Powerful Women
  • Future 50
  • World’s Most Admired Companies
  • See All Rankings
Sections
  • Finance
  • Fortune Crypto
  • Features
  • Leadership
  • Health
  • Commentary
  • Success
  • Retail
  • Mpw
  • Tech
  • Lifestyle
  • CEO Initiative
  • Asia
  • Politics
  • Conferences
  • Europe
  • Newsletters
  • Personal Finance
  • Environment
  • Magazine
  • Education
Customer Support
  • Frequently Asked Questions
  • Customer Service Portal
  • Privacy Policy
  • Terms Of Use
  • Single Issues For Purchase
  • International Print
Commercial Services
  • Advertising
  • Fortune Brand Studio
  • Fortune Analytics
  • Fortune Conferences
  • Business Development
About Us
  • About Us
  • Editorial Calendar
  • Press Center
  • Work At Fortune
  • Diversity And Inclusion
  • Terms And Conditions
  • Site Map
Fortune Secondary Logo
  • About Us
  • Editorial Calendar
  • Press Center
  • Work At Fortune
  • Diversity And Inclusion
  • Terms And Conditions
  • Site Map
  • Facebook icon
  • Twitter icon
  • LinkedIn icon
  • Instagram icon
  • Pinterest icon

Latest in Tech

Graphic depicting a coin reads, Fortune Crypto: Facebook Crypto 2.0
CryptoCrypto Playbook
Facebook’s first crypto push set off a firestorm. This time around, its plans are met with a shrug
By Jeff John RobertsFebruary 27, 2026
36 minutes ago
jack dorsey
AILayoffs
Block CEO Jack Dorsey lays off nearly half of his staff because of AI and predicts most companies will make similar cuts in the next year
By Jake AngeloFebruary 27, 2026
2 hours ago
Anthropic CEO Dario Amodei.
AIAnthropic
The Pentagon brands Anthropic’s CEO a ‘liar’ with a ‘God-complex’ as deadline looms over AI use in weapons and surveillance
By Beatrice NolanFebruary 27, 2026
4 hours ago
lacks
LawLawsuit
The immortal life of Henrietta Lacks lawsuits gets a bit shorter with Novartis settlement
By Brian Witte and The Associated PressFebruary 27, 2026
4 hours ago
burger king
AIOpenAI
Burger King tests OpenAI-powered headsets that will track the friendliness of drive-through workers
By Dee-Ann Durbin and The Associated PressFebruary 27, 2026
5 hours ago
zuck
LawSocial Media
20-year-old claiming social media addiction in landmark trial says she was on it ‘all day long’ as a child. Meta brings up abusive environment
By Kaitlyn Huamani, Barbara Ortutay and The Associated PressFebruary 27, 2026
6 hours ago

Most Popular

placeholder alt text
Innovation
An MIT roboticist who cofounded bankrupt robot vacuum maker iRobot says Elon Musk’s vision of humanoid robot assistants is ‘pure fantasy thinking’
By Marco Quiroz-GutierrezFebruary 25, 2026
2 days ago
placeholder alt text
Success
Jeff Bezos says being lazy, not working hard, is the root of anxiety: ‘The stress goes away the second I take that first step’
By Sydney LakeFebruary 25, 2026
2 days ago
placeholder alt text
Economy
Trump claims America is ‘winning so much.’ The IMF agrees, adding that Trump’s trade policies are the only thing holding it back from even more
By Tristan BoveFebruary 26, 2026
1 day ago
placeholder alt text
Success
Gen Z Olympic champion Eileen Gu says she rewires her brain daily to be more successful—and multimillionaire founder Arianna Huffington says it really does work
By Orianna Rosa RoyleFebruary 25, 2026
2 days ago
placeholder alt text
Economy
It’s more than George Clooney moving to France: America is becoming the ‘uncool’ country that people want to move away from
By Nick LichtenbergFebruary 27, 2026
12 hours ago
placeholder alt text
AI
Jamie Dimon says society should start preparing for AI job displacement: ‘Now’s the time to start thinking about’ it
By Marco Quiroz-GutierrezFebruary 25, 2026
2 days ago

© 2026 Fortune Media IP Limited. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | CA Notice at Collection and Privacy Notice | Do Not Sell/Share My Personal Information
FORTUNE is a trademark of Fortune Media IP Limited, registered in the U.S. and other countries. FORTUNE may receive compensation for some links to products and services on this website. Offers may be subject to change without notice.