Cloudflare, which blocks crawlers for you, launches "one-click site-wide crawler API" with perfect support for RAG, incremental updates, and model training

動區BlockTempo

Cloudflare launched a brand new /crawl endpoint for its Browser Rendering service on March 10th (currently in Open Beta). This new feature allows developers to crawl entire websites through a single API call, automatically converting content into HTML, Markdown, or structured JSON formats, providing a powerful and compliant tool for building AI training datasets and RAG (Retrieval-Augmented Generation) pipelines.
(Background: Cloudflare’s major outage caused widespread global network disruption—Is “decentralized architecture” the future of infrastructure?)
(Additional context: 24 hours after Cloudflare’s outage: Why does the internet “collapse instantly”? Centralization risks for Web3 and RWA future)

Table of Contents

Toggle

  • Asynchronous operation supporting Markdown and structured JSON
  • Focused on “Good Boy” crawlers, strictly adhering to compliance and protection mechanisms
  • Incremental crawling to save costs, free plans available for testing

With the explosive growth of generative AI and RAG (Retrieval-Augmented Generation) technology, efficiently and compliantly acquiring website data has become a top challenge for developers. In response, cloud infrastructure giant Cloudflare officially announced on March 10th a game-changing new feature for its Browser Rendering service: the all-new /crawl API endpoint.

Currently in open beta, this feature aims to let developers “crawl an entire website with just one API call.”

Asynchronous operation supporting Markdown and structured JSON

According to Cloudflare’s announcement, the new crawler API operates asynchronously. Developers only need to submit a starting URL, and the system will return a Job ID, with the backend using a headless browser to automatically discover and render web pages. Developers can check crawl progress and results at any time via this ID.

To seamlessly integrate with current AI development workflows, the API offers multiple output formats. Besides traditional HTML, it can directly output Markdown—favored by large language models (LLMs)—and structured JSON driven by Workers AI. This significantly reduces the time developers spend on data cleaning and format conversion.

Focused on “Good Boy” crawlers, strictly adhering to compliance and protection mechanisms

Unlike many malicious crawlers attempting to bypass protections, Cloudflare’s /crawl endpoint emphasizes “compliance and transparency.” The official states that this endpoint is a signed agent that strictly follows the target website’s robots.txt directives (including crawl delay limits) and respects Cloudflare’s own “AI Crawl Control” standards.

Additionally, Cloudflare explicitly states that this tool “will identify itself as a robot” and cannot bypass Cloudflare’s bot detection systems or CAPTCHA challenges. This design ensures that crawling activities do not infringe on website owners’ intentions or server resources.

Incremental crawling to save costs, free plans available for testing

To improve efficiency and reduce costs, the API includes several advanced control features:

  • Incremental crawling: Supports modifiedSince and maxAge parameters to automatically skip unchanged or recently crawled pages, saving computational resources.
  • Fine-grained scope control: Developers can customize crawl depth, page limits, and use wildcards to include or exclude specific URL paths.
  • Static mode: For static websites that don’t require JavaScript rendering, set render: false to skip headless browser startup, enabling ultra-fast crawling.

Currently, this powerful crawling feature is fully available to both free and paid Cloudflare Workers users. For teams needing regular website monitoring, research data collection, or enterprise AI knowledge base building, this represents a highly attractive infrastructure upgrade.

View Original
Disclaimer: The information on this page may come from third parties and does not represent the views or opinions of Gate. The content displayed on this page is for reference only and does not constitute any financial, investment, or legal advice. Gate does not guarantee the accuracy or completeness of the information and shall not be liable for any losses arising from the use of this information. Virtual asset investments carry high risks and are subject to significant price volatility. You may lose all of your invested principal. Please fully understand the relevant risks and make prudent decisions based on your own financial situation and risk tolerance. For details, please refer to Disclaimer.
Comment
0/400
No comments