Path Traversal in Crawl4AI File Downloads Enables Unauthenticated Arbitrary File Write and RCE
Crawl4AI's download handler fails to sanitize filenames from HTTP headers and page-controlled sources, allowing path traversal to write arbitrary files with attacker content. Pre-authenticated exploitation is possible via the Docker `/crawl` endpoint, enabling remote code execution through shell rc-file overwriting, SSH key injection, or cron job placement.
Affected
Vulnerability Description
Crawl4AI contains a critical path traversal vulnerability in its file download handlers affecting two code paths: AsyncHTTPCrawlerStrategy and AsyncPlaywrightCrawlerStrategy. The root cause is insufficient input validation on user-controlled filenames. The HTTP strategy extracts filenames from the Content-Disposition header without sanitization; the browser strategy uses page-supplied suggested_filename values. Both are joined directly to a downloads directory using path operations without confinement checks, permitting absolute paths (e.g., /etc/cron.d/) or traversal sequences (../) to escape the intended directory. The attacker controls not just the filename but also the file contents, escalating this to arbitrary file write (CWE-434) with remote code execution potential.
Proof-of-Concept Significance
This vulnerability is particularly severe because the HTTP-strategy sink is reachable without authentication on default Docker deployments through the /crawl endpoint when HTTPCrawlerConfig is supplied. The PoC demonstrates that an unauthenticated remote attacker can trigger the vulnerability by crafting malicious HTTP responses with specially-formed Content-Disposition headers or by hosting pages with malicious download suggestions. No user interaction beyond invoking the crawler against attacker-controlled URLs is required, making this highly reliable and trivial to exploit at scale.
Detection Guidance
Defenders should monitor for: (1) unexpected file writes outside the designated downloads directory, particularly to sensitive locations (/etc/cron.d, ~/.ssh, system Python paths); (2) HTTP requests with Content-Disposition headers containing path traversal sequences (../, absolute paths starting with /); (3) Docker container logs showing file operations with suspicious paths; (4) filesystem audit logs (auditd, Windows ETW) capturing writes by the crawler process to out-of-bounds locations; (5) new cron jobs, SSH keys, or Python imports appearing in sensitive locations coinciding with crawler activity. Log indicators include path components like ..%2F, URL-encoded traversal, or header values with / characters. Consider YARA rules targeting response headers containing encoded traversal patterns in captured network traffic.
Mitigation Steps
Immediate actions: (1) Patch: Update crawl4ai to a version with filename sanitization; (2) Network isolation: Restrict unauthenticated access to the Docker /crawl endpoint using firewalls, reverse proxy authentication, or network policies; (3) Filesystem hardening: Run the crawler in a chroot, container, or VM with minimal filesystem permissions; (4) Input validation: Implement server-side filename sanitization—strip path separators, reject absolute paths, validate against whitelist patterns before any filesystem operations; (5) Principle of least privilege: Execute the crawler process as an unprivileged user without write access to system directories.
Risk Assessment
This vulnerability has critical exploitability in the wild. The pre-authentication requirement on Docker deployments (common for CI/CD automation and public crawling services) dramatically lowers the barrier to exploitation. Threat actors operating crawling infrastructure, botnets, or supply-chain attacks would find this attractive for achieving code execution on target systems. The simplicity of crafting malicious HTTP responses and the universal applicability across all Python environments make this high-priority for patching. Organizations running crawl4ai should treat this as requiring emergency remediation, especially if the service is internet-exposed or processes untrusted URLs.
Sources