Large-Scale Web Scraping and Crawling

Job description:

#### **About Us**
We’re building the first end-to-end testing platform for web agents, including a Browser Gym for RL-driven optimization. Our platform helps teams evaluate, benchmark, and improve web agents before they go live, ensuring they can handle real-world, dynamic environments.
With synthetic user simulations, automated evaluations, and large-scale benchmarking, we’re setting a new standard for web agent testing.
We’re a YC-backed team, and this is a founding engineering role—you’ll be one of the first hires defining how we crawl, structure, and analyze the open web at scale.
#### **The Role**
We need a Founding Web Scraping Engineer to build internet-scale web crawling infrastructure—not just scraping a single site, but handling millions of domains and evolving anti-bot defenses.
You’ll be responsible for designing robust, distributed crawling systems that adapt dynamically to web changes, optimize for efficiency, and ensure reliable data extraction.
#### **What You’ll Do**
* Build large-scale, distributed crawlers that intelligently prioritize, schedule, and optimize requests across millions of domains.
* Develop adaptive web scraping systems that handle DOM changes, WebSockets, AJAX-heavy sites, and dynamically loaded content.
* Optimize scraping performance and resilience, ensuring high-throughput data extraction with proxy/network optimizations and behavior-driven stealth tactics.
* Solve captchas at scale, integrating third-party solvers, heuristic-based workarounds, and behavior-driven bypass techniques.
* Manage proxy and identity rotation, implementing session-aware scraping, JA3/TLS fingerprint spoofing, and request signature control.
* Structure and clean extracted data for downstream analytics, AI training, and benchmarking applications.
#### **What We’re Looking For**
* Expert-level experience in large-scale web scraping & crawling (Selenium, Puppeteer, Playwright, Scrapy, undetected-chromedriver).
* Deep knowledge of anti-bot detection strategies (TLS fingerprinting, JA3 signatures, request header anomalies, and bot behavior tracking).
* Hands-on expertise with captcha-solving strategies, including leveraging APIs, OCR-based approaches, and behavior-driven evasion.
* Proven experience building efficient proxy management systems, including rotating IPs across residential, datacenter, and mobile networks.
* Proficiency in Python, Go, or JavaScript, with experience in high-performance, parallelized scraping frameworks.
* Understanding of HTTP/2, HTTP/3, WebSockets, GraphQL, and browser-based fingerprinting.
* Experience designing scalable, fault-tolerant scraping infrastructure that adapts to changes in real time.
#### **Bonus Points**
* Experience with search engine-scale crawling.
* Background in LLM-driven web extraction or RL-enhanced adaptive crawling.
* Contributions to open-source scraping tools or web automation projects.
#### **Why Join?**
* Founding role—you’ll define and own our web crawling infrastructure from day one.
* Work at internet scale—building a system that dynamically adapts and scales across millions of domains.
* YC-backed—we’re building something that doesn’t exist yet, and you’ll be part of the core team making it happen.

Large-Scale Web Scraping and Crawling

Job description:

Be a part of our comminity