Large-Scale Web Scraping and Crawling

Job description:

#### **About Us**
We’re building the first end-to-end testing platform for web agents, including a Browser Gym for RL-driven optimization. Our platform helps teams evaluate, benchmark, and improve web agents before they go live, ensuring they can handle real-world, dynamic environments.
With synthetic user simulations, automated evaluations, and large-scale benchmarking, we’re setting a new standard for web agent testing.
We’re a YC-backed team, and this is a founding engineering role—you’ll be one of the first hires defining how we crawl, structure, and analyze the open web at scale.
#### **The Role**
We need a Founding Web Scraping Engineer to build internet-scale web crawling infrastructure—not just scraping a single site, but handling millions of domains and evolving anti-bot defenses.
You’ll be responsible for designing robust, distributed crawling systems that adapt dynamically to web changes, optimize for efficiency, and ensure reliable data extraction.
#### **What You’ll Do**
* Build large-scale, distributed crawlers that intelligently prioritize, schedule, and optimize requests across millions of domains.
* Develop adaptive web scraping systems that handle DOM changes, WebSockets, AJAX-heavy sites, and dynamically loaded content.
* Optimize scraping performance and resilience, ensuring high-throughput data extraction with proxy/network optimizations and behavior-driven stealth tactics.
* Solve captchas at scale, integrating third-party solvers, heuristic-based workarounds, and behavior-driven bypass techniques.
* Manage proxy and identity rotation, implementing session-aware scraping, JA3/TLS fingerprint spoofing, and request signature control.
* Structure and clean extracted data for downstream analytics, AI training, and benchmarking applications.
#### **What We’re Looking For**
* Expert-level experience in large-scale web scraping & crawling (Selenium, Puppeteer, Playwright, Scrapy, undetected-chromedriver).
* Deep knowledge of anti-bot detection strategies (TLS fingerprinting, JA3 signatures, request header anomalies, and bot behavior tracking).
* Hands-on expertise with captcha-solving strategies, including leveraging APIs, OCR-based approaches, and behavior-driven evasion.
* Proven experience building efficient proxy management systems, including rotating IPs across residential, datacenter, and mobile networks.
* Proficiency in Python, Go, or JavaScript, with experience in high-performance, parallelized scraping frameworks.
* Understanding of HTTP/2, HTTP/3, WebSockets, GraphQL, and browser-based fingerprinting.
* Experience designing scalable, fault-tolerant scraping infrastructure that adapts to changes in real time.
#### **Bonus Points**
* Experience with search engine-scale crawling.
* Background in LLM-driven web extraction or RL-enhanced adaptive crawling.
* Contributions to open-source scraping tools or web automation projects.
#### **Why Join?**
* Founding role—you’ll define and own our web crawling infrastructure from day one.
* Work at internet scale—building a system that dynamically adapts and scales across millions of domains.
* YC-backed—we’re building something that doesn’t exist yet, and you’ll be part of the core team making it happen.

Be a part of our comminity

Join us on Telegram or Discord to get instant notifications about the newest freelance projects and talk to some of the smartest software engineers in the world.