The employer is a decentralized, Solana-based web-scraping network that allows users to monetize their unused internet bandwidth. By installing a browser extension, users securely share bandwidth to help AI companies crawl the web for public data, receiving Points (convertible to crypto tokens) as compensation.
They also operate a massive distributed crawler, giving them unique access to high-quality public web data at global scale.
They are hiring a Research Crawling Engineer (Full remote - USA/EU 6 hour overlap with EST)
You will join a company at the forefront of developing a web-scale crawler and knowledge graph that improves access to public web data and extends the value of AI to the people.
As a Research Crawling Engineer, you will design and operate large-scale web data acquisition systems for research and model development. You will work will span distributed systems, scraping infrastructure, and data pipelines.
This Role Involves:
- Operating at the boundary of scale and reliability
- Adapting to constantly changing web environments
- Balancing throughput, coverage, and data quality
- Owning end-to-end data acquisition pipelines
MISSIONS
- Design high-throughput, fault-tolerant systems for data collection (millions to billions of URLs/day)
- Handle anti-bot systems, rate limits, and dynamic/JS-heavy sites
- Develop pipelines for cleaning, deduplication, filtering, and normalisation
- Construct and maintain datasets for research and model training
- Monitor crawl performance, coverage, and data quality; iterate quickly
- Collaborate with research teams to align data collection with modeling needs
- Optimize infrastructure for cost, latency, and reliability
Example Projects you could work on :
- Build a distributed crawler for a continuously updated, high-quality web project
- Design a system to classify and filter billions of pages for pretraining
- Extract structured data from dynamic, JS-heavy sites at scale
- Improve deduplication and quality scoring across multimodal datasets
Requirements
- Strong programming experience in one or more of : Go, Rust, Python, Java, or C++
- Experience working for reputable companies
- Experience building and maintaining large-scale web crawlers or large-scale data pipelines
- Experience designing high-throughput, fault-tolerant systems for data collection (millions to billions of URLs/day)
- Experience handling anti-bot systems, rate limits, and dynamic/JS-heavy sites
- Experience constructing and maintaining datasets for research and model training
- Solid understanding of HTTP, networking, and browser behavior
- Familiarity with distributed systems and parallel processing
- Experience working with large datasets (TB–PB scale preferred)
- Ability to debug unstable or adversarial environments
Preferred / Bonus:
- Experience with NLP pipelines or dataset curation for ML
- Familiarity with LLM pretraining data or retrieval systems
- Experience with headless browsers (e.g., Chrome DevTools Protocol, Playwright, Puppeteer)
- Knowledge of proxy systems, IP rotation, and large-scale request orchestration
- Background in data quality evaluation or benchmarking
- Experience running workloads on cloud or bare-metal infrastructure
Main Evaluation Criteria:
- Ability to design systems that scale without degrading quality
- Practical problem-solving under real-world constraints
- Speed of iteration and ownership
- Measurable improvements in data coverage, quality, or efficiency
Benefits
- Contract : Permanent role (Full remote - USA or 6 hour overlap with EST).
- Salary : $150k to $225k based on experience and demonstrated ability to operate at scale + Equity package / tokens
Recruitment process :
- Recruiter / HR Call
- Technical Interview
- CEO Interview
- Final Interview