Research Crawling Engineer

 Posted 2 months ago
     
 $150K - $225K per year
  
5-10 years experience
Apply Now

Please mention DailyRemote when applying

AI Summary

Design and operate high-throughput, fault-tolerant systems for large-scale web data acquisition and model development. Collaborate with research teams to build data pipelines for cleaning, filtering, and normalizing billions of web pages.
The employer is a decentralized, Solana-based web-scraping network that allows users to monetize their unused internet bandwidth. By installing a browser extension, users securely share bandwidth to help AI companies crawl the web for public data, receiving Points (convertible to crypto tokens) as compensation.
They also operate a massive distributed crawler, giving them unique access to high-quality public web data at global scale.

They are hiring a Research Crawling Engineer (Full remote - USA/EU 6 hour overlap with EST) 

You will join a company at the forefront of developing a web-scale crawler and knowledge graph that improves access to public web data and extends the value of AI to the people.

As a Research Crawling Engineer, you will design and operate large-scale web data acquisition systems for research and model development. You will work will span distributed systems, scraping infrastructure, and data pipelines.



This Role Involves:

- Operating at the boundary of scale and reliability
- Adapting to constantly changing web environments
- Balancing throughput, coverage, and data quality
- Owning end-to-end data acquisition pipelines


MISSIONS


  • Design high-throughput, fault-tolerant systems for data collection (millions to billions of URLs/day)
  • Handle anti-bot systems, rate limits, and dynamic/JS-heavy sites
  • Develop pipelines for cleaning, deduplication, filtering, and normalisation
  • Construct and maintain datasets for research and model training
  • Monitor crawl performance, coverage, and data quality; iterate quickly
  • Collaborate with research teams to align data collection with modeling needs
  • Optimize infrastructure for cost, latency, and reliability

Example Projects you could work on :

- Build a distributed crawler for a continuously updated, high-quality web project
- Design a system to classify and filter billions of pages for pretraining
- Extract structured data from dynamic, JS-heavy sites at scale
- Improve deduplication and quality scoring across multimodal datasets

Requirements

  • Strong programming experience in one or more of : Go, Rust, Python, Java, or C++
  • Experience working for reputable companies
  • Experience building and maintaining large-scale web crawlers or large-scale data pipelines
  • Experience designing high-throughput, fault-tolerant systems for data collection (millions to billions of URLs/day)
  • Experience handling anti-bot systems, rate limits, and dynamic/JS-heavy sites
  • Experience constructing and maintaining datasets for research and model training
  • Solid understanding of HTTP, networking, and browser behavior
  • Familiarity with distributed systems and parallel processing
  • Experience working with large datasets (TB–PB scale preferred)
  • Ability to debug unstable or adversarial environments

Preferred / Bonus:

  • Experience with NLP pipelines or dataset curation for ML
  • Familiarity with LLM pretraining data or retrieval systems
  • Experience with headless browsers (e.g., Chrome DevTools Protocol, Playwright, Puppeteer)
  • Knowledge of proxy systems, IP rotation, and large-scale request orchestration
  • Background in data quality evaluation or benchmarking
  • Experience running workloads on cloud or bare-metal infrastructure

Main Evaluation Criteria:

  • Ability to design systems that scale without degrading quality
  • Practical problem-solving under real-world constraints
  • Speed of iteration and ownership
  • Measurable improvements in data coverage, quality, or efficiency


Benefits

  • Contract : Permanent role (Full remote - USA or 6 hour overlap with EST).
  • Salary : $150k to $225k based on experience and demonstrated ability to operate at scale + Equity package / tokens

Recruitment process :

  • Recruiter / HR Call
  • Technical Interview
  • CEO Interview
  • Final Interview


Similar Jobs

See all Remote Software Development jobs →

Personalize your Remote Job Search in 3 Easy Steps!

Discover remote opportunities in Software Development

Answer easy questions

Answer easy questions

200,000+ jobs across 15+ categories

Get your best job matches

Get your best job matches

Only hand-screened, legit jobs

Find a remote job faster

Find a remote job faster

No ads, scams, or junk

I was the first applicant for a remote marketing position that got listed on the company website the same day I applied. Had an interview within 48 hours!

Sarah J. — Sarah J. · Marketing Manager ★★★★★ Verified