Peakflo

AI/ML Ops Engineer - (India/Remote)

Posted 5 months ago

India

1400K - 1800K per year

⭐ 2-5 years experience

Apply Now

Please mention DailyRemote when applying

AI Summary

The engineer will own the intersection of cloud infrastructure and AI systems, focusing on LLMOps, RAG performance, and production monitoring for finance agents. Responsibilities include managing vector databases, scaling orchestration frameworks, and ensuring zero-downtime model deployments.

🚀 Who we are & What we’re building

Peakflo is a rapidly growing Agentic AI company. We are revolutionizing the way global finance teams work with our cutting-edge agentic workflows, and we are actively seeking exceptional professionals across diverse disciplines to champion this transformation, uniting multidisciplinary expertise to propel our global strategic vision.

Our Growth Story : Peakflo is backed by top-tier global accelerators and investors. We are proud alumni of the prestigious Y-Combinator (W22) and the Google AI Accelerator. Our momentum and impact have been recognized globally by top tech and finance publications:

Our Culture : We believe in building a vibrant, high-performance culture that rewards curiosity, ownership, and innovation. Our team spans the globe, and we love coming together to solve hard problems and celebrate our wins. Most importantly, we have begun building an environment that provides the support and mentorship needed to succeed, learn, and grow. ❤️

💻 What we’re Looking For:

We are seeking a highly capable and autonomous AI Ops Engineer to own the intersection of our cloud infrastructure and artificial intelligence systems. As a mid-level builder, you will be the backbone of our AI deployment lifecycle, ensuring our finance agents are highly performant, cost-effective, and exceptionally reliable. You will partner closely with our Operations team, Core Engineering, ML researchers, and the CTO to scale our platform and transition cutting-edge models seamlessly into production.

💪 What you’ll do

LLMOps, RAG & Agent Performance Management
- Production AI Monitoring: Deploy and manage observability frameworks to track the real-time performance of our AI agents. Monitor critical metrics such as latency, token usage, drift, and hallucination rates.
- Intelligent Routing & Fallbacks: Implement robust API routing, load balancing, rate limiting, and fallback mechanisms across multiple LLM providers to ensure 100% agent availability and reliability.
- Vector Database Infrastructure: Provision, scale, and maintain the vector database infrastructure (e.g., Pinecone, Milvus, Weaviate) that powers our Retrieval-Augmented Generation (RAG) pipelines.
- Orchestration & Frameworks: Maintain and scale resilient data pipelines utilizing advanced LLM orchestration tools and agentic frameworks (e.g., LangChain, LlamaIndex).
- Model Deployment: Streamline the transition of models from research to production, ensuring zero-downtime deployments, canary releases, and reliable versioning for our finance agents.
Cloud Infrastructure & Platform Engineering
- GCP Architecture: Architect, provision, and maintain secure, highly available, and scalable cloud infrastructure primarily utilizing Google Cloud Platform (GCP).
- Containerization & IaC: Leverage Docker and Kubernetes for container orchestration. Build out Infrastructure as Code (IaC) using Terraform to ensure reproducible and scalable environments across staging and production.
- CI/CD & Uptime: Design, implement, and optimize continuous integration and continuous deployment (CI/CD) pipelines to maintain high platform uptime and accelerate the pace of engineering delivery.
- Cost & Resource Optimization: Monitor cloud compute and LLM API usage, implementing FinOps best practices to optimize infrastructure costs without sacrificing performance as we scale.
- Security & Access Management: Configure secure network architectures, robust IAM policies, and secret management (e.g., GCP Secret Manager) to safeguard proprietary models and infrastructure.
Cross-Functional Collaboration
- Strategic Partnership: Act as the critical operational bridge between the Operations team, Engineering, Machine Learning, and the CTO to align technical infrastructure with business scalability needs.
- Developer Tooling: Build internal tools, CLI utilities, and dashboards that empower the ML and core engineering teams to test and ship agents faster and with higher confidence.

🕵️‍♀️ Who we’re looking for

Experience: 2+ years of hands-on industry experience in DevOps, Cloud Engineering, or MLOps, demonstrating an ability to build autonomously and take ownership of complex systems.
Cloud & Infrastructure Skills: Deep expertise in GCP, Kubernetes, Docker, and Terraform. Proven track record of building and managing robust CI/CD pipelines.
AI/LLM Ecosystem: Practical experience deploying and managing applications built with LLM frameworks like LangChain, LlamaIndex, or similar technologies.
Programming Proficiency: Strong background in Python and bash scripting for automation and system integrations.
Mindset: A proactive, problem-solving attitude with the ability to navigate ambiguity, design scalable solutions, and work independently in a fast-paced startup environment.

➕ We’re Particularly Interested In People Who Have:

Experience integrating specialized LLM observability tools (such as LangSmith, Weights & Biases, or equivalent).
Previous experience managing high-throughput, low-latency APIs in a B2B SaaS or fintech environment.
A strong grasp of cost-optimization strategies for large-scale cloud and GPU deployments.

🙂Benefits :

Competitive compensation package
Comprehensive health and wellness benefits.
Opportunity for rapid career growth, working directly with the leadership team.
Collaborative and innovative work environment.
Flexible work hours and remote work options.
Unlimited opportunities to shape the foundational infrastructure of a global AI platform.

Automatically Apply to the Best Remote Jobs

Stop the endless job search. Our AI finds and applies to the best jobs for you.

Try it Now

Peakflo