Optimize the vLLM inference engine to improve the speed and cost of running LLMs and diffusion models. Develop innovations for diverse hardware and architectures, including mixture-of-experts and multimodal models.
Inferact
8 Remote Job Openings at Inferact
The role involves creating high-quality technical content, tutorials, and demos to help developers adopt and scale vLLM. You will act as an educator-builder, explaining complex inference systems concepts and hosting workshops for the AI infrastructure community.
Own and operate high-performance GPU compute infrastructure to ensure health, availability, and observability for engineering teams. Standardize provisioning, scaling, and incident response across various neo-cloud and dedicated compute providers.
Member of Technical Staff, TPU & AMD GPU Performance Engineering
Inferact
·
Full Time
·
6 days ago
Inferact
Build and optimize AMD GPU and TPU backends, kernels, and compiler integrations to make vLLM a first-class inference engine on non-NVIDIA hardware. Improve critical paths such as attention, GEMM, and KV-cache while developing robust benchmarking infrastructure.
The role involves writing kernels and low-level optimizations to enhance the performance of vLLM as an inference engine. The engineer will collaborate with hardware vendors to maximize performance across various accelerator types.
The cloud orchestration engineer will build the operational backbone for vLLM, focusing on cluster management, deployment automation, and production monitoring. The role involves ensuring that vLLM deployments are observable, debuggable, and recoverable.
The role involves building distributed systems that power inference at a global scale. You will design and implement foundational layers to enable vLLM to serve models across thousands of accelerators with minimal latency and maximum reliability.
Member of Technical Staff, Exceptional Generalist (Remote)
Inferact
·
Full Time
·
5 months ago
Inferact
You will work across the entire vLLM stack, optimizing CUDA kernels, designing distributed orchestration systems, and implementing new model architectures. Your work will directly impact how the world runs AI inference.