Tanla /
Sr. Inference Engineer
Sr. Inference Engineer
Location
Hyderabad, Telangana IN
Department
Innovation
Job Role
As a Model Inference Engineer, you will bridge the gap between model training and production deployment. You will take high-performance checkpoints from our Training Engineers and transform them into optimized, production-ready artifacts. Your mission is to architect, build, and rigorously test inference servers that deliver our Voice AI capabilities across both real-time streaming and highthroughput batch scenarios.
You will also play a key role in hardware-software co-optimization, selecting the right computer profiles and implementing scaling strategies to balance high-fidelity audio quality with cost-efficient, reliable production delivery.
What you'll be responsible for?
- Transform trained checkpoints into high-performance artifacts using TensorRT, ONNX, or TVM. Implement quantization strategies (FP16, INT8, FP8) to balance precision and performance.
- Architect and maintain inference servers using Triton Inference Server or vLLM. Implement efficient request handling through dynamic batching and streaming protocols (gRPC, WebSockets).
- Profile and optimize model performance at the kernel level. Select and tune compute profiles across various NVIDIA GPU architectures (T4, L4, A100, H100) to maximize cost-efficiency.
- Design and execute rigorous performance tests to measure latency (TTFC), throughput, and memory usage. Ensure optimized models maintain the required acoustic fidelity and accuracy.
- Partner with Training Engineers to define export-friendly architectures and provide feedback on model performance in production-like environments.
Qualification and other skills
- Experience writing or optimizing kernels using CUDA (C++) or Triton (Python) to accelerate non-standard operators.
- Familiarity with Apache TVM, Kubernetes, Docker, and managing GPU clusters for large-scale inference deployment.
What you'd have?
- Deep practical experience with model serving frameworks such as vLLM, Triton Inference Server, and Ollama.
- Strong experience with model acceleration and runtime frameworks including TensorRT, TensorRT-LLM, and ONNX Runtime.
- Ability to optimize inference performance through batching, quantization, GPU utilization, and latency tuning for large-scale model serving.
- Ability to profile and identify bottlenecks across the entire stack—from Python/C++ code to GPU kernels and memory bandwidth.
- 5–7 years of industry experience in machine learning model optimization, inference systems, or ML infrastructure engineering.
- BE/BTech/ME/MTech/PhD in Computer Science, Artificial Intelligence, Machine Learning, or a related field preferred.
- Strong proficiency in C++ and Python for building high-performance machine learning and inference systems.
- Solid understanding of NVIDIA GPU architectures (e.g., Ampere, Hopper) and CUDA programming concepts for accelerated computing.
- Experience working with Linux-based environments, including system-level debugging and performance tuning.
- Familiarity with networking protocols and APIs such as gRPC, WebSockets, and HTTP/2 for realtime inference services.
- Proficiency with version control systems such as Git, and experience with collaborative software development workflows.
Why join us?
- Impactful Work: Play a pivotal role in safeguarding Tanla's assets, data, and reputation in the industry.
- Tremendous Growth Opportunities: Be part of a rapidly growing company in the telecom and CPaaSspace, with opportunities for professional development.
- Innovative Environment: Work alongside a world-class team in a challenging and fun environment, where innovation is celebrated.
Tanla is an equal opportunity employer. We champion diversity and are committed to creating an inclusive environment for all employees.