Tanla /

Sr. Inference Engineer

Sr. Inference Engineer

Location

Hyderabad, Telangana IN

Department

Innovation

Job Role

As a Model Inference Engineer, you will bridge the gap between model training and production deployment. You will take high-performance checkpoints from our Training Engineers and transform them into optimized, production-ready artifacts. Your mission is to architect, build, and rigorously test inference servers that deliver our Voice AI capabilities across both real-time streaming and highthroughput batch scenarios.

You will also play a key role in hardware-software co-optimization, selecting the right computer profiles and implementing scaling strategies to balance high-fidelity audio quality with cost-efficient, reliable production delivery.

What you'll be responsible for?

  • Transform trained checkpoints into high-performance artifacts using TensorRT, ONNX, or TVM. Implement quantization strategies (FP16, INT8, FP8) to balance precision and performance.
  • Architect and maintain inference servers using Triton Inference Server or vLLM. Implement efficient request handling through dynamic batching and streaming protocols (gRPC, WebSockets).
  • Profile and optimize model performance at the kernel level. Select and tune compute profiles across various NVIDIA GPU architectures (T4, L4, A100, H100) to maximize cost-efficiency.
  • Design and execute rigorous performance tests to measure latency (TTFC), throughput, and memory usage. Ensure optimized models maintain the required acoustic fidelity and accuracy.
  • Partner with Training Engineers to define export-friendly architectures and provide feedback on model performance in production-like environments.

Qualification and other skills

  • Experience writing or optimizing kernels using CUDA (C++) or Triton (Python) to accelerate non-standard operators.
  • Familiarity with Apache TVM, Kubernetes, Docker, and managing GPU clusters for large-scale inference deployment.

What you'd have?

  • Deep practical experience with model serving frameworks such as vLLM, Triton Inference Server, and Ollama.
  • Strong experience with model acceleration and runtime frameworks including TensorRT, TensorRT-LLM, and ONNX Runtime.
  • Ability to optimize inference performance through batching, quantization, GPU utilization, and latency tuning for large-scale model serving.
  • Ability to profile and identify bottlenecks across the entire stack—from Python/C++ code to GPU kernels and memory bandwidth.
  • 5–7 years of industry experience in machine learning model optimization, inference systems, or ML infrastructure engineering.
  • BE/BTech/ME/MTech/PhD in Computer Science, Artificial Intelligence, Machine Learning, or a related field preferred.
  • Strong proficiency in C++ and Python for building high-performance machine learning and inference systems.
  • Solid understanding of NVIDIA GPU architectures (e.g., Ampere, Hopper) and CUDA programming concepts for accelerated computing.
  • Experience working with Linux-based environments, including system-level debugging and performance tuning.
  • Familiarity with networking protocols and APIs such as gRPC, WebSockets, and HTTP/2 for realtime inference services.
  • Proficiency with version control systems such as Git, and experience with collaborative software development workflows.

Why join us?

  • Impactful Work: Play a pivotal role in safeguarding Tanla's assets, data, and reputation in the industry.
  • Tremendous Growth Opportunities: Be part of a rapidly growing company in the telecom and CPaaSspace, with opportunities for professional development.
  • Innovative Environment: Work alongside a world-class team in a challenging and fun environment, where innovation is celebrated.

Tanla is an equal opportunity employer. We champion diversity and are committed to creating an inclusive environment for all employees.

Connect to join our team

Max file size 10MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.