Tanla /
SRE II / III
SRE II / III
Location
Hyderabad, Telangana IN
Department
Product & Engineering
Job Role
We are looking for a Senior Site Reliability Engineer (SRE) to ensure high availability, reliability, scalability, and performance of our CPaaS platforms supporting real-time communication services such as SMS, Voice, RCS, WhatsApp and APIs.
The SRE will work closely with platform engineering, DevOps, networking, NOC to build resilient distributed systems and improve operational excellence.
What you'll be responsible for?
- Ensure 99.9%+ Uptime for CPaaS messaging and Voice Services.
- Manage high throughput real-time communication platforms.
- Maintain SLOs, SLIs and error budgets for platform services.
- Conduct incident management, RCA, and post-mortems.
- Build and maintain CI/CD pipelines.
- Establish and maintain Telecom integrations (SS7, SMPP and SIP).
- Own entire platforms (prod and lower environments) Deploying, automating, maintaining, and managing production systems.
- Engage in and improve the whole lifecycle of services from inception and design, through deployment, operation, and refinement.
- Support services before they go live through activities such as system design consulting, capacity planning and launch reviews.
- Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability.
- Collaborate with Agile teams in defining technical requirements and best practices with containerized and cloud-native applications.
- Represent production support and site reliability in stand-ups, planning sessions, and architecture reviews.
Qualification and other skills
What you'd have?
- Strong knowledge of SRE principles (SLI/SLO/Error Budgets).
- Should have good know how of applications at scale, middleware (RabbitMQ, Redis, Kafka etc) , Databases (postgres, clickhouse etc.), infrastructure & Linux OS.
- Should have very good understanding in Docker and Kubernetes.
- Should understand CI/CD and DevOps tools like Jenkins, Ansible, Shell scripting etc.
- Monitoring and Logging: Experience with monitoring and logging tools (ex. Prometheus/Grafana, ELK, SolarWinds).
- Good Experience of distributed systems.
- Should have worked on high traffic & highly scalable systems in the past.
- Knowledge on fundamental aspects for release automation (packaging, dependencies, deployment, compliance)
- Experience on project management tools such as JIRA and insight on quality analysis as well.
- Experience on working towards drafting Root cause analysis, reports for production deployments.
- Strong incident management skills and ability to work in 24/7 production environments.
- Should have known/how on Release and patch management and CCM.
Why join us?
- Impactful Work: Play a pivotal role in safeguarding Tanla's assets, data, and reputation in the industry.
- Tremendous Growth Opportunities: Be part of a rapidly growing company in the telecom and CPaaSspace, with opportunities for professional development.
- Innovative Environment: Work alongside a world-class team in a challenging and fun environment, where innovation is celebrated.
Tanla is an equal opportunity employer. We champion diversity and are committed to creating an inclusive environment for all employees.