SRE Lead
navneetkaur | Updated: January 23, 2025
About Trantor
Trantor is a technology services company focused on outsourced product development and digital re-engineering. Leveraging our CaptiveCoE™ engagement model, we operate as a seamless extension of our clients’ teams to provide rapid scalability with predictable budgets. Founded in 2012, Trantor has worked with customers across Tech, FinTech, Media & Cyber Security industries. We have centers in the US, India, Canada, and Costa Rica. We are consistently rated as the #1 employer in the region with the ability to attract and retain technical talent. Our commitment to excellence and impactful results has translated to long-term relationships and value for our clients and solution partners Please visit us at: https://trantorinc.com
Role and Responsibilities
About Trantor
Trantor is a technology services company focused on outsourced product development and digital re-engineering. Leveraging our CaptiveCoE™ engagement model, we operate as a seamless extension of our clients’ teams to provide rapid scalability with predictable budgets. Founded in 2012, Trantor has worked with customers across Tech, FinTech, Media & Cyber Security industries. We have centers in the US, India, Canada, and Costa Rica. We are consistently rated as the #1 employer in the region with the ability to attract and retain technical talent. Our commitment to excellence and impactful results has translated to long-term relationships and value for our clients and solution partners Please visit us at: https://trantorinc.com
Role and Responsibilities
Reliability Engineering:
- Design and implement strategies to ensure system reliability and scalability, meeting high traffic demands during peak events like FIFA 2026.Define and enforce Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for all critical systems.
Monitoring and Observability:
- Set up and maintain monitoring and alerting systems using tools like CloudWatch, Datadog, or Prometheus.Create dashboards and alerts for key performance metrics, including latency, error rates, and throughput.
Incident Management:
- Develop and maintain incident response plans, ensuring quick resolution and root cause analysis.
- Lead on-call rotations and manage post-mortem processes for critical incidents.
Automation and Scalability:
- Automate routine operational tasks, including scaling, failover, and disaster recovery, using tools like AWS Lambda and scripting languages.
- Collaborate with DevOps and Platform Engineering teams to improve CI/CD pipelines and deployment reliability.
Security and Compliance:
- Work with the security team to implement best practices, including IAM role enforcement, encryption, and secure logging.
- Monitor compliance with regulatory requirements, including GDPR and AWS-specific standards.
Knowledge Sharing and Leadership:
- Provide mentorship to junior team members on SRE practices.
- Collaborate with the AWS architect to ensure operational alignment with infrastructure designs.
Skills and Qualifications
- 7+ years of experience in SRE or a related role with a focus on system reliability and performance.
- Strong proficiency in monitoring and observability tools (e.g., Datadog, Prometheus, Grafana).
- Expertise in scripting (Python, Go, or Bash) for automation and tooling.
- Hands-on experience with AWS services, including CloudWatch, Lambda, DynamoDB, and S3.
- Proven track record of managing high-traffic, consumer-facing applications with strict uptime requirements.
- Experience implementing and managing SLAs, SLOs, and SLIs.
- Knowledge of security best practices, including vulnerability assessments and IAM management.
- Familiarity with chaos engineering practices to test system resilience.
- AWS Certified DevOps Engineer – Professional certification.