Deskripsi Pekerjaan
Are you a technical leader passionate about building robust, scalable, and resilient cloud infrastructure? Virtual AI Ltd is seeking a highly skilled Platform & Reliability Lead to oversee our critical AWS and Azure environments. In this role, you will be the guardian of our platform's health, driving uptime excellence and implementing cutting-edge observability practices.
As the Platform & Reliability Lead, you will own the strategy for system reliability, ensuring our services are not only available but also performant and secure. You will lead a team of engineers in continuous improvement initiatives, leveraging modern DevOps methodologies to automate processes and reduce time-to-market. If you thrive in a fast-paced, high-impact environment and have a deep understanding of cloud-native technologies, we want to hear from you.
Tanggung Jawab
- Architect & Maintain: Design and maintain highly scalable, secure, and resilient cloud architectures across AWS and Azure.
- Uptime & Availability: Lead initiatives to ensure 99.9% uptime and drive rapid incident response to minimize downtime.
- Observability: Implement and manage comprehensive observability stacks (monitoring, logging, tracing) to gain deep insights into system performance.
- Reliability Engineering: Conduct failure mode analysis and develop robust disaster recovery and business continuity plans.
- CI/CD: Optimize and scale continuous integration and continuous deployment pipelines for maximum efficiency.
- Security: Enforce security best practices and compliance standards across all cloud resources.
- Continuous Improvement: Drive technical debt reduction and lead the team in adopting new technologies for system reliability.
Kualifikasi
- Education: Bachelor’s degree in Computer Science, Information Technology, or a related technical field.
- Experience: 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE).
- Cloud Expertise: Strong proficiency in AWS and/or Azure services (EC2, S3, RDS, Lambda, etc.).
- Tools: Deep knowledge of containerization (Docker, Kubernetes) and infrastructure as code (Terraform, CloudFormation).
- Scripting: Proficiency in scripting languages such as Python, Go, or Bash.
- Observability: Experience with monitoring tools like Datadog, New Relic, or Prometheus.
- Leadership: Proven ability to lead technical projects and mentor engineering teams.