Deskripsi Pekerjaan

Are you a technical leader passionate about building robust, scalable, and resilient cloud infrastructure? Virtual AI Ltd is seeking a highly skilled Platform & Reliability Lead to oversee our critical AWS and Azure environments. In this role, you will be the guardian of our platform's health, driving uptime excellence and implementing cutting-edge observability practices.
As the Platform & Reliability Lead, you will own the strategy for system reliability, ensuring our services are not only available but also performant and secure. You will lead a team of engineers in continuous improvement initiatives, leveraging modern DevOps methodologies to automate processes and reduce time-to-market. If you thrive in a fast-paced, high-impact environment and have a deep understanding of cloud-native technologies, we want to hear from you.

Tanggung Jawab

Architect & Maintain: Design and maintain highly scalable, secure, and resilient cloud architectures across AWS and Azure.
Uptime & Availability: Lead initiatives to ensure 99.9% uptime and drive rapid incident response to minimize downtime.
Observability: Implement and manage comprehensive observability stacks (monitoring, logging, tracing) to gain deep insights into system performance.
Reliability Engineering: Conduct failure mode analysis and develop robust disaster recovery and business continuity plans.
CI/CD: Optimize and scale continuous integration and continuous deployment pipelines for maximum efficiency.
Security: Enforce security best practices and compliance standards across all cloud resources.
Continuous Improvement: Drive technical debt reduction and lead the team in adopting new technologies for system reliability.

Kualifikasi

Education: Bachelor’s degree in Computer Science, Information Technology, or a related technical field.
Experience: 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE).
Cloud Expertise: Strong proficiency in AWS and/or Azure services (EC2, S3, RDS, Lambda, etc.).
Tools: Deep knowledge of containerization (Docker, Kubernetes) and infrastructure as code (Terraform, CloudFormation).
Scripting: Proficiency in scripting languages such as Python, Go, or Bash.
Observability: Experience with monitoring tools like Datadog, New Relic, or Prometheus.
Leadership: Proven ability to lead technical projects and mentor engineering teams.

Platform & Reliability Lead

Deskripsi Pekerjaan

Tanggung Jawab

Kualifikasi

Keahlian yang Dibutuhkan

Siap Mengambil Tantangan Ini?

Lowongan Terkait

Website Developer

Backend Developer (Node.js Expert)

Network Operation Center Director