Deskripsi Pekerjaan
Are you a seasoned Site Reliability Engineer passionate about building highly available, scalable, and resilient cloud-native applications? Vanguard Software Pte Ltd, a dynamic boutique software house, is seeking a Senior Site Reliability Engineer (SRE) to join our innovative team in Singapore.
We specialize in transforming complex business goals into robust, enterprise-grade WebApps, leveraging cutting-edge cloud technologies and modern architectural patterns. At Vanguard, SRE is not just about keeping the lights on; it's about engineering a future where our systems are inherently reliable, performant, and secure. You'll be instrumental in shaping the reliability posture of our platforms, from infrastructure as code to automated deployments and proactive monitoring.
This is an unparalleled opportunity to work with a forward-thinking team that values innovation, collaboration, and continuous improvement. You'll play a critical role in designing, implementing, and optimizing our cloud infrastructure and services, ensuring exceptional uptime and responsiveness for our demanding enterprise clients. If you thrive in a challenging environment and are eager to make a significant impact on mission-critical systems, driving operational excellence and technical leadership, we want to hear from you. Join us and help build the future of enterprise cloud applications!
Tanggung Jawab
- Design, implement, and maintain highly available, scalable, and secure cloud infrastructure on platforms like AWS, Azure, or GCP.
- Develop and implement robust automation for deployment, monitoring, and operational tasks using scripting languages (Python, Go, Bash) and CI/CD pipelines.
- Establish and refine comprehensive monitoring, alerting, and logging systems to proactively identify and resolve production issues.
- Lead incident response, conduct thorough root cause analysis, and implement post-mortem processes to continuously improve system reliability.
- Collaborate closely with development teams to embed reliability practices throughout the software development lifecycle (SDLC).
- Implement and manage container orchestration platforms such as Kubernetes, ensuring optimal performance and resource utilization.
- Conduct performance tuning, capacity planning, and disaster recovery drills for critical systems.
- Champion SRE best practices, mentor junior engineers, and contribute to the overall architectural strategy and vision.
Kualifikasi
- Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
- Minimum of 5+ years of experience in a Site Reliability Engineering (SRE), DevOps, or similar role.
- Deep expertise in at least one major cloud provider (AWS, Azure, GCP), including IaaS, PaaS, and serverless offerings.
- Proficiency with containerization technologies (Docker) and orchestration (Kubernetes).
- Strong command of Linux operating systems and networking fundamentals.
- Extensive experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation.
- Solid understanding and hands-on experience with observability tools (e.g., Prometheus, Grafana, ELK Stack, Datadog).
- Proven ability to script and automate tasks using Python, Go, Bash, or similar languages.