Deskripsi Pekerjaan
Join our dynamic team as a Senior Site Reliability Engineer and play a pivotal role in designing, implementing, and maintaining scalable, high-performance systems that power our digital infrastructure. You'll be instrumental in bridging the gap between development and operations, ensuring seamless deployment pipelines, robust automation, and exceptional system reliability. This position offers the opportunity to work with cutting-edge cloud technologies while driving continuous improvement in monitoring, incident response, and disaster recovery frameworks. Collaborate with cross-functional teams to optimize resource utilization and enhance service availability while maintaining industry-leading security standards. If you're passionate about building resilient systems and thrive in a fast-paced environment, this role is your chance to make a significant impact.
Tanggung Jawab
- Design, implement, and maintain scalable infrastructure using cloud platforms (AWS/GCP/Azure)
- Develop and optimize CI/CD pipelines for automated deployments and releases
- Proactively monitor system health using observability tools (Prometheus, Grafana)
- Lead incident response and post-mortem analysis for production issues
- Automate operational processes to improve efficiency and reduce manual overhead
- Implement infrastructure-as-code solutions (Terraform, CloudFormation)
- Collaborate with development teams to troubleshoot complex system issues
- Establish and maintain SLOs/SLIs for critical services
Kualifikasi
- Bachelor's degree in Computer Science or equivalent practical experience
- 5+ years in SRE, DevOps, or infrastructure engineering roles
- Expertise in containerization (Docker, Kubernetes) and orchestration
- Strong programming skills in Python, Go, or Bash
- Experience with infrastructure-as-code and configuration management
- Knowledge of monitoring, logging, and alerting systems
- Proven ability to troubleshoot complex distributed systems
- Experience with cloud service provider environments