Deskripsi Pekerjaan
Are you an experienced Site Reliability Engineer looking for a flexible, remote opportunity that supports a US-based team? KMC Solutions is seeking a highly skilled SRE to join our dynamic engineering department. In this pivotal role, you will be responsible for ensuring the stability, scalability, and performance of our production environments. You will leverage our existing AWS Organizations structure to manage complex cloud infrastructure and play a critical role in successful product launches.
We are looking for a problem-solver who thrives in a fast-paced environment and can effectively bridge the gap between development and operations. As a remote SRE working the US shift, you will collaborate closely with cross-functional teams to drive efficiency, implement robust automation, and maintain high availability standards. If you are passionate about cloud technologies and want to make a tangible impact on product reliability, we want to hear from you.
Tanggung Jawab
- Design, implement, and maintain scalable, reliable, and efficient AWS infrastructure using best practices.
- Manage and optimize AWS Organizations to ensure security compliance and governance standards.
- Monitor system performance and troubleshoot complex incidents during US business hours.
- Collaborate with development teams to improve CI/CD pipelines and automate deployment processes.
- Conduct capacity planning and performance tuning to support product launches.
- Drive incident response and post-mortem analysis to prevent future outages.
- Implement strategies for cloud cost optimization and disaster recovery.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or a related technical field.
- 5+ years of experience in Site Reliability Engineering, DevOps, or a similar role.
- Strong proficiency in AWS services (EC2, Lambda, S3, RDS, CloudWatch, etc.).
- Deep understanding of Linux system administration and scripting (Python, Bash, or Go).
- Experience with containerization technologies like Docker and Kubernetes.
- Familiarity with monitoring and logging tools such as Prometheus, Grafana, or ELK Stack.
- Ability to work effectively in a remote US-shift environment with strong communication skills.