Deskripsi Pekerjaan
Join our dynamic team as an SRE Engineer at PT. Dollar Information Consultan Indonesia, where you'll play a pivotal role in ensuring the reliability and scalability of our cutting-edge GPU infrastructure. We're seeking a highly skilled professional with experience operating large-scale GPU cluster data centers, including those with thousands or tens of thousands of NVIDIA GPUs. In this role, you'll be responsible for designing, implementing, and maintaining robust systems that power our AI and machine learning workloads. You'll work closely with development teams to build resilient systems, automate processes, and continuously improve our infrastructure's performance and efficiency. If you're passionate about cloud technologies, distributed systems, and have a knack for solving complex technical challenges, we invite you to apply and contribute to our innovative projects.
Tanggung Jawab
- Design, implement, and maintain scalable and reliable GPU cluster infrastructure
- Monitor system performance, identify bottlenecks, and implement optimization strategies
- Develop automation tools and scripts for deployment, configuration, and monitoring
- Collaborate with development teams to ensure systems are built with reliability in mind
- Respond to incidents, perform root cause analysis, and implement preventive measures
- Manage capacity planning and resource allocation for large-scale GPU deployments
- Stay updated with industry best practices and emerging technologies in SRE and GPU computing
Kualifikasi
- Bachelor's degree in Computer Science, Engineering, or related field
- 3+ years of experience in Site Reliability Engineering or similar role
- Proven experience managing large-scale GPU clusters with thousands of NVIDIA GPUs
- Strong knowledge of Linux systems, networking, and cloud platforms (AWS, GCP, Azure)
- Proficiency in automation tools like Ansible, Terraform, or Kubernetes
- Experience with monitoring and observability tools (Prometheus, Grafana, etc.)
- Excellent problem-solving skills and ability to work in a fast-paced environment