Deskripsi Pekerjaan

Join ByteDance's pioneering AI-driven search and recommendation team, where innovation meets scale. We're building cutting-edge systems that power personalized experiences for millions of global users. As a Site Reliability Engineer, you'll architect and maintain robust infrastructure that ensures the seamless operation of AI-driven applications. This role offers the unique opportunity to work at the intersection of artificial intelligence and distributed systems, solving complex challenges in performance optimization, automation, and resilience. You'll collaborate with world-class ML engineers and product teams to deliver solutions that define the future of digital experiences.
Your impact will be felt across our global ecosystem as you design fault-tolerant systems, implement proactive monitoring strategies, and drive continuous improvement initiatives. ByteDance provides an environment where your technical expertise will directly shape how AI applications scale and evolve. If you're passionate about building systems that power next-generation AI products while maintaining the highest standards of reliability and performance, this is your chance to make a significant impact.

Tanggung Jawab

Design and implement scalable infrastructure for AI-powered applications and recommendation systems
Develop automation frameworks for deployment, monitoring, and incident response
Optimize system performance and reliability for high-traffic AI workloads
Collaborate with ML teams to ensure seamless integration of AI models into production environments
Implement robust monitoring and observability solutions using industry-standard tools
Lead post-mortem analysis and implement preventive measures for system incidents
Drive innovation in cloud-native technologies (Kubernetes, containers, serverless)
Contribute to SRE best practices and knowledge sharing within the organization

Kualifikasi

Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience)
3+ years of experience in Site Reliability Engineering, DevOps, or similar roles
Strong expertise in cloud platforms (AWS/GCP/Azure) and containerization technologies
Proficiency in scripting languages (Python, Bash, Go) and automation tools
Experience with CI/CD pipelines and infrastructure-as-code (Terraform, Ansible)
Understanding of machine learning workflows and AI application deployment patterns
Knowledge of monitoring solutions (Prometheus, Grafana) and observability practices
Excellent problem-solving skills with experience in high-availability systems

Site Reliability Engineer - AI Application

Deskripsi Pekerjaan

Tanggung Jawab

Kualifikasi

Keahlian yang Dibutuhkan

Siap Mengambil Tantangan Ini?

Lowongan Terkait

Supervisor Apprentice Program (SAP) - Pandan Jaya, Cheras

Protege – IT (Petaling Jaya)

IT Protege (Trainee) - Kuala Perlis