Deskripsi Pekerjaan
Join ByteDance's pioneering AI-driven search and recommendation team, where innovation meets scale. We're building cutting-edge systems that power personalized experiences for millions of global users. As a Site Reliability Engineer, you'll architect and maintain robust infrastructure that ensures the seamless operation of AI-driven applications. This role offers the unique opportunity to work at the intersection of artificial intelligence and distributed systems, solving complex challenges in performance optimization, automation, and resilience. You'll collaborate with world-class ML engineers and product teams to deliver solutions that define the future of digital experiences.
Your impact will be felt across our global ecosystem as you design fault-tolerant systems, implement proactive monitoring strategies, and drive continuous improvement initiatives. ByteDance provides an environment where your technical expertise will directly shape how AI applications scale and evolve. If you're passionate about building systems that power next-generation AI products while maintaining the highest standards of reliability and performance, this is your chance to make a significant impact.
Tanggung Jawab
- Design and implement scalable infrastructure for AI-powered applications and recommendation systems
- Develop automation frameworks for deployment, monitoring, and incident response
- Optimize system performance and reliability for high-traffic AI workloads
- Collaborate with ML teams to ensure seamless integration of AI models into production environments
- Implement robust monitoring and observability solutions using industry-standard tools
- Lead post-mortem analysis and implement preventive measures for system incidents
- Drive innovation in cloud-native technologies (Kubernetes, containers, serverless)
- Contribute to SRE best practices and knowledge sharing within the organization
Kualifikasi
- Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience)
- 3+ years of experience in Site Reliability Engineering, DevOps, or similar roles
- Strong expertise in cloud platforms (AWS/GCP/Azure) and containerization technologies
- Proficiency in scripting languages (Python, Bash, Go) and automation tools
- Experience with CI/CD pipelines and infrastructure-as-code (Terraform, Ansible)
- Understanding of machine learning workflows and AI application deployment patterns
- Knowledge of monitoring solutions (Prometheus, Grafana) and observability practices
- Excellent problem-solving skills with experience in high-availability systems