Deskripsi Pekerjaan
Are you a vigilant problem-solver with a passion for high-availability systems? Our team is looking for a dedicated Operations Support Engineer to serve as the critical first line of defense in maintaining our mission-critical services. In this role, you will be the backbone of our operational stability, ensuring 24/7 service uptime and seamless technical performance.
As an Operations Support Engineer, you will work in a fast-paced environment where you will diagnose complex issues, coordinate rapid incident response, and implement preventative measures to enhance system resilience. You will collaborate closely with engineering and DevOps teams to bridge the gap between development and production, ensuring that service interruptions are minimized and resolved with precision. If you thrive under pressure and enjoy dissecting technical challenges in a global-scale environment, we want to hear from you.
Tanggung Jawab
- Monitor and maintain 24/7 production environments to ensure maximum service uptime and performance.
- Serve as the primary point of contact for incident response, triage, and rapid technical troubleshooting.
- Coordinate with cross-functional engineering teams to escalate critical issues and provide clear status updates.
- Analyze system logs and performance metrics to identify recurring bottlenecks and implement long-term fixes.
- Maintain comprehensive documentation of incident logs, resolution steps, and post-mortem analysis.
- Automate routine operational tasks using scripting languages to reduce manual overhead.
- Participate in on-call rotations to provide continuous support for global service infrastructure.
- Implement proactive monitoring alerts to detect and mitigate potential service disruptions before they impact users.
Kualifikasi
- Bachelor’s degree in Computer Science, Information Technology, or a related technical field.
- At least 3-5 years of experience in an Operations, Site Reliability, or Technical Support Engineering role.
- Strong proficiency in Linux/Unix system administration and troubleshooting.
- Hands-on experience with cloud platforms (AWS, Azure, or GCP) and containerization tools like Docker or Kubernetes.
- Solid understanding of networking concepts (DNS, Load Balancers, HTTP/S, TCP/IP).
- Proven ability to write automation scripts in Python, Bash, or Go.
- Excellent communication skills with the ability to convey complex technical issues to both technical and non-technical stakeholders.
- Ability to work effectively in a high-pressure, rotating shift or on-call environment.