Deskripsi Pekerjaan
Are you a seasoned troubleshooter who thrives in high-stakes environments? We are seeking a highly skilled Senior Operations Support Engineer to join our dynamic team in Singapore. In this critical role, you will be the backbone of our 24/7 service availability, serving as the first line of defense in identifying, diagnosing, and resolving complex technical incidents.
You will work closely with cross-functional engineering teams to maintain system reliability, improve operational workflows, and ensure a seamless experience for our end-users. We are looking for a proactive problem-solver who is not just reactive to issues but dedicated to building robust monitoring systems and automated response mechanisms to prevent downtime before it occurs.
If you are passionate about site reliability, cloud infrastructure, and operational excellence, we want to hear from you. Join us in shaping the future of our digital services and delivering uncompromising performance in a fast-paced environment.
Tanggung Jawab
- Lead the real-time monitoring and incident response for core services to ensure 24/7 uptime.
- Act as the primary point of escalation for complex technical issues, performing deep-dive root cause analysis.
- Develop and maintain operational playbooks and documentation to streamline incident management.
- Collaborate with DevOps and Software Engineering teams to implement reliability improvements and infrastructure patches.
- Design and optimize automated alerts to improve mean time to detection (MTTD) and resolution (MTTR).
- Manage cross-team communications during critical service outages, ensuring stakeholders are updated promptly.
- Mentor junior team members and foster a culture of technical excellence and continuous improvement.
Kualifikasi
- Bachelor’s degree in Computer Science, Information Technology, or a related field.
- Minimum 5+ years of experience in technical support, site reliability engineering, or operations.
- Proven track record in managing cloud infrastructure (AWS, Azure, or GCP).
- Strong proficiency in Linux/Unix environments and shell scripting (Bash, Python).
- Deep understanding of networking protocols, load balancers, and container orchestration (Kubernetes/Docker).
- Familiarity with monitoring and logging tools such as Datadog, Splunk, Prometheus, or ELK Stack.
- Exceptional analytical and problem-solving skills with the ability to handle high-pressure situations.
- Strong interpersonal skills with the ability to communicate technical concepts to non-technical stakeholders.