Joining Razer will place you on a global mission to revolutionize the way the world games. Razer is a place to do great work, offering you the opportunity to make an impact globally while working across a global team located across 5 continents. Razer is also a great place to work, providing you the unique, gamer-centric #LifeAtRazer experience that will put you in an accelerated growth, both personally and professionally.
Job Responsibilities :
We are seeking an experienced AIOps Engineer to enhance the reliability, performance, and operational intelligence of mission-critical payment platform infrastructure and services.
This role focuses on designing and implementing intelligent automation solutions, improving observability practices, and leveraging AI-assisted operational capabilities to proactively detect system risks and optimize platform performance. The AIOps Engineer will work closely with DevOps, SRE, and Software Engineering teams to reduce operational overhead, improve incident response efficiency, and support scalable transaction processing environments.Role Overview
We are looking for an AIOps Engineer to help improve the reliability, visibility, and operational efficiency of our systems.
You will work with software, infrastructure, and operations teams to develop monitoring, automation, and AI-assisted solutions for production environments. This role is suitable for an engineer with practical experience in DevOps, site reliability engineering, platform engineering, software engineering, or data-related operations who is ready to take ownership of defined technical areas.
You will contribute to existing AIOps initiatives while receiving guidance on complex architectural, operational, and business-critical decisions.
Key Responsibilities
- Develop and maintain monitoring dashboards, alerts, and operational reports for applications and infrastructure.
- Analyse logs, metrics, traces, and system events to identify performance issues, abnormal behaviour, and recurring incidents.
- Build scripts and automation workflows to reduce repetitive operational work.
- Support the implementation of alert correlation, incident classification, and operational data analysis.
- Assist with production incident investigation, root-cause analysis, and follow-up improvements.
- Improve alert quality by tuning thresholds, reducing duplicate alerts, and adding useful operational context.
- Develop and maintain integrations between monitoring, ticketing, communication, and automation platforms.
- Apply suitable AI or machine-learning techniques to operational use cases such as anomaly detection, incident summarisation, log classification, and capacity forecasting.
- Test and validate automation before deployment to production environments.
- Maintain runbooks, technical documentation, operational procedures, and implementation records.
- Work with senior engineers and relevant teams on system reliability, capacity planning, and service improvements.
- Participate in reviews of operational performance and recommend practical improvements within assigned areas.
Requirements
- Bachelor’s degree in Computer Science, Software Engineering, Information Technology, Data Science, Computer Engineering, or a related field.
- Approximately 2–4 years of relevant experience in DevOps, SRE, platform engineering, software engineering, cloud operations, data engineering, or a related technical role.
- Practical programming or scripting experience using Python, Go, Bash, or a similar language.
- Working knowledge of Linux, networking, APIs, databases, and distributed applications.
- Experience with logs, metrics, dashboards, alerting, or observability tools.
- Familiarity with containers and cloud environments.
- Ability to investigate technical issues systematically and communicate findings clearly.
- Understanding of software development practices, version control, testing, and deployment workflows.
- Interest in applying AI and automation to practical operational problems.
Preferred Qualifications
- Experience with tools such as Prometheus, Grafana, OpenTelemetry, Elasticsearch, New Relic, Datadog, Splunk, or similar platforms.
- Familiarity with Docker, Kubernetes, AWS, Azure, or Google Cloud.
- Exposure to messaging systems, event-streaming platforms, or distributed systems.
- Basic experience with machine learning, statistical analysis, or large language model integrations.
- Experience in payment, fintech, e-commerce, or another high-availability environment.
- Familiarity with incident-management processes, service-level indicators, and service-level objectives.
What Success Looks Like
Within this role, you will be expected to:
- Deliver reliable monitoring and automation for assigned systems.
- Investigate operational issues with moderate independence.
- Produce clear technical findings and documentation.
- Improve existing workflows rather than redesigning the entire operational platform.
- Escalate high-risk or architectural decisions appropriately.
- Gradually take ownership of larger and more complex AIOps capabilities as your experience grows.
Pre-Requisites :
Razer is proud to be an Equal Opportunity Employer. We believe that diverse teams drive better ideas, better products, and a stronger culture. We are committed to providing an inclusive, respectful, and fair workplace for every employee across all the countries we operate in. We do not discriminate on the basis of race, ethnicity, colour, nationality, ancestry, religion, age, sex, sexual orientation, gender identity or expression, disability, marital status, or any other characteristic protected under local laws. Where needed, we provide reasonable accommodations - including for disability or religious practices - to ensure every team member can perform and contribute at their best.
Are you game?