Site Reliability Engineer (SRE)

About the Employer

Job Title:

Site Reliability Engineer (SRE)

Requisition ID:

R001523

Job Description:

Site Reliability Engineer (SRE)


Demonware creates and provides the online services behind hugely popular video game
franchises such as Crash Bandicoot, Skylanders and Guitar Hero. We provide Matchmaking, Microtransactions, Storage, Identity & Access Management services for almost half a billion users. Demonware has offices in Dublin, Ireland; Vancouver, Canada; and Shanghai, China, and is a wholly-owned subsidiary of Activision Blizzard, Inc.

Team / Role Summary
Demonware’s Titles department is responsible for developing and running the services and infrastructure for some of the largest entertainment franchises in the world. We work side by side with game studios at Activision, like Vicarious Visions and Toys for Bob, to make radical game designs a reality at massive scale. We then ensure that those features keep running 24/7 for years after launch. We also work closely with all departments at Demonware to influence and execute on our technical direction so that we improve how we ship titles in the future.

We are looking for an experienced SRE who can provide an input into all areas of service development: embedded on the development team, working directly with other engineers to ensure service design is reliable, scalable and has uptime at its core; automating to prevent issues happening in the first place; debugging problems at scale; supporting services in production by being part of an on-call rota; and working closely with other departments to ensure that our platform and product technologies support a sustainable future for the Titles department.

This is a full-time position at Demonware’s Dublin Office in Ireland.


Responsibilities

  • Create scalable services
  • Contribute improvements to the availability, scalability, latency, and efficiency of Demonware’s services
  • Be a part of a full-service and cross disciplinary development team, participating in the full development process, including design, capacity planning, and production deployments, while promoting site reliability engineering best practices
  • Support scalable services
  • Contribute to our deployment and automation tools, as well as the platform to more efficiently detect, address, and prevent problems from recurring
  • Define and measure production title availability, navigating known downtime, and service level outages
  • Debug problems at scale for our mission critical services, and help our platform and service teams to implement lasting fixes to recurring issues
  • Be a part of our on-call rotation, which is a responsibility you'll share with your engineering team and other Demonware engineers around the world
  • Influencing our technology
  • Be an expert customer of our platform teams, helping them to shape our architecture. Influence and create new designs, architectures, standards, and methods for large-scale distributed systems
  • Influence a culture of service ownership at Demonware. Engage in training and mentoring to help develop other engineers with this mindset

Requirements

  • Minimum 5 years relevant work experience, including in a high-volume or critical production service environment
  • Experience working at scale - thousands of servers running a high-volume or critical production service environment
  • Automation / scripting skills and a desire to automate all the things
  • Comfortable with at least one scripting language, e.g. Python or Ruby
  • Experience with at least one major database e.g. MySQL, Cassandra.
  • Solid understanding of fundamental technologies, e.g. TCP/IP, Linux/Unix internals
  • Experience in configuration management systems, e.g. Ansible, Puppet, Terraform
  • Demonstrable capacity for an investigative approach and be keen to learn new technologies
  • Demonstrated excellence in communicating within and across teams
  • Working closely with Demonware’s developers in supporting new features and services
  • Developing new, and improving on, existing automation tools
  • Troubleshooting issues with our titles, working as part of the team to deploy, test and troubleshoot
  • Investigating performance degradation and resolving issues.
  • Participating in an On-call support rotation as required.


Desirable

  • Experience working with public cloud providers and cloud technologies e.g. Amazon, GCP
  • Experience working with container orchestration e.g. Kubernetes
  • Experience in monitoring and metrics systems, e.g. Nagios, Zabbix, Graphite, ELK
  • Background in Sys Admin is advantageous
  • Experience with Python scripting language