Lead Incident Manager

Apr 26, 2024
Austin, United States
... Not specified
... Intermediate
Full time
... Office work

EA's Production Infrastructure & Engineering (PI&E) organization provides the essential platforms and infrastructure hosting solutions that power EA's live services. Our charter is to make EA's games and services available to all players anytime and anywhere. To do this, we focus on the high availability of infrastructure, primary services, and studio services. We aim to help developers to experiment and build new games quickly with infrastructure services on-demand and workflows that promote rapid development in the cloud. In all of this, we focus on being there for players where and when they want to play.

The Challenge Ahead

The Mission Control Center (MCC) resides within the EA Production Infrastructure & Engineering (PI&E) team which is responsible for the infrastructure that our games run on. The MCC is the central point of contact for the PI&E team and plays a key role in driving online 'always on' services keeping a watchful eye over all monitored endpoints to ensure a continuous 24X7 uptime for our stakeholders.

As Lead Incident Manager, you will report to the MCC's Senior Manager.

Responsibilities:

  • You will work together with the MCC Manager to ensure that operations during a shift are managed to proper SLAs and standards are followed.
  • You will be the first point of escalations for MCC team members and partners and stakeholders.
  • You will be Incident Manager for Severity 1 and Severity 2 Incidents (coordinating the incident from the initial triage to the resolution, engaging teams, escalating long running incidents).
  • During an assigned shift MCC Lead assists with queue oversight, you will triage high priority incidents, identifying the needs of MCC and areas of improvement across the MCC team.
  • You will assist in tracking and providing data for internal group reports that detail the success and utilization of our Mission Control Center, disaster recovery policies and emergency/incident management drills.
  • You will analyze data to help provide results of emergency management and disaster recovery drills as defined by agreed incident escalation and disaster recovery policies.
  • You will partner with other EA Operational teams on a consistent basis to reduce systems downtime.
  • You will escalate emergencies to management as needed.
  • You will ensure MCC notification and escalation procedures are followed.
  • You understand the rigorous demands a 24x7 real-time online operational environment requires.

Qualifications:

  • Bachelor's degree in Computer Science, Engineering, related field or relevant experience.
  • 3+ years of experience with Systems Operations/Engineering organizational responsibilities-- including ownership and management of incident escalation, resolution tracking and resolution reporting.
  • 1+ years of experience working in a Lead capacity.
  • Knowledge of Cloud technology offerings, Networking, virtualization, and security fundamentals.
  • Experience or dealings with Network Operations Center best practices.
  • Strong understanding of company resources such as databases, software applications, and organizational structure.
  • Strong incident management skills.
  • Demonstrated skills in quantitative, analytical, and conceptual thinking.
  • Ability to define problems, document and establish facts to draw valid conclusions.
  • Excellent English verbal and written communication skills and confidence to communicate under crisis conditions.
  • Strong understanding of ITIL, especially Incident, Change and Problem Management – their purpose and how they are connected.
  • Availability for shift work-- including weekends and holidays.
COMPANY JOBS
493 available jobs
WEBSITE