Site Reliability Engineer III
Feb 04, 2023
ROLE: Site Reliability Engineer III
EXP: 5-8 years
- You will build and operate distributed, large-scale, cloud-based infrastructure using modern open-source software solutions.
- You will help build and operate a unified platform across EA, extract and process massive data from spanning 20+ game studios, and use the insight to serve massive online requests.
- You will use automation technologies to ensure repeatability, eliminate toil, and reduce mean time to detection and resolution (MTTD & MTTR), and repair services.
- You will perform root cause analysis and post-mortems with an eye toward future prevention. You will design and build CI/CD pipelines.
- You will create monitoring, alerting, and dashboarding solutions that improve visibility into EA's application performance and business metrics.
- You will produce documentation and support tooling for online support teams.
- You will develop reporting systems that inform us of important metrics, detect anomalies, and forecast future results. Develop and operate both SQL and NoSQL solutions.
- You will build complex queries to solve data mining problems.
- You will develop a large-scale online platform to personalize the player experience and provide reporting and feedback.
- You will help in interviewing and hiring the best candidates for the team.
You will help mentor the team members and help them grow in their skillsets.
You will be responsible for driving growth and modernization efforts and projects for the team Qualifications:
- 4+ years of experience with Virtualization, Containerization, Cloud Computing (AWS preferred), VMWare ecosystems, Kubernetes, or Docker.
- 6+ years of experience supporting high-availability production-grade Data infrastructure and applications with defined SLIs and SLOs. Systems Administration or Cloud experience, including a strong understanding of Linux / Unix.
- Network experience, including an understanding of standard protocols/components. Automation and orchestration experience including Terraform, Helm, Chef, and Packer.
- Experience writing code in Python, Golang, or Java. Experience with Monitoring tech stacks like Prometheus, Grafana, Loki, and Alertmanager
- Experience with distributed systems to serve massive concurrent requests Experience working with large-scale systems and data platforms/warehouses.