Systems Design Engineer - Site Reliability Engr

Apr 03, 2024
Austin, United States
... Not specified
... Intermediate
Full time
... Office work


WHAT YOU DO AT AMD CHANGES EVERYTHING

We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences – the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world’s most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives. 

AMD together we advance_




Position Description 
 

The Site Reliability Engineer (SRE) position is for the newly formed Cluster Platform Engineering (CPE) team in the Data Center Cluster Solutions (DCCS) organization, as part of the AMD (Advanced Micro Devices) Data Center Solutions Group (DSG) business unit. DCCS supports the cluster deployment needs of the Datacenter GPU (DCGPU) business unit. The SRE will be responsible for helping to create and automate processes that bring up and keep deployed GPU and CPU cluster systems running. This position will be focused on the operational aspects of large-scale GPU-accelerated AI (Artificial Intelligence) and HPC (High Performance Computing) Cluster systems within AMD. 

 

The SRE will work closely with the CPE Platform Engineering (PE) and Data Center Operations (DCOps) teams as internal and external systems are brought up for customers. They will work on using software tools to convert manual processes over and automate tasks such as systems management and application monitoring. They will also work with the CPE Release Engineering (RE) team to develop and automate reliable processes for applying updates to cluster systems. 

 

This position is an exciting opportunity to help build a platform and create a world-class operation in support of this exciting growth area for AMD and for the industry. This position reports to the Senior Manager of the Site Reliability Engineering team within the Cluster Platform Engineering group. 

 

Role and Responsibilities 

 
This SRE role will primarily involve learning the AMD GPU cluster systems, assisting in the bring up of these systems, and developing automation to keep them operational, as well as working with the various other DCGPU and DSG teams to incorporate requirements and address any issues on the systems. 

 

Specific responsibilities of this position include: 

 

  • Working with the Platform Engineering team to develop and automate management of an infrastructure control plane and deployment system for GPU and CPU clusters. 
  • Working with the Release Engineering team to automate the application of updates and system configuration management tools. 
  • Resolution of problem tickets reported by internal and external customers for GPU and CPU cluster systems. 
  • Develop and enhance internal and 3rd party network and cluster management tools, applications, and processes that enable internal teams and customers to build, test, and optimize performance of high-performance networks supporting large-scale GPU and CPU cluster systems. 
  • Assist in developing the software ecosystem needed for at-scale cluster operations providing Cluster-as-a-Service for AMD internal and customer access systems. This includes some involvement with rack-and-stack datacenter operations, at scale software install and configuration management, and at scale system provisioning, helping to build and operate an on-prem cloud service for internal AMD stakeholders that forms a model for customer adoption. 
  • Helping to create an enterprise-class operational model for internal cluster systems that provides a reliable, secure, automated infrastructure for rapid response to changing requirements, efficient use of assets, and a reference template for customer adoption. 
  • Participate in a strong customer-centric culture focused on meeting commitments. 

  

Experience and Qualifications 
 

  • 10+ year's experience in high-performance networks, platform hardware, firmware, and systems management solutions at scale. 
  • Strong Linux system administration knowledge and skills around installation, configuration, package management, and system management across multiple OS (Operating System) distributions. Related skill in system performance tuning at user and kernel mode is a plus. 
  • Experience with virtualization and containerization including systems like KVM, Docker, podman, OpenShift, and Kubernetes. 
  • Strong experience with system automation and configuration management at scale using tools like Ansible, Salt, Chef, Puppet, bash, and Python. 
  • Experience working with dev teams developing and maintaining a CI/CD pipeline development environment. 
  • Strong networking knowledge and troubleshooting capability in large scale Ethernet networks. Experience with RDMA/RoCE and InfiniBand is a big plus. 
  • Experience using common industry tools to fix software issues and automate operational processes. 
  • Familiarity with database management, data analysis, storage filesystems, volume management like LVM, HW and SW RAID, and similar systems. 
  • Demonstrated record of accomplishment of successfully building and delivering complex operational solutions at scale, with the ability to learn new systems quickly in a rapidly changing environment. 
  • Remote position but with ability to travel when required (up to 10%). 

 
Personal Characteristics 

 
Excellent Communication and People Skills – The ability to interact with various teams in order to accomplish operational goals. 

 

Technology Orientation – Affinity towards technology products. Someone who is curious and creative in applying technology for innovative ideas and applications. 

 

Outstanding Integrity – A thoroughly honest and forthright individual, who is upfront and direct with subordinates, peers, and management executives to whom he/she reports 
  

Effective working in a culturally diverse organization 
 
Education 

 
BSEE or relevant technical degree 

Linux system certifications such as Red Hat, Canonical, SUSE, and others a plus

 

LOCATION:  Austin

 

#LI-DR2 

 




At AMD, your base pay is one part of your total rewards package.  Your base pay will depend on where your skills, qualifications, experience, and location fit into the hiring range for the position. You may be eligible for incentives based upon your role such as either an annual bonus or sales incentive. Many AMD employees have the opportunity to own shares of AMD stock, as well as a discount when purchasing AMD stock if voluntarily participating in AMD’s Employee Stock Purchase Plan. You’ll also be eligible for competitive benefits described in more detail here.

 

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law.   We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

At AMD, your base pay is one part of your total rewards package.  Your base pay will depend on where your skills, qualifications, experience, and location fit into the hiring range for the position. You may be eligible for incentives based upon your role such as either an annual bonus or sales incentive. Many AMD employees have the opportunity to own shares of AMD stock, as well as a discount when purchasing AMD stock if voluntarily participating in AMD’s Employee Stock Purchase Plan. You’ll also be eligible for competitive benefits described in more detail here.

 

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law.   We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

Position Description 
 

The Site Reliability Engineer (SRE) position is for the newly formed Cluster Platform Engineering (CPE) team in the Data Center Cluster Solutions (DCCS) organization, as part of the AMD (Advanced Micro Devices) Data Center Solutions Group (DSG) business unit. DCCS supports the cluster deployment needs of the Datacenter GPU (DCGPU) business unit. The SRE will be responsible for helping to create and automate processes that bring up and keep deployed GPU and CPU cluster systems running. This position will be focused on the operational aspects of large-scale GPU-accelerated AI (Artificial Intelligence) and HPC (High Performance Computing) Cluster systems within AMD. 

 

The SRE will work closely with the CPE Platform Engineering (PE) and Data Center Operations (DCOps) teams as internal and external systems are brought up for customers. They will work on using software tools to convert manual processes over and automate tasks such as systems management and application monitoring. They will also work with the CPE Release Engineering (RE) team to develop and automate reliable processes for applying updates to cluster systems. 

 

This position is an exciting opportunity to help build a platform and create a world-class operation in support of this exciting growth area for AMD and for the industry. This position reports to the Senior Manager of the Site Reliability Engineering team within the Cluster Platform Engineering group. 

 

Role and Responsibilities 

 
This SRE role will primarily involve learning the AMD GPU cluster systems, assisting in the bring up of these systems, and developing automation to keep them operational, as well as working with the various other DCGPU and DSG teams to incorporate requirements and address any issues on the systems. 

 

Specific responsibilities of this position include: 

 

  • Working with the Platform Engineering team to develop and automate management of an infrastructure control plane and deployment system for GPU and CPU clusters. 
  • Working with the Release Engineering team to automate the application of updates and system configuration management tools. 
  • Resolution of problem tickets reported by internal and external customers for GPU and CPU cluster systems. 
  • Develop and enhance internal and 3rd party network and cluster management tools, applications, and processes that enable internal teams and customers to build, test, and optimize performance of high-performance networks supporting large-scale GPU and CPU cluster systems. 
  • Assist in developing the software ecosystem needed for at-scale cluster operations providing Cluster-as-a-Service for AMD internal and customer access systems. This includes some involvement with rack-and-stack datacenter operations, at scale software install and configuration management, and at scale system provisioning, helping to build and operate an on-prem cloud service for internal AMD stakeholders that forms a model for customer adoption. 
  • Helping to create an enterprise-class operational model for internal cluster systems that provides a reliable, secure, automated infrastructure for rapid response to changing requirements, efficient use of assets, and a reference template for customer adoption. 
  • Participate in a strong customer-centric culture focused on meeting commitments. 

  

Experience and Qualifications 
 

  • 10+ year's experience in high-performance networks, platform hardware, firmware, and systems management solutions at scale. 
  • Strong Linux system administration knowledge and skills around installation, configuration, package management, and system management across multiple OS (Operating System) distributions. Related skill in system performance tuning at user and kernel mode is a plus. 
  • Experience with virtualization and containerization including systems like KVM, Docker, podman, OpenShift, and Kubernetes. 
  • Strong experience with system automation and configuration management at scale using tools like Ansible, Salt, Chef, Puppet, bash, and Python. 
  • Experience working with dev teams developing and maintaining a CI/CD pipeline development environment. 
  • Strong networking knowledge and troubleshooting capability in large scale Ethernet networks. Experience with RDMA/RoCE and InfiniBand is a big plus. 
  • Experience using common industry tools to fix software issues and automate operational processes. 
  • Familiarity with database management, data analysis, storage filesystems, volume management like LVM, HW and SW RAID, and similar systems. 
  • Demonstrated record of accomplishment of successfully building and delivering complex operational solutions at scale, with the ability to learn new systems quickly in a rapidly changing environment. 
  • Remote position but with ability to travel when required (up to 10%). 

 
Personal Characteristics 

 
Excellent Communication and People Skills – The ability to interact with various teams in order to accomplish operational goals. 

 

Technology Orientation – Affinity towards technology products. Someone who is curious and creative in applying technology for innovative ideas and applications. 

 

Outstanding Integrity – A thoroughly honest and forthright individual, who is upfront and direct with subordinates, peers, and management executives to whom he/she reports 
  

Effective working in a culturally diverse organization 
 
Education 

 
BSEE or relevant technical degree 

Linux system certifications such as Red Hat, Canonical, SUSE, and others a plus

 

LOCATION:  Austin

 

#LI-DR2 

 

COMPANY JOBS
1501 available jobs
WEBSITE