GPU RAS Architect

Feb 22, 2024
Austin, United States
... Not specified
... Intermediate
Full time
... Office work


WHAT YOU DO AT AMD CHANGES EVERYTHING

We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences – the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world’s most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives. 

AMD together we advance_




GPU RAS Architect

 

The Role:

We are looking for a talented and experienced RAS Architect (Reliability, Availability, and Serviceability) to join our team. As the RAS Architect, you will be responsible for designing and implementing robust systems, ensuring high levels of reliability, availability, and serviceability. You will work closely with cross-functional teams, including hardware engineers, system architects, and software developers, to create designs that meet stringent reliability requirements and deliver exceptional customer experiences. Are you ready to change the next generation of computing? Join us at the forefront of technological advancement.

 

 

In this role, you will be required to:

  • Design and develop server -level RAS architectures for AMD’s Data Center products. Define and analyze server-level reliability, availability, and serviceability requirements, ensuring compliance with industry standards and customer expectations. RAS solutions will be deployed in scale-out environments.
  • Collaborate with hardware and software teams to identify potential points of failure, conduct failure mode and effects analysis (FMEA), and propose mitigation strategies. Develop fault detection, isolation, and recovery mechanisms to ensure system resilience and minimize downtime.
  • Design redundancy and fault-tolerant mechanisms, including redundant components, and error correction codes (ECC), to maximize system availability. Define and implement advanced diagnostics and monitoring capabilities to enable proactive system health management and predictive maintenance.
  • Evaluate and select appropriate technologies and components to optimize reliability, availability, and serviceability, considering factors such as mean time between failures (MTBF), mean time to repair (MTTR), and total cost of ownership (TCO).
  • Collaborate with vendors and suppliers to assess and integrate their RAS-related solutions into the overall system architecture. Conduct system-level simulations, analysis, and testing to validate and verify the effectiveness of the RAS architecture and its components.
  • Stay up-to-date with the latest advancements in RAS techniques, fault tolerance mechanisms, and industry trends to guide future system designs.
  • Contribute to all phases of product development, from product definition and architecture and design, through implementation, debugging, testing and early customer support.

 

Key Responsibilities:

  • 10+ years of experience
  • Strong programming in C/C++ in Linux operating environment, strong understanding of Linux kernel internals, strong code review skills.
  • Strong expertise in system-level architecture design, reliability engineering, and fault tolerance mechanisms, optimizing RAS architectures for complex computing systems, data centers, or mission-critical applications.
  • Experience with fault-tolerant design principles and techniques, including redundancy, error correction codes (ECC), and error recovery mechanisms.
  • Proficiency in system-level simulation tools and methodologies (e.g., fault injection, reliability block diagrams, failure rate analysis).
  • Excellent problem-solving skills, attention to detail, and the ability to analyze complex system-level issues.
  • Possess excellent written and oral communication skills, good work ethics, high sense of team-work, love to produce quality work and commitment to finish your tasks every single day.
  • You are a self-starter who loves to find creative solutions to complicated problems.

 

Preferred qualifications:

  • BS, MS, or PhD in EE/CS or related field of education or equivalent experience
  • PMP certificate
  • Prior ASIC project management experience

Academic Credentials:  

  • Bachelors or Masters degree in electrical or computer engineering 

 

Location:

Austin, TX

#LI-RW1 




At AMD, your base pay is one part of your total rewards package.  Your base pay will depend on where your skills, qualifications, experience, and location fit into the hiring range for the position. You may be eligible for incentives based upon your role such as either an annual bonus or sales incentive. Many AMD employees have the opportunity to own shares of AMD stock, as well as a discount when purchasing AMD stock if voluntarily participating in AMD’s Employee Stock Purchase Plan. You’ll also be eligible for competitive benefits described in more detail here.

 

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law.   We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

At AMD, your base pay is one part of your total rewards package.  Your base pay will depend on where your skills, qualifications, experience, and location fit into the hiring range for the position. You may be eligible for incentives based upon your role such as either an annual bonus or sales incentive. Many AMD employees have the opportunity to own shares of AMD stock, as well as a discount when purchasing AMD stock if voluntarily participating in AMD’s Employee Stock Purchase Plan. You’ll also be eligible for competitive benefits described in more detail here.

 

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law.   We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

GPU RAS Architect

 

The Role:

We are looking for a talented and experienced RAS Architect (Reliability, Availability, and Serviceability) to join our team. As the RAS Architect, you will be responsible for designing and implementing robust systems, ensuring high levels of reliability, availability, and serviceability. You will work closely with cross-functional teams, including hardware engineers, system architects, and software developers, to create designs that meet stringent reliability requirements and deliver exceptional customer experiences. Are you ready to change the next generation of computing? Join us at the forefront of technological advancement.

 

 

In this role, you will be required to:

  • Design and develop server -level RAS architectures for AMD’s Data Center products. Define and analyze server-level reliability, availability, and serviceability requirements, ensuring compliance with industry standards and customer expectations. RAS solutions will be deployed in scale-out environments.
  • Collaborate with hardware and software teams to identify potential points of failure, conduct failure mode and effects analysis (FMEA), and propose mitigation strategies. Develop fault detection, isolation, and recovery mechanisms to ensure system resilience and minimize downtime.
  • Design redundancy and fault-tolerant mechanisms, including redundant components, and error correction codes (ECC), to maximize system availability. Define and implement advanced diagnostics and monitoring capabilities to enable proactive system health management and predictive maintenance.
  • Evaluate and select appropriate technologies and components to optimize reliability, availability, and serviceability, considering factors such as mean time between failures (MTBF), mean time to repair (MTTR), and total cost of ownership (TCO).
  • Collaborate with vendors and suppliers to assess and integrate their RAS-related solutions into the overall system architecture. Conduct system-level simulations, analysis, and testing to validate and verify the effectiveness of the RAS architecture and its components.
  • Stay up-to-date with the latest advancements in RAS techniques, fault tolerance mechanisms, and industry trends to guide future system designs.
  • Contribute to all phases of product development, from product definition and architecture and design, through implementation, debugging, testing and early customer support.

 

Key Responsibilities:

  • 10+ years of experience
  • Strong programming in C/C++ in Linux operating environment, strong understanding of Linux kernel internals, strong code review skills.
  • Strong expertise in system-level architecture design, reliability engineering, and fault tolerance mechanisms, optimizing RAS architectures for complex computing systems, data centers, or mission-critical applications.
  • Experience with fault-tolerant design principles and techniques, including redundancy, error correction codes (ECC), and error recovery mechanisms.
  • Proficiency in system-level simulation tools and methodologies (e.g., fault injection, reliability block diagrams, failure rate analysis).
  • Excellent problem-solving skills, attention to detail, and the ability to analyze complex system-level issues.
  • Possess excellent written and oral communication skills, good work ethics, high sense of team-work, love to produce quality work and commitment to finish your tasks every single day.
  • You are a self-starter who loves to find creative solutions to complicated problems.

 

Preferred qualifications:

  • BS, MS, or PhD in EE/CS or related field of education or equivalent experience
  • PMP certificate
  • Prior ASIC project management experience

Academic Credentials:  

  • Bachelors or Masters degree in electrical or computer engineering 

 

Location:

Austin, TX

#LI-RW1 

COMPANY JOBS
1746 available jobs
WEBSITE