LLM High-Performance Optimization Engineer

Sep 06, 2024
Beijing, China
... Not specified
... Intermediate
Full time
... Office work


WHAT YOU DO AT AMD CHANGES EVERYTHING

We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences – the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world’s most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives. 

AMD together we advance_




LLM High-Performance Optimization Architect
 
Our team's primary work involves training and inference accelerating for LLM on AMD GPUs. We adapt LLM training and inference open-source frameworks based on ROCm tools, complete pre-training accuracy validation, and fully utilize hardware resources to enhance model training and inference performance.

 

 

Responsibilities:
- Develop and implement LLM training and inference frameworks that adapt to and fully utilize hardware resources to improve training and inference performance.
- Analyze and optimize accuracy and performance issues during model training and deployment.

 

Requirements:
- Proficient in PyTorch and familiar with common distributed machine learning frameworks such as Megatron-LM and DeepSpeed.
- Familiar with mainstream LLM inference engines, such as FasterTransformer/vLLM, and common inference optimization methods like FlashAttention, PageAttention, Continuous Batching, and Speculative Decoding.
- Experience with multi-machine distributed training and inference.
- Experience in accelerating and optimizing deep learning applications, with the ability to perform targeted optimizations based on different scenarios and hardware platforms.
- Proficient in Python and C/C++ programming.
- Strong communication and teamwork skills, with the ability to read academic papers and communicate effectively in English.

 

Preferred:
- Experience in distributed training and acceleration of large models.
- Familiarity with CUDA kernel development acceleration.

 

我们团队主要的工作是大模型的训练和推理加速在AMDGPU上。基于ROCm的工具适配LLM 训练和推理框架,完成预训练精度验证, 并充分利用硬件资源,提升训练和推理性能模型。

岗位职责:
优化大模型训练和推理性能,降低大模型推理时延,提升吞吐,包括但不限于模型剪枝、模型量化、模型蒸馏、模型压缩等。适应并充分利用硬件资源,控制模型部署成本; 对模型训练和部署时存在的精度与性能问题进行分析和调优,识别和解决瓶颈问题,提高模型的训练和推理速度; 推动深度学习优化算法的研发和产业落地。
岗位要求:
有深度学习应用加速和优化的经验,能够根据不同场景和硬件平台进行针对性的优化;
熟悉C/C++编程,cuda kernel开发,有底层算法性能调试及加速经验
熟练掌握TensorFlowPyTorch等至少一种深度学习框架,熟悉常见的分布式机器学习框架,如MegatronDeepSpeedHuggingFace Transformers等;
熟悉LLM主流推理引擎,如FasterTransformer/vLLM。熟悉常见的推理优化方法,如FlashAtentionPageAttentionContinuous BatchingSpeculative Decoding等;
具备良好的沟通和团队合作能力,能够与跨功能团队密切合作,解决问题并实现共同目标。
加分项:
熟悉TensorRT/Triton/Cutlass经验者优先;
AIGC模型推理和训练加速加速加速落地经验者优先;
熟悉分布式推理加速框架,有超大模型分布式加速经验优先。

 

#LI-FL1




Benefits offered are described:  AMD benefits at a glance.

 

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law.   We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

Benefits offered are described:  AMD benefits at a glance.

 

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law.   We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

LLM High-Performance Optimization Architect
 
Our team's primary work involves training and inference accelerating for LLM on AMD GPUs. We adapt LLM training and inference open-source frameworks based on ROCm tools, complete pre-training accuracy validation, and fully utilize hardware resources to enhance model training and inference performance.

 

 

Responsibilities:
- Develop and implement LLM training and inference frameworks that adapt to and fully utilize hardware resources to improve training and inference performance.
- Analyze and optimize accuracy and performance issues during model training and deployment.

 

Requirements:
- Proficient in PyTorch and familiar with common distributed machine learning frameworks such as Megatron-LM and DeepSpeed.
- Familiar with mainstream LLM inference engines, such as FasterTransformer/vLLM, and common inference optimization methods like FlashAttention, PageAttention, Continuous Batching, and Speculative Decoding.
- Experience with multi-machine distributed training and inference.
- Experience in accelerating and optimizing deep learning applications, with the ability to perform targeted optimizations based on different scenarios and hardware platforms.
- Proficient in Python and C/C++ programming.
- Strong communication and teamwork skills, with the ability to read academic papers and communicate effectively in English.

 

Preferred:
- Experience in distributed training and acceleration of large models.
- Familiarity with CUDA kernel development acceleration.

 

我们团队主要的工作是大模型的训练和推理加速在AMDGPU上。基于ROCm的工具适配LLM 训练和推理框架,完成预训练精度验证, 并充分利用硬件资源,提升训练和推理性能模型。

岗位职责:
优化大模型训练和推理性能,降低大模型推理时延,提升吞吐,包括但不限于模型剪枝、模型量化、模型蒸馏、模型压缩等。适应并充分利用硬件资源,控制模型部署成本; 对模型训练和部署时存在的精度与性能问题进行分析和调优,识别和解决瓶颈问题,提高模型的训练和推理速度; 推动深度学习优化算法的研发和产业落地。
岗位要求:
有深度学习应用加速和优化的经验,能够根据不同场景和硬件平台进行针对性的优化;
熟悉C/C++编程,cuda kernel开发,有底层算法性能调试及加速经验
熟练掌握TensorFlowPyTorch等至少一种深度学习框架,熟悉常见的分布式机器学习框架,如MegatronDeepSpeedHuggingFace Transformers等;
熟悉LLM主流推理引擎,如FasterTransformer/vLLM。熟悉常见的推理优化方法,如FlashAtentionPageAttentionContinuous BatchingSpeculative Decoding等;
具备良好的沟通和团队合作能力,能够与跨功能团队密切合作,解决问题并实现共同目标。
加分项:
熟悉TensorRT/Triton/Cutlass经验者优先;
AIGC模型推理和训练加速加速加速落地经验者优先;
熟悉分布式推理加速框架,有超大模型分布式加速经验优先。

 

#LI-FL1

COMPANY JOBS
1016 available jobs
WEBSITE