關於職位
- 國別: Germany
- 州/省/縣: Berlin
- 城市: Berlin
- Design, build, and rigorously optimize everything vital for large-scale training and/or fine-tuning with different model architectures. Includes everything from data loading to distributed training to inference, to enhance the MFU (Model Flop Utilization) on large compute clusters as well as collaborate closely and proactively with research scientists, translating research models and algorithms into high-performance, production-ready code and infrastructure. Ability to implement, integrate & test the latest advancements from research publications or open-source code
- Relentlessly profile and resolve training performance bottlenecks, optimizing every layer of the training stack from data loading to model inference for speed and efficiency.
- Contribute to technology evaluations and selection of hardware, software, and cloud services that will define our AI infrastructure platform.
- Experience with MLOps frameworks (MLFlow, WnB, etc) to implement best practices across the model lifecycle – development, training, validation, and monitoring – ensuring reproducibility, reliability, and continuous improvement.
- Create thorough documentation for infrastructure, data pipelines, and training procedures, ensuring maintainability and knowledge transfer within the growing AI lab.
- Stay at the forefront of advancements in large-scale training strategies and data engineering, and proactively driving improvements and innovation in our workflows and infrastructure as well as high-agency individual demonstrating initiative, problem-solving, and a commitment to delivering robust and scalable solutions for rapid prototyping and turnaround
- Deep practical expertise with AI frameworks (PyTorch, Jax, PyTorch Lightning, etc). Hands-on experience with large-scale multi-node GPU training, and other optimization strategies for developing large foundation models, across various model architectures. Ability to scale solutions involving large datasets and sophisticated models on distributed compute infrastructure.
- Excellent problem-solving, debugging, and performance optimization skills, with a data-driven approach to identifying and resolving technical challenges.
- Strong communication and partnership skills, with a collaborative approach to working with research scientists and other engineers.
- Experience with MLOps best practices for model tracking, evaluation and deployment
- Bachelor's or Master's degree or equivalent experience in Computer Science, Engineering, or a related technical field.
- Long term hands-on experience in a Data & AI Engineer, Machine Learning Engineer, specifically building and optimizing infrastructure for large-scale machine learning systems.
- Public GitHub profile demonstrating a track record of open-source contributions to relevant projects in data engineering or deep learning infrastructure is a BIG PLUS
- Experience writing CUDA/Triton/CUTLASS kernels
- Experience with performance monitoring and profiling tools for distributed training and data pipelines
At Siemens Energy, we are more than just an energy technology company. With ~100.000 dedicated employees in more than 90 countries, we develop the energy systems of the future, ensuring that the growing energy demand of the global community is met reliably and sustainably. The technologies created in our research departments and factories drive the energy transition and provide the base for one sixth of the world's electricity generation.
Our global team is committed to making sustainable, reliable, and affordable energy a reality by pushing the boundaries of what is possible. We uphold a 150-year legacy of innovation that encourages our search for people who will support our focus on decarbonization, new technologies, and energy transformation.
- In addition to an attractive remuneration package in line with the market, you can expect an attractive employer-financed company pension scheme
- We also offer the opportunity to become a Siemens Energy shareholder
- We offer our employees the opportunity to work flexibly and remotely, and our inspiring offices provide space for collaboration and creativity
- The professional and personal development of our employees is very important to us. We provide them with the opportunities to learn and develop in a self-determined way, various attractive programmes and learning materials are available for this purpose
- In relation to the "compatibility of family and work", we have a wide range of offers, e.g. flexible working time models, childcare places at many locations, the possibility of trial part-time work or even a sabbatical .