Job description:
Für unseren Kunden sind wir auf der Suche nach einem AI Infrastructure & Inference Engineer (m/w/d) mit Fokus GPU & LLM.
Laufzeit: 5.1.26
Auslastung: Vollzeit
Einsatzort: Remote
- Design, implement, and optimize LLM and multimodal inference pipelines across multi-GPU, multi-node, and distributed environments.
- Build request routing and load balancing systems to ensure ultra-low latency, high-throughput services.
- Develop auto-scaling and intelligent resource allocation to meet strict SLAs across multiple data centers.
- Architect trade-offs between latency, throughput, and cost efficiency for diverse workloads.
- Implement traffic shaping and multi-tenant orchestration for fair and reliable compute allocation.
- Collaborate with AI researchers, platform engineers, and ML practitioners to bring new model architectures to production.
- Automate system provisioning, deployment pipelines, and operational tasks using modern DevOps and MLOps practices.
- Monitor, profile, and benchmark system-level performance for maximum GPU utilization and uptime.
- Apply best practices in system security, observability (logging/metrics/tracing), and disaster recovery.
- Contribute to open-source ecosystems and internal tooling to push the boundaries of inference performance.
- Maintain comprehensive technical documentation and participate in continuous process improvements.
Required skills
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
- 5
- years of experience in high-performance computing, GPU infrastructure, or distributed systems.
- Deep understanding of multi-GPU orchestration, workload scheduling, and distributed architectures.
- Proficiency with programming (Python or similar language) and systems automation scripting.
- Strong background in containerization (Docker), orchestration frameworks (Kubernetes), and CI/CD pipelines.
- Familiarity with observability tools such as Prometheus, Grafana, and OpenTelemetry.
- Strong understanding of OS-level performance (multi-threading, networking, memory management).
- Clear communication skills and the ability to work collaboratively across technical teams.
Preferred Skills
- Experience with NVIDIA DGX systems, NIM, TensorRT-LLM, or high-performance inference frameworks.
- Hands-on knowledge of CUDA, NCCL, Triton, MPI, NVLink, or InfiniBand networking.
- Experience deploying GPU clusters in both cloud and bare-metal environments.
- Familiarity with open-source inference ecosystems like SGLang, vLLM, or NVIDIA Dynamo.
- Knowledge of LLM optimization techniques for inference and fine-tuning acceleration.
- Understanding of enterprise security frameworks, compliance standards, and GDPR requirements.