100% remote - AI Infrastructure & Inference Engineer Fokus GPU & LLM

Job description:

Für unseren Kunden sind wir auf der Suche nach einem AI Infrastructure & Inference Engineer (m/w/d) mit Fokus GPU & LLM.
Laufzeit: 5.1.26
Auslastung: Vollzeit
Einsatzort: Remote
- Design, implement, and optimize LLM and multimodal inference pipelines across multi-GPU, multi-node, and distributed environments.
- Build request routing and load balancing systems to ensure ultra-low latency, high-throughput services.
- Develop auto-scaling and intelligent resource allocation to meet strict SLAs across multiple data centers.
- Architect trade-offs between latency, throughput, and cost efficiency for diverse workloads.
- Implement traffic shaping and multi-tenant orchestration for fair and reliable compute allocation.
- Collaborate with AI researchers, platform engineers, and ML practitioners to bring new model architectures to production.
- Automate system provisioning, deployment pipelines, and operational tasks using modern DevOps and MLOps practices.
- Monitor, profile, and benchmark system-level performance for maximum GPU utilization and uptime.
- Apply best practices in system security, observability (logging/metrics/tracing), and disaster recovery.
- Contribute to open-source ecosystems and internal tooling to push the boundaries of inference performance.
- Maintain comprehensive technical documentation and participate in continuous process improvements.
Required skills
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
- 5
- years of experience in high-performance computing, GPU infrastructure, or distributed systems.
- Deep understanding of multi-GPU orchestration, workload scheduling, and distributed architectures.
- Proficiency with programming (Python or similar language) and systems automation scripting.
- Strong background in containerization (Docker), orchestration frameworks (Kubernetes), and CI/CD pipelines.
- Familiarity with observability tools such as Prometheus, Grafana, and OpenTelemetry.
- Strong understanding of OS-level performance (multi-threading, networking, memory management).
- Clear communication skills and the ability to work collaboratively across technical teams.
Preferred Skills
- Experience with NVIDIA DGX systems, NIM, TensorRT-LLM, or high-performance inference frameworks.
- Hands-on knowledge of CUDA, NCCL, Triton, MPI, NVLink, or InfiniBand networking.
- Experience deploying GPU clusters in both cloud and bare-metal environments.
- Familiarity with open-source inference ecosystems like SGLang, vLLM, or NVIDIA Dynamo.
- Knowledge of LLM optimization techniques for inference and fine-tuning acceleration.
- Understanding of enterprise security frameworks, compliance standards, and GDPR requirements.

Be a part of our comminity

Join us on Telegram or Discord to get instant notifications about the newest freelance projects and talk to some of the smartest software engineers in the world.