Senior ML Ops Engineer (Praca zdalna)

ZOOLA TECH POLAND sp. z o.o.

Wrocław, Fabryczna
Zdalna
🚢 Kubernetes
PyTorch
Terraform
Helm
Kustomize
🌐 Zdalna

Requirements

Expected technologies

Kubernetes

PyTorch

Terraform

Helm

Kustomize

Optional technologies

AWS

Operating system

Windows

Our requirements

  • 5+ years of experience in MLOps, infrastructure, or platform engineering.
  • Experience setting up and scaling training and fine-tuning pipelines for ML models in production environments.
  • Strong expertise in Kubernetes, container orchestration, and cloud-native architecture (AWS preferred), specifically with GPUs.
  • Hands-on with training frameworks like PyTorch Lightning, Hugging Face Accelerate, or DeepSpeed.
  • Proficiency in infrastructure-as-code (Terraform, Helm, Kustomize) and cloud platforms (AWS preferred).
  • Familiar with artifact tracking, experiment management, and model registries (e.g., MLflow, W&B, SageMaker Experiments).
  • Strong Python engineering skills and experience debugging ML workflows at scale.
  • Experience deploying and scaling inference workloads using modern ML frameworks.
  • Deep understanding of CI/CD systems and their role in ML production.
  • Working knowledge of monitoring and alerting systems for ML workloads.
  • A strong sense of ownership and commitment to quality, security, and operational excellence.

Your responsibilities

  • Design and maintain infrastructure for training, evaluating, and deploying machine learning models at scale.
  • Manage GPU orchestration on Kubernetes (EKS), including node autoscaling, bin-packing, taints/tolerations, and cost-aware scheduling strategies (e.g., spot/preemptible GPUs).
  • Build and optimize CI/CD pipelines for ML code, data versioning, and model artifacts using tools like GitHub Actions, Argo Workflows, and Terraform.
  • Manage and optimize containerized ML workloads on Kubernetes (EKS), including node auto-scaling, GPU orchestration, and runtime scheduling.
  • Develop and maintain observability for model and pipeline health (e.g., using Prometheus, Grafana, OpenTelemetry).
  • Collaborate with Data Scientists and ML Engineers to productionize notebooks, pipelines, and models.
  • Implement and work with security and compliance to bring best practices around model serving and data access
  • Support inference backends including vLLM, Hugging Face, NVIDIA Triton, and other runtime engines and Optimize GPU utilization
  • Develop tools to simplify model deployment, rollback, and A/B testing for experimentation and reliability.
  • Lead incident response and debugging of performance issues in production AI systems.
Wyświetlenia: 1
Opublikowanadzień temu
Wygasaza 12 dni
Tryb pracyZdalna
Źródło
Logo
Logo

Podobne oferty, które mogą Cię zainteresować

Na podstawie "Senior ML Ops Engineer"