Senior ML Ops Engineer (Praca zdalna)

Hexjobs ATS

Senior ML Ops Engineer (Praca zdalna)

ZOOLA TECH POLAND sp. z o.o.

Wrocław, Fabryczna

Zdalna

🚢 Kubernetes

PyTorch

Terraform

Helm

Kustomize

🌐 Zdalna

Requirements

Expected technologies

Kubernetes

PyTorch

Terraform

Helm

Kustomize

Optional technologies

AWS

Operating system

Windows

Our requirements

5+ years of experience in MLOps, infrastructure, or platform engineering.
Experience setting up and scaling training and fine-tuning pipelines for ML models in production environments.
Strong expertise in Kubernetes, container orchestration, and cloud-native architecture (AWS preferred), specifically with GPUs.
Hands-on with training frameworks like PyTorch Lightning, Hugging Face Accelerate, or DeepSpeed.
Proficiency in infrastructure-as-code (Terraform, Helm, Kustomize) and cloud platforms (AWS preferred).
Familiar with artifact tracking, experiment management, and model registries (e.g., MLflow, W&B, SageMaker Experiments).
Strong Python engineering skills and experience debugging ML workflows at scale.
Experience deploying and scaling inference workloads using modern ML frameworks.
Deep understanding of CI/CD systems and their role in ML production.
Working knowledge of monitoring and alerting systems for ML workloads.
A strong sense of ownership and commitment to quality, security, and operational excellence.

Your responsibilities

Design and maintain infrastructure for training, evaluating, and deploying machine learning models at scale.
Manage GPU orchestration on Kubernetes (EKS), including node autoscaling, bin-packing, taints/tolerations, and cost-aware scheduling strategies (e.g., spot/preemptible GPUs).
Build and optimize CI/CD pipelines for ML code, data versioning, and model artifacts using tools like GitHub Actions, Argo Workflows, and Terraform.
Manage and optimize containerized ML workloads on Kubernetes (EKS), including node auto-scaling, GPU orchestration, and runtime scheduling.
Develop and maintain observability for model and pipeline health (e.g., using Prometheus, Grafana, OpenTelemetry).
Collaborate with Data Scientists and ML Engineers to productionize notebooks, pipelines, and models.
Implement and work with security and compliance to bring best practices around model serving and data access
Support inference backends including vLLM, Hugging Face, NVIDIA Triton, and other runtime engines and Optimize GPU utilization
Develop tools to simplify model deployment, rollback, and A/B testing for experimentation and reliability.
Lead incident response and debugging of performance issues in production AI systems.

Wyświetlenia: 1

Zgłoś

Opublikowana	dzień temu
Wygasa	za 12 dni
Tryb pracy	Zdalna
Źródło

Podobne oferty, które mogą Cię zainteresować

Na podstawie "Senior ML Ops Engineer"

Dlaczego nikt nie odpowiada na Twoje CV?

Milczenie jest przytłaczające. Wysyłasz aplikacje jedna po drugiej, ale Twoja skrzynka odbiorcza pozostaje pusta. Nasze AI ujawnia ukryte bariery, które utrudniają Ci dotarcie do rekruterów.

Nie znaleziono ofert, spróbuj zmienić kryteria wyszukiwania.