Staff Site Reliability Engineer - PRE

VISA

Hybrydowa
Hadoop
Kafka
HBase
Spark
Shell
Ansible
🐍 Python
Kerberos
Ranger
Active Directory
Hybrydowa

Requirements

Expected technologies

Hadoop

Kafka

HBase

Spark

Shell

Ansible

Python

Kerberos

Ranger

Active Directory

Our requirements

  • Sound knowledge on managing large scale Hadoop platforms including monitoring the platform, debugging issues, and tuning the performance of the cluster.
  • In-depth knowledge of the Hadoop ecosystem, including Zookeeper, HDFS, Yarn, HIVE, SPARK, Trino and Kafka.
  • Proven experience in debugging issues on both Hadoop platform and applications.
  • Familiarity with security tools such as Kerberos, Ranger, and active directory integrations.
  • Experience on Cloud technologies preferably AWS EMR.
  • Knowledge on Kubernetes, AI, MLOPS will be advantageous.
  • Master's degree in Math, Science, Engineering, or Computer Science, Information Systems, or related field. OR Bachelor's degree in Math, Science, Engineering, or Computer Science,
  • Information Systems, or related field AND minimum five (5) years of experience in a directly related field. OR
  • Minimum five (5) plus years working on Hadoop systems.
  • The role involves performing Big Data SRE and Engineering activities on multiple open-source platforms such as Hadoop, Kafka, HBase, and Spark. The candidate should possess strong troubleshooting and debugging skills.
  • Other responsibilities include effective root cause analysis of major production incidents and the development of learning documentation. The person will identify and implement high-availability solutions for services with a single point of failure.
  • The role involves planning and performing capacity expansions and upgrades in a timely manner to avoid any scaling issues and bugs. This includes automating repetitive tasks to reduce manual effort and prevent human errors.
  • The successful candidate will tune alerting and set up observability to proactively identify issues and performance problems. They will also work closely with Level-3 teams in reviewing new use cases and cluster hardening techniques to build robust and reliable platforms.
  • The role involves creating standard operating procedure documents and guidelines on effectively managing and utilizing the platforms. The person will leverage DevOps tools, disciplines (Incident, problem, and change management), and standards in day-to-day operations.
  • The individual will ensure that the Hadoop platform can effectively meet performance and service level agreement requirements. They will also perform security remediation, automation, and self-healing as per the requirement.
  • The individual will concentrate on developing automations and reports to minimize manual effort. This can be achieved through various automation tools such as Shell scripting, Ansible, or Python scripting, or by using any other programming language.

Your responsibilities

  • Collaborate closely with L-3 teams to review new use cases and implement cluster hardening techniques, ensuring the development of robust and reliable platforms.
  • Foster cross-team collaboration, building and maintaining strong relationships with customer teams, user communities, architects, and engineering teams.
  • Work jointly on key deliverables to ensure production scalability and stability.
  • Automation: Hands-on Experience with automations using Ansible, Shell, python, or any programming languages. The ability to automate the manual tasks is key in this role.
  • Observability: knowledge on observability tools like Grafana, opera, Prometheus and Splunk.
  • Linux: understanding of Linux, networking, CPU, memory, and storage.
  • Programming Languages: Knowledge of and ability to code or program in one of python, Java or a widely used coding language.
  • Communication: Excellent interpersonal skills, along with superior verbal and written communication abilities.
Wyświetlenia: 2
Opublikowanaokoło miesiąc temu
Wygasaza 14 dni
Tryb pracyHybrydowa

Podobne oferty, które mogą Cię zainteresować

Na podstawie "Staff Site Reliability Engineer - PRE"