Know ATS Score
CV/Résumé Score
  • Expertini Resume Scoring: Our Semantic Matching Algorithm evaluates your CV/Résumé before you apply for this job role: Site Reliability Engineer.
Mexico Jobs Expertini

Urgent! Site Reliability Engineer Job Opening In Monterrey – Now Hiring NOV Inc

Site Reliability Engineer



Job description

Overview

We are seeking a highly motivated and experienced Site Reliability Engineer (SRE) with a specialization in Application Performance Monitoring (APM) to join our team.

You will be a key player in ensuring the reliability, performance, and scalability of our mission-critical applications and systems.

You will work closely with software engineering and operations teams to proactively identify, analyze, and resolve performance issues.

The ideal candidate is a creative problem-solver with deep expertise in APM tools, particularly the Elastic Stack, and a passion for designing and implementing innovative solutions to complex technical challenges.

Responsibilities
  • APM Strategy: Design, implement, and manage our Application Performance Monitoring strategy using tools like Elastic APM, Datadog, Dynatrace, or similar platforms.
  • Deep Performance Analysis: Utilize APM tools to conduct in-depth performance analysis, tracing distributed requests, identifying bottlenecks, and optimizing application code and infrastructure.
  • Dashboarding and Alerting: Develop and maintain comprehensive dashboards, visualizations, and intelligent alerting systems in Grafana, Kibana, or other platforms to provide real-time insights into application health and performance.
  • Proactive Issue Resolution: Monitor systems to detect and respond swiftly to performance degradations, security threats, and system failures before they impact users.
  • Define and Track SLOs: Measure and optimize system performance by establishing and tracking key Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
  • Root Cause Analysis (RCA): Lead post-incident investigations to analyze the root cause of production issues, quantify business impact, and implement corrective actions to prevent recurrence.
  • Automation: Automate repetitive tasks, monitoring setups, and incident response processes to enhance efficiency and reduce manual intervention.
  • Collaboration: Partner with software engineering and operations teams to embed reliability and performance best practices into the entire development lifecycle.
  • Continuous Improvement: Continuously refine our systems, processes, and APM tooling to elevate reliability, performance, and observability.
  • Stakeholder Engagement: Engage with business stakeholders to understand key application pain points and solicit feedback to inform the platform roadmap.
  • Requirements
  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
  • 5+ years of experience in a Site Reliability, DevOps, or Performance Engineering role.
  • Proven, hands-on experience with Application Performance Monitoring (APM) tools such as Elastic APM, Datadog, Dynatrace, New Relic, or AppDynamics.
  • Expertise in the Elastic Stack (Elasticsearch, Logstash, Kibana, Beats) for logging, monitoring, and APM.
  • Strong understanding of SRE principles, Production Support Operations, DevOps, and CI/CD methodologies.
  • Proficiency in scripting languages such as Python, Bash, or PowerShell for automation and data analysis.
  • Solid understanding of Linux/Unix systems, networking fundamentals, and distributed systems architecture.
  • Experience with containerization and orchestration technologies, specifically Docker and Kubernetes.
  • Excellent problem-solving skills with the ability to perform deep-dive analysis and think creatively.
  • Strong communication and interpersonal skills, with the ability to collaborate effectively in a global, cross-functional team environment.
  • Desired Skills
  • Experience with Infrastructure as Code (IaC) automation tools like Ansible, Terraform, or Chef.
  • Knowledge of cloud-native services and serverless architectures (., AWS Lambda, Azure Functions).
  • Familiarity with modern CI/CD tools and environments (., GitHub, Azure DevOps, Jenkins).
  • Experience with other observability pillars, including metrics (Prometheus) and logging.
  • Knowledge of agile development methodologies.

  • Required Skill Profession

    Computer Occupations



    Your Complete Job Search Toolkit

    ✨ Smart • Intelligent • Private • Secure

    Start Using Our Tools

    Join thousands of professionals who've advanced their careers with our platform

    Rate or Report This Job
    If you feel this job is inaccurate or spam kindly report to us using below form.
    Please Note: This is NOT a job application form.


      Unlock Your Site Reliability Potential: Insight & Career Growth Guide