London Prism Digital Ltd 1-2 Paris Garden London SE1 8ND
Major Incident Lead 2026-03-19 Major Incident Lead | SRE, Cloud & Incident Management | Mission-Critical Platforms | Global HealthTech You will lead the response to high-severity, customer-impacting incidents across a global managed services platform. Penpole 2026-04-19

Major Incident Lead

€90,000 - €100,000

Dublin
Jemima Allison

90000 DAY

€90,000 - €100,000

GBP
APPLY NOW BACK TO VACANCIES

Major Incident Lead | SRE, Cloud & Incident Management | Mission-Critical Platforms | Global HealthTech

 

  • Dublin, Ireland (initially remote, 4 days onsite)
  • Permanent, Full-time
  • €90,000 – €100,000 base bonus

 
The role:
You will lead the response to high-severity, customer-impacting incidents across a global managed services platform.

This is a hands-on incident leadership role, acting as the central point of coordination during major outages to ensure fast resolution, clear communication and minimal impact. You will operate in an SRE-led environment, combining real-time incident management with ongoing reliability improvements.

You will work across engineering, cloud and support teams to manage incidents end-to-end and drive post-incident improvements to prevent recurrence.
The role includes a paid on-call rotation, with incidents relatively infrequent — the focus is on being ready when it matters.
 
Non-Negotiables:

  • Major incident management experience (P1/P2 incidents)
  • SRE, production operations or service reliability background
  • Experience in 24/7, mission-critical environments
  • Strong understanding of incident, problem and change management (ITIL)
  • Experience with cloud platforms (AWS, Azure or GCP)
  • Experience with incident tooling (ServiceNow, Jira, PagerDuty)
  • Strong stakeholder communication (including senior leadership)
  • Ability to lead under pressure and make clear, structured decisions

 
What You’ll Work With

  • AWS / Azure / GCP
  • Kubernetes environments
  • ServiceNow / Jira / PagerDuty
  • Observability tooling (Grafana, Prometheus, Datadog, Splunk)
  • Cloud monitoring tools (CloudWatch)
  • SRE practices (SLIs, SLOs, error budgets)
  • Runbooks and incident playbooks
  • High-availability, distributed systems

 
Core Responsibilities:

  • Lead end-to-end management of major incidents (P1/P2)
  • Act as Incident Commander during live incidents
  • Coordinate across engineering, SRE, cloud and support teams
  • Deliver clear, timely updates to stakeholders and customers
  • Manage executive-level communication during critical events
  • Drive post-incident reviews and root cause analysis (RCA)
  • Ensure corrective and preventative actions are defined and tracked
  • Identify patterns, trends and repeat issues for improvement
  • Improve incident processes, tooling and runbooks
  • Ensure incidents meet SLA and regulatory requirements
  • Maintain audit-ready documentation and reporting
  • Participate in on-call rotation and incident simulations

 
Examples of the work:

  • Leading response to critical outages across cloud platforms
  • Coordinating multi-team incident resolution under time pressure
  • Communicating with senior stakeholders during live incidents
  • Running post-mortems to prevent repeat failures
  • Improving incident response processes and automation
  • Identifying reliability gaps across a 24/7 platform

 
Nice to Haves

  • Background in SRE or Site Reliability Engineering roles
  • Experience in MSP or managed services environments
  • Experience in regulated or healthcare systems
  • Exposure to data platforms or high-throughput systems
  • Interest in AI or automation in incident management

 
Why Join
You will join a team responsible for mission-critical platforms where reliability is non-negotiable.

This role sits at the centre of operations, giving you visibility across engineering, cloud and customer environments. You will have real ownership over how incidents are handled and how the platform improves over time.

The business is also investing heavily in its cloud and managed services capability, making this a strong opportunity to step into a role with long-term growth and impact.
 
Employee Benefits

  • 12.5% annual bonus
  • Paid on-call allowance and call-out compensation
  • Clear progression as the Ireland function scales
  • High-impact role with exposure to global teams and leadership


Major Incident Lead | SRE, Cloud & Incident Management | Mission-Critical Platforms | Global HealthTech

 

Job reference: #BH-12386