Gizmo Logo

Gizmo

Founding Site Reliability Engineer (SRE)

Posted 18 Days Ago
Be an Early Applicant
In-Office
London, Greater London, England
Expert/Leader
In-Office
London, Greater London, England
Expert/Leader
As the founding Site Reliability Engineer, you'll ensure performance and reliability, define SLIs/SLOs, conduct load testing, automate operations, and mentor engineers.
The summary above was generated by AI

Gizmo is an AI startup on a mission to make learning so easy that anyone can learn anything. We're building Duolingo for anything - a platform that uses gamification and social mechanics to make learning fun.  

With over 1 million monthly active users and $4M in annual recurring revenue, we’re already one of the fastest-growing startups in the UK. Backed by leading investors, we recently raised $22M in Series A funding to accelerate our vision of helping 1 billion people learn.

Role Overview
You will be our founding SRE. Reporting to the CTO, you will own capacity, performance and reliability for Gizmo’s full-stack platform as daily traffic climbs from hundreds of thousands to millions of users. You’ll write code across the stack, but your charter is classic SRE: defend SLOs, eliminate toil, and raise the ceiling on scale before it becomes a hard limit.

Key Responsibilities

  • Define SLIs/SLOs for latency, availability and error rate; codify error budgets and partner with product teams on trade-offs.
  • Perform load-testing, capacity modelling and up-front scalability design for PostgreSQL, OpenSearch, Redis, Hasura and CF Workers; produce data-driven scaling plans.
  • Extend metrics, structured logging and tracing; establish alert rules that page only on user-visible impact; build actionable runbooks.
  • Join the on-call rotation, lead blameless post-mortems, drive remediation work to closure and track MTTR/MTBF improvements.
  • Automate repetitive ops on Kubernetes and CI/CD; keep “toil” <50 % of your time by pushing fixes into code.
  • Coach full-stack engineers on query optimisation, schema design and back-pressure techniques; document patterns and anti-patterns by creating an SRE playbook

Requirements
  • Hands-on scale experience: you have run relational stores at 100 k+ TPS or 1 M+ concurrent users (e.g., multi-tenant PostgreSQL, sharded MySQL).
  • You have software engineering experience.
  • Strong backend fundamentals around concurrency, caching, indexing and distributed systems trade-offs.
  • Proven track record of setting SLOs, building dashboards (Prometheus/Grafana, OpenTelemetry, etc.) and tuning alerts.
  • Comfort with Kubernetes, IaC and cloud-native patterns; can debug from network to application layer.
  • Self-starter with a maker mindset. We’re looking for ex-founders or individuals with start-up experience. 
  • Start-up bias for action: you prioritise high-leverage fixes, ship iteratively and own outcomes end-to-end.
  • Collaborative and feedback-driven; you welcome post-mortem culture and continuous improvement.
  • Driven by impact - you prioritise work that moves the needle!

Nice-to-haves: experience with Hasura internals, Cloudflare Workers edge optimisation, or operating OpenSearch at scale.


Benefits
  • Highly competitive salary.
  • You'll own a piece of what you're building - equity included.
  • Hybrid working model with 4 days in our East London office, ideally located between Shoreditch High Street, Old Street, and Liverpool Street stations.
  • The opportunity to become one of the earliest employees in one of the UK’s fastest-growing startups.
  • Private health insurance

Top Skills

Ci/Cd
Grafana
Hasura
Kubernetes
Opensearch
Opentelemetry
Postgres
Prometheus
Redis

Similar Jobs

3 Days Ago
Hybrid
London, Greater London, England, GBR
Mid level
Mid level
Machine Learning • Software • Conversational AI
The Site Reliability Engineer will enhance the reliability of products and systems, manage cloud deployments, automate processes, and improve monitoring and incident response.
Top Skills: Amazon Web Services (Aws)ArgocdBashDatadogDockerGitlabGoogle Cloud Platform (Gcp)HelmKubernetesAzureOpentelemetryPythonTerraform
4 Days Ago
In-Office
London, England, GBR
Mid level
Mid level
Healthtech • Biotech
The Site Reliability Engineer will ensure platform reliability and scalability, automate infrastructure, engage in incident response, and foster collaboration across engineering teams to drive performance improvements.
Top Skills: AWSBashCloudwatchDatadogGitlab CiJenkinsKubernetesOpsgeniePostgresPythonTerraform
5 Days Ago
In-Office or Remote
Manchester, Greater Manchester, England, GBR
Mid level
Mid level
Information Technology • Internet of Things • Machine Learning • Software
The Site Reliability Engineer will ensure the reliability of services, implement infrastructure solutions, automate deployments, and collaborate with teams to enhance operational security.
Top Skills: AnsibleBashDockerGitGrafanaKubernetesLinuxNagiosPrometheusPuppetPythonTerraformUnixVMware

What you need to know about the Belfast Tech Scene

If asked to name the birthplace of the RMS Titanic, you might not say Belfast. Similarly, if asked to name Europe's leading destination for foreign direct investment in new software development, Belfast might not come to mind. Yet, both are true. The city has emerged as a tech powerhouse, recently ranked among the best in the U.K. for tech careers — especially for software developers. It also leads the U.K. with the highest percentage of software development jobs advertised.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account