Employment Type

Full-Time

Location

Remote - North America, Remote - South America

Tech Lead, Site Reliability Engineering (SRE)

Employment Type

Full-Time

Location

Remote - North America, Remote - South America

At Edge & Node, we’re focused on building The Graph, a decentralized protocol for accessing and organizing the world’s knowledge and information. Subgraphs, a core technology developed by Edge & Node to access blockchain data, are widely used across web3 to power decentralized applications.

We’re a tight-knit, efficient team with a bias for action and a strong sense of ownership. Our teams have autonomy, low ego, and are trusted to drive projects end to end. We care deeply about building infrastructure for web3 use cases and collaborate across disciplines to make that happen. If you’re passionate about infrastructure that has a real impact on our users, enjoy solving hard problems, and thrive in a fast-paced environment, you’ll feel right at home.

The Engineering Operations team, including Site Reliability, works closely with engineering teams across Edge & Node to ensure the services we operate are reliable, performant, predictable, and secure. We are on a mission to take our service delivery to the next level.

About the Role

We're looking for a Tech Lead, Site Reliability Engineering (SRE) to help guide the future of infrastructure and reliability at Edge & Node within The Graph ecosystem. In this role, you'll be both a technical leader and hands-on contributor, driving industry best practices in reliability, infrastructure, observability, and operational excellence while collaborating cross-functionally to optimize and secure our systems.

You'll play a critical role in setting the technical direction for the team—shaping how we build, scale, and support our services—while also staying actively involved in the day-to-day work. This is a role for someone who’s passionate about infrastructure and security, has stellar interpersonal skills, and is excited to lead us toward a more scalable, observable, and resilient future. You will partner closely with the team’s manager as well as stakeholders across the engineering organization.


What You'll Do

  • Lead by example as a hands-on technical contributor, participating in on-call rotations, incident response, and the day-to-day work of the SRE team

  • Partner with engineering and product leadership to shape roadmaps, define team priorities, and plan work that improves reliability, performance, and scalability across the stack

  • Team with and support other SREs, leveraging your leadership and soft skills to foster a culture of continuous learning, blameless retrospectives, and technical excellence

  • Own the incident lifecycle, including root cause analysis and follow-up remediation, and work to make our systems increasingly self-healing

  • Drive SRE team strategy, advocating for industry best practices, standardization, and secure and optimized infrastructure

  • Architect and improve core infrastructure services, with an eye toward high availability, fault tolerance, performance, and end-to-end observability

  • Work across teams to challenge assumptions, fundamentally overhaul our systems, and improve documentation

  • Collaborate with external partners and vendors as needed to ensure the health of critical services

What We’re Looking For

  • Proven experience as a senior or lead SRE or devops engineer, ideally having led large-scale reliability initiatives or infrastructure transformation projects

  • Strong project or technical leadership skills, with a track record of guiding teammates and setting technical direction while still remaining hands-on

  • Deep knowledge of the SRE/devops domain, including incident response, security awareness, maintaining SLAs and uptime guarantees, observability, supporting internal development teams, project and capacity planning, and/or system architecture

  • Experience with both cloud and on-prem core infrastructure, ideally with Google Cloud Platform (GCP), bare metal infra, and kubernetes (or similar orchestration tools)

  • Fluency in infrastructure as code, Terraform, automation tooling, CI/CD pipelines, and system monitoring solutions such as Grafana

  • Excellent interpersonal, leadership, and communication skills, with the ability to align stakeholders and motivate and unblock team members

  • Experience in web3, crypto, or blockchain is a plus (but not required)

About The Graph
The Graph is the indexing and query layer of the decentralized internet. As the first open data marketplace to introduce and standardize subgraphs, The Graph is a flagship solution for accessing blockchain data across web3.

Since launching in 2018, tens of thousands of developers have built subgraphs to power dapps across 90+ blockchains. As demand for web3 data grows, The Graph is evolving to support a broader range of data services and query languages, expanding what’s possible with decentralized infrastructure—now and in the future.

Discover more about how The Graph is shaping the future of decentralized physical infrastructure networks (DePIN) by following The Graph on X, LinkedIn, Instagram, Facebook, Reddit, and Medium. Join the community on The Graph’s Telegram, and join technical discussions on The Graph’s Discord.

Let’s solve the world’s biggest challenges - Edge & Node