Site Reliability Engineer (SRE)

Company

Handoff

Date Posted

04-01-2026

Location

São Paulo, São Paulo, Brazil

Why join us?
Handoff is the AI agent that runs a construction company. We help remodelers automate estimating, streamline operations, and win more work - backed by real-time cost data, intuitive design, and workflows that “speak contractor.” With over 10,000 monthly active users and $6B in annualized project volume already flowing through our platform, we’re becoming the trusted partner for the people who build our homes.
We are backed by $25M+ raised from Y Combinator, Initialized, and Greycroft. Our team is distributed across hubs in Austin, São Paulo, and Buenos Aires, and we are deeply focused on building intuitive, high-impact solutions that make a real difference for our users.

Site Reliability Engineer at Handoff
You will own and elevate the reliability, scalability, and observability of Handoff’s platform. This is a hands-on role focused on preventing incidents, improving system resilience, and enabling fast, safe product development.
You’ll work closely with Backend, Fullstack, Data, and AI engineers to ensure our systems are production-ready, observable, and built to scale, while keeping a strong focus on user impact and developer velocity. This is not a pure ops role. We’re looking for someone who thinks like an engineer, codes regularly, and partners deeply with product and engineering teams.

What you'll do
    • Define and implement SLIs, SLOs, and error budgets for critical services, making reliability visible and measurable across the org.
    • Design, build, and maintain monitoring, logging, and alerting systems that surface real issues without unnecessary noise.
    • Lead and participate in incident response, owning detection, coordination, communication, and resolution.
    • Partner with engineers early in the design phase to improve reliability, scalability, and production readiness.
    • Improve CI/CD pipelines, deployment strategies, and rollback mechanisms to enable fast and safe releases.
    • Automate operational tasks and reduce toil through tooling, scripting, and infrastructure-as-code.
    • Ensure backup, disaster recovery, and failover strategies are defined, documented, and tested.
    • Monitor infrastructure performance and costs, proposing optimizations that balance reliability, speed, and efficiency.
    • Create and maintain runbooks, incident procedures, and reliability documentation to support team scale.
About You
    • Strong experience as an SRE, Platform Engineer, DevOps Engineer, or similar reliability-focused role.
    • Solid understanding of reliability fundamentals, availability, latency, error rates, throughput, durability.
    • Hands-on experience with cloud platforms like AWS, GCP, or Azure.
    • Deep familiarity with observability tools such as Prometheus, DataDog, Grafana, OpenTelemetry, or similar.
    • Strong debugging skills and comfort working in high-pressure production incidents.
    • Experience improving CI/CD pipelines and release safety.
    • Ability to write production-quality code or scripts in languages like Python, Go, or Bash.
    • Experience with infrastructure-as-code and automation.
    • A pragmatic mindset that balances reliability with product velocity and real-world constraints.
    • Strong communication skills and comfort collaborating across engineering, product, and leadership.
    • Comfortable in a fast-paced environment, you’re quick to adapt to changing priorities and balance rapid iteration with high-quality outputs.
What we offer
    • 💸 Competitive salary in USD
    • 💰 Attractive stock options
    • 🌴 Unlimited PTO
    • 🚛 Relocation allowance
    • 👨‍💻 Top-notch equipment
    • 🧳 Team offsites around the world - we've already been to more than 5 countries!
If you’re motivated by owning reliability end-to-end and shaping infrastructure decisions that impact customers every day, we’d love to hear from you.
Please note that we will only consider applications submitted in English.