Site Reliability Engineer (SRE)
Company
Handoff
Date Posted
04-01-2026
Location
São Paulo, São Paulo, Brazil
Why join us?
Handoff is the AI agent that runs a construction company. We help remodelers automate estimating, streamline operations, and win more work - backed by real-time cost data, intuitive design, and workflows that “speak contractor.” With over 10,000 monthly active users and $6B in annualized project volume already flowing through our platform, we’re becoming the trusted partner for the people who build our homes.
We are backed by $25M+ raised from Y Combinator, Initialized, and Greycroft. Our team is distributed across hubs in Austin, São Paulo, and Buenos Aires, and we are deeply focused on building intuitive, high-impact solutions that make a real difference for our users.
Site Reliability Engineer at Handoff
You will own and elevate the reliability, scalability, and observability of Handoff’s platform. This is a hands-on role focused on preventing incidents, improving system resilience, and enabling fast, safe product development.
You’ll work closely with Backend, Fullstack, Data, and AI engineers to ensure our systems are production-ready, observable, and built to scale, while keeping a strong focus on user impact and developer velocity. This is not a pure ops role. We’re looking for someone who thinks like an engineer, codes regularly, and partners deeply with product and engineering teams.
What you'll do
- Define and implement SLIs, SLOs, and error budgets for critical services, making reliability visible and measurable across the org.
- Design, build, and maintain monitoring, logging, and alerting systems that surface real issues without unnecessary noise.
- Lead and participate in incident response, owning detection, coordination, communication, and resolution.
- Partner with engineers early in the design phase to improve reliability, scalability, and production readiness.
- Improve CI/CD pipelines, deployment strategies, and rollback mechanisms to enable fast and safe releases.
- Automate operational tasks and reduce toil through tooling, scripting, and infrastructure-as-code.
- Ensure backup, disaster recovery, and failover strategies are defined, documented, and tested.
- Monitor infrastructure performance and costs, proposing optimizations that balance reliability, speed, and efficiency.
- Create and maintain runbooks, incident procedures, and reliability documentation to support team scale.
About You
- Strong experience as an SRE, Platform Engineer, DevOps Engineer, or similar reliability-focused role.
- Solid understanding of reliability fundamentals, availability, latency, error rates, throughput, durability.
- Hands-on experience with cloud platforms like AWS, GCP, or Azure.
- Deep familiarity with observability tools such as Prometheus, DataDog, Grafana, OpenTelemetry, or similar.
- Strong debugging skills and comfort working in high-pressure production incidents.
- Experience improving CI/CD pipelines and release safety.
- Ability to write production-quality code or scripts in languages like Python, Go, or Bash.
- Experience with infrastructure-as-code and automation.
- A pragmatic mindset that balances reliability with product velocity and real-world constraints.
- Strong communication skills and comfort collaborating across engineering, product, and leadership.
- Comfortable in a fast-paced environment, you’re quick to adapt to changing priorities and balance rapid iteration with high-quality outputs.
What we offer
- 💸 Competitive salary in USD
- 💰 Attractive stock options
- 🌴 Unlimited PTO
- 🚛 Relocation allowance
- 👨💻 Top-notch equipment
- 🧳 Team offsites around the world - we've already been to more than 5 countries!
If you’re motivated by owning reliability end-to-end and shaping infrastructure decisions that impact customers every day, we’d love to hear from you.
Please note that we will only consider applications submitted in English.