Principal Site Reliability Engineer, Data Protection Products
Company
Connectwise
Date Posted
26-08-2025
Location
N/A, N/A, United States
Remote
ConnectWise is an industry and Global leading software company with over 3,000 colleagues in North America, EMEA and APAC. As a community-driven software company dedicated to the success of technology solution providers, our suite helps over 45,000 of our partners manage their businesses better, sell more efficiently, automate service delivery, and remotely control technology so they can consistently deliver amazing customer experiences.
Our company is powered by our connections, our colleagues, and our community. And, we accept all kinds.
Game-changers, innovators, culture-lovers—and humankind.
We invite discovery and debate. We recognize key moments as milestones.
We see you and value you for your unique contributions. Our inclusive, positive culture lays the foundation to ensure every colleague is valued for their perspectives and skills, giving you the choice of how YOU make a difference.
Curious? Read this opportunity to learn how YOU can make a difference at ConnectWise!
General Summary:
As a Site Reliability Engineer, you will work as an integral member of product teams, helping to build, deploy, and monitor cloud services reliably. You will contribute to complex software development projects to maintain essential, revenue-critical services. Additionally, you will actively develop code and build frameworks to monitor services deployed in production, driving reliability and performance across a large scale. You will be responsible for ensuring the reliability, availability, and performance of our Elasticsearch infrastructure. We're seeking a talented Site Reliability Engineer who can work with minimal supervision, define test procedures, and collaborate effectively with Developers, Designers, Customer Support, and Engineering Leadership.
Essential Duties and Responsibilities:
· Build systems and infrastructure to monitor complex, large-scale distributed systems.
· Identify stability/performance issues and collaborate with developers to triage critical issues in production systems.
· Represent the SRE organization in design reviews and operational readiness exercises for new and existing services.
· Devise ways to actively monitor system throughput, capacity, and reliability.
· Have the ability to debug complex systems and evolve a running environment without causing downtime.
· Engage in service capacity planning and demand forecasting, as well as software performance analysis and system tuning.
· Drive standardization efforts across multiple disciplines and services in conjunction with embedded SREs throughout the organization.
· Monitor and troubleshoot Elasticsearch performance issues and outages.
Who You Are
· Bachelor’s degree in Computer Science or equivalent work experience as a System Administrator with programming skills.
· Fundamental knowledge of technologies across a broad range of disciplines, including virtualization, storage, networking, server, and security.
· Understanding of systems and application design, including the operational trade-offs of various designs.
· Experience with monitoring and logging solutions such as Prometheus, Grafana, and ELK stack.
· Proficiency in scripting languages such as Python.
· Experience with infrastructure-as-code tools such as Terraform or CloudFormation.
· Strong understanding of Linux system administration and networking concepts.
· Excellent troubleshooting and problem-solving skills.
· Ability to work independently and collaboratively in a fast-paced environment.
· Strong communication and interpersonal skills.
· Demonstrable knowledge of Unix, TCP/IP, HTTP, web application security, and experience supporting multi-tier web application architectures.
· Experience in analyzing logs and troubleshooting large-scale distributed systems.
· Excellent organizational, time management, and communication skills.
Nice to Have
· Experience with instrumenting and monitoring production systems using tools such as ELK stack, Zabbix, Nagios, Statsd/Graphite, APM, etc.
· Experience with Amazon AWS Infrastructure (including EC2, S3, VPC, Security Groups, RDS) and related services is desirable.
· A working understanding of Docker, Vagrant, and configuration management tools like Ansible, Chef, or Puppet.
· Experience with one or more general-purpose programming/scripting languages, including but not limited to Python, Bash, Perl, or Go.
Benefits include:
· Medical Insurance
· Flexible PTO
· Flex Friday
· Hybrid Work Option Available
· Tuition Reimbursement
· And more!
ConnectWise is an Equal Opportunity Employer, dedicated to building a diverse and inclusive workforce and providing a workplace free from discrimination and harassment. ConnectWise provides equal employment opportunities to all employees and applicants without regard to race, ethnicity, color, religion, age, sex (including pregnancy), sexual orientation, gender, gender identity or expression, ancestry, national origin, citizenship status, physical or mental disability, genetic information, military/veteran status, marital status, familial or parental status, or any other characteristic or status protected by applicable federal, state and local laws.
The statements above are intended to describe the general nature and level of work being performed by individuals assigned to this job. Other duties may be assigned as needed. Reasonable accommodations may be made to enable qualified individuals with disabilities to perform the essential functions of the job and/or to receive other benefits and privileges of employment. If you need a reasonable accommodation for any part of the application and hiring process, please contact us at talentacquisition@connectwise.com or 1-800-671-6898.