Site Reliability Engineering Lead - Cloud (Directo
The Judge Group Inc.

Plano, Texas

Posted in IT
13 days ago

Job Info

Location: Plano, TX
Description: Our client is currently seeking a Site Reliability Engineering Lead

(Deep experience in Cloud with some experience in SRE)

Client is seeking a passionate and talented Site Reliability Engineering Director (Cloud SRE lead), with extensive hands-on experience in cloud to join our newly formed CTO SRE practice. This team of SREs will own the tooling and processes that enable a successful cloud-based product, including alerting, monitoring, dashboards, test automation, upgrade plans, and runbooks. This is a greenfield opportunity and this team will have a big impact on defining processes, culture, and technologies used.

If you love designing, engineering, running systems and infrastructure, with experience in Cloud and a good handle on Site reliability Engineering principles, then this is the place for you!

Successful candidates will have experience delivering and operating applications and services on public cloud platforms, will have experience running software as a service, and are able to work well in a low-process/quick moving environment on a low definition problem set. The ideal candidate will be able to drive technical discussions about cloud-native technologies and processes with the team and external partners as well as contribute technically to the team.

The right individual will be motivated and will have a passion for automation, deployment processes, and enabling innovation.


•Lead Cloud SRE Practice

•Define and meet scalability, availability, security, and performance goals

•Collaborate with the engineering team on projects as the expert on reliability, performance, and efficiency

•Build Cloud resilient services which utilize telemetry and metrics to drive operational excellence

•Develop highly repeatable processes and have a mindset to automate everything from deploying software to mitigating system failures

•Manage auto-scaling of cloud infrastructure and instance to right-size workload and performance

•Drive reliability improvements to the product build and testing solutions

•Take part in a 24x7 on-call rotation

•Create "blameless" root cause analysis on issues, in a fast-paced environment

•Document current and future configuration processes and policies

•Build systems and tooling experience, including testing and analysis

•Runbook / workflow / event-based automation


•Experience in Site Reliability Engineering discipline

•Strong ability to inspire engineering and support teams to up their reliability game

•Customer-focused understanding of the impact of architecture and processes on application SLIs

•Exceptionally forward thinking and innovative - stay on top of the latest trends, technologies, etc.

•Experience working in a fast paced Agile/Scrum environment

•Experience with Azure DevOps and Infrastructure as Code (IaC)

•Advanced understanding of container orchestration (Kubernetes).

•Enterprise level experience in deploying, provisioning and utilizing Docker, Azure App Services, AKS, Azure Storage)

•Understanding of how to build modern cloud-native applications - able to have a conversation of why technologies were chosen or not.

•Experience troubleshooting enterprise scale platforms; building such platforms a plus

•Strong hands-on experience with PowerShell (Modules, DSC, etc.)

•Experience with monitoring and logging tools and platforms in Cloud

•Experience with Continuous Integration (CI) and Continuous Delivery (CD)

•Understanding of automation and its role in successfully scaling platforms and processes.

•Experience in solving complex problems with technology (automation, ML, etc.)

•Ability to work in a fast paced, evolving, growing and agile environment

•Strong level of curiosity and interest to learn

•Leadership skills, including coaching, teambuilding, and conflict resolution.

•Excellent analytical and problem-solving skills.

•Advanced multi-tasking and prioritization skills.

Basic Requirements

•Bachelor's degree (or equivalent) in Computer Science, Computer Engineering, Software Engineering or relevant engineering discipline or 5+ years of DevOps.

•5+ years Cloud hands-on experience

•with delivering or operating software on top of public or private cloud platforms

•working on production-level software as a service

•delivering upgrades to operational software as a service platforms

•using or developing advanced CI/CD pipelines that quickly deliver code to production

•5+ years of experience as a technical leader within a team or people manager.

•Good understanding of Site Reliability engineering discipline.

•Experience with monitoring in a public cloud environment.

•Public cloud certification (Azure) or Kubernetes certification (CKA/CKAD) desirable.

•Experience in 24x7 operations with on-call responsibilities desirable.


This job and many more are available through The Judge Group. Find us on the web at

More jobs for you in IT


Posted about 13 hours ago

Prairie Farms

Posted about 13 hours ago

New Season

Posted about 13 hours ago

View IT jobs »

New post from our employment blog

Share this job with the community

Click a community link below, and then social share the Site Reliability Engineering Lead - Cloud (Directo job.

African American Job Search Logo
Asian Job Search Logo
Disabled Job Seekers Logo
Hispanic Job Exchange Logo
LGBT Job Search Logo
Seniors to Work Logo
US Diversity Job Search Logo
Veteran Job Center Logo