Location: Plano, TX
Description: Our client is currently seeking a Site Reliability Engineering Lead
(Deep experience in Cloud with some experience in SRE)
Client is seeking a passionate and talented Site Reliability Engineering Director (Cloud SRE lead), with extensive hands-on experience in cloud to join our newly formed CTO SRE practice. This team of SREs will own the tooling and processes that enable a successful cloud-based product, including alerting, monitoring, dashboards, test automation, upgrade plans, and runbooks. This is a greenfield opportunity and this team will have a big impact on defining processes, culture, and technologies used.
If you love designing, engineering, running systems and infrastructure, with experience in Cloud and a good handle on Site reliability Engineering principles, then this is the place for you!
Successful candidates will have experience delivering and operating applications and services on public cloud platforms, will have experience running software as a service, and are able to work well in a low-process/quick moving environment on a low definition problem set. The ideal candidate will be able to drive technical discussions about cloud-native technologies and processes with the team and external partners as well as contribute technically to the team.
The right individual will be motivated and will have a passion for automation, deployment processes, and enabling innovation.
•Lead Cloud SRE Practice
•Define and meet scalability, availability, security, and performance goals
•Collaborate with the engineering team on projects as the expert on reliability, performance, and efficiency
•Build Cloud resilient services which utilize telemetry and metrics to drive operational excellence
•Develop highly repeatable processes and have a mindset to automate everything from deploying software to mitigating system failures
•Manage auto-scaling of cloud infrastructure and instance to right-size workload and performance
•Drive reliability improvements to the product build and testing solutions
•Take part in a 24x7 on-call rotation
•Create "blameless" root cause analysis on issues, in a fast-paced environment
•Document current and future configuration processes and policies
•Build systems and tooling experience, including testing and analysis
•Runbook / workflow / event-based automation
•Experience in Site Reliability Engineering discipline
•Strong ability to inspire engineering and support teams to up their reliability game
•Customer-focused understanding of the impact of architecture and processes on application SLIs
•Exceptionally forward thinking and innovative - stay on top of the latest trends, technologies, etc.
•Experience working in a fast paced Agile/Scrum environment
•Experience with Azure DevOps and Infrastructure as Code (IaC)
•Advanced understanding of container orchestration (Kubernetes).
•Enterprise level experience in deploying, provisioning and utilizing Docker, Azure App Services, AKS, Azure Storage)
•Understanding of how to build modern cloud-native applications - able to have a conversation of why technologies were chosen or not.
•Experience troubleshooting enterprise scale platforms; building such platforms a plus
•Strong hands-on experience with PowerShell (Modules, DSC, etc.)
•Experience with monitoring and logging tools and platforms in Cloud
•Experience with Continuous Integration (CI) and Continuous Delivery (CD)
•Understanding of automation and its role in successfully scaling platforms and processes.
•Experience in solving complex problems with technology (automation, ML, etc.)
•Ability to work in a fast paced, evolving, growing and agile environment
•Strong level of curiosity and interest to learn
•Leadership skills, including coaching, teambuilding, and conflict resolution.
•Excellent analytical and problem-solving skills.
•Advanced multi-tasking and prioritization skills.
•Bachelor's degree (or equivalent) in Computer Science, Computer Engineering, Software Engineering or relevant engineering discipline or 5+ years of DevOps.
•5+ years Cloud hands-on experience
•with delivering or operating software on top of public or private cloud platforms
•working on production-level software as a service
•delivering upgrades to operational software as a service platforms
•using or developing advanced CI/CD pipelines that quickly deliver code to production
•5+ years of experience as a technical leader within a team or people manager.
•Good understanding of Site Reliability engineering discipline.
•Experience with monitoring in a public cloud environment.
•Public cloud certification (Azure) or Kubernetes certification (CKA/CKAD) desirable.
•Experience in 24x7 operations with on-call responsibilities desirable.