Senior DevOps SRE
As a Senior DevOps Site Reliability Engineer, you:
Have an automation first mindset
Are passionate about performance, stability, and security
Believe in a proactive approach of prevention over mitigation, and mitigation over fixing
Are comfortable with change
Have had a positive experience working for a startup before
Are a U.S. Citizen with an active clearance or willing and able to undergo the clearance process, including polygraph
Required Skills - advanced knowledge of:
AWS — 3+ years of hands-on experience (Architect / DevOps / SysOps AWS Certification preferred)
Infrastructure as a Code (Terraform)
Ansible automation
Kubernetes — 2+ years of in-depth experience deploying production applications / containers orchestration
K8S scheduling, networking, security, load-balancing
CI/CD (GitLab, Jenkins or Bamboo)
Python, Perl, or Golang
Best practices and IT operations in an always-up, always-available mission critical service
Desired Experience:
Implementing observability and monitoring in AWS, using Splunk / ELK / similar
EKS, ECS , ECR
Working in an agile environment, focused on rapid cycles and CD
Supporting, analyzing, and troubleshooting large-scale distributed mission-critical systems
Building software and/or platforms where security, regulatory compliance and high availability are critical
Strong understanding of Information Security in various environments
Responsibilities
Implement and support FedRAMP and other applicable USG standards, policies, and regulations
Set up, integrate, and maintain a scalable, stable set of CI/CD tools to support development, testing, and security scanning
Be accountable for a large-scale SaaS app w/a mission-critical customer base
Manage multiple tools, infrastructure, and roles in a fast-paced environment
Own the availability of our SaaS infrastructure and application
Implement best-in-class AWS solution using infrastructure as code
Collaborate with engineering and product to continuously improve service availability and quality
Be involved in the entire production lifecycle: code deployments, infrastructure management, and troubleshooting
Share ownership w/the Dev team, and own service availability and proactive issue prevention, using structured troubleshooting to mitigate issues
Work closely with our Dev and DevOps teams to ensure that our production services are secure, scalable, performant, and resilient