Nefeli Networks is an exciting early stage startup in the NFV space. This is the opportunity to get in on the ground floor at a well funded company, working on an exciting new technology with a great team of developers. We are based in Berkeley, CA with an additional office in Sunnyvale, CA.
We are looking for a highly motivated DevOps/Site Reliability engineer to join our exceptional team. The candidate we are looking for is ready to design, automate and support our cloud infrastructure, back-end systems and do technical integration with our partners. The ideal candidate would have some experience operating and supporting networking solutions and familiarity with automation tools and processes.
As a Dev Ops/SRE at Nefeli, you will play a critical role in helping us shape our software stack and hardware infrastructure. Your knowledge of design, analytics, development, coding, testing and application programming will enhance our development team to satisfy customer business and functional requirements. This person will also be instrumental in deploying systems at customer sites.
- Improve the whole product lifecycle through inception, design, deployment, operation and refinement
- Design, build and operate Cloud infrastructure to enable reliable and rapid deployment of microservices with effective monitoring and resilient operations
- Work with development teams to make sure applications are production ready, scalable and reliable from the ground up
- Identify and drive opportunities to improve automation for code deployment, management and visibility of application services
- Develop tools and framework to automate operational tasks, deployment of machines, services, applications
- Write automation code for provisioning and operating infrastructure at massive scale
- Establish end-to-end monitoring and alerting on all critical components of the applications, including availability, latency and overall system health
- Participate in the on-call rotation supporting the platform and/or the production application
- Direct root-cause-corrective-action analysis of critical business and production issues
- Develop standard methodology for Infra orchestration and troubleshooting application service in production
- Represent DevOps/SRE in design reviews and works with Engineering teams on operational readiness
- 5+ years of related experience
- Experience with modern logging/reporting tools such as Prometheus
- Experience with networking (e.g., TCP/IP, routing, network topologies & hardware, SDN, NFV)
- Experience with implementing monitoring tools such as Grafana, collectd, and Zabbi
- Experience with etcd, NoSQL and time series Databases