Being offered a management position, we were able to further drive, automate
and own/consolidate in many other Agile/DevOps/CI/CD areas such as Artifactory,
Wandisco, Gerrit, Jenkins, Zuul, ArgoCD, code-review ... Most of these
areas became my responsibilities over time and I've organically became
SRE for many different crucial applications. I now had the opportunity
to apply the same high standards for Infrastructure automation, SOP/SLA/processes/support
to all other services increasing my total team members to 14+ and counting.
The team was able to support, maintain, upgrade, patch, release, manage,
configure all applications with a relatively small headcount in comparison
with the scale at which we operated by having invested hugely into automation,
support, documentation team-skill, structure and processes.
Solving complex problems, identifying (infrastructure) bottlenecks, troubleshooting
on-scale networking and global deployments , architecting deployments and
architecture, roadmapping , ... was a constant during my period with Western
Digital.
On every occasion, these experiences were used to mentor, guide and work
with all team members allowing them to grow in the difficult area of troubleshooting,
problem identification, increasing skillset, … a learning opportunity.
Something I was doing for 20+ years now and is not an easy trade to learn.
-
Build out global team to manage Agile tools
-
Properly define how small teams could operate at high velocity by focusing
on Ansible automation and it's developing process
-
Own CI/CD environment and infrastructure of essential engineering tools
required for development/qa/release in the company
-
Strong focus on hiring, people-management, define and lead 3 verticals
within team.
-
By owning and managing many teams and services, became SRE (Site Reliability
Engineer) for tools on which 15000+ engineers rely on 24/7
-
Design and implementation of global-scaled distributed monitoring system
using a fully integrated Icinga2, Prometheus, Pagerduty and Teams notification
method. Defining predictive failure, knowing what is about to fail is critical.
-
Technologies used: Kubernetes, Ansible, Artifactory, Jenkins, Wandisco,
Gerrit, SVN, Zuul, Code-review, Jira, ...