Tag Archives: devops

Continuous Improvement: The Path to Excellence

The quest for operational excellence is unending in Cloud Engineering and Operations. We want to do more, better, faster, with fewer errors and with the same number of people. Amidst this quest, the philosophy of Continuous Improvement, a concept well-articulated by James Clear, finds a resounding echo. The essence of this philosophy lies in embracing a culture of making small, consistent improvements daily, which, over time, aggregate to substantial advancements.

The Myths Holding Us Back

Often, there’s a misconception in the operational realm that a massive overhaul of processes, done once and for all, will lead to a toil-free, highly automated environment.

We long for this mythical event where a major transformation will take place overnight, and our lives and jobs will be near-perfect and forever joyful.

However, this notion of an overnight transformation is more of a myth. It portrays a misleading picture of reality that can lead to an endless cycle of stress and disappointment if we chase it relentlessly.

Taking a goal-oriented approach that concentrates on setting up a perfect environment as the objective is likely to lead us down a path of frustration. It can mask the inherent value of incremental progress and the compound benefits it brings over time.

Another common myth is that there’s this one engineer who comes up with an amazing solution and implementation all by himself. My experience has shown that this is far from the truth. Exceptional tools come from great teams that work together, slowly building more resources on top of previous work—the well-known idea of standing on the shoulders of giants.

The Power of Small, Daily Wins

Drawing parallels from James Clear’s elucidation, the real power lies in accumulating small wins daily. It’s about identifying a manual task that can be automated, a process that can be optimized, or a workflow that can be streamlined. Each small win reduces toil, improves efficiency, and enhances system reliability. This is the process-oriented approach.

My take is to use the Pareto principle, also known as the 80/20 rule: Find the 20% of the tasks that cause 80% of your pain – or toil – and be relentless in eliminating, automating or delegating them. Keep doing it for as many iterations as you need to reach your operational workload goals.

The 1% Rule: Compounding Operational Efficiency

Adopting the spirit of the 1% rule – improving by a mere 1% every day, can have a transformative effect in the cloud operational landscape. Over time, these daily increments compound, significantly enhancing operational efficiency, system reliability, and team satisfaction. The beauty of this approach is that it’s sustainable and less overwhelming for the teams involved.

The Journey Towards Operational Excellence

Operational excellence in Cloud Environments is not a destination but a journey. A journey marked by daily efforts to eliminate toil, automate repetitive tasks, and enhance system resilience. By adhering to the philosophy of Continuous Improvement, you will position yourself on a trajectory of sustained growth and excellence.

Boost Resilience with Upstream Thinking

In the high-speed realm of Information Technology, professionals often engage in a continuous cycle of troubleshooting, colloquially known as “firefighting.” Imagine an IT team constantly dealing with server crashes or software bugs only as they occur, causing operational disruptions and mounting frustration. That’s the firefighting approach. But there’s a game-changing alternative: upstream thinking. Inspired by Dan Heath’s book, “Upstream,” this concept encourages a proactive approach to IT, prioritizing the prevention of issues over firefighting. Think of it as building resilient systems that mitigate the risk of server crashes and designing software with robust error handling and prevention strategies.

Upstream thinking can transform the reactive chaos of firefighting into a structured, proactive environment focused on sustainable solutions.

The Power of Blameless Postmortems:

Blameless postmortems are an essential part of the upstream thinking process. They encourage an open, honest dialogue about incidents, focusing on learning and improvement rather than finding fault.

Blameless postmortems promote a culture of growth and resilience by providing a safe space for teams to discuss and learn from their mistakes.

Identifying Root Causes:

Embracing upstream thinking requires identifying and addressing the root causes of problems. Many techniques and frameworks, such as the “5 Whys” method and fishbone diagrams, can help IT professionals get to the heart of issues. By using these tools, organizations can uncover and resolve the underlying causes of problems rather than only addressing the symptoms.

Building Resilient Systems and Processes:

Resilience is the cornerstone of upstream thinking, and there are multiple strategies for building systems and processes that can stand the test of time and adversity. One such method is conducting a “premortem,” a unique practice where IT teams envision a hypothetical system failure and then brainstorm potential causes. This proactive method allows teams to identify and address issues before they occur, fortifying systems against potential failures.

Beyond premortems, other crucial practices include automation, proactive maintenance, and regular system updates. These strategies reduce manual effort, enhance system performance, and prevent possible errors and failures. Automation, for instance, can help eliminate human error and free up valuable time. Proactive maintenance and regular updates ensure that systems are always in their best health, reducing the chance of unexpected failures.

By combining these approaches, you’re not just responding to issues – you’re anticipating them, thus crafting systems and processes that are far more robust, reliable, and resilient.

Cultivating a Culture of Continuous Improvement:

Creating a culture of continuous improvement within IT organizations is essential for making upstream thinking a reality. This means establishing an environment where team members are encouraged to openly share insights, experiment with new approaches, and implement changes based on what they learn from blameless postmortems. This culture values collaboration, knowledge sharing, and small successes.


Incorporating upstream thinking into IT operations can transform how your organization handles problems. Shifting from firefighting to proactive problem-solving conserves resources and reduces stress, resulting in a more reliable and resilient IT environment.

Blameless postmortems and a culture of continuous improvement empower teams to tackle issues at their root, preventing recurrence in the future. Transform your IT operations by embracing upstream thinking.

Everything Sucks – Managing IT Risks: Strategies for IT Professionals.

As someone who has worked in the IT industry for many years, I have realized that technology is far from perfect. In fact, I would go so far as to say that everything sucks when it comes to technology.

IT professionals constantly deal with a never-ending barrage of issues, from unexpected hardware failures to software bugs and infrastructure breakdowns. It is Murphy’s Law all the way.

And while we often joke about the shortcomings of operating systems like Windows, even the most reliable and robust systems like Linux are not immune to bugs and glitches. The sheer complexity of software development means that dozens of bugs are likely lurking in every thousand lines of code, making it impossible to catch them all.

It is everything

But it’s more than just problematic software. Even the best hardware can fail unexpectedly, despite companies spending large sums on the latest and greatest equipment. Mean Time Between Failures (MTBF) might offer some guidance, but it’s often a source of delusion rather than certainty.

And when it comes to infrastructure, the fragility of the Internet can be mind-boggling. For example, one broken fibre cable in Egypt caused widespread disruption to millions across Africa, the Middle East, and South Asia. Given the countless potential points of failure and the constant threat of cybercriminals, it’s a miracle that the Internet works at all.

And let’s not even go into all the problems around Border Gateway Protocol (BGP), which is a fundamental protocol that helps keep the Internet running. It is based on trust rather than security. This means that every network operator must trust the information provided by others, even if they have no direct relationship with them. What could possibly go wrong, right?

But not all is lost

Despite all these challenges, there are ways to mitigate the risks and prepare for the worst.

It’s important to perform risk analyses and prioritize resources accordingly. While protecting against every potential threat is impossible, it’s crucial to focus on the most significant risks and allocate resources accordingly.

Performing risk analysis is a critical step for any IT professional in preparing for the worst. It involves identifying potential risks and evaluating the likelihood of those risks occurring, as well as the potential impact they could have. By conducting a risk analysis, IT professionals can better understand where their systems and infrastructure are vulnerable and prioritize resources accordingly.

Risk Matrix

One common risk analysis method uses a risk matrix, which assigns likelihood and impact scores to various risks to determine their overall risk level. Once the risks have been identified and evaluated, IT professionals can develop strategies to mitigate them and prepare for the worst.


For example, if a company relies heavily on a particular system, it might identify the failure of that system as a significant risk. They could then develop a backup plan, such as having redundant systems or backup servers, to minimize the impact of a potential failure.

It is a continuous process

It’s important to note that risk analysis is an ongoing process. Risks can change over time, and new ones can emerge, so it’s essential to regularly review and update risk analyses to ensure that IT professionals are always prepared for the worst.

IT professionals must acknowledge technology’s flaws and take action to prepare for potential risks. By performing risk analyses and prioritizing resources, we can develop effective strategies to minimize the impact of unexpected challenges and ensure critical systems remain operational. Let’s make risk analysis and mitigation strategies a priority in our work and ensure technology works for us.

Shift Left

Note: This article was originally written for my blog in Portuguese back in 2021.

Shift Left is a practice in software development where the aim is to find defects as early in the process as possible. A study from NIST shows that the cost of finding and fixing defects increases exponentially the farther it is found in the development cycle. Therefore, the ideal scenario is to find defects as early as possible, ideally in the design phase.

As a DevOps professional, a large part of my work has been focused on code quality in our business unit, which primarily develops code for embedded systems, FPGAs, industrial automation, and industrial robot controllers. Given the nature of our products, the cost of developing high-quality code is immense, and each release cycle is exceptionally long.

To solve these problems, our team has been working to implement a Shift Left approach by following these steps:

Writing Unit Tests Concurrently with Code

Writing unit tests concurrently with the code is the most critical part of Shift Left. Developers should not wait for the next phase, testing, to see if there are any obvious bugs in the code. Instead, most testing should be done in the implementation phase through unit tests that must be run constantly. This also helps ease the test team’s workload and lets them focus on more important things than testing the basics.

Code Review

The next step is code review. Developers should create a branch, write the code (including tests), and send it for review instead of merging it directly into the trunk. Code review allows for the early detection of bugs and can prevent these issues from propagating to later stages of development, saving time and resources. Code review allows developers to learn from one another and share best practices, resulting in better code quality and more effective teamwork, again addressing them on the left. See Google’s best practices for code review.


Human beings don’t like to have their work criticized, so pre-commit hooks can reduce criticism’s human element. Using pre-commit, a bunch of tests can be pre-programmed to run in the code about to be committed. Linters, tools that check for leaking secrets, styling tools, and others can be used. Pre-commit does not allow code to be pushed if it does not pass QA on the developer’s machine. It ensures that the basics are covered before the code goes into review.

Text Editor and Plugins

The text editor/IDE that the developer uses is as far left as possible. Developers can use various plugins and tools to improve the code. For instance, Microsoft Visual Studio Code is an excellent editor with many useful plugins, including language servers, linters, and AI-based plugins like Co-pilot, Sourcery and TabNine. Sonarlint, a Sonarqube plugin, can analyze code and display issues as soon as the user saves it.

By following the steps outlined above, software development teams can significantly reduce the number of defects that make it to the testing phase. While no single solution guarantees perfect code, combining the tools and techniques discussed and a strong focus on code quality can help minimize the total project delivery cost and time. For further information on how these steps can help reduce defects and improve the overall quality of code, I strongly recommend reading Steve McConnell’s book Code Complete. Take action today to improve your software development process and achieve better outcomes for your team and business.