Data Operations #

At this point, this is just a loose collection of thoughts for topics like monitoring, support, incident management and governance.

Monitoring #

My philosophy for monitoring is simple – no incident should occur a second time. Every recurring incident erodes trust, both internally and externally, and chips away at confidence in the reliability of data systems.

Effective monitoring extends far beyond surface-level alerting, which often leads to over-alerting and alert fatigue. The goal is to detect weak signals early, address root causes, and ensure that lessons from incidents drive continuous improvement. However, a key challenge lies in avoiding the trap of over-engineering for edge cases.

The real test of good monitoring is in how it guides problem-solving. When detecting or resolving issues becomes overly complex within the current codebase, I take it as a cue to step back and reassess the problem from a different angle. Simplifying the solution often leads to more robust outcomes. This is where monitoring loops back into engineering, creating a feedback cycle that enhances both system reliability and the quality of the underlying architecture, ideally reducing technical dept on the fly.

Having introduced Monte Carlo Data as a Tool and established a Data Observability Practice, my thinking on monitoring very much revolves around the Data Reliability lifecycle which covers:

Detect: issues with data freshness, volume, schema, lineage, distribution and more and alert appropriately to not over-alert.
Resolve: understand the impact of data quality issues, communicate, identify the root cause and fix the issue.
Prevent: adapt monitoring and alerting if needed, and review overall practices to prevent similar issues in the future.

Support #

Providing effective support in a fast-evolving data environment is essential for maintaining reliability and user satisfaction. To achieve this, fostering strong collaboration between data engineers, DevOps, and IT support teams—along with comprehensive documentation—is crucial. These foundational elements ensure smoother operations and quicker issue resolution.

By applying frameworks like ITIL (Information Technology Infrastructure Library) to data engineering platform support, we can participate and benefit from standard IT processes. ITIL’s emphasis on identifying recurring bottlenecks and addressing root causes, rather than symptoms, strengthens long-term platform stability and performance.

Support Optimization #

As of 2024, there’s significant interest in automating support through AI, with numerous success stories emerging. While I fully support automating repetitive tasks, I believe there’s a lack of focus on innovative approaches to uncovering bottlenecks and deeper issues within support processes.

In my opinion GenAI is a fantastic tool to do some process mining and find systematic issues in the support processes.

In my view, GenAI is an exceptional tool for process mining and identifying systematic inefficiencies in support workflows. My first test using GenAI in BigQuery analyzed support patterns and revealed problems that, once addressed, reduced support cases for the Data Engineering team by over 20%. These issues ranged from incomplete documentation and convoluted request forms to unintuitive permissions setups that where all easy to fix.

For a deeper dive into the analysis, you can read more here.

Troubleshooting #

In my experience, providing good third-level support requires deep fundamental knowledge of systems and architecture, which allows for ideation around possible issues and thus often quicker identification of underlying root causes. I’ve found that this foundational expertise is far more valuable than surface-level tool skills when it comes to diagnosing complex issues that lower-tier support can’t resolve. Effective troubleshooting doesn’t always require advanced monitoring - just sufficient visibility and systems designed to throw meaningful, actionable errors.

For me, Root Cause Analysis (RCA) is essential, not just for fixing immediate problems but for driving long-term stability by preventing repeat issues. I believe that sharing RCA insights across teams within the same discipline plays a key role in shaping better architecture for the future. From my perspective, continuous improvement comes from strong collaboration, clear communication, and a commitment to refining infrastructure based on lessons learned.

Troubleshooting, even when it is often required at inconvenient times, is a valuable learning opportunity.

Back to Data Engineering