It’s been a few years since my last visit to my favorite pub in Basel, but the memories of my past discussions about Operational Data Stores and the brewing analogy still linger. I’ve decided to revisit those concepts and remaster my series of articles, incorporating my newfound experiences and insights. Join me as I take a fresh look at the world of ODS and compare it to the art of brewing, exploring the similarities and differences between Ale and Lager and their respective uses in the world of data management. Get ready for a refreshing and educational journey into the realm of Operational Data Stores and brewing!
About the ODS and beer brewing
When discussing Operational Data Stores (ODS), it’s important to note that today’s focus is on real-time or near-real-time ODS. This is what defines it as operational. In the past, the term was misused to refer to exaggerated staging areas in data warehouses used for early access to D+1 data in business intelligence patterns. However, these concepts were proven incorrect due to issues such as:
- Using ETL in EDW as a source or driving structure that couldn’t keep up with the demand for up-to-date information.
- Ad-hoc reporting requirements that negatively impacted data and design quality in the ODS.
- The emergence of data lakes as a better solution for operational reporting.
The need for high-quality, real-time, or near real-time data led to modern, slim but well-monitored ODS development. Therefore, when we refer to ODS, we mean the real-time one, similar to how we refer to beer, not lemonade. The term ODS has various definitions, but most are misleading as it plays a different role in the EDW concept than in modern, real-time ODS.
The yeast strains and fermentation conditions are responsible for the differences in flavor, aroma, and other characteristics between ales and lagers.
Data integration and beer brewing have more in common than one may initially think. Both processes involve combining different elements to create a final product that is greater than the sum of its parts. Just as yeast strains and brewing conditions play a crucial role in the outcome of a beer, data integration and storage are crucial for the success of a business.
The top logical layer of data
The top logical layer in data refers to the way the data is presented and used in a business context. It represents a higher-level view of the data and how it supports business objectives. In a similar way, top-fermenting yeast strains play a significant role in the final flavor, aroma, and other characteristics of a beer. They are responsible for producing fruity, spicy, and sometimes estery flavors that can contribute to the unique character of a beer.
The bottom logical layer of data
The bottom logical layer in data refers to the actual storage and retrieval of the data. It represents a more technical view of the data and how it is organized and managed in the underlying systems. Similarly, bottom-fermenting yeast strains are crucial for a beer’s clean, crisp, and smooth flavors. They play a key role in the production of lagers, which are widely appreciated for their balanced and refined character.
Just as the choice of yeast strains and brewing conditions are critical to the final outcome of a beer, the choice of data integration and storage solutions can have a major impact on the success of a business. Both processes require careful planning, attention to detail, and a deep understanding of the underlying elements and how they work together.
In conclusion, data integration and beer brewing share a common goal: to create a final product that is greater than the sum of its parts. By combining different elements in the right way, both processes can produce results that are genuinely outstanding and provide valuable insights and opportunities for growth.
The bottom-wise data integration
The bottom-wise data integration, such as CDC (Change Data Capture), involves tracking changes in data at a low-level, often at the database level, and transferring these changes to other systems or data stores in real-time. This type of integration is typically used for high-volume data replication and can be an effective solution for businesses that need to keep multiple systems in sync.
The top-wise data integration
On the other hand, top-wise data integration, such as API and microservices integration, involves the use of APIs and microservices to expose and access data from different systems and applications. This type of integration is typically used for more complex business scenarios and allows for greater flexibility and control over the data integration process.
The bottom-wise integrated Operational Data Store (ODS) can be referred to as the “lager-wise ODS”, while the top-wise integrated ODS can be referred to as the “ale-wise ODS”.
Lager-wise ODS, similar to lagers in brewing, is characterized by its focus on efficiency and consistency. The data in a lager-wise ODS is integrated at a low level, often at the database level, using techniques such as CDC. This approach is well suited for high-volume data replication and helps to ensure that data is consistent and up-to-date across multiple systems.
Ale-wise ODS, similar to ales in brewing, is characterized by its focus on flexibility and control. The data in an ale-wise ODS is integrated through APIs and microservices, which provide greater control over the data integration process and support more complex business scenarios. This approach is well suited for businesses that require more control over their data integration efforts and need to expose and access data from multiple systems and applications.
Both lager-wise and ale-wise ODS approaches have their strengths and weaknesses, and the choice between the two often depends on the specific needs and requirements of the business and the data integration project. By understanding these differences, businesses can choose the right approach for their needs and create an effective data integration strategy that supports their goals and drives their success.
The reverse transformation logic is a critical component in the bottom-wise data integration process, as it helps to ensure that the data is properly transformed and aligned with the canonical model in the Operational Data Store. This allows the business to work with the data in a way that is consistent, reliable, and aligned with its goals and objectives.
The following illustration shows an example of an Operational Data Store (ODS). The data feeding and receiving take place on the same logical level (usually through ESB messages), so there’s no need for extensive transformation logic during this “fermenting” process. However, there is still a need to maintain data consistency (e.g. prevent outdated information from overwriting newer data). In terms of manageability, performance, and development effort, this approach is the most efficient and aligns with the desired outcome.
The Ale approach is event-based. In Ale, source systems initiate changes based on events that occur, such as sending messages or filling topics. This allows the source systems to initiate events immediately after they are committed, thus avoiding the transmission of false changes to the ODS. The source systems can also be initially loaded and reconciled through simulation using the same infrastructure as real-time loading. This requires more development on the source systems side, but there is still no coupling as everything operates based on agreed interfaces.
The following illustration shows an example of a Bottom-Fermented (Lager-wise) Operational Data Store. Data receiving occurs on the same logical level as in the previous type (usually through ESB messages). The feeding process uses Capture Data Change (CDC) technology, allowing the ODS to react to changes in the source database (raw data) and update its structures accordingly. The raw data from the source systems are transformed into logical data in the ODS, which results in a need for more data source-related logic in the ODS, making it more expensive to manage and develop. However, there are cases where a Bottom-Fermented ODS may be the better choice, such as when the data source cannot provide information from a “logical top” and there are no enterprise-wide processes in place to catch events and update both the ODS and the source systems in parallel. Overall, I prefer the Top-Fermented (Ale-wise) ODS, but the Bottom-Fermented (Lager-wise) option can also be effective in certain situations.
We previously discussed Lagers and Ales, but there is another type of top-fermented beer called Stout. Stout has a sweet and smooth taste with a rich, creamy foam. This same concept applies to the STOUT-WISE ODS – it is fast and efficient. Information is entered into the ODS at the same time it is entered into the back-end systems, resulting in low latency. However, there are concerns about data integrity and initial loading. This method is not suitable for simulating data, and so if it is used for persistent data, side-wise initial loading must be developed and designed. Reconciliation processes also need to be considered. In conclusion, the STOUT-WISE ODS is a real-time solution that is compatible with microservices and SOA architecture. It requires additional initial loading and reconciliation costs, but these costs are still less than the event control required for ALE-WISE systems.
Consistency control adds complexity to the picture. To ensure consistency, a two-phase processing approach must be implemented. Changes are first processed as unconfirmed, and after the backend system confirms successful processing, the ODS confirms the information. In a standard READ COMMITTED isolation level, the information should then be visible to ODS consumers.
In this scenario, there is no need for a two-phase commit. The Operational Data Store only receives confirmed information from the source of records. However, the backend system needs to take responsibility for sending event information about changes to the ODS. Let’s examine how this works in a similar example as in the STOUT integration (where an event is issued externally by the frontend or another source, which is not always necessary in the ALE-WISE integration)
The following picture shows an example of such collaboration of systems:
The challenges of ODS integration
Both lager-wise and ale-wise ODS integration have their own challenges, much like how both beer types have their own bitter notes.
- There is no room for downtime or inconsistency in the ODS.
- The possibility of defining load workflows is limited.
- There’s a risk of resource exhaustion and potential slowdowns for consumers.
- There are … well, it is bitter.
Despite these challenges, it’s necessary to accept the bitterness in order to achieve a functional ODS.
Service Level Agreement in ODS Design
Managing user expectations is a crucial and challenging aspect of ODS design. Real-time systems have limited capabilities, and users may expect an immediate response, but that may not always be possible. It’s important to work with users and understand their needs, and align their expectations with the limitations and costs of the technology.
There are two aspects of Service Level Agreement (SLA) in ODS:
- SLA level A – (fermenting) – is a latency of information on the way to ODS.
- SLA level B – (drinking) – is an answer time to requests from users to ODS.
There is crucial to feeling the difference between these two. You can accept latency (level A) in the range of seconds, but you expect response time (level B) in milliseconds.
SLA can also differ during working hours and some service time at the night, however with the globalization of our services all is expected to be the requirement of 24*7 availability.
At some systems, I have met a situation where there were both low latency and low response time, but there were long windows of time when the latency exceeded hours because of some service reconciliation loads. There is important to publish information about current data quality/latency level.
The best approach is to minimize these servicing windows.
We recommend using three performance (SLA) indicators:
- C90 – Latency 90% of cases shouldn’t exceed. Excesses like End of days or reconciliations do not count.
- MAX – Maximum latency. Excesses like End of days or reconciliations do not count.
- MAX-Excesses – Maximum latency during excesses like End of days or reconciliations.
Incorrect expectations can result in inappropriate use cases
As an example, consider the use of a Lager-style Operational data store for synchronous responses to requests. This includes all back-end processing, CDC integration, and ODS transformation logic. In today’s fast-paced environment, waiting more than 10 seconds for an application’s response is unrealistic, taking into account the cumulative time of each processing step.
Lager takes time to ferment. In comparison, some Ales have lower latency that may be more suitable for real-time applications. It’s crucial to set realistic expectations and requirements for each type of data to avoid issues in the future. Starting with a lower service level is easier than reducing it later as the system grows and scaling challenges arise.
We expect the data in the Operational Data Store to be valid and complete, meaning no information has been lost during the loading process or internal transformations. This requires extra measures for consistency management, as there are many potential points of failure, and messages or data changes may be lost and not processed. Unlike bulk data loading, it is not easy to check this.
There are three main challenges:
- Initial load – Even if you trust that the ODS will stay complete and consistent with event-based loading, it still needs to be loaded initially. This is easier with Lager-based systems that have a source table but more difficult with Ales. There are different ways to load it initially, such as using a different source or generating artificial events to fill the structures.
- Data – During the design phase, it is important to ensure that all necessary data sources are covered and loaded. This is particularly important for sparsely filled entities and attributes, which may be omitted and not easily tested.
- Events – In later phases, it is important to ensure that no events are lost in the process.
An initial load can be performed by simulating surrogate events for all source records. Although this may be resource-intensive, it uses the same infrastructure as the standard load and may not always be a bad idea.
The consistency of the ODS information is a crucial aspect that determines the success of the solution. No one wants outdated or incorrect information, just like stale beer. Inconsistency can lead to two types of problems:
- Outdated data – this is usually due to the latency of the system. As ODS are integrated asynchronously, the information is always “slightly outdated”, but we can minimize the impact on the business by avoiding publishing outdated information. In worst-case scenarios, we may miss some changes in the source data.
- Incorrect data – this is known as eventual inconsistency. This occurs when part of the information is updated and another part is not, causing temporarily incorrect information if the changes are interdependent. The ODS loads should follow the transactional logic of the source system as much as possible. Another scenario is the read uncommitted situation, where the ODS reacts to an event but does not account for the rollback of the change in the source.
Inconsistencies are inevitable in an ODS system, but the extent and impact of these inconsistencies can vary. It’s important to weigh the cost of the solution against the desired outcomes, taking into account expectation management. As an integration task, ODS loading must deal with challenges such as distributed systems, unordered messages, and distributed transactions, making it necessary to implement additional consistency mechanisms, such as Saga and Dirty Load (last to win), etc.
On the elementary level, different levels of data consistency control exist.
- Transactions – These are the most basic and reliable way to maintain consistency in processes. They bundle steps into a group that is either completed fully or not at all, without any half-done or uncertain states.
- Distributed transactions – These extend transactions to cover collaboration across multiple local transaction systems in distributed systems.
- Extra-transaction consistency – This involves two-step confirmation, where changes to data are prepared but actual actions occur only after final confirmation. It is typically handled by the outer logic of the solution framework but can utilize database engine-embedded mechanisms like Flashback.
Can you rely solely on event-based real-time loads to ensure consistency in your ODS? It’s important to design the system to be trustworthy, but there are several factors that could disrupt its consistency:
- Initial state – will be discussed in the chapter on initial load, when the system is rolled out and there are no events signaling the initial changes.
- Downtime – systems aren’t 100% available and the ODS is made up of several systems, including middleware infrastructure, that can experience downtime.
- Modifications to service data – errors in source data may be repaired in a non-standard manner, causing events to go unregistered.
- Transformation bugs – in an agile world, there are no perfect components and our ODS and its loads are constantly evolving. After new functionality is rolled out, glitches may occur that need to be detected and solved.
- Untested scenarios – the complexity of transformations and event processing creates a system that can never be fully tested. It’s impossible to simulate all possible sequences and causality to guarantee that something new won’t happen, making reconciliation a monitoring component as well.
The tasks of reconciliation are:
- Quality control – Reconciliation helps ensure the quality of data by detecting and alerting operators to inconsistencies in the complex event-based transformation process.
- Data repair – It is also responsible for fixing inconsistent data.
- Standard process – In some cases, real-time transformations may not be triggered by events, leading to temporary inconsistencies that can be resolved through reconciliation. It may also be used as a workaround for fixing bugs.
There are the following metrics and factors we have to consider designing the ODS from the point of view of performance.
- Latency refers to the delay between a change being made to a record in the source system and it becomes accessible to the target user. (SLA level A.)
- Response time is the elapsed time between a query being made to the ODS output service and receiving a response. (SLA level B.)
- The end-to-end cycle measures the time from a change being made by a source system used to its reflection in the ODS output service, reflecting the potential negative impact on the user experience if the service is not yet aware of the change.
- Throughput, on the other hand, evaluates the system’s ability to handle high volumes of data transfers during peak times or fluctuations and assesses the risk of reaching the system’s integration limits.
- Independence of Processes refers to the impact that one process has on the performance of another, where an increase in resources allocated to one stream (e.g. an influx of changed data) directly or indirectly affects the performance of other tasks. A specific instance of this is the dependence of the response time of the output service on the current INGESTION load. There could also appear the impact of different consuming processes, especially in synchronous adventures.
- Elasticity is the ability of a system to maintain optimal performance regardless of the volume of inputs. ROW-BASED change processing may require different optimization strategies compared to bulk processing, and choosing the wrong strategy (such as shared pool memory in Oracle) can result in decreased throughput and increased latency.
- Fake Updates: CDC technology responds to all changes to records, including false updates (changes to the same value) and updates to columns that are not relevant. For example, if a code list of operators is integrated from a table that also includes the date of the operator’s last login in the source system, a CDC event will be generated every time the operator logs in, even though this information is not needed.
- Impact of CDC and Supplemental Logging – Suppliers of CDC technologies often claim minimal impact on source systems, such as supplemental logging in Oracle, but this increases the demand on processor time and IO operations of redo-logs to a certain extent (a percentage). This needs to be taken into account, especially if the source system is close to reaching its resource limits.
- Increased Resource Load – Real-time transformations can result in increased operational events, such as HARD PARSE or DYNAMIC SAMPLING, that are not usually observed during integration tasks. Unlike bulk processing, processor performance, not database IO operations, becomes the bottleneck.
Shall we mix?
The choice of integration type for your ODS solution depends on its scope and size. Each type requires effort, so focusing on one can simplify the solution and reduce the time to market. A hybrid solution can optimize integration based on latency and SLA requirements and source system capabilities. Adding Ale or Stout capabilities to a complex Lager setup is not as challenging, but incorporating CDC into a Service-oriented infrastructure should only be considered as a last resort. Ales are not as bad when you get used to them.
Having middleware that can obtain data from the ODS, it may make sense to also support top-fermented brewing. This could reduce latency for critical tasks and improve the solution’s functionality. For STOUT-WISE ODS, you often have to deal with initial loads and reconciliation using raw data, so the existing bottom-fermentation infrastructure may be useful. However, if you don’t require CDC integration, don’t need to couple systems, and don’t have legacy systems with limited ESB integration capabilities, it may be best to stick with Ale or Stout.
Real-time processing is more costly than bulk loads due to the need to manage events, process impacts, and maintain consistency in near real-time. Although avoiding load process management can help, cost reduction can be achieved through “Early pruning” or “Early filtering.” These techniques involve:
- Only accepting real events from data sources used for ODS load.
- Filtering events that only change data not participating in the transformation.
- Eliminating redundant events promptly.
- Removing “no change” data manipulations.
- Reducing transformation data reads to only necessary records.
Early filtering can significantly reduce the workload during real-time processing and ODS load or reconciliation. This is especially important in cases of recursive processing, where changes to one entity may trigger further changes to the same entity. Early filtering is crucial to the success of these processes and should not be overlooked. Just as filtering is essential to brewing a great-tasting beer, it is equally important in building high-quality ODS and real-time loads.
Synchronous / Asynchronous
The synchronicity of loads is a critical aspect of real-time ETL. In a synchronous approach, every change is immediately reflected in the target, without any queuing. However, the approach is vulnerable to peaks, causing a significant increase in latency. On the other hand, the asynchronous approach involves queuing changes and processing them in bulk, at a certain interval. The bulk processing becomes more efficient as the number of changes increases and bulks grow larger. The belief that only a synchronous approach can be considered real-time and asynchronous can’t is a myth.
Dependencies and dirty approach
In the DIRTY approach forget about:
- Maintaining the exact order of changes – there is neither the time nor the resources for this.
- Aggregating signals based on source logical events – you must accept that these events are now separate from the main event, so each event can trigger a separate change even though it will eventually lead to a single transformation.
- Eventual consistency – it is difficult to keep track of source changes, so it is possible that the transformation process may start without having received all events and made a change, only to be triggered again later with the remaining change data. This is especially true for extra-transaction processes.
- So, the solution may not be entirely clean, but it is “organically dirty,” and we have to live with that.
I will describe methods of supporting consistency within an asynchronous extra-transactional world in one of the following chapters.
Reconciling the Operational data store usually requires downtime or at least a pause in the standard (real-time, event-based) load process. This is often not possible during business hours, and thus downtime may need to be scheduled outside of working hours, potentially leading to some inconsistencies in the data. This need for reconciliation may arise repeatedly, especially in global solutions. However, there is a solution to this issue: smooth reconciliation. It’s important to note that this solution requires early filtering as a prerequisite, just as filtering is essential for a good beer. In addition, reliable and detailed audit information about actual updates (not fake, no-change updates) must also be available. Then, the following process can be carried out.
The following process can be followed for smooth reconciliation:
- Mark the start time of reconciliation (T0) to keep track of changes that occur during the process.
- Load exact copies of the ODS stage tables (Mirror tables) based on current source data. This cannot be done in the standard stage as real-time loads influence it.
- Compare the Mirror tables with the stage tables. Differences can be ignored, and records that are different are noted. Based on early filtering, records that changed after T0 can also be ignored. A hypothetical operation for the difference (insert, update, delete) should be registered. The comparison should be done in a set-based manner for performance reasons.
- Repair the Stage and generate events. This step should be done row-based and cautiously to avoid slowing down the standard real-time processing. The following steps should be taken:
- Lock a record with a difference identified in the previous step.
- Compare the record again and check the updated timestamp. If it is newer than T0 or there is no longer a difference, ignore it. Otherwise, update the Stage record based on the Mirror one and generate an event.
- Unlock the record. The final comparison is necessary to ensure the record hasn’t changed after the previous set-based comparison.
Most transformations can be represented as queries that have the format of the target table as the result set. Transformations in the ODS transform data from lower levels into higher levels, for example, data from the stage (referred to as L0) are transformed into the normalized layer (referred to as L1), and data from the L1 layer are transformed into the denormalized layer (referred to as L2).
Unfortunately, not all transformations can be represented as queries with result sets in the target table format. This can occur for various reasons, including the use of target layer tables as sources, which can lead to recursive process dependencies. To avoid these issues, it’s important to employ early filtering in a reasonable manner, as this can help prevent system cycling.
One must be prepared for varying levels of data loads from source systems. These loads may mostly be small, but can also occasionally increase to large amounts. These large loads can be either scheduled, such as at the end of the business or day, or unscheduled, like when repairing data. Initially, I believed that the data load could be switched from real-time to bulk processing using a scheduler and that only row-based processing would be necessary. However, even regular processes are not always regular, so a mechanism for signals from the source system would need to be implemented. Additionally, irregular or partially regular large loads may occur without the source system’s awareness or during working hours, teaching us that our solution must be able to handle both small and large loads during standard operations. In my opinion, row-based processing is a myth, as it is not flexible enough to handle large amounts of data, and large loads are inevitable in both regular and irregular processing.
A well-designed metadata system is essential for effective system management. Metadata provides information about the structure, organization, and contents of data within a system. It acts as a roadmap for the system, providing clear and concise descriptions of what the data represents and how it is used. This information allows system administrators and developers to quickly understand the data and make informed decisions about how to manage it.
In order to make the system well manageable, it is important to have a metadata system that is separated from the actual data implementation. The metadata should be at the logical level, describing what should happen and free of all the technical details of how it should happen. The “how” should be described by patterns, which are reusable sets of code or configurations that provide a blueprint for data processing and management. This approach allows the metadata to be abstract and independent of the underlying technology, making it easier to maintain and evolve over time.
By generating code from the metadata using different patterns, developers can quickly and easily implement changes to the data processing and management system. This approach also provides a high level of flexibility and scalability, as new patterns can be developed and added as needed. Furthermore, the separation of metadata and code makes it easier to test and debug the system, as the metadata provides a clear and concise description of what the system should do, independent of the underlying code.
In conclusion, a well-designed metadata system is a critical component of any well-managed system. Providing clear and concise information about the data and its processing, allows system administrators and developers to make informed decisions and efficiently manage the system over time. Regenerate response
That way I can generate different codes from the same metadata by different patterns.
I will write more about metadata in some of the further chapters.
OtherS in the series
What do beer brewing and Operational Data Store data integration have in common?