Operational Data Store: A Perfect Blend of Data and Brewing Science – Chapter 4

In this fourth part of our series, we focus on the Lager-wise Operational Data Store (ODS). This approach to data integration prioritizes efficiency and consistency, using techniques such as CDC to integrate data at the database level. We examine the strengths and weaknesses of this approach and explore how it can support businesses’ data integration needs. Join us as we delve into the world of Lager-wise ODS.

LAGER-WISE ODS – CHALLENGES

Lager-wise ODS (Operational Data Store) is a low-level data integration approach that focuses on efficiency and consistency. It integrates data at the database level, often using techniques like Change Data Capture (CDC), to ensure that data is up-to-date and consistent across multiple systems. This approach is best suited for high-volume data replication and ensuring that data is accurate and up-to-date.

Lager-wise ODS

Lager-wise Operational Data Stores (ODS) can be complicated and not as real-time as expected, causing some confusion among those who prefer Ale-wise ODS. However, despite its challenges, Lager-wise ODS is gaining popularity in the Business Intelligence world, especially because it adopts an agile approach, reducing the burden of extensive integration tasks on legacy and source layer application owners. The use of a Metadata-driven approach and ETL patterns makes Lager-wise ODS a feasible and efficient option, even though it may seem “silly” to some.

More about ODS integration typology you can read in the first chapter of the serie.

Pros and cons of Lager-wise ODS vs. Ale-wise ODS

Pros:

  • Reduced definition effort required from the source system
  • Low latency with low risk of data loss
  • Consistent ETL definition for initial and reconciliation loads
  • Lower latency due to real-time integration at the database level
  • Consistent data across multiple systems
  • High-volume data replication capabilities
  • Reduced risk of data loss

Cons:

  • Need to define reactions to multiple data events on the source side
  • Rigid integration: the integration is done at a low level and is, therefore, more rigid, making it harder to change or update in the future.
  • Limited control: the focus on efficiency and consistency can result in limited control over the data integration process.
  • May not support complex scenarios: the approach may not be suitable for businesses with complex data integration needs, as it may not provide enough control and flexibility.

Challenge 1 – Target vs. Source perspectives

The standard ETL process involves transforming data from multiple source tables into a single target table, following the rule of one mapping per target table. In contrast, real-time approaches focus on data events, where a change in a single source table can result in changes in multiple target tables. This shift in logic is a defining characteristic of real-time approaches and presents unique challenges.

The logic is based either on the side of one target table or on the side of one source table

Target-based Perspective in Transformation Logic

The transformation logic is defined from the target structure’s requirements, mapping source data accordingly. A source-based perspective, with inverted logic, is not possible when the complexity exceeds a certain mostly linear level. Both initial and incremental loads must be defined in a target-based perspective, even at a physical level, for standard ETL approaches.

Combined Perspective in Transformation Logic

The use of a combination of Target-wise and Source-wise perspectives aims to achieve consistency and manageability in real-time ETL transformations. This approach reduces redundancy between the two perspectives and helps maintain the integrity of the solution, as changes are made in one place starting with the logical design (Target-wise perspective). The goal is to have a consistent and manageable solution, especially in reconciliation and real-time loads.

The task can be addressed in two ways:

  1. Adaptive solution: The solution is designed to be as flexible as possible by using reusable parts of its physical design.
  2. Metadata-driven solution: The solution leverages the ability to split one metadata into two different perspectives during the physical implementation process. Although there may still be some redundancy in the physical design, there is no risk of inconsistency because all branches are generated from the same metadata. This approach requires a full metadata-driven solution, making it more time-consuming to implement, especially without a well-established metadata-driven ETL tool.
Two ways to address the lager-wise integration

Challenge 2 – Target vs. Source change identification

Target IDs and Source IDs are two methods used to efficiently maintain changes in sources through each transformation. The Target ID approach involves transforming source data into a target identifier for each transformation in the source table. This can be complicated, but it is possible. On the other hand, the Source ID approach requires storing the source table identifier in the target table. This adds a requirement to the target structure, but it allows for a more direct reaction to changes in every source table. The use of Source IDs limits the scope of possible transformation complexity, but it is easier to react to changes. For example, Source IDs are used in Oracle Materialized Views for fast refresh methods.

  • Target IDs:
    • Requires transformation of source data into target identifier
    • May require building redundant transformation structures
    • Efficient in cases with minimal changes to LOV (List of values)
    • Consider execution level as not many changes occur
Target IDs
  • Source IDs:
    • Requires storage of source table identifier in the target table
    • Limits scope of possible transformation complexity
    • Used in Oracle Materialized views for fast refresh methods
    • Avoids additional look-ups and makes the handling of LOV value change easier.
Source IDs

Challenge 3 – CRUD Matrix revival

The early theory of Data modeling involved the CRUD Matrix, which showed the processes that impacted tables and how. Although this may seem trivial in detailed mapping metadata, Real-time ETL needs to understand how changes to specific source objects (at the table or column level) affect the target objects in a particular mapping. The CRUD matrix included Create (Insert), Read, Update, and Delete operations. Now, for Real-time ETL, it’s crucial to map active operations, so we use the IUD matrix (Insert, Update, Delete) instead.

Example

The example above shows standard flows of data mapped from source tables to target tables. For a more efficient way, we have to classify every source involved in mapping by the IUD categories:

UID Matrix for our example

The IUD matrix (formerly CRUD) is crucial for the Real-time ETL approach. We use the matrix to map how changes in source objects (tables and columns) impact the target object in a particular mapping. The process of seeking a target ID involves extracting part of the final transformation and determining an ID transformation for each source object. The more source tables involved, the more complex the entire process becomes. For instance, a mapping with three source tables is considered a simple case, but it could involve multiple look-ups in reality.

Target IDs approach in the implementation example

Now we reconsider all the processes using the CRUD matrix. All the look-ups make usually the hugest mass of source tables in large transformations. And all the look-ups have got U only in the CRUD matrix. So for the reason of pruning, we change all target ID look-ups with U only in the CRUD matrix to the source ID approach. That requires mappings and target tables to be adapted to work with Source IDs too. The following picture shows the first wave of reduction. Compare it with the equivalent above:

Reduced Target IDs approach

Note: In our recent solution we usually rejected the classification of these types of changes. The metadata-driven approach is as powerful as it overpowers the advantages of detailed classification of the UID impact. The effort necessary to classify UID impact would be overhead in most situations, so it stayed just in theory and can be used in special cases.

Challenge 3 – Orchestration and “Dirty” approach

Real-time transformation orchestration is a crucial aspect of data management and processing. There are two main approaches to real-time transformation orchestration, the Fine approach and the Dirty approach.

The Fine approach seeks to establish a strict order of transformations, ensuring that all updates, inserts, and deletes are performed in a predetermined order to maintain consistency in the target data. This approach is known for its strict adherence to order and high levels of accuracy. However, it also has several significant drawbacks.

One of the main drawbacks of the Fine approach is its complexity. To maintain the strict order of transformations, a significant amount of orchestration is required, which can be difficult to manage efficiently. This can lead to deadlocks, which are situations where two or more processes are waiting for each other to finish, causing the entire system to become blocked.

Another drawback of the Fine approach is the workload it creates. The orchestration process is demanding, requiring significant computational resources and adding to the overall workload of the system. This can result in reduced performance, especially in large-scale real-time transformations.

The Dirty approach, on the other hand, offers a much simpler solution to real-time transformation orchestration. This approach operates on the principle of always using the most recent data available, regardless of the order in which transformations take place. The Dirty approach is much less complex to manage than the Fine approach, and it is much less prone to deadlocks and performance issues.

One of the main advantages of the Dirty approach is its ability to handle changes to the source data in real time. The Dirty approach always uses the most recent data available, which means that it can handle changes to the source data without any delays. This makes the Dirty approach ideal for real-time transformations that require fast and accurate data processing.

In conclusion, the Dirty approach offers a much simpler and more efficient solution to real-time transformation orchestration compared to the Fine approach. The Dirty approach is less prone to deadlocks and performance issues, and it offers faster and more accurate data processing.

Orchestration challenge of mutual operations

Example:
Let’s have 2 operations on the same target record established by the first of them. When you mess up the order, in the Fine operations approach it fails to cause you either to try to update the not existing record or insert an existing one. In the Dirty operations approach the Final merge operation acts based on the current state of the target table, so it never fails. It also always retakes current information from the L0 (source-wise) layer so the last operation has got always the most recent data.

Pipes – queues of the Source or Target IDs of changes

The asynchronous approach for real-time data transformations has been widely recommended. This involves transferring information about changes in the source data to the ETL processes in any form, such as logs, queues, or pipes. The process of detecting changes in the source and refreshing a mirror of all sources in the L0 layer fills a pipe with information about what has changed. This information is then passed on to the ETL process that transforms the data from the L0 layer to the L1 layer.

In this approach, it is assumed that a complete mirror of the source (L0 layer) is maintained and all lookups are performed against this mirror.

Key terms:

  • CDC (Capture Data Change): A method to capture changes in source data based on REDO log information on the source side.
  • L0 layer: A complete mirror of the source replicated objects, maintained using CDC technology.
  • L1 layer: The target integrated data after transformation.
  • L0 to L1 ETL: The ETL processes transform data from the L0 layer to the L1 layer form.
  • Pipe: Information about the keys of changed records.
The pipe. We could call it a queue too, but let’s call queues transferring identifiers pipes now.

The next step is to decide what information the pipe should contain, how to manage the consumers of each pipe, etc. There are two main classifications of pipes:

  1. Type of key contained: The simplest way to fill the pipe is to use information directly from the CDC, in which case the source primary key of the record is used. Another possibility is to use the target record identifier, which often requires look-ups during the L0 load phase.
  • PIS pipe: This type of pipe contains source identifiers and can have multiple consumers depending on the number of ETL processes using the same source.
  • PIT pipe: This type of pipe contains target identifiers and is associated with a single ETL process, but can be filled from several sources based on the number of sources used in a particular ETL mapping.
  1. Type of consumer management: The records in the pipe must be posted, used by consumers, and then deleted. Honestly, that is challenging in real-time. We will open the topic later.

Source-wise Pipes (PIS)

PIS pipes are simpler to fill but more complicated to use. They require a multi-consumer model, which can be addressed through pipe multiplicators or a system of flags, one for each consumer. As a result, the mapping would need to have several clones or parametric variations, one for each incoming pipe. To better understand, let’s return to the example used in the chapters above.

The following picture shows the distribution of PIS pipes to tasks of ETL mappings:

PIS pipes and consumers

The following pattern shows an example of how to prepare a Merge source:

WITH PIS as(
..pipe select
)
,SRC as (
..select source transformation, written the way allowing Push predicate from join with pipe
)
SELECT 
 .. target columns
FROM PIS 
JOIN SRC ON .. pipe determination condition 

In the PIS concept delete pattern should be designed separately. It requires adding source identifiers of each source to the target table. For example, when you use transformation as a join of three source tables with single primary keys, you have to add three “service” columns to the target table.

In the example below you see we use two source tables but the second one in two instances. That way the target table will contain three source identifiers and the example also explains why we use the ALIAS name in the transformation definition as a prefix instead of the short name of the source table. 

...
From <Source1> as S1
Join <Source2> as S2 on <condition for S2>
Join <Source2> as S3 on <condition for S3> 

That way we will have to add three new service columns:

 ,S1_<name of primary key column of Source1> <datatype>
 ,S2_<name of primary key column of Source2> <datatype>
 ,S3_<name of primary key column of Source2> <datatype>

The following pattern shows an example of how to prepare a Delete source (both Merge and Delete operations can be done in one statement):

WITH PIS as(
..pipe select
)
,SRC as (
..select source transformation, written the way allowing Push predicate from join with pipe
)
SELECT 
 ..target columns
FROM PIS 
JOIN <Target table> 
  ON ..target records for the PIS values, join based on the source identifier in the target
LEFT JOIN SRC ON .. pipe determination condition 
WHERE ..anti-join condition when there is nothing found on the source side

Note: PIS approach is efficiently used with Oracle Materialized views for fast refresh.

The PIS approach, which uses source primary keys as the basis for its pipes, has both advantages and disadvantages.

Advantages:

  • Efficient processing with no additional look-ups required, as all lookups are performed on the target side.
  • Centralized transformation logic, as the transformation of the L1 layer, does not impact the L0 load and there is no need for an intermediate layer.

Disadvantages:

  • Need for replication of the ETL process, as a separate instance is required for each source pipe. This can result in deployment and maintenance difficulties, especially if using a traditional batch-based ETL tool. To mitigate these issues, a metadata-driven ETL solution may be necessary, but it also comes with additional infrastructure costs, such as running multiple sessions in Powercenter Informatica.

Target-wise Pipes (PIT)

Target pipes are more complex to fill, but easier to use. They can be implemented as a single-consumer model. The main advantage is the simple and quick implementation of L1 ETL and the ability to perform the Delete operation in one step along with the Merge. However, the drawback is that all PITs must be known at the time of L0 definition, making L0 loads no longer independent, and CDC sessions become more complex and challenging to manage.

The PIT Pipe

Both Merge and Delete operations can be done in one step:

WITH PIT as(
..pipe select
)
,SRC as (
..select source transformation, written the way allowing Push predicate from join with pipe
)
SELECT 
 .. target columns
FROM PIT 
LEFT JOIN SRC ON .. pipe determination condition 

According to the pattern above we work with two alternatives based on the result:

  • Record from SRC was found – leads to Merge (Insert/Update) operation because the source record exists.
  • Record from SRC was not found – leads to Delete operation because the source record does not exist but it was in PIT so there was a reason it disappeared from the source transformation, so it should be deleted (marked deleted).

Note: We call the approach Microbatch Full Refresh Pattern.

Advantages of the PIT approach:

  • Simplified L1 ETL implementation, as the module only requires one instance.
  • Easy delete operation as a single step without having to query the target.

Disadvantages of the PIT approach:

  • Separation of ETL logic between L1 ETL and processing of PITs.
  • Complex L0 load sessions are due to the need for look-ups to fill PITs, making the loads slower and less manageable.

PIS to PIT – both source and target-wise pipes in one solution

The PIS to PIT approach is a solution that combines the advantages of two existing approaches, PIS (Source-wise Pipes) and PIT (Target-wise Pipes) in data integration and ETL (Extract, Transform, Load) processes. In this approach, the L0 (Level 0) pipes are filled using PIS pipes, which allows for independent L0 loads. The intermediate element, PIS2PIT, then performs the necessary look-ups and fills the PITs based on the PISs. The L1 (Level 1) ETL process is solved in the same way as in the PIT approach. This approach improves the manageability of the solution compared to the separate PIS and PIT approaches, but with a slightly increased latency due to the additional PIS2PIT step. The L0 tables in this approach require a “soft delete” approach, meaning records can only be marked as deleted instead of being completely deleted. The PIS2PIT process must also approach even deleted records in L0 to construct the “former” key.

PIS to PIT

The PIS2PIT process works as follows:

  1. The L0 process fills PIS pipes with CDC information as a single-consumer pipe.
  2. The PIS2PIT process fills multiple PIT pipes based on PIS information, acting as a single-consumer pipe. It distributes the PIS data to the required PITs.
  3. The look-up process in the PIS2PIT process uses soft-deleted records from the L0 layer to construct keys based on deleted records.
  4. The L1 process operates based on the filled PIT pipes and uses only non-deleted records from the L0 layer.
The schema illustrates the complexity of Pis to Pit relationships. PIS to PIT is designed to delegate the peer-to-peer complex dependencies to separate nodes.

Advantages:

  • Simplified L1 ETL with all processing in a single instance.
  • Easy Delete operation through a single step, without querying the target.
  • Complete independence of L0 loads due to the intermediate element separating them from L1 logic.

Disadvantages:

  • ETL logic separated between L1 ETL and PIS2PIT processing.
  • The additional step of PIS2PIT increases latency slightly.
  • L0 tables require a “soft delete” approach with deleted records marked rather than removed. PIS2PIT must process even deleted L0 records to construct the “former” key.

Conclusion

Larger-wise ODS presents various challenges in terms of efficient data processing, managing, and storing. The pipe approach, with its combination of PIS and PIT, provides a solution to these challenges by offering a combination of flexibility and manageability. This approach is worth exploring further, and in the following chapters, we will delve deeper into the intricacies of Larger-wise ODS and the various solutions it presents.

OtherS in the series