Operational Data Store: A Perfect Blend of Data and Brewing Science – Chapter 2

Metadata is often referred to as “data about data” and refers to the information that describes and contextualizes the data within a system. In the context of an ODS (Operational Data Store) system, metadata plays a crucial role in ensuring the effectiveness and maintainability of the system. By utilizing metadata to drive development, organizations can streamline processes, improve data governance, and create a more manageable data environment. In this article, we will delve into the importance of metadata in ODS systems, including its different forms and levels of abstraction. From there, we will explore the benefits of metadata-driven development and how it can be used to create a well-designed, effective, and sustainable ODS system.

Metadata

Metadata plays a crucial role in the organization and management of data systems. It can be categorized into three main types: Contextual, Static, and Dynamic.

  • Contextual metadata includes business data dictionaries, which define data domains, and entities, and standardize terminology.
  • Static metadata describes data models, such as conceptual, logical, and physical data models.
  • Dynamic metadata describes data flows, transformations, and manipulations, and can be represented by a logical transformation model, stripped of the platform and design specifics, that can be used to generate implementation-specific scripts, views, procedures, or code. The key to successful metadata-driven development of an ODS system is to maintain consistent and minimal dynamic metadata, which is free of standard or implementation specifics.

Abstraction

Metadata in data management can be understood at different levels of abstraction, similar to data models in database design. Just as a conceptual data model provides a high-level view of the data, and a physical data model provides a more concrete and specific implementation, dynamic metadata can also be viewed at different levels of abstraction. Logical dynamic metadata strips away platform-specific information and functional design patterns, while physical dynamic metadata includes detailed script and implementation specifics. By keeping logical dynamic metadata minimal, it remains flexible and adaptable to changes in technology and implementation.

In the world of data management, just like physical data models can be abstracted from their logical form to create a more universal representation, dynamic metadata can also be abstracted from their most concrete form. Just as physical data models are generated from their logical forms using DBMS-specific generators, dynamic metadata can also be generated from a more abstract representation using pattern-specific generators. This allows for greater flexibility and adaptability in the face of changing technology or new implementations. By keeping the logical representation of dynamic metadata stripped of implementation-specific features, it is easier to maintain consistency with static metadata describing the source and target data structures, leading to a more effective and manageable ODS system.

Generator independency

Consistency

Having consistent relationships between static and dynamic metadata is crucial in the metadata-driven development of ODS systems. It helps ensure that dynamic metadata is accurate and up-to-date with the latest changes in the source and target data structures. The relationships also allow for automated suggestions and error prevention during the design process, making the overall development process more efficient and effective.

Prescriptive and descriptive metadata

Prescriptive metadata refers to the metadata that is intentionally defined and used to generate a solution. On the other hand, descriptive metadata simply describes the current state of systems and is often obtained through reverse engineering existing systems.

When discussing a metadata-driven approach, we are referring to prescriptive metadata.

Using metadata and patterns to generate particular modules.

Collaboration, documentation

The metadata-driven approach not only provides a central repository for all data-related information but also facilitates documentation and collaboration. With clear and up-to-date metadata, teams can better understand the data and its usage, as well as ensure that everyone is working with the same information. This leads to improved data quality, reduced errors, and more efficient processes. Furthermore, having all metadata in one place helps to ensure that everyone has access to the information they need, improving communication and collaboration across the team.

Evolution of ETL towards a metadata-driven approach

Prehistorical Age

In the prehistorical age of ETL development, standard ETL tools were the main means of implementation. The majority of tasks were stereotypical, requiring manual typing of code or drawing lines in graphical tools by a team of ETL developers. This era was marked by a heavy reliance on time and material accounting, with most tasks requiring manual effort.

Prehistorical age

Age of Pioneers

In the age of pioneers, standard ETL tools were complemented with tools that could solve the most stereotypical tasks based on standard metadata, typically in the form of a data dictionary or an add-on to data design tools. At this time, about 20% of tasks could be automated. The introduction of these tools marked the beginning of a shift towards a more efficient and streamlined approach to ETL development.

Age of pioneers

Age of Conquerors

The age of conquerors was marked by the introduction of metadata-driven ETL tools, a complex metadata repository, and ETL patterns. This era saw a significant increase in the number of tasks that could be generated, with approximately 98% of tasks now automated. The use of metadata-driven tools and patterns enabled organizations to streamline their ETL processes and increase efficiency.

Age of conquerors

Metadata-driven

The metadata-driven era of ETL development is characterized by the use of modular metadata-driven solutions, manageable ETL patterns, and independent logical metadata. With 100% of tasks generated, this era represents the pinnacle of ETL development, where organizations can achieve maximum efficiency and flexibility in their ETL processes. The ability to manage ETL patterns and logical metadata independently makes it possible to easily switch between different systems and vendors without sacrificing consistency or efficiency.

Metadata-driven ETL

Differently from our historical experience when we had to prove the viability of the metadata-driven concept now it is more about the decision of whether to do it or not. Starting a new greenfield project you can involve a metadata-driven approach from the very start without compromises. It is much harder to implement metadata-driven into existing solutions than to start it at the very start.

Why metadata-driven:

Adopting a metadata-driven approach to building an ODS system can bring several benefits as well as some challenges.

Pros:

  • Centralized repository: Keeping all metadata in a centralized location makes it easier to maintain, update and reuse them.
  • Consistency and accuracy: By having relationships between static and dynamic metadata, it helps ensure that the metadata are consistent and accurate, reducing the risk of errors.
  • Improved collaboration: With all the metadata and documentation in one place, it enables better collaboration among different teams, departments, and stakeholders.
  • Faster development: Automating the generation of physical data models and dynamic metadata based on patterns can help reduce manual effort and speed up the development process.
  • Increases independence from systems and vendors of target infrastructure as it is possible to switch from one to another by changing patterns.
  • Provides a proof of concept for new implementations by replacing old patterns with new ones.
  • Helps navigate away from dead-ends in solution design.
  • Allows for solution design in metadata even before all technical and implementation details are resolved.

Cons:

  • Initial investment: Implementing a metadata-driven approach can require a significant initial investment in terms of resources, tools, and infrastructure.
  • Dependency on tools: The success of a metadata-driven approach depends on the availability and reliability of the tools used to manage and automate the metadata.
  • Requires a significant investment in metadata management and maintenance.
  • May have a steep learning curve for designers unfamiliar with metadata-driven development.
  • Requires adherence to strict metadata standards and governance.
  • This may result in increased overhead for small projects where metadata management is not required.
  • There may be resistance from teams who are accustomed to working with more traditional, manual approaches.

In conclusion, while a metadata-driven approach can bring several benefits, it is important to carefully assess the needs, resources, and risks involved before making the decision to adopt it.

While with Batch ETL the approach is an option (however inefficient), in ODS, especially the lager-wise one, that is the must.

Requirements

the requirements for a metadata management system can be categorized into several groups:

  1. Descriptive functions:
  • A logical picture of the system and its history
  • Visualization of metadata
  • Impact analysis and development support
  • QA
  1. Prescriptive functions:
  • Support in generating parts of the system
  • Automatic testing and QA
  • Automation – metadata-driven development
  1. Support:
  • Team collaboration
  • Version control
  • Branches (e.g., DEV vs. PROD FIX)
  • High Availability
  • Deployment
  1. Extensibility:
  • New meta-entities
  • New relationship types
  • New attributes, directives
  • New interfaces (API, GUI, Scanners, Generators)
  1. Exchange:
  • Scan Database schemas
  • Scan (synchronize) Data models
  • Scan SQL, PL/SQL, T-SQL, P-SQL
  • Generate DDL/DML
  • Generate XSD, WSDL, etc.
  • Generate Operational metadata
  • Generate necessary documentation
  1. Housekeeping:
  • Backup/restore
  • Statistics, optimization
  • Aging, log retention
  1. Metadata QA:
  • Rules
  • Issue reports
  • Checklists
  1. Security:
  • Roles
  • Access Auditing
  1. Additional requirements:
  • Completeness – the ability to store and maintain all required attributes
  • Open interfaces, APIs
  • User-friendly customizable interfaces
  • Impact analysis
  • Support – the ability to analyze the impact of changes on metadata and the system
  • Integration – the ability to integrate with other systems and tools
  • Flexibility – the ability to support different metadata use cases and workflows
  • Scalability – the ability to handle large amounts of metadata efficiently
  • Automation – the ability to automate repetitive metadata tasks and processes.

Static metadata

The static metadata domains can be categorized as follows:

  1. Data domains: describes the different types of data being managed and processed.
  2. Schema and Data Sets: defines the structure of the data and how it is organized, including tables, views, files, data frames, documents, key-value pairs, and column families.
  3. Relationships: describes how different data sets and elements are related to each other.
  4. Constraints and Indexes: defines the rules and conditions for managing and processing the data, such as indexes and data constraints.
  5. Attributes: describes the characteristics and properties of the data, such as data type, length, and value.

Dynamic metadata

The categories of dynamic metadata are:

  • Process metadata: Includes information about the processes involved in ETL such as transformations, interfaces, and targets.
  • Task metadata: Contains information about specific tasks such as dependencies, resources, and attributes.
  • Dataset metadata: Covers information about the source and target datasets, including levels and relationships between datasets.
  • Transformation metadata:
    • Transformation rules
    • Mapping information
    • Join conditions
    • Filter conditions
    • Aggregation information
    • Calculated fields
    • Data enrichment information
    • Data type conversions
    • Null handling information
    • Key generation information
    • Change data capture information

The metadata should be organized in a way that supports the generation of the required implementation objects, such as queries, data transformations, and so on. This helps in automating the ETL development process, improving efficiency, and reducing manual errors.

High availability

High Availability refers to the ability of a system to remain operational and accessible even in the event of hardware or software failures or other disruptions. This is important for a metadata management system because if the metadata system is down, it can impact the ability to generate fixes or changes for the production environment, even if it does not directly impact the functionality of the production systems themselves. To ensure high availability, the metadata management system should be designed with redundancy, failover mechanisms, and other measures to minimize downtime and ensure continuous operation.

Extensibility

The steps in the process of extending metadata can be summarized as follows:

  1. Prototype development with simultaneous description in metadata
  2. Gap analysis to identify missing descriptive metadata
  3. Extension of metadata system to cover the descriptive gap
  4. Comprehensive gap analysis to identify missing prescriptive metadata
  5. Extension of metadata system to cover the prescriptive gap
  6. Building a code generator based on the metadata
  7. Automating the prescriptive metadata.

These are examples of the types of changes that may occur frequently in a metadata system and require extensibility to support updates and modifications to the system (ordered from the most common to initial or rare ones):

  • Changes in generators
  • Changes in GUI (ergonomics)
  • Documentation generators
  • New directives for generators in the metadata structure
  • Import/export processes
  • New metadata entities, relationships, and attributes

Those are the only options for competencies necessary for various types of extensions:

  • Possible handover of the development to internal people
  • Embedded extensibility as a system feature
  • Provided by supplier expanding base product
  • Provided by the supplier as a customization

Metadata-Driven Automation

Metadata-Driven Automation is a powerful technique that enables organizations to streamline and automate various tasks in their data pipeline by leveraging a single source of metadata. By capturing the business requirements and technical details of the data transformations, reconciliation processes, and test cases in a structured metadata repository, organizations can use that information to generate the necessary implementation scripts and artifacts in a consistent and efficient manner. This not only saves time and reduces the risk of manual errors, but it also enables organizations to quickly respond to changing requirements and update their data pipelines with minimal disruption. With Metadata-Driven Automation, organizations can focus on the critical aspects of their data operations, such as defining their data governance policies and ensuring the quality of their data, while leaving the tedious and repetitive tasks to be handled automatically by the metadata management system.

Metadata-driven automation example

Approaches to “Specialities” in Metadata-Driven Automation

In the field of metadata-driven automation, there are three main approaches to handling the special cases or exceptions potentially arising during code generation. These approaches are manual overrides, pattern directives, and custom transformations.

The problem: Specialities

Manual override approach

The manual override approach involves manually editing the generated code after it has been produced by the generator. This approach is the least flexible of the three, as it requires human intervention each time we generate a new version of the code. Furthermore, manual overrides are a risk because they increase the likelihood of human error, which can cause the generated code to behave in unexpected ways.

Manual amendments (override)

Custom transformation approach

The custom transformation approach involves incorporating custom code into the metadata, which is then used by the generator to produce the final code. This approach is less flexible than the pattern directive approach, as it requires that the custom code be rewritten each time the system is adapted to a new platform or methodology. Furthermore, custom transformations increase the risk of manual overrides, as they require human intervention to make changes to the custom code each time a new version of the code is generated.

Custom transformation approach

Pattern directive approach

The pattern directive approach involves incorporating special case handling into the metadata. This approach is the most flexible of the three, as it allows the system to adapt to new requirements or platforms with minimal manual intervention. For example, when the generator encounters a pattern directive in the metadata, it will generate code that is customized to handle the specified case. This approach reduces the risk of manual overrides, while still allowing the system to be adapted to new requirements.

Pattern directives

In conclusion, each of the three approaches to handling special cases in metadata-driven automation has its own advantages and disadvantages. The manual override approach is the least flexible and carries the greatest risk of human error, while the custom transformation approach reduces the risk of manual overrides but reduces the system’s ability to adapt to new requirements.

The pattern directive approach provides a balance between flexibility and stability, making it the preferred approach for many metadata-driven automation projects.

Metadata-driven approach and vendor lock

A well-designed metadata-driven solution allows for flexibility and scalability without being locked into a specific vendor.

The metadata acts as the central source of information and enables the automation of various processes and transformations, resulting in a streamlined and efficient system. The ability to modify or extend the metadata structure to fit changing business needs, without being limited to a single vendor’s offerings, helps to avoid vendor lock-in and ensures the solution can evolve and adapt as the organization grows and changes. A metadata-driven approach also allows for greater transparency and accountability, as the metadata provides a clear record of the relationships and transformations within the system.

The well-designed metadata-driven solution brings a much more comprehensive level of independency than an optically universal solution. It is much easier to develop new generators for new patterns (e.g. old pattern deploying to Informatica PowerCenter to the new one for Oracle PL/SQL or the opposite way) than to port the solution from one RDBMS to another.

Conclusion

In conclusion, metadata management is a critical aspect of a successful ODS system. A metadata-driven approach provides a clear structure and governance of metadata, leading to greater efficiency and automation in the development and deployment of ODS solutions. The key elements of a successful metadata management system include well-defined metadata entities, relationships, and attributes, as well as robust access management, versioning, deployment and testing processes, and extensibility options. The ability to hand over development to internal teams embed extensibility as a system feature, and supplier support for expanding the base product or customization are also important competencies to consider. With a well-designed metadata management system in place, organizations can fully leverage the benefits of a metadata-driven approach to ODS, streamlining the development process and increasing the overall efficiency and quality of their ODS solutions.

OtherS in the series