Our series highlights the key aspects of ETL workflow management, namely metadata, dependencies, and resource management.
When I first began designing ETL systems, I realized that efficient process management is just as crucial as any other aspect of the ETL approach. Most current workflow systems prioritize easy visualization in a structured format, but they fall short in terms of robustness and performance. A more advanced solution would be a heap-based workflow system that allows for the use of operational research methods for greater robustness and performance in ETL systems.
The beginning of promoting metadata-driven solutions revealed that only a fully integrated metadata-driven approach, with the logical process, transformation, and data models, ensures optimized processes and manageable solutions. Optimal transformations require proper workflow management and vice versa. Logical transformation metadata should be free of implementation details, while workflow metadata should only include logical dependencies and resource management information.
Structure vs. Heap
In a structured workflow, you specify when each task should be executed. In a heap-based workflow, you place all tasks in a single heap and define constraints for their execution. The constraints include:
- predecessor dependency (addresses dependencies arising from referential integrity, data readiness, etc.),
- resource consumption (handles system resource balancing, filling of certain segments exclusively, and temporary disability of certain objects),
- condition function (takes care of additional timing requirements, branches, and skips).
The heap-based workflow can be transformed into an oriented graph and optimized by advanced methods.
Manageability
The problem of manageability should be addressed at the definition stage, ideally through a ‘process interface’ between local workflow models. The runtime process workflow model should have metadata on the finest level and execution should be managed through methods of operational research. There are two levels of metadata definition and two sets of requirements for them:
- logical, definition level (manageable and well-documented),
- physical, engine level (effective at runtime).
Based on the first level, metadata for the second level is generated, avoiding harm to the quality of the second level as seen in most current solutions. Process workflow should be modeled like data structures, first creating a model and then the necessary elements such as tables, indexes, and constraints. In most processes, only one instance of the load process is used at a time, but advanced workflow allows for simultaneous loads, which is more complex as it requires considering dependencies between tasks and even instances of the same task. The simpler first variant, covering the most common cases, will be addressed first.
Essential Components of the Heap Model
- T represents a Task, the basic unit of a process workflow that performs an action, typically running an ETL module (ETL mapping) with specific parameters.
- R represents a Resource, either real or virtual, defined to limit parallel task execution. Resources can be based on real resources, such as CPU time, IO, a lock on a particular segment, or virtual resources that encompass a group of resources when it’s not easy to identify the real resource consumption of tasks. Best practices begin with common performance resources, avoiding large tasks from running concurrently and refining to real resources based on runtime statistics.
- D(T, T) represents a Dependency, which links a predecessor task to a successor task. Best practices have shown that modeling only the predecessor-successor dependency is sufficient. Mutual exclusion, which is often defined, is usually a result of resource consumption conflict and can be better addressed through Resources. A task cannot be executed until all its predecessors are completed in the current load.
- R(T) represents Resource consumption, which specifies the resource consumption of a particular task. A task cannot be executed until all required resources are available in the required amount at the moment.
- Tc represents a Task run condition, an optional property of a task that determines whether the task should be executed or skipped. Optional execution of certain groups of tasks can be solved using such conditions for each task or by defining separate heaps when the group is too large and separate.
- Tp represents Task priority, which determines the order of execution when multiple tasks satisfy all dependency and resource requirements. Priorities are determined offline based on runtime statistics and operation research heuristic models.
- Tt represents Task estimated time, an attribute of a task filled with runtime statistics as an average execution time
Runtime Components
Runtime components are the elements that are active and in operation during the execution of a process. In the context of process workflow, runtime components may include Load, Task run, Task runtime status, Thread, etc. These elements work together to carry out the process as defined in the heap model.
- L represents Load .. a single execution of the process, e.g. daily load.
- T(L) represents Task Execution .. represents the execution of a specific task in a specific load.
- T(L)s represents Task Run Status .. status of runtime (e.g. None, Wait, Running, Failed, Done). The status ‘Wait’ requires an additional attribute of the time of the first possible run. The status ‘Failed’ requires a description of the failure and the number of occurrences. The status ‘Done’ should have a flag indicating a skipped run.
- X represents Thread .. a background process during runtime that searches for eligible tasks to execute and performs them. After completion (whether successful or not), the thread looks for another runnable task and executes it. If there are no more executable tasks, the thread goes to sleep for a while. The number of tasks limits the parallelism, so there are usually 10 or more active threads in an ETL workflow engine. In earlier implementations, the sleeping task would wake up after a certain time and plan its own start. In advanced solutions, the sleeping thread is also awakened by another thread after a task has finished and it is discovered that there is more than one runnable task.”
The Heap and the Process
A Heap is a collection of all Tasks that have dependencies and defined resource consumption. While it may seem most efficient to consolidate all daily ETL (Extract, Transform, Load) tasks into a single large Heap, this approach can be difficult to manage. It’s important to note that just because tasks are placed in a single Heap, it doesn’t necessarily mean they need to be designed that way. Alternatively, it may be more manageable to group related tasks together in a logical process with a dependencies model.
Several process models on a logical level are used to generate one heap of tasks with dependencies and resource consumption defined.
Process interface
When it comes to organizing things between processes, there are multiple methods available. The most prevalent approach is to create a dependency-oriented graph on the process level. This means that any given process can be dependent on the completion of one or more other processes. However, this approach often leads to one of two negative outcomes.
- Firstly, the workflow may not be fine-grained enough, resulting in suboptimal optimization.
- Secondly, the processes may be too small, making them difficult to manage effectively, and the overall structure of the processes may not align with the business’s requirements or competencies.
To enable fine-grained interactions between processes, a process interface is utilized. Would you like me to explain how it works?
Process interface can be implemented as a couple of Milestone-type tasks present in both processes with a defined state (contract) of processing in the sourcing process. That way designers of the first process can always keep in mind all requirements specified in the interface contract are fulfilled (e.g. all records of the Party entity are loaded including identifiers).
Redundant dependencies
After generating your workflow graph (which is typically done automatically by a generator), you may discover that some of the dependencies appear to be redundant or transitive. In such cases, there are two potential approaches for dealing with these dependencies:
- The first approach is to not generate them (transitive reduction) and instead maintain them at the logical level. However, it’s important to keep these dependencies on file since they’re critical for impact analyses and for managing future changes.
- The second approach is to generate the dependencies and flag them as “Redundant.” Doing so enables the engine to disregard them during execution but retains them for operational tasks such as impact research during problem-solving.
Circular dependencies
Circular dependencies can lead to deadlock in workflow processing. It’s crucial to check for round dependencies during the generation of heap workflow metadata from process metadata or after any ad-hoc changes. This check should be done before committing any changes in the dependency structure. If round dependencies are identified, they should be reported, and changes shouldn’t be committed.
Another type of deadlock can occur when the total resource consumption is greater than the available resources. In such cases, it may be necessary to temporarily halt processing for specific phases by operations. The state should only be reported and notified when processing is stopped for this reason.
To address these issues, we introduce a virtual node in the “source” process that’s visible to the “target” process. The definition of this node should include the owning process, consuming process(es), and a description of the interface’s meaning (e.g., “table T1 is loaded, table T2 is loaded from source system S1, and table T3 has loaded all key attributes”). This description serves as a contract between the two processes, and it’s essential to ensure that both parties understand their responsibilities. While the narrative may seem redundant since it’s depicted by the dependencies, it’s crucial as it allows the source process owner to make changes to its processing and understand how to reorganize dependencies to fulfill the contract.
Resources
Resource limitation has several attributes:
- The available amount of resources is typically set to 100 as the default percentage value at the initial setting.
- Resource consumption, which can be a percentage of the total available amount. However, it’s best practice to keep the available amount modifiable to enable scaling in response to issues that arise.
- Type of resource, which can be either standard or pending. A standard resource is blocked in the appropriate amount during task execution and released after completion (regardless of whether it was successful). In contrast, a pending resource is blocked at the start of the task and remains blocked even after completion until another task releases it through a negative consumption set. Pending resources are typically used to block certain actions during intervals that exceed one task frame.
Threads
A thread is a background process that executes tasks independently by following a series of steps:
- It puts a lock on the choosing process.
- It selects a set of tasks available for further execution.
- It selects a task for execution and marks it as running.
- It books resources needed for the task.
- It unlocks the choosing process.
- It executes the task and logs the result, changing the task’s state based on the result.
- It repeats the above steps until there are no more tasks available for execution. In such cases, the thread switches to sleeping mode for a while.
More advanced solutions may have additional steps. For example, after the thread finishes executing a task, it checks how many tasks are available for further execution. Depending on the result, it may:
- Fall asleep for a longer period (about 5 minutes) if there are no tasks and any other thread is active.
- Fall asleep for a short period (about one minute) if there are no tasks remaining and no other thread is active.
- Execute the single remaining task if there is only one available.
- Execute the task with higher priority if there are multiple tasks available and wake up another thread to execute the remaining tasks.