In Data Lineage, we demonstrated that
<<trace>> relationships can be a superior alternative for documenting data-lineage, and provide a framework for Enterprise Lineage. This blog is concerned with the automatic derivation of lineage constraints for
<<metdata>> Information Items and the implementation.
<<trace>> dependency is inherently an graph problem, but within software architecture the scale of the problem is a finite graph so does not need to be transferred to, or interrogated with Graph Database technology – any 64-bit operating systems can hold the entire graph in memory while processing.
Modelling oddities like
[Chicken] has a trace dependency to
[Egg] has a trace dependency to
[Chicken] can be ignored because no additional information is provided by recursive search – and can be a useful construct when an in-memory object is sourced from a database row, and the database row is sourced from the in-memory object.
The trace-graph can be expressed as a (In JSON) as an array of objects, where each object is either a text node or a trace-graph of other dependencies – a flattened array of all dependencies can produced by removing all the
] from the array.
Lineage is not concerned with the process/algorithm of any transformation, but the reagents list of the information used.
This implementation uses Sparx Enterprise Architect, which stores the repository information in a normalised relational database, than can be accessed by Entity Data Model using EA.Gen.Model to provide an object-graph view of the repository database
The derivation consists of three parts: recursive derivation of Attribute data-lineage; recursive derivation of Information Item data-lineage constraints; derivation of Enterprise Lineage
Starting from the
<<metadata>> Information Item, recursive search through
<<trace>> abstractions to find common attributes (either by name or alias) that imply lineage. This is a separate pair of functions
mapDataLineage to allow all data lineage to be refreshed for all entities once when derivation is scheduled. The script cane be tailored to site-specific requirements.
It’s generally recommended that overloaded names like “Name” and “Id” are not used in Data-warehouses because Business Inteligence tools (PowerBI, Tableau, Qlikview, etc) will assume that they represent the same domain (if absolutely necessary a copy-columns view object can avoid implied linage by using a different name – this is preferable to changing the script)
The result of the recursive search is stored as
There is no fragile dependency that trace does not include Chicken/Egg loops.
Starting from the
<<metadata>> Information Item, the
dataLineageReport function recursively gathers all attribute lineage Tag values into an in-memory dictionary so that every object needed for Lineage is referenceable.
For every attribute of every element referenced by the
<<metadata>> Information Item a Lineage constraint is created by recursively searching the dictionary to expand the lineage constraint until all source reagents are added.
Starting from the
<<metadata>> Information Item, all
<<flow>> Information Flow references are recursively gathered into an in-memory dictionary of elements referenced (including components, classes, actors, process, etc)
For every data entity referenced by a
<<flow>> a dictionary is produced of data references with their onward
<<trace>> references. This dictionary is then recursively expanded to include a reference to every source reagent.
For every data-entity included in every trace reference to the
<<metadata> item a constraint is created recursively searching the flow dictionary and filtering flows with the rule
For each Element in the flow if the next flow conveys a data entity with common data-lineage to the previous flow and has common data-lineage with the constraint item, then it is inferred that this is an extension of the flow lineage
The Lineage.fs script is included as an example with the Enterprise Hub and scheduled either as a real-time change trigger (where a complex graph is calculated in a few seconds); scheduled refresh job or both
<!-- realtime scheduling on change --> <connection name="name"> <triggers> <trigger class="EA.Gen.Hub.Script.ScriptJob" assembly="EA.Gen.Hub.Script" description="lineage" workflow=".\Lineage.fs" elementClass="Element" type="InformationItem"/> </triggers> </connection> <!-- batch scheduling via Qaurtz --> <schedule> <job name="PathwiseComplexity" startup="true" startAt="01:00" interval="01:00" frequency="Daily" connections="name"> <trigger class="EA.Gen.Hub.Script.ScriptJob" assembly="EA.Gen.Hub.Script" description="lineage" workflow=".\Lineage.fs" /> </job> </schedule>
This kind of highly recursive and data-intensive function cannot reasonably be performed within a client-side addin, but is fast and efficient when scheduled through an Enterprise Hub.
When combined with the change-governance capability of the Enterprise Hub it is possible to include lineage reviews in the governance process.