For the best web experience, please use IE11+, Chrome, Firefox, or Safari

What is Data Lineage?

What is Data Lineage?

Data lineage is the journey of your organization's data. From its creation at source, through its movement and transformation into and out of the data platforms and repositories within and outside of your business, to its final resting place in the applications and business intelligence relied on for operations and decision-making. End-to-end data lineage provides a complete, detailed visual of how a certain dataset flows into, across and out of your organization. It can help you plan IT modernization efforts, troubleshoot data quality issues, tackle data governance and regulatory compliance, and quickly see and understand the available data and its relationships within your organization.

Why is data lineage important?

Tracing the source and flow of data is an arduous task. As your organization accumulates information systems, it also accumulates data entry points and transformation rules for ever-moving data. Additionally, data integration tools, extract-transform-load (ETL) tools, procedural code and even APIs and business intelligence (BI) reports aggregate and transform data constantly. As a result, it’s difficult to manually compile and understand the complicated web of data formed among the systems within your organization and present it in a simple visual flow. Automated data lineage can provide IT, data governance teams and business users with current visibility and context of organizational data to work more efficiently, make sounder decisions, and better leverage and protect the data at their disposal.

Not only can data lineage enable you to understand where data originates, how it is transformed and how it moves through your organization, it can bring together technical and business attributes and governance, spotlight sensitive data and other data classifications, deliver data quality visibility including helping users quickly conduct root cause analysis to data quality issues, and more. In the absence of automated impact analysis capabilities or in conjunction with these capabilities, data lineage can also be extremely useful for scoping and assessing the impact of potential data management, data intelligence and data platform migration efforts.

Why is data lineage important?

Benefits of data lineage

  • Business impact and data landscape visibility: Awareness and understanding of available data and its transformation and flow across multiple systems can influence and support the business decisions you make. For example, understanding the journey and context of data can help you answer questions in business intelligence reports, including who owns the data, governance guidance to be aware of, and how fit the data is for your use. With data lineage, you can better understand the data that is aligned with critical business initiatives, where and how the data was sourced, and if it should be trusted given its upstream lineage journey, associated data quality scores and more. Organizations can also more easily see the risk areas where critical privacy information may be exiting the company.
  • Data governance foundational support: Data ownership, accountability and traceability is at the heart of a sound data governance program. Combined with automation, data lineage can help data owners, data stewards, and IT and business users understand and validate data usage. Automated data lineage can deliver horizontal data lineage, from initial data source to target and all data transformations in between, as well as vertical data lineage providing insight into the associations between business and technical assets. End-to-end data lineage, both upstream and downstream, can reveal when and by whom any changes were made, and it brings together technical and business governance in a central visual for all data users. Data lineage is an essential transparency tool for data governance teams.
  • Regulatory compliance and auditability: Business terms and data policies should be implemented through documented business rules within your data intelligence and governance solution. You can then use data lineage to track regulatory compliance with those rules, incorporating auditability and validation controls each time the data is transformed. Data lineage also offers visual and time-saving proof to customers, employees and auditors that you are taking compliance regulations and privacy concerns seriously and implementing programs to support your efforts.
  • Collaboration across teams: Data lineage provides a common view of the data within your organization and its journey across business lines and disparate legacy and modern analytical environments. This transparency of the data and its usage fuels additional data sharing and clearly identifies critical handoffs during data integration projects. With one central reference point, teams throughout your organization can more efficiently work together to ensure the data within your organization is fully leveraged and properly protected.
  • Data quality visibility and targeting: Root-cause analysis is the first step in addressing poor data quality. With data lineage insight into data quality, data stewards can trace the information flow backwards to determine whether standardizations and transformations of the data were performed correctly. End-to-end data lineage allows you to fix a data quality problem at its source and enables teams to proactively monitor key data pipelines to ensure data integrity. Data lineage can also provide business users with a direct view of data fitness that builds trust.
  • IT planning and data management: Data lineage provides the insight and information needed for data architects and data engineers to understand the as-is state of your current data landscape and smartly manage and plan for IT modernization, data warehouse migrations, and other critical efforts. It provides you with a visual project scope of how many applications, transformation rules and business units will be impacted. Visibility of sensitive and other data classifications, data quality indicators and readily available governance context help IT teams better manage the data within the organization.
  • Data preparation: For data analysts preparing data, data lineage delivers the roadmap needed to ensure the best-fitting data is selected for the intended analysis. Data lineage can provide analysts with a visual of the systems, environments, tables and columns available. It serves as a reference point for both physical and logical naming and can include data quality, sensitive data indicators, and governance guardrails to efficiently guide the work of analysts and speed data preparation time.

Is data lineage part of data governance?

Yes. Effective data governance requires that business glossaries, data dictionaries and data privacy policies live in one central location. But as the volume of data grows in an organization, those objects end up in different formats in different places, expressed in different ways. Data lineage helps the organization easily track, monitor and update them over time. Data lineage offers advantages in governance practices such as consistency of terminology, data quality remediation, auditing of data pipelines and root-cause analysis during data incidents.

What are the different types of data lineage?

Data lineage includes both business lineage and technical lineage.

Business lineage focuses on the datasource overview to help data governance teams and business users understand the flow of data between different data sources within an organization’s data landscape.

Technical lineage focuses on the organization’s data flow at the table and column levels, the underlying transformation layers and rules, temporary tables/files, and other objects of interest for technical users to understand an organization’s data journey.

Both business and technical lineage views are critical for organizations to understand the value of their data and easily assess the impact of changes. The ability to drill down from business lineage to technical lineage — or vice versa — provides added flexibility for data analysts to quickly get the right perspective of data flow.

Not all automated data lineage is delivered in the same way. Referred lineage may pull lineage together based on element or attribute names and composition. More-detailed data lineage is based on code at the element level and is more trustworthy.

Data lineage is quickly evolving to become more intelligent, using AI to capture patterns in difficult-to-parse code.

What are the different types of data lineage?

What is a data catalog and lineage?

A data catalog is a centralized view of the data assets within an organization used to manage, leverage, protect, understand and socialize an organization’s data. A data catalog comprises the metadata, or data describing your data, that is collected from the data sources, data platforms, applications and business intelligence systems throughout your business. It is then further enriched with business context and data governance guidance. Automated data cataloging takes advantage of automated data discovery and metadata harvesting capabilities using platform-specific data connectors for data-at-rest metadata. Automated data lineage leverages the data-at-rest metadata. It also captures data-in-motion metadata, taking advantage of data connector reverse-engineering capabilities. It parses the code involved in data movement, such as procedural code and ETL, to map the flow of data throughout an organization. The resulting lineage aids IT, data governance teams and business users in visually understanding the data assets available within the organization. It specifies the journey of data from initial data source to final target applications, with all data transformations in between.

What is a data lineage tool?

A data lineage tool uses automation to discover and map the flow of data from initial data sources, through transformation and processing, to final destinations. For example, in mapping the data pipelines populating your data warehouse, a data lineage tool might reverse-engineer the ETL tasks that immediately precede the loading of the data into the target warehouse. Iterating through all data sources, the tool accumulates a task list and documents each discovered piece of lineage. At the end of the sequence, the process will have reverse-mapped the pipelines for populating that warehouse. Automated data lineage tools make it practical to harvest and keep data lineage current. They save data stewards up to 95 percent of the time traditionally attributed to manual data lineage methods.

Where can I get help with data lineage?

erwin Data Intelligence by Quest® delivers automated data lineage from source to target, with all data transformations in between. IT, data governance teams and business users can navigate data lineage with technical and business context, data quality visibility and sensitive data indicators.

Get started now

Learn how erwin Data Intelligence can help you see and understand your entire data landscape with business-friendly data lineage, impact analysis, data quality scoring and more. Maximize the business impact of your data.