Demystifying Data Lineage: A Comprehensive Guide

Demystifying Data Lineage: A Comprehensive Guide

The life of a passport

Ross Moore

Not everyone has a passport, but for those who have a passport and travel extensively, seeing their passport paints fascinating picture, what with all the stamps, signatures, and colors – it’s a work of art. If one could add in journal entries, receipts, and tickets, a life story unfolds.

Seeing this recorded story of where someone well-traveled is a form of data lineage.

Data, and data, and more data

Data is everywhere. Not only that, but the staggering amount of data has created numerous roles for attending to data: data scientist, data analyst, data engineer, data science engineer, architect, librarian, specialist, visualization specialist – and that’s just a sampling.

Data is key to digital transformation. With all of that data being brought in, having reliable data to analyze, engineer, etc. is critical, and this leads to data intelligence. Data intelligence answers the 5 Ws: Who, What, Where, When, and Why. Questions such as “Where did the data come from?” and “When did it move there?” are a couple of many questions that need to be answered to have an intelligent view of data and, therefore, be able to make intelligent decisions about and with the data.

Designing solutions that will properly track and use the data can be implemented by using data lineage. Data lineage “is the process of tracking data as it moves within an organization to understand its origins, the ways it’s been modified, as well as who is using it and how.”

The CIA triad (Confidentiality, Integrity, and Availability) plays a major role in data assurance. That data needs to be available at the right times to the right people, and that’s determined by the type of data, the need of the internal and external customers, and the business vision and mission.

The data needs to be secured to ensure that it’s viewable and changeable only by authorized personnel. In addition to securing the data, compliance and regulations have to be thrown into the mix. On top of that, data privacy is a factor. I throw this in here because one aspect of privacy is customer-driven - regardless of regulatory needs, data privacy is expected by customers and is a growing business advantage by those businesses who want to get ahead of upcoming regulations.

More about Data Lineage

Data lineage’s process of tracking and documenting the flow of data from its origin to its destination (including transformations and changes) provides a clear understanding of how data moves through an organization's systems, enabling better data management, governance, and error tracing in data analytics processes.

Key aspects of data lineage include:
  • Data origin: Identifying where the data comes from, such as a specific data source or system.
  • Data flow: Understanding how the data moves through various systems, processes, and transformations.
  • Data changes: Tracking any modifications or transformations that occur to the data as it moves through the organization.
  • Data destination: Determining where the data ultimately ends up, such as a data warehouse or a reporting tool.
It uses metadata (data about the data), to enable both end users and data management professionals to track the history of data assets and get information about their business meaning or technical attributes.

Data Governance

Data lineage is a crucial component of data governance, as it provides transparency and traceability in data movement and transformations, ensuring data quality, and supporting compliance with regulations.

Here are some ways that data lineage can help with data governance:
  • Regulatory compliance: Data lineage can help organizations comply with regulations by providing a clear understanding of where data comes from, how it's transformed, and where it's stored. This information can be used to demonstrate compliance with regulations such as GDPR and HIPAA.
  • Data quality: Data lineage can help ensure data quality by identifying the source of data quality issues and enabling organizations to trace them back to their origin. This can help organizations improve their data quality processes and prevent future issues.
  • Risk management: Data lineage can help organizations manage risks associated with data by providing a clear understanding of where data comes from, how it's transformed, and where it's stored. This information can be used to identify potential risks and take steps to mitigate them.
  • Data discovery: Data lineage can help organizations discover new data sources and understand how they relate to existing data sources. This can help organizations identify new opportunities for data analysis and improve their overall data management processes.

Data Security

Defenders think in lists. Attackers think in graphs. As long as this is true, attackers will win.” To win the data security challenge, data defenders need to understand the relationship between their internal and external sources and repositories. Where is the customer’s data? When was the data changed? Where are the customers? How do our products relate to our customers? Seeing these relationships – thinking in graphs – will lead to a much better security posture by being able to protect the data properly according to its importance and flow.

What to Look For in a Data Lineage Tool

When selecting a data lineage tool, here are some key features to consider.

1. Robust data import capabilities: The tool should be able to import data from a variety of sources, including databases, data warehouses, and data lakes.

2. Column- and field-level lineage: The tool should provide a granular view of data lineage, showing how individual columns and fields are transformed and used throughout the organization.

3. Data tagging: The tool should be able to apply metadata tags to data sets to help describe and characterize them for better data management.

4. Visual representation: The tool should provide a visual representation of data lineage, allowing users to explore the data flow, transformations, and dependencies in a more intuitive way.

5. Data quality management: The tool should help users identify data quality issues and trace them back to their origin, enabling organizations to improve their data quality processes and prevent future issues.

6. Regulatory compliance: The tool should help organizations comply with regulations by providing a clear understanding of where data comes from, how it's transformed, and where it's stored.

Data from, for, and to People

The game Telephone – where one person whispers a message to the next person, on down the line from player to player, until the last person says out loud what was heard (it’s never the original message!) shows how the degradation of data transmission happens so quickly.

Data lineage is required for a company’s data to remain viable; losing data integrity negatively impacts the business value of data. The better the insight into all aspects of the data, the better decisions one can make, and the better value your company can provide.

About Author:

Ross Moore

Ross Moore is the Cyber Security Support Analyst with Passageways. He has experience with ISO 27001 and SOC 2 Type 2 implementation and maintenance. Over the course of his 20+ years of IT and Security, Ross has served in a variety of operations and infosec roles for companies in the manufacturing, healthcare, real estate, business insurance, and technology sectors. He holds (ISC)2’s SSCP along with CompTIA’s Pentest+ and Security+ certifications, a B.S. in Cyber Security and Information Assurance from WGU, and a B.A. in Bible/Counseling from Johnson University. He is also a regular writer at Bora.

Post a Comment

Previous Post Next Post