Data Management Glossary
Data is a collection of qualitative or quantitative values for drawing references that aids in decision-making. Examples of organizational data are sales figures, CAPEX spends, OPEX spends, etc.
A database is a facility used to organize, store, manage, safeguard, and control access to data. Database designs depend on different schemes (schema), defined by the relational model designed for ease of access by programs and data queries. Some database examples include relational database management systems (RDBMS), in-memory databases, object-oriented databases (OODBMS), NoSQL databases, and NewSQL databases. Each of these has its own set of pros and cons.
Data management is a collection of processes required to collect, store, control, protect, operate on, and deliver data. The system is built up of a network of data storage units (databases, data warehouses, data marts, etc.), data collection tools, data retrieval mechanisms, and tools/processes that determine data governance. The entire system is then integrated with data analytics tools to derive meaningful insights. A data strategy also forms a core part of the data management function where it strives to establish accountability for data that originates or is endemic to particular areas of responsibility.
A database management system (DBMS) aids in the creation and management of databases. It handles the storage and data management part of the system. Many data manipulation operations are performed by users in a DBMS. All the applications dependent on the DBMS must be integrated to ensure the smooth functioning of both the application and the system. The DBMS is essentially a toolkit for database management.
A relational database collects information that is organized in predefined relationships. The data is stored in a tabular format which defines the relationships between various data points. Relational databases use structured query language (SQL) to let administrators communicate with the database, join tables, insert and delete data, and more.
Data that is not stored in a tabular format depend on the NoSQL database. Unlike relational databases, a variety of forms can be supported in a NoSQL database – document, key-value, wide-column, and graph to name a few. They have excellent scalable capabilities and hence can handle high volumes of data. A NoSQL database is high in demand specifically for web 2.0 companies.
When data is set in a standardized format, with a well-defined structure complying with a data model and guarantees ease of access, it is termed structured data. This can be a simple excel sheet to data accessed from a relational database. Financial transaction information, geographic climatic information, and demographic targeting for marketing, all can be classified as structured data.
Any data that does not conform to predefined standards, and lacks structure or architecture is called unstructured data. It is not organized into rows and columns – making it more difficult to store, analyze, and search. They are not stored in a relational database and access is tough. Examples include raw Internet of Things (IoT) data, video and audio files, social media comments, and call center transcripts. Unstructured data is usually stored in data lakes, NoSQL databases, or modern data warehouses.
A simple definition of semi-structured data is data that can’t be organized in relational databases or doesn’t have a strict structural framework, yet does have some structural properties or a loose organizational framework. Semi-structured data includes text that is organized by subject or topic or fits into a hierarchical programming language, yet the text within is open-ended, having no structure itself. A good example of semi-structured data is e-mail – which includes some structured data, like the sender and recipient addresses, but also unstructured data, like the message itself.
Data mapping helps in matching fields from one database or data structure into another. Mostly considered as the primary step, this has to be executed to enable smooth data migration, data integration, or other data management actions. This is essentially helpful when data is populated from multiple sources. Most data analytics tools are designed to be a single source of truth (one source for all data queries), and this ensures that there is consistency in the data being processed by removing duplicates, conflicting data, etc.
The process of visually representing data flows within the system as a whole or as parts to understand the connections between data points and structures. Data Modeling helps us understand the relationship between various data types, and how they can be organized or grouped based on attributes. From this flow diagram, software engineers can define the characteristics of the data formats, structures, and database handling functions to efficiently support the data flow requirements.
A data warehouse is a comprehensive single source of storage where data flows from different sources – both internal and external. Data engineers and other stakeholders access data for business intelligence (BI), reporting, and analytics through a data warehouse. A modern data warehouse plays a central role in data-driven decision-making and can manage all data types, structured and unstructured. They are cloud-ready to enable all-time access.
Extremely large datasets consisting of structured, unstructured, and semi-structured data that traditional data processing software cannot handle are called big data. Big Data is defined by its 5 V’s – Velocity, Veracity, Volume, Variety, and Value. Velocity at which data is generated, Veracity to which data conforms, Volume of data handled, Variety of data types stored, and finally the Value the data provides in a business context. There are dedicated systems to mine Big Data for deep insights that aid in data-driven decision-making.
Data integration is the process of bringing data from multiple sources to a single source of truth. This is aimed at breaking data silos across the enterprise and beyond – including partners as well as third-party data sources and use cases. Techniques include bulk/batch data movement, extract, transform, load (ETL), change data capture, data replication, data virtualization, streaming data integration, data orchestration, and more.
Data from disparate sources and formats are unified in a virtual layer. This process is called data virtualization. It centralizes data security and governance and delivers data in real-time to the users. This saves time in duplicating data and helps users discover, access, act on and manipulate data in real-time regardless of its physical location, format, or protocol.
Data fabric is a combination of architecture and technology to break data silos and improve ease of data access for self-service data consumption. This concept is agnostic to location, sources, and data types, and enhances the end-to-end data management capabilities. It also automates data discovery, governance, and consumption enabling companies to quickly access and share data regardless of where it is or how it was generated.
A data pipeline defines the flow of data from a source to its intended target via different elements connected in series. The output from one element is considered to be the input of the next component. Data pipeline helps us connect various data storage systems and data flow can be automated to happen at specific intervals.
Data silos denote a situation where data access from other departments or functions is restricted. This can prove quite disastrous, as businesses would be impacted in terms of cost, inability to predict market changes, and losing agility to respond to market fluctuations. Data silos also lead to duplication of data, which leads to gaps in coordination between teams.
Data wrangling also called Data Munging is the process of transforming and mapping raw data into a format that databases and applications can access and read. The process may include structuring, cleaning, enriching, and validating data as necessary to make raw data useful. This process ensures that data is appropriate for downstream activities like analytics and reporting.
Data security is the practice of protecting data from unauthorized access or exposure, disaster, or system failure, data corruption, data theft, and more. It also ensures that data are readily accessible to approved users and applications. It spans from physical data security methods to policies set around data security. Data encryption, key management, redundancy and backup practices, and access controls are some methods used. Amidst the rising data threats and privacy concerns, data security plays a central role. Data backup is a major requirement of Business Continuity Plans (BCP).
Data privacy deals with the activity of restricting and controlling relevant data access within and outside the organization. This policy determines the kind of data that can be accessed and stored within the organization’s database, the level of consent required, and other regulatory requirements set by various governing bodies.
Data quality determines the reliability and usefulness of the data. Data is of good quality when it satisfies the five requirements – Accuracy, Completeness, Consistency, Reliability, and Recency. Ensuring data quality is determined by implementing an end-to-end data strategy supported by industry standards, best practices, tools, and systems with a fool-proof data management policy.
Data validation is the process of ensuring that the data is cleansed, possesses a quality, and is accurate and valid to determine its usefulness before consuming it for analytics and reporting purposes. Data authentication, data cleansing, and making sure data is free of errors and duplicates are key steps while implementing data validation. Businesses need trustworthy, accurate, and quality data for decision making, and data validation ensures precisely that.
Data cleansing also called data scrubbing is the process of fixing incorrect data, clearing duplicate records, correcting incomplete data, etc. to ensure that the organization will have access to clean, accurate and usable data. This is a major step in data analysis as erroneous data can have catastrophic effects in the longer run. A simple example would be when Salespeople enter a proper noun in two different ways, like the name of a person is spelled differently using ‘y’ and ‘i’ by two different sales representatives. This leads to duplication of records and shows the sales figures as jacked up and falsifies revenue in the system.
Data integrity ensures that data stays unchanged over the long term. After the data is entered into the database after prior steps like data cleansing, wrangling, and validation, users can rest assured that data that has gone in will not change for whatsoever reason. This statement is what data integrity provides. Even though data integrity deals with reliability and dependability, sometimes it is also considered a synonym for data quality.
Data Governance is a set of principles that ensures consistency and reliability of the data. Data Governance takes stock of legal and regulatory requirements, along with introducing and adhering to industry best practices. Data Governance is also highly subjective as it encompasses the organization’s process standards as part of its structure. This includes the process of loading and storing data, access restrictions, and more.
Data stewardship is the process of implementing and controlling data governance policies and procedures to ensure data accuracy, reliability, integrity, and security. The data leaders overseeing data stewardship have control over the procedures and tools used to handle, store, and protect data.
Data architecture is a framework of models, policies, and standards used by an organization to manage the data flow. Data has to be regularly cleaned to improve ease of access which will help other team members. Successful data architecture standardizes the processes to capture, store, transform and deliver usable data to people who need it. It identifies the business users who will consume the data and their varying requirements.
Master data management (MDM) is a technology-enabled discipline in which business and IT work together to ensure the uniformity, accuracy, stewardship, semantic consistency, and accountability of the enterprise’s official shared master data assets. It includes policies and procedures for defining, managing, and controlling (or governing) the handling of master data. Centralized master data management eliminates conflict and confusion that stems from scattered databases with duplicate information and uncoordinated data that might be out-of-date, corrupted, or displaced in time – updated in one place but not in another. Master data is the consistent and uniform set of identifiers and extended attributes that describes the core entities of the enterprise including customers, prospects, citizens, suppliers, sites, hierarchies, and a chart of accounts.
The process of crunching data to understand patterns and trends from which meaningful insights for business decisions can be derived. The results are generally represented in a visual format like graphs or charts, which are later incorporated into various reports or dashboards. There are 4 stages of evolution in the data analytics maturity curve – Descriptive, Diagnostic, Predictive, and Prescriptive.
Data mining is a tool in the broader spectrum of data analytics that helps by sifting through large data sets to identify patterns and relationships. For example, data mining might reveal the most common factors associated with a rise in insurance claims. Data mining can be conducted manually or automatically with machine learning technology.
Data profiling deals with understanding the statistics and traits of a dataset, such as its accuracy, completeness, and validity. Data profiling helps in data validation and data cleansing efforts, as it helps detect quality issues like duplication, data errors, missing data points, and inconsistencies.
As opposed to previous years, data has taken over the way businesses operate. It’s at the forefront of the world powered by information. Organizations use this data, process it, analyze it, and derive meaningful insights from it to aid in decision making. This process of leveraging technology to make data-driven decisions that will positively impact the business and revenue is termed Business Intelligence.
Often called the most basic form of data analytics, descriptive analytics deals with breaking down big numbers into consumable smaller formats. It helps in understanding basic trends but doesn’t help with deeper analysis. It stops at answering the “what”. Small everyday operations rely on descriptive analytics for their day-to-day planning.
The next step in the analytics journey, diagnostic analytics is aimed at understanding the “why” behind an occurrence. Studying the cause will help organizations mitigate or plan better for their future. Diagnostic analytics uses data drilling, data mining, and correlation analysis to uncover underlying causes. A simple example would be when product marketing teams are planning for a product launch campaign, diagnostic analytics reports of previous campaigns will help them plan better.
As the name suggests, predictive analytics is the science of predicting future scenarios in business based on historical data. It relies on advanced statistical analysis, data mining, and machine learning for the system to come out with a comprehensive prediction. It helps business leaders in data-driven decision-making and proactively mitigating risks. An everyday example would be analyzing a potential candidate’s past payment behavior and predicting on-time payment probability for a bank to extend credit lines.
Being the final stage of the data analytics maturity curve, prescriptive analytics feeds on the results of descriptive, diagnostic, and predictive analytics results to suggest a probable cause of action and help businesses make informed decisions. Jack-riding on the previous example, if the individual was a serial defaulter, then the system can suggest the mortgage officer not to sanction loans for the individual as he has a history of defaulting and his credit scores are a mess.
A recent technological advancement in the field of data analytics, behavior analytics helps in revealing consumer behavior insights across platforms like eCommerce, online games, web applications, etc. This will help businesses tailor their services or offerings to resonate with the end-user.
Correlation analysis is an advanced statistical technique used to measure the relationship between two variables. A high point indicates a strong correlation between the two variables. This is majorly employed during the quantitative analysis of the data points collected through methods like polls, surveys, etc. A simple example of correlation analysis would be to check the sales data of Thor merchandise concerning the sale of Thor: Love and Thunder tickets.
Data Lake is a centralized repository in which enterprise-wide data can be structured, semi-structured, or unstructured are saved. Data Lake ensures access restrictions pending authorization along with improving the ease of data access. The data can be stored in its native format without the need for structuring it and various types of analytics can be run on it inclusive of big data processing, data visualization, dashboard creations, etc. Data lakes are highly scalable and complex data operations can be performed inside them.
A dashboard is a tool used by organizations to group various data relations and visually represent them in different graphical formats based on business requirements to track the performance of key variables. There are some self-serving tools like Tableau and PowerBI that enable data analysts to create dashboards based on required formats with various visualization options.
Data Mart is a data storage unit aimed at retaining data for a specific business line or department. Summarized data of a specific business function is kept within the data mart for ease of data access. For example, for the accounting department to close that year’s books, they can easily access a data mart to get specialized access to specific data sets.
Exploratory Data Analysis called by its abbreviation EDA is one of the initial steps a data scientist or data engineer does when they are presented with a new data set. This is the initial investigation aimed at understanding the characteristics of the data, the relationships between variables, test hypotheses, and test presumptions about the data with statistical graphs and other data visualization tools.
On a mission to make data access simple. Our thoughts, learnings and quirks on this journey