Analytics Glossary

A single source of truth for all your analytics questions

Analytics Glossary

Bias metric is a descriptive statistics measure that is computed before and after training the model. There are two extremes of the bias metric. A lower bias score denotes that the model assumptions or predictions are almost correct. Another conclusion that is drawn is that the predictions are equally split into two halves, with one above the actual value, while the other half is below. However, if the model throws a higher bias score with low variance, then the architecture needs to be revisited.

The algorithm is a set of instructions or a method used to generate a machine learning model. Linear regression, decision trees, support vector machines, and neural networks are a few examples of an algorithm. It’s different from a machine learning model as they are output by algorithms and are comprised of model data and a prediction algorithm.

Continuous variables are designed to take on a wide range of values spread between the lowest and highest point of measurement. Some examples of continuous variables are speed, distance, lifespan, etc. It’s an important element in inferential statistics but they’re not preferred in data mining.

Convergence can be defined as a state achieved by the machine learning algorithms where the loss (the difference between the true value from the data set and the predicted value) is very minimal and change is almost non-existent with each iteration.

Arriving at a solution through a top-down approach where the technique starts by assuming a hypothesis. The hypothesis is then tested and observations are drawn to reach a conclusion. 

Deep learning can be called a subset of machine learning, where the system mimics human attributes in knowledge acquisition. It plays an important role in data science with dependencies on predictive modeling with years of research behind it. Its applications are varied but not limited to computer vision, signal processing, medical diagnosis, and self-driving cars. The modern-day computing power has led to this rise in deep learning systems.

The number of attributes or features existing in the data-set is called a dimension. If a data set has more than a hundred attributes, it’s generally referred to as a high-dimensional data set and calculations are often difficult in such data sets.

The attribute can be defined as the figure describing an observation such as height, weight, or size. For simpler reference, we can picture them as column headers or column names in a spreadsheet.

The activity of dividing or grouping data points based on some similar traits. Data points in a cluster are similar to each other while they’re different compared to other data points in other clusters. It’s an unsupervised learning method used to draw references without labeling.

The act of predicting outside the data ranges is called extrapolation. When the extrapolation goes outside training data, machine learning algorithms face some trouble.

The method of statistical analysis deals with collecting and analyzing data to unearth hidden trends and patterns. It aims to remove bias via numerical analysis of the data. Interpreting research data, developing statistical models, planning for surveys, etc. are some key application areas of statistical analysis.

Mean or average mean is one of the primary methods used in statistical analysis to determine the pattern or trend of the data. It’s relatively easy to calculate, by summing up the data points and dividing the figure by the total number of data points. Mean alone cannot be considered a sole indicator for decision-making. It must be combined with other statistical indicators to make a more accurate decision.

The Median is the middle value in a set of data separating the higher set of data from the lower set of data. The first step is arranging the data in ascending order. Then divide the data points into two sets. If there is an odd number of data points, then the value in the middle position is the median, for an even number, the average of numbers in the middle values will form the median.

A mode or a modal value is that data point repeating the most number of times in a given data set. Along with median and mean, mode forms the trio of measures for a central tendency. There are chances that there can be no mode or more than one mode in the same data set.

A set of data from which a sample is drawn for statistical analysis is called a population. Generally, a population consists of data sets showcasing similar attributes. For example, from a data set consisting of people, a population can be residents of North America, white male Caucasians, etc. 

An analytical subset of a population for statistical analysis and from which inferences can draw for that population is called a sample. The idea of sampling is to have analysts perform their operations in a manageable set of data. Reaching the ideal set of sample data is a time-consuming activity.

Variance is used to analyze how the data is spread out. Mathematically, variance is the square root of deviation. It determines the spread of the probability distribution. In real-world scenarios, finance managers rely on variance to calculate the risk and reward of an investment portfolio.

Covariance defines how two variables vary in relation to one another. If both the variables tend to move together then it’s called a positive covariance, whereas if they move inversely then it becomes a negative covariance. The major difference between variance and covariance is the number of variables. Variance deals with one variable whereas covariance deals with two. A real-world example of covariance is in molecular biology where the behavior of certain DNAs is studied using the covariance method.

Standard Deviation is another widely used statistical analysis method. It determines how data deviates from the calculated mean of the data points. It shows us the spread of the data around the mean. It is calculated as the square root of the variance and helps in predicting future trends.

Correlation analysis is an advanced statistical technique used to measure the relationship between two variables. A high point indicates a strong correlation between the two variables. This is majorly employed during the quantitative analysis of the data points collected through methods like polls, surveys, etc. A simple example of correlation analysis would be to check the sales data of Thor merchandise concerning the sale of Thor: Love and Thunder tickets.

Regression analysis is used to determine the relationship between two or more sets of variables, the independent variable, and the dependent variable. It’s called a single linear regression in the case of two variables and if there are more than two variables, it’s known as multiple regression. It helps analysts understand the magnitude of the relationship between the variables and predict its behavior in relationship to the other variable.

Hypothesis testing is a statistical analysis method used to test the validity of a hypothesis or an assumption taken into consideration in a dataset. It helps us in establishing the relationship between two variables. There are different types of hypothesis testing known as null hypothesis testing, alternate hypothesis testing, single hypothesis testing, composite hypothesis testing, etc.

Sample size determination or data sampling is a technique used to derive a sample from the entire population, which is representative of the population. This method is used when the size of the population is very large. You can choose from among the various data sampling techniques such as snowball sampling, convenience sampling, and random sampling. 

Often called the most basic form of data analytics, descriptive analytics deals with breaking down big numbers into consumable smaller formats. It helps in understanding basic trends but doesn’t help with deeper analysis. It stops at answering the “what”. Small everyday operations rely on descriptive analytics for their day-to-day planning.

As the name suggests, predictive analytics is the science of predicting future scenarios in business based on historical data. It relies on advanced statistical analysis, data mining, and machine learning for the system to develop a comprehensive prediction.  It helps business leaders in data-driven decision-making and proactively mitigating risks. An everyday example would be analyzing a potential candidate’s past payment behavior and predicting on-time payment probability for a bank to extend credit lines.

Being the final stage of the data analytics maturity curve, prescriptive analytics feeds on the results of descriptive, diagnostic, and predictive analytics results to suggest a probable cause of action and help businesses make informed decisions. Jack-riding on the previous example, if the individual was a serial defaulter, then the system can suggest the mortgage officer not to sanction loans for the individual as he has a history of defaulting and his credit scores are a mess.

Exploratory Data Analysis called by its abbreviation EDA is one of the initial steps a data scientist or data engineer does when presented with a new data set. This is the initial investigation aimed at understanding the characteristics of the data, the relationships between variables, test hypotheses, and test presumptions about the data with statistical graphs and other data visualization tools.

Inferential statistical analysis uses analytical tools to draw inferences about the data set from the sample data points collected.  It helps in making generalizations about the population by using various analytical tests and tools. To pick out random samples that will represent the population accurately many sampling techniques are used. There are different types of inferential analysis such as parameter estimation, confidence intervals,  hypothesis testing, etc.

The causal statistical analysis focuses on determining the cause and effect relationship between different variables within the raw data. In simple words, it determines why something happens and its effect on other variables. This methodology can be used by businesses to determine the reason for failure.


On a mission to make data access simple. Our thoughts, learnings and quirks on this journey

Challenges In Data Driven Decision Making

In today’s data-driven world, businesses are collecting data …

Poor Quality Data

Poor data quality is a common problem in …

Introduction to the World of Apache Spark

InfoWorld defines Apache Spark as a data processing …

How Apache Spark Programs are Executed

We discussed about the fundamentals of an Apache …

© 2022 Purpleslate Private Limited | Made with 🤍 at Chennai, India