Skip to main content

Exploratory Data Analysis

By December 28, 2022April 25th, 2023Data Analytics, Data Visualization5 mins read
Exploratory Data Analysis

Definition

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey. Exploratory Data Analysis is the process of performing initial investigations on data to discover trends and patterns, spot anomalies, test hypotheses and check assumptions with the help of statistical summary and graphical representations.

A smart strategy is to first comprehend the data and then try to extract as many insights from it as you can. It is all about making sense of the data in hand, before digging deep into it.

Overview

EDA is used before modeling to see what the data can tell us. It is not easy to look at the whole spreadsheet and determine important characteristics of the data. It may be overwhelming to derive insights by looking at plain numbers.

EDA is not a formal process with a predefined set of rules, it is a state of mind on how a data analysis should be carried out.  It is a philosophy as to how to dissect a data set, what to look for, how to look, and how to interpret. In the initial phases of EDA, we should feel free to investigate every idea that crosses our minds. Some may pan out, while others can be dead ends.

EDA encompasses a larger venue, it is an approach to data analysis that postpones the usual assumptions about what kind of model to follow. Though EDA heavily uses the collection of techniques that is called “statistical graphics”, it is not identical to statistical graphics per se.

Techniques and tools

With a few exceptions, most EDA techniques are graphical. The primary function of EDA is open-minded exploration. Visuals or graphics give analysts unmatched power to achieve this by analyzing the data to expose its structural secrets and preparing them to always obtain some new, unexpected insight into the data.

The specific graphical methods used in EDA are straightforward and it consists of different methods as below:

Plotting raw data in data traces, histograms, probability, block and lag plots, etc.,

Plotting simple statistics from raw data in mean, standard deviation, box and main effects plots.

Using numerous plots per page, for example, to leverage our innate pattern-recognition ability.

The most commonly utilized data science tools for developing an EDA are:

Python: Python and EDA can be used together to identify missing values in a data set, which is important so you can decide how to handle missing values for machine learning.

R: The R language is widely used among statisticians in data science in developing statistical observations and data analysis.

1. Visualization methods

Univariate visualization means visualizing each field in the raw dataset, with the statistical summary.

Bivariate visualizations and statistical summaries enable you to evaluate the link between each field in the dataset and the target variable under consideration.

Multivariate visualizations are used to map and analyze interactions between multiple fields in data.

Techniques for clustering and dimension reduction aid in the creation of graphical displays of high-dimensional data with many variables. Clustering is a technique that is widely used in market segmentation, pattern recognition, and image compression.

2. Non Visualization methods

Univariate non-graphical. This is the most basic type of data analysis, in which the data being evaluated consists of only one variable. Since there is only one variable, no causes or correlations are dealt with. Univariate analysis’s main objective is to describe the data and find patterns within it.

Multivariate nongraphical: Multivariate Multivariate data is made up of multiple variables. In general, multivariate non-graphical EDA techniques use cross-tabulation or statistics to illustrate the link between two or more variables of the data.

Predictive models, something like linear regression, rely on statistics and data to make predictions.

A data item or object that considerably differs from the other (so-called normal) items is referred to as an outlier. Errors in measurement or execution may be the reason for them. Outlier mining is the analysis used for outlier discovery.

Usage

EDA is fundamentally a creative process. The key to asking good questions, as with most creative processes, is to come up with a lot of them.

1. Find out what are the most common values? Why?

2. List out the rare values? Why? Is that in line with your expectations?

3. Any visible unusual patterns? What could be the reason for those patterns?

Clusters of similar values suggest that subgroups do exist in your data. To get more insights into subgroups ask the below questions.

1. How are the entities within the cluster similar to each other?

2. How are the entities in separate clusters differ from each other?

3. How to explain/describe the clusters?

We can quickly drill down into the most interesting parts of your data—and develop a set of thought-provoking questions—if we follow up each question with a new question based on our findings. Summarizing what we can get from EDA,

  • maximize insight into a data set;
  • uncover underlying structure;
  • extract important variables;
  • detect outliers and anomalies;
  • test underlying assumptions;
  • develop parsimonious models; and
  • determine optimal factor settings.

It can also help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events and find interesting relations among the variables.

It is also used to ensure the results produced by data analysis are valid and applicable to any desired business outcomes and goals. Once EDA is complete and insights are drawn, Its capabilities can subsequently be applied to more advanced data analysis or modeling, such as machine learning.

Conclusion

EDA provides the context necessary to create an acceptable model for the problem at hand and to accurately understand its results, making it an essential step to take before delving into machine learning or statistical modeling. Data scientists can benefit from EDA to ensure that the outcomes they provide are reliable, accurately interpreted, and applicable to the intended business contexts.

Leave a Reply

Purpleslate is sponsoring the 2024 CULytics Summit from March 25-28 at Microsoft Commons in Redmond, WA.