Data visualization is an essential skill for any data scientist. Data visualization is an interdisciplinary science dealing with the graphical representation of data and information using various formats like charts, graphs, and maps.
Data visualization forms the core of data storytelling, where data is presented in a framework that simplifies the communication of information. It helps in taking the complexity out of data story presentations. It’s highly important to understand the various types of visualizations available and when they’re best used. In this article, we’ll discuss some common data visualization methods and where you can use them while building your data story.
A bar chart is a graphical representation of data, which is used to compare different categories. They are commonly used to show the relationship between two variables, such as the number of customers and their average spending per customer. Bar charts can also be used to show the size of items in relation to other large groups or categories. The most common distinguishing feature of a bar chart, when compared to a histogram, is that they do not represent continuous development over an interval. The critical issues with a bar chart are that: –
- They fail to reveal in-depth details such as patterns, effects, and causes
- They can be easily manipulated with false data
Pie charts are useful for showing the proportions of a whole, but they can also be used to show relationships between components. For example, if you have three groups of data: A, B, and C, then the pie chart would show the proportion of each group against another. The basic premise of a pie chart is to break the segments of a circle and assign a proportion based on the percentages of the category. Adding the percentages must equate to 100%, which is represented by the complete circle. It gives the viewer a superfluous but quick understanding of how various categories are distributed proportionally. There are a few downsides to implementing pie charts
- The value representation is restricted as the size of the slices is indirectly dependent on the number of values, i.e., the more the number of values, the smaller the slices. This affects readability
- They take up more size as compared to other alternatives and require a legend to understand the information represented through it
A histogram is a graphical representation of the distribution of numerical data. It is used to display the distribution of a continuous variable and shows how many values fall into each interval or range. The horizontal axis represents the number of occurrences and the vertical axis represents their frequency relative to each other. Histograms can also be used when plotting test scores or other types of data, depending on what your goal is with this type of visualization. However, histograms are not free from problems.
- They depend too much on the variable’s maximum and minimum intervals
- They are not a suitable method to compare distributions and don’t allow to differentiate between continuous and discrete variables
Scatter plots are a type of visual representation that shows how two variables relate to each other. They can be used to analyze and describe the relationship between two variables and are represented in a Cartesian plot. Plotting the information along an axis helps in determining the relationship between the two variables. If you want your visualizations “to talk” about their findings, scatter plots are ideal because they allow you to see both spatial locations together on one plot. However, there are certain disadvantages to using scatter plots
- They are limited in showing the direction of the correlation while the degree of the correlation is not represented. Also, the relationship of only two variables can be shown
- This is effective if there are few numbers of data points, too many data points can lead to clutter, and will be impossible to draw conclusions from such a plot
The Gantt chart is a bar chart that shows the progression of activity over time. It’s used to show the order in which tasks will be completed, and how long each task will take. It’s an effective tool in project management as it helps in planning and estimating the total time for project completion. This effectively aids in budget determination, resource allocation, and other project management activities. It’s very easy to read a Gantt chart. Rows designate the activity and columns depict the total timeline. The duration of each activity is represented by the length of the bar and each of them is separately color-coded based on various factors. The downsides of a Gantt chart are:-
- It’s a bit tough to prepare as opposed to the usual bar charts and if the project is too complex, the activity becomes time-consuming and stressful
- The success of a Gantt chart depends on its ability to capture the end-to-end list of tasks and their timelines. In projects involving multiple touchpoints, this becomes a challenge
WordCloud is a data visualization method that uses words as the elements of a graphic. WordClouds are useful for showing the frequency of terms in the text by manipulating their size. They are arranged in a cluster that looks like a cloud. Hence the term, WordCloud. They are great at representing the meta-data of various categories. Think of names of organizations based on their revenue. Larger the organization’s name, the bigger the revenue. They are very straightforward to understand but come with a few flaws.
- Long words take up a lot of space as compared to shorter words. If long words are big, it affects the readability of short words
- They’re not known for their analytical accuracy and are mostly used for presentation purposes rather than drawing impactful insights from them
Line plots also called line graphs are used to display data that is continuous. They are useful for comparing two or more variables, showing trends over time, or showing relationships between variables. Line plots use the horizontal axis to show the values of one variable and the vertical axis to show a second variable (or a set of continuous data). The points plotted on the line represent sample observations from this set of observations. The usual representation of a line graph determines the gradual progress over time with the Y-Axis representing the quantitative value and X-axis for the timescale. There are certain issues associated with line plots.
- They’re not great at representing quantitative measures containing decimal points
- There is a probability of clutter when there are multiple line plots in the same 2-dimensional space
A stem-and-leaf plot is used to display the distribution of a variable. The stem is the number of units in the data set, and the leaf is the last digit of each unit. This can be used for any numerical variable such as age or weight; however, it’s most commonly used for categorical variables. Stem and leaf plots give a quick overview of the distribution whilst showing the mode of the measures. They also highlight the outliers quickly and their major downsides are that they’re dependent on the size of the distribution.
- Smaller datasets are not effectively communicated with stem and leaf plots
- Larger datasets lead to clutter with the distribution volume
Data visualization is a powerful way to present your data engagingly and understandably. By using different methods, you can choose the one that best suits your needs. The list expanded above is in no way an exhaustive one as data visualization is one statistical stream where continuous innovation takes place. Newer methods of representing data stories are coming and they will aid business decision-makers with easy and simple representations to get the big picture.
Interested to learn more about data analytics? Head over to our data analytics glossary where we break down the basics of major data analytics concepts.