5 Data Visualization Concepts Every Data Scientist Should Know!

Margaret Awojide
Analytics Vidhya
Published in
4 min readJan 7, 2021

--

1. Anscombe Quartet

Anscombe Quartet was developed by Francis Anscombe, a Statistician, in 1973. The Anscombe’s quartet comprises four datasets that have nearly identical statistical properties, yet appear very different when graphed.

Anscombe’s Quartet Dataset

When we find the mean of all the X’s in the above dataset, they are approximately 9 and the mean of all the Y’s are approximately 7.5. Also the Standard Deviation of all the X’s are approximately 3.16 and the Standard Deviation of all the Y’s are 1.94
The values for the summary statistics are equal but when we plot the scatter plot of each X and Y pair, we’ll see that the data, is actually very different!

Relying on just summary statistics can be very misleading and limited, as such, it is important that we visualize our data to better understand it.

Image Credit: Wikipedia

2. Explanatory and Exploratory Analysis

There are 2 main purposes for Visualization in Data Science, to Explore and to Explain. Exploratory Analysis is used when looking for relationships in data, they do not need to be perfect because we are simply looking for patterns in the data and trying to understand the data better. On the other hand, Explanatory Analysis is used to highlight insights in a data and is used to tell a story to an audience.

Exploratory Visuals are used for finding relationships and summarizing main characteristics in data whereas Explanatory Visuals are used to highlight insights in Data and tell a story to the audience

Credit : PInterest Exploratory Data Analysis

3. Visual Encodings

Visual Encodings are like mappings on data with display elements, these display elements include Position (on the X and Y axis), Shape, Size, Angle, Length etc. They are used to help convey our data in the best way to our audience. The most important display elements are: Position and Length.

Color is an important display element as it helps to highlight differences in data for our audience, but it is important that they are used only when they are necessary. Color should be added for communication of your data and not just for beautification. Due to Color Blindness, Data Scientist are encouraged to use colors that will include their color blind audience and as such avoid Red Green Color Palette.

Visual Encodings should only be used when they are absolutely necessary and should not be overused or they will defeat their purpose

4. Chart Junk and the Data Ink Ratio

Chart junk includes every visual elements in charts and graphs that are not necessary and distracts the viewer from the information presented in the chart or graph. They include Shading, 3 Dimensional Charts, Heavy Grid Lines etc.

Credit: Wikipedia
Credit: Medium

The Data Ink Ratio by Edward Tufte is the ratio of the ink used to describe the data to the ink used to describe everything else. The higher the data ink ratio, the better the data and the lower the chart junk.

When creating charts, it is important to remove unnecessary and distracting visual elements and maximize the data ink ratio

5. Data Integrity and the Lie Factor

Data Integrity is the accuracy, completeness and validity of data. It is important to create charts that maintains your data integrity. One of the measures for calculating data integrity in plots is by using the Lie Factor (by Edward Tufte) which is the degree to which a visualization misinterprets the data values being plotted.

Towards Data Science

Most Visualization Lies use:
a. Extra Dimensions
b. Dual Axis
c. Wrong Binning Method

Credit: Business Insider

The chart shows 2012 Presidential Run, It depicts this using a 3 Dimensional Pie Chart, Notice that as a result of this, the Green Pie with 60% looks bigger than the Red Pie with 70% because of the extra dimensions.

This is another chart that could mislead the audience into believing that there is a large difference between the 2 bars, because the chart does not begin at the origin.

Generally, A Good Visualization Should …..

Maintain Data Integrity
Minimize Chart Junks
Maximize a Data Ink Ratio
Avoid Using Display Elements except absolutely necessary

Thank You!

--

--