reveal.js

Visualize your data: DOs and DONT's

Some design principles

The Minard Map

Matplotlib & Seaborn

Matplotlib


						import matplotlib.pyplot as plt
						fig = plt.figure()

Matplotlib


						import matplotlib.pyplot as plt
						fig, ax = plt.subplots()

Matplotlib


						import matplotlib.pyplot as plt
						fig, axes = plt.subplots(1, 2, figsize=(12, 4))

Matplotlib


									import matplotlib.pyplot as plt
									fig, axes = plt.subplots(1, 2, figsize=(4, 8))
									x = [0.2, 0.6]
									y = [0.4, 0.5]
									axes[0].scatter(x, y, s=60)
									axes[1].bar(x, y, width=0.1)

Matplotlib


									import matplotlib.pyplot as plt
									fig, axes = plt.subplots(1, 2, figsize=(4, 8))
									x = [0.2, 0.6]
									y = [0.4, 0.5]
									axes[0].scatter(x, y, s=200, color="r")
									axes[1].bar(x, y, width=0.1)
									axes[1].set_xticks([])
									axes[1].set_title("a barchart", fontsize=16)

Modify overall aesthetics in plot call.

Modify individual elements via set_element(attribute).

Seaborn

A high-level interface to Matplotlib for statistical analyses. Get inspiredy by the gallery.


									import seaborn as sns
									sns.set_theme(style = "whitegrid")
									iris = sns.load_dataset("iris")
									iris = pd.melt(iris, "species", var_name = "measurement")
									fig, ax = plt.subplots()
									sns.stripplot(
										data = iris,
										x = "value", 
										y = "measurement",
										hue = "species",
										dodge = True,
										alpha = .25,
										ax = ax
									)
									ax.set_ylabel("")

Why Matplotlib?

... Matplotlib is a pain in the ass to learn.

I know Matplotlib.

Great synergy with seaborn.

Other packages like networkx build on it.

Matplotlib offers very fine control over all figure elements.

For publication-level plots it pays off to learn Matplotlib.

Overview: Python plotting packages

package	style	pro	con
matplotlib	object oriented	very customizable, many extensions	very complex
seaborn	declarative	very sensible defaults, easy(er) to use	less customizable, but can fall back to matplotlib
plotly	declarative	easy to use, made for interactive visualizations, also available for R	less customizable, getting used to web apps might take some getting used to
altair	declarative	easy to use	less customizable
bokeh	grammar of graphics	rather straight forward to use, can handle very big data	less customizable, getting used to web apps might take some getting used to
plotnine	grammar of graphics	similar to ggplot2 in R, straight forward to use	less customizable

Plot types

dim 1	dim 2	type	function
numerical	-	histogram	sns.histplot()
numerical	categorical	bar chart	sns.barplot()
numerical	time	time series	ax.plot()
numerical	numerical	scatter plot	sns.scatterplot()

Example study

"New conceptions of truth foster misinformation in online public political discourse"

(-) Conceptions of "truth" splinter into two distinct camps.

(-) "Truth-seeking" aims to uncover factual information and update one's beliefs.

(-) "Belief-speaking" conceptualizes truth as "authenticity" and "speaking one's mind".

(-) We measure "truth-seeking" and "belief-speaking" in tweets by U.S. Congress Members.

Example data

"New conceptions of truth foster misinformation in online public political discourse"

Histogram: number of tweets per account


									import matplotlib.pyplot as plt
									import seaborn as sns
									users = pd.read_csv("users.csv")

									sns.histplot(
										data=users, 
										x="tweet_count"
									)

Adapt aspect ration to data


									fig, ax = plt.subplots(figsize=(8, 4))

									sns.histplot(
										data=users, 
										x="tweet_count",
										ax=ax
									)

Choose intuitive bin width


									fig, ax = plt.subplots(figsize=(8, 4))

									sns.histplot(
										data=users, 
										x="tweet_count",
										ax=ax,
										bins=range(0, 3510, 250),
										shrink=0.8
									)

De-cluttering


									fig, ax = plt.subplots(figsize=(8, 4))

									sns.histplot(
										data=users, 
										x="tweet_count",
										ax=ax,
										bins=range(0, 3510, 250),
										shrink=0.8,
										edgecolor=None
									)
									ax.spines["top"].set_visible(False)
									ax.spines["right"].set_visible(False)
									ax.set_xlabel("Tweet count", fontsize=16)
									ax.set_ylabel("User count", fontsize=16)

Barchart: two categories


									fig, ax = plt.subplots(figsize=(8, 4))

									sns.barblot(
										data=belief_speaking, 
										x="proportion",
										y="time_period",
										hue="party",
										ax=ax
									)

Use known metaphors


									fig, ax = plt.subplots(figsize=(8, 4))

									sns.barblot(
										data=belief_speaking, 
										x="proportion",
										y="time_period",
										hue="party",
										ax=ax,
										palette=["#0015BC", "#FF0000"],
										hue_order=["Democrat", "Republican"]
									)

De-cluttering


									sns.barblot(
										data=belief_speaking, 
										x="proportion",
										y="time_period",
										hue="party",
										ax=ax,
										palette=["#0015BC", "#FF0000"],
										hue_order=["Democrat", "Republican"]
									)
									ax.spines['right'].set_visible(False)
									ax.spines['top'].set_visible(False)
									ax.legend(frameon=False, fontsize=16)
									ax.set_ylabel("")
									ax.set_xlabel("% of Tweets", fontsize=16)
									ax.tick_params(axis='both', labelsize=12)

Alternative category representation


									sns.barblot(
										data=belief_speaking, 
										x="proportion",
										y="party",
										hue="time_period",
										ax=ax,
										palette=[(0.5, 0.5, 0.5), (0.2, 0.2, 0.2)],
										hue_order=["2010 to 2013", "2019 to 2022"]
									)

Time-series: number of tweets


									counts = pd.read_csv(
										"tweet_counts.csv",
										parse_dates = ["date"]
									)
									counts = counts.set_index("date")
									dem = counts[counts["party"] == "Democrat"]
									rep = counts[counts["party"] == "Republican"]

									fig, ax = plt.subplots(figsize = (9, 4))
									ax.plot(
										dem.index,
										dem["tweet_count"],
										color = "#0015BC", 
									)

Rolling average


									counts = counts.set_index("date")
									dem = counts[counts["party"] == "Democrat"]\
										.rolling("90D").mean()
									rep = counts[counts["party"] == "Republican"]\
										.rolling("90D").mean()

									fig, ax = plt.subplots(figsize = (9, 4))
									ax.plot(
										dem.index,
										dem["tweet_count"],
										color = "#0015BC", 
									)

Labels & de-cluttering


									fig, ax = plt.subplots(figsize = (9, 4))
									ax.plot(
										dem.index,
										dem["tweet_count"],
										color = "#0015BC", 
										label="Democrat"
									)
									ax.set_ylabel("Tweet count", fontsize=12)
									ax.legend(frameon=False, loc=6)
									ax.spines['right'].set_visible(False)
									ax.spines['top'].set_visible(False)

Annotations


									p_elections = [pd.to_datetime(date) for\
									            date in ["2012-11-06", "2016-11-08", "2020-11-03"]]
									s_elections = [pd.to_datetime(date) for\
									            date in ["2013-11-05", "2014-11-04", "2015-11-03",
									                     "2017-11-07", "2018-11-06", "2019-11-05"]]
									for el in p_elections:
									    ax.plot([el, el], [0, 750], color="k")
									for el in s_elections:
									    ax.plot([el, el], [0, 600], "--", color="grey")
									    
									ax.text(pd.to_datetime("2013-07-01"), 650,
									        "Congress elections", color="grey", fontsize=12)
									ax.text(pd.to_datetime("2015-03-01"), 800,
									        "Presidential elections", color="k", fontsize=12)

Scatterplot: followers vs. following


									fig, ax = plt.subplots(figsize = (7, 4))
									sns.scatterplot(
										data = users, 
										x = "followers_count", 
										y = "following_count", 
										hue = "party",
										palette = ["#0015BC", "#FF0000"],
										hue_order = ["Democrat", "Republican"],
										ax = ax
									)

Logarithmic axis scales


									fig, ax = plt.subplots(figsize = (7, 4))
									sns.scatterplot(
										data = users, 
										x = "followers_count", 
										y = "following_count", 
										hue = "party",
										palette = ["#0015BC", "#FF0000"],
										hue_order = ["Democrat", "Republican"],
										ax = ax
									)
									ax.set_yscale("log")
									ax.set_xscale("log")

Labels & de-cluttering


									sns.scatterplot(
										data = users, 
										x = "followers_count", 
										y = "following_count", 
										hue = "party",
										palette = ["#0015BC", "#FF0000"],
										hue_order = ["Democrat", "Republican"],
										ax=ax
									)
									ax.set_xlabel("followers", fontsize=12)
									ax.set_ylabel("following", fontsize=12)
									ax.legend(frameon=False, fontsize=12)
									ax.spines['right'].set_visible(False)
									ax.spines['top'].set_visible(False)

Adding another data dimension


									sns.scatterplot(
										data = users, 
										x = "followers_count", 
										y = "following_count", 
										size = "N_tweets",
										hue = "party",
										palette = ["#0015BC", "#FF0000"],
										hue_order = ["Democrat", "Republican"],
										alpha = 0.3,
										linewidth = 0,
										ax = ax
									)
									ax.legend(frameon = False, loc = 9, fontsize = 12,
										bbox_to_anchor = [1.2, 0.9, 0, 0], 
									)

Visualizing words: wordcloud

Visualizing words: scattertext

Visualizing words: Word Shift Graphs

Quantify which words contribute to a difference between two texts.

Quantify how words contribute the difference.

Can be used for comparing texts according to word proportions, sentiment, ... (any dictionary).

Ryan Gallagher: "whenever you think about using a wordcloud, use a Word Shift Graph instead".

Visualizing networks: spring layout

Code source

Visualizing networks: known underlying structure

Code source

Random things I thought might be helpful

Colors

Coloring for colorblindness

Color hunt color palettes

Image filetypes


							plt.savefig("my_figure.png", dpi=300)

							# great to import in Inkscape or Adobe Illustrator
							plt.savefig("my_figure.svg")

							plt.savefig("my_figure.pdf")

							# this does not work :(
							plt.savefig("my_figure.eps")

Plot coding style: use functions


							def PlotSomethingNice(ax, data, param="xy"):
								...
								...

							def PlotAnotherNiceThing(ax, data, param="xy"):
								...
								...

							fig, axes = plt.subplots(2, 2)

							PlotSomethingNice(axes[0][0], subset1, param="first part")
							PlotSomethingNice(axes[0][1], subset2, param="second part")
							PlotAnotherNiceThing(axes[1][0], df, param="do this")
							PlotAnotherNiceThing(axes[1][1], df, param="do something else")

Large data

Problem 1: since every data point is an object, Matplotlib gets VERY slow for large amounts of data.

Problem 2: figure files also get VERY large because of this.

Solution 1: subsample your data.


							less_data = lots_of_data.sample(frac=0.05, random_state=42)

Solution 2: rasterize your figures and save as .png.

Summary

Matplotlib + Seaborn offers both high abstraction and excessive detail.

Use known visual metaphors and de-clutter your plots for maximum clarity.

Whenever you do something unexpected in your visuals (resampling or excluding data, truncating axes), DESCRIBE IT in the text.