Visualize your data: DOs and DONT's

Some design principles

The Minard Map

Matplotlib & Seaborn

Matplotlib


						import matplotlib.pyplot as plt
						fig = plt.figure()
						

Matplotlib


						import matplotlib.pyplot as plt
						fig, ax = plt.subplots()
						

Matplotlib


						import matplotlib.pyplot as plt
						fig, axes = plt.subplots(1, 2, figsize=(12, 4))
						

Matplotlib


									import matplotlib.pyplot as plt
									fig, axes = plt.subplots(1, 2, figsize=(4, 8))
									x = [0.2, 0.6]
									y = [0.4, 0.5]
									axes[0].scatter(x, y, s=60)
									axes[1].bar(x, y, width=0.1)
								

Matplotlib


									import matplotlib.pyplot as plt
									fig, axes = plt.subplots(1, 2, figsize=(4, 8))
									x = [0.2, 0.6]
									y = [0.4, 0.5]
									axes[0].scatter(x, y, s=200, color="r")
									axes[1].bar(x, y, width=0.1)
									axes[1].set_xticks([])
									axes[1].set_title("a barchart", fontsize=16)
								

Modify overall aesthetics in plot call.

Modify individual elements via set_element(attribute).

Seaborn

A high-level interface to Matplotlib for statistical analyses. Get inspiredy by the gallery.


									import seaborn as sns
									sns.set_theme(style = "whitegrid")
									iris = sns.load_dataset("iris")
									iris = pd.melt(iris, "species", var_name = "measurement")
									fig, ax = plt.subplots()
									sns.stripplot(
										data = iris,
										x = "value", 
										y = "measurement",
										hue = "species",
										dodge = True,
										alpha = .25,
										ax = ax
									)
									ax.set_ylabel("")
								

Why Matplotlib?

... Matplotlib is a pain in the ass to learn.

I know Matplotlib.

Great synergy with seaborn.

Other packages like networkx build on it.

Matplotlib offers very fine control over all figure elements.

For publication-level plots it pays off to learn Matplotlib.

Overview: Python plotting packages

package style pro con
matplotlib object oriented very customizable, many extensions very complex
seaborn declarative very sensible defaults, easy(er) to use less customizable, but can fall back to matplotlib
plotly declarative easy to use, made for interactive visualizations, also available for R less customizable, getting used to web apps might take some getting used to
altair declarative easy to use less customizable
bokeh grammar of graphics rather straight forward to use, can handle very big data less customizable, getting used to web apps might take some getting used to
plotnine grammar of graphics similar to ggplot2 in R, straight forward to use less customizable

Plot types

dim 1 dim 2 type function
numerical - histogram sns.histplot()
numerical categorical bar chart sns.barplot()
numerical time time series ax.plot()
numerical numerical scatter plot sns.scatterplot()

Example study

"New conceptions of truth foster misinformation in online public political discourse"

(-) Conceptions of "truth" splinter into two distinct camps.

(-) "Truth-seeking" aims to uncover factual information and update one's beliefs.

(-) "Belief-speaking" conceptualizes truth as "authenticity" and "speaking one's mind".

(-) We measure "truth-seeking" and "belief-speaking" in tweets by U.S. Congress Members.

Example data

"New conceptions of truth foster misinformation in online public political discourse"

Histogram: number of tweets per account


									import matplotlib.pyplot as plt
									import seaborn as sns
									users = pd.read_csv("users.csv")

									sns.histplot(
										data=users, 
										x="tweet_count"
									)
								

Adapt aspect ration to data


									fig, ax = plt.subplots(figsize=(8, 4))

									sns.histplot(
										data=users, 
										x="tweet_count",
										ax=ax
									)
								

Choose intuitive bin width


									fig, ax = plt.subplots(figsize=(8, 4))

									sns.histplot(
										data=users, 
										x="tweet_count",
										ax=ax,
										bins=range(0, 3510, 250),
										shrink=0.8
									)
								

De-cluttering


									fig, ax = plt.subplots(figsize=(8, 4))

									sns.histplot(
										data=users, 
										x="tweet_count",
										ax=ax,
										bins=range(0, 3510, 250),
										shrink=0.8,
										edgecolor=None
									)
									ax.spines["top"].set_visible(False)
									ax.spines["right"].set_visible(False)
									ax.set_xlabel("Tweet count", fontsize=16)
									ax.set_ylabel("User count", fontsize=16)
								

Barchart: two categories


									fig, ax = plt.subplots(figsize=(8, 4))

									sns.barblot(
										data=belief_speaking, 
										x="proportion",
										y="time_period",
										hue="party",
										ax=ax
									)
								

Use known metaphors


									fig, ax = plt.subplots(figsize=(8, 4))

									sns.barblot(
										data=belief_speaking, 
										x="proportion",
										y="time_period",
										hue="party",
										ax=ax,
										palette=["#0015BC", "#FF0000"],
										hue_order=["Democrat", "Republican"]
									)
								

De-cluttering


									sns.barblot(
										data=belief_speaking, 
										x="proportion",
										y="time_period",
										hue="party",
										ax=ax,
										palette=["#0015BC", "#FF0000"],
										hue_order=["Democrat", "Republican"]
									)
									ax.spines['right'].set_visible(False)
									ax.spines['top'].set_visible(False)
									ax.legend(frameon=False, fontsize=16)
									ax.set_ylabel("")
									ax.set_xlabel("% of Tweets", fontsize=16)
									ax.tick_params(axis='both', labelsize=12) 
								

Alternative category representation


									sns.barblot(
										data=belief_speaking, 
										x="proportion",
										y="party",
										hue="time_period",
										ax=ax,
										palette=[(0.5, 0.5, 0.5), (0.2, 0.2, 0.2)],
										hue_order=["2010 to 2013", "2019 to 2022"]
									)
								

Time-series: number of tweets


									counts = pd.read_csv(
										"tweet_counts.csv",
										parse_dates = ["date"]
									)
									counts = counts.set_index("date")
									dem = counts[counts["party"] == "Democrat"]
									rep = counts[counts["party"] == "Republican"]

									fig, ax = plt.subplots(figsize = (9, 4))
									ax.plot(
										dem.index,
										dem["tweet_count"],
										color = "#0015BC", 
									)
								

Rolling average


									counts = counts.set_index("date")
									dem = counts[counts["party"] == "Democrat"]\
										.rolling("90D").mean()
									rep = counts[counts["party"] == "Republican"]\
										.rolling("90D").mean()

									fig, ax = plt.subplots(figsize = (9, 4))
									ax.plot(
										dem.index,
										dem["tweet_count"],
										color = "#0015BC", 
									)
								

Labels & de-cluttering


									fig, ax = plt.subplots(figsize = (9, 4))
									ax.plot(
										dem.index,
										dem["tweet_count"],
										color = "#0015BC", 
										label="Democrat"
									)
									ax.set_ylabel("Tweet count", fontsize=12)
									ax.legend(frameon=False, loc=6)
									ax.spines['right'].set_visible(False)
									ax.spines['top'].set_visible(False)
								

Annotations


									p_elections = [pd.to_datetime(date) for\
									            date in ["2012-11-06", "2016-11-08", "2020-11-03"]]
									s_elections = [pd.to_datetime(date) for\
									            date in ["2013-11-05", "2014-11-04", "2015-11-03",
									                     "2017-11-07", "2018-11-06", "2019-11-05"]]
									for el in p_elections:
									    ax.plot([el, el], [0, 750], color="k")
									for el in s_elections:
									    ax.plot([el, el], [0, 600], "--", color="grey")
									    
									ax.text(pd.to_datetime("2013-07-01"), 650,
									        "Congress elections", color="grey", fontsize=12)
									ax.text(pd.to_datetime("2015-03-01"), 800,
									        "Presidential elections", color="k", fontsize=12)
								

Scatterplot: followers vs. following


									fig, ax = plt.subplots(figsize = (7, 4))
									sns.scatterplot(
										data = users, 
										x = "followers_count", 
										y = "following_count", 
										hue = "party",
										palette = ["#0015BC", "#FF0000"],
										hue_order = ["Democrat", "Republican"],
										ax = ax
									)
								

Logarithmic axis scales


									fig, ax = plt.subplots(figsize = (7, 4))
									sns.scatterplot(
										data = users, 
										x = "followers_count", 
										y = "following_count", 
										hue = "party",
										palette = ["#0015BC", "#FF0000"],
										hue_order = ["Democrat", "Republican"],
										ax = ax
									)
									ax.set_yscale("log")
									ax.set_xscale("log")
								

Labels & de-cluttering


									sns.scatterplot(
										data = users, 
										x = "followers_count", 
										y = "following_count", 
										hue = "party",
										palette = ["#0015BC", "#FF0000"],
										hue_order = ["Democrat", "Republican"],
										ax=ax
									)
									ax.set_xlabel("followers", fontsize=12)
									ax.set_ylabel("following", fontsize=12)
									ax.legend(frameon=False, fontsize=12)
									ax.spines['right'].set_visible(False)
									ax.spines['top'].set_visible(False)
								

Adding another data dimension


									sns.scatterplot(
										data = users, 
										x = "followers_count", 
										y = "following_count", 
										size = "N_tweets",
										hue = "party",
										palette = ["#0015BC", "#FF0000"],
										hue_order = ["Democrat", "Republican"],
										alpha = 0.3,
										linewidth = 0,
										ax = ax
									)
									ax.legend(frameon = False, loc = 9, fontsize = 12,
										bbox_to_anchor = [1.2, 0.9, 0, 0], 
									)
								

Visualizing words: wordcloud

Visualizing words: scattertext

Visualizing words: Word Shift Graphs

  • Quantify which words contribute to a difference between two texts.
  • Quantify how words contribute the difference.
  • Can be used for comparing texts according to word proportions, sentiment, ... (any dictionary).
  • Ryan Gallagher: "whenever you think about using a wordcloud, use a Word Shift Graph instead".
  • Visualizing networks: spring layout

    Code source

    Visualizing networks: known underlying structure

    Code source

    Random things I thought might be helpful

    Colors

    Coloring for colorblindness

    Color hunt color palettes

    Image filetypes

    
    							plt.savefig("my_figure.png", dpi=300)
    
    							# great to import in Inkscape or Adobe Illustrator
    							plt.savefig("my_figure.svg")
    
    							plt.savefig("my_figure.pdf")
    
    							# this does not work :(
    							plt.savefig("my_figure.eps")
    						

    Plot coding style: use functions

    
    							def PlotSomethingNice(ax, data, param="xy"):
    								...
    								...
    
    							def PlotAnotherNiceThing(ax, data, param="xy"):
    								...
    								...
    
    							fig, axes = plt.subplots(2, 2)
    
    							PlotSomethingNice(axes[0][0], subset1, param="first part")
    							PlotSomethingNice(axes[0][1], subset2, param="second part")
    							PlotAnotherNiceThing(axes[1][0], df, param="do this")
    							PlotAnotherNiceThing(axes[1][1], df, param="do something else")
    						

    Large data

    Problem 1: since every data point is an object, Matplotlib gets VERY slow for large amounts of data.

    Problem 2: figure files also get VERY large because of this.

    Solution 1: subsample your data.

    
    							less_data = lots_of_data.sample(frac=0.05, random_state=42)
    						

    Solution 2: rasterize your figures and save as .png.

    Summary

    Matplotlib + Seaborn offers both high abstraction and excessive detail.

    Use known visual metaphors and de-clutter your plots for maximum clarity.

    Whenever you do something unexpected in your visuals (resampling or excluding data, truncating axes), DESCRIBE IT in the text.