Distribution plots

A distribution plot displays the distribution of a single continuous variable by showing how the values are spread out across a range, using a combination of visual elements such as histograms or density curves, to provide a comprehensive view of the data.

By showcasing the shape, skewness, and variability of the data, distribution plots helps understand the underlying distribution of data, identify anomalies or outliers, and make informed decisions about statistical modeling and analysis.

Hide code cell source
import plotly.io as pio

pio.renderers.default = "sphinx_gallery"
import plotly.express as px
import statsplotly

distplot requires one dimension argument over which the distribution is computed and plotted :

df = px.data.tips()

fig = statsplotly.distplot(
    data=df,
    x="total_bill",
)
fig.show()

Slicing data

Data can be sliced along a dimension, and the color palette specified :

df = px.data.tips()

fig = statsplotly.distplot(
    data=df,
    x="total_bill",
    color_palette="Set2_r",
    slicer="sex",
)
fig.show()

Set equal_bins to True to use the same binnings across slices :

df = px.data.tips()

fig = statsplotly.distplot(
    data=df,
    x="total_bill",
    color_palette="Set2_r",
    equal_bins=True,
    slicer="sex",
)
fig.show()

By default, distribution of each level of the slicer are overlayed in the order they appear in the DataFrame.

Use slice_order to obtain better slice ordering to fit this particular visualization :

df = px.data.tips()

fig = statsplotly.distplot(
    data=df,
    x="total_bill",
    color_palette="Set2_r",
    equal_bins=True,
    slice_order=["Male", "Female"],
    slicer="sex",
)
fig.show()

Setting histogram norm

The histnorm parameter controls the normalization of bins :

fig = statsplotly.distplot(
    data=df,
    x="total_bill",
    color_palette="Set2_r",
    equal_bins=True,
    histnorm="probability",
    slicer="sex",
)
fig.show()

The combination of step, rug, hist and kde parameters allows for the fine control on the representation of the underlying distribution.

The central tendency of the distribution can also be plotted :

df = px.data.tips()

fig = statsplotly.distplot(
    data=df,
    x="total_bill",
    rug=True,
    kde=True,
    hist=False,
    bins=20,
    color_palette="Set2_r",
    central_tendency="mean",
    slicer="sex",
)
fig.show()
/home/runner/work/statsplotly/statsplotly/statsplotly/plot_specifiers/data/_core.py:278: FutureWarning:

The provided callable <function mean at 0x7f584047ccc0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.

Empirical Cumulative Distribution Function

Set ecdf to True to plot the Empirical Cumulative Distribution Function :

df = px.data.tips()

fig = statsplotly.distplot(
    data=df,
    x="total_bill",
    ecdf=True,
    hist=True,
    bins=50,
    equal_bins=True,
    histnorm="probability",
    color_palette="Set2_r",
    slicer="sex",
)
fig.show()

Horizontal histograms

Specifying a y dimension instead of x plot an horizontal histogram.

Here we also use step lines instead of filled bar to display the bins :

df = px.data.tips()

fig = statsplotly.distplot(
    data=df,
    y="total_bill",
    step=True,
    hist=True,
    equal_bins=False,
    bins=20,
    color_palette="Set2",
    slicer="sex",
    central_tendency="median",
)
fig.show()
/home/runner/work/statsplotly/statsplotly/statsplotly/plot_specifiers/data/_core.py:285: FutureWarning:

The provided callable <function median at 0x7f58403d9760> is currently using SeriesGroupBy.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.

Drawing horizontal and vertical lines

hlines and vlines parameters are convenience arguments to graph horizontal or vertical lines attached to a slice of the data.

This is useful to highlight particular values on the distribution :

df = px.data.tips()

fig = statsplotly.distplot(
    data=df,
    y="total_bill",
    step=True,
    hist=True,
    equal_bins=False,
    bins=20,
    color_palette="Set2",
    slicer="sex",
    central_tendency="median",
    hlines={"Female": ("Actual value", 50), "Male": ("Actual value", 60)},
)
fig.show()
/home/runner/work/statsplotly/statsplotly/statsplotly/plot_specifiers/data/_core.py:285: FutureWarning:

The provided callable <function median at 0x7f58403d9760> is currently using SeriesGroupBy.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.

Full details of the API : distplot().