When exploring a dataset, you’ll often want to get a quick understanding of the distribution of certain numerical variables within it. A common way of visualizing the distribution of a single numerical variable is by using a histogram. A histogram divides the values within a numerical variable into “bins”, and counts the number of observations that fall into each bin. By visualizing these binned counts in a columnar fashion, we can obtain a very immediate and intuitive sense of the distribution of values within a variable.
This recipe will show you how to go about creating a histogram using R. Specifically, you’ll be using R’s hist() function and ggplot2.
In our example, you’re going to be visualizing the distribution of session duration for a website. The steps in this recipe are divided into the following sections:
- Data Wrangling
- Data Exploration & Preparation
- Data Visualization
You can find implementations of all of the steps outlined below in this example Mode report. Let’s get started.
You’ll use SQL to wrangle the data you’ll need for our analysis. For this example, you’ll be using the
sessions dataset available in Mode’s Public Data Warehouse. Using the schema browser within the editor, make sure your data source is set to the Mode Public Warehouse data source and run the following query to wrangle your data:
Once the SQL query has completed running, rename your SQL query to
Sessions so that you can easily identify it within the R notebook.
Data Exploration & Preparation
Now that you have your data wrangled, you’re ready to move over to the R notebook to prepare your data for visualization. Mode automatically pipes the results of your SQL queries into an R dataframe assigned to the variable
datasets. You can use the following line of R to access the results of your SQL query as a dataframe and assign them to a new variable:
`sessions <- datasets[['Sessions']]`
To create a histogram, we will use R’s
hist() function. Since you are only interested in visualizing the distribution of the
session_duration_seconds variable, you will pass in the column name to the
hist() function to limit the visualization output to the variable of interest:
`# Using hist() function in base graphics to make a histogram
histinfo=hist(sessions$session_duration_seconds, main="Histogram with Default Parameters")`
You can further customize the appearance of your histogram by supplying the
hist() function additional parameters:
`hist(sessions$session_duration_seconds, main="Adding grid lines and ticks", xlab="Session Duration (in seconds)", ylab= "Count", xlim=c(0,55), ylim=c(0, 49000), col="lightgrey")
axis(4, labels=FALSE, col = "lightgrey", lty=2, tck=1)`
You can also use ggplot2’s native histogram creation functionality to create and style histograms in R with additional features like kernel density estimations:
`p <- ggplot(sessions, aes(x=session_duration_seconds)) +
geom_histogram(aes(y=..density..), # Histogram with density instead of count on y-axis
colour="black", fill="white") +
p + labs(x = "Session Duration (in seconds)", y = "Density", title = "Density Curve using ggplot2") + coord_fixed(ratio = 100)
width = 5,
height = 8,
dpi = 1200)`