[This article was first published on coding-the-past, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

1. What is a violin plot?

A violin plot is a mirrored density plot that is rotated 90 degrees as shown in the picture. It depicts the distribution of numeric data.

Visual description of what a violin plot is. First a density curve is shown. Second, a mirrored version of it is shown and lastly it is rotated by 90 degrees.


2. When should you use a violin plot?

A violin plot is useful to compare the distribution of a numeric variable across different subgroups in a sample. For instance, the distribution of heights of a group of people could be compared across gender with a violin plot.


3. How to code a ggplot2 violin plot?

First, map the numeric variable whose distribution you would like to analyze to the x position aesthetic in ggplot2. Second, map the variable you want to use to separate your sample in different groups to the y position aesthetic. This is done with aes(x = variable_of_interest, y = dimension) inside the ggplot() function. The last step is to add the geom_violin() layer.

To exemplify these steps, we will examine the capacity of Roman amphitheaters across different regions of the Roman Empire. The data for this comes from the cawd R package, maintained by Professor Sebastian Heath. This package contains several datasets about the Ancient World, including one about the Roman Amphitheaters. To install the package, use devtools::install_github("sfsheath/cawd").


tips_and_updates

 

Learn more about Roman amphitheaters in this informative article by Laura Klar, Department of Greek and Roman Art, The Metropolitan Museum of Art:

Theater and Amphitheater in the Roman World

After loading the package, use data() to see the available data frames. We will be using the ramphs dataset. It contains characteristics of the Roman amphitheaters. For this example, we will use the column 2 (title), column 7 (capacity) and column 8 (mod.country), which specifies the modern country where the amphitheater was located. We will also consider only the three modern countries with the largest number of amphitheaters – Tunisia, France or Italy. The code below loads and filters the relevant data.


content_copy
Copy

library(cawd)
library(ggplot2)

# Store the dataset in df1
df1 <- ramphs

# Select all rows of relevant columns
df2 <- df1[ ,c(2,7,8)]

# Filter only the rows where modern country is either Tunisia, France or Italy
df3 <- df2[df2$mod.country %in% c("Tunisia", "France", "Italy"), ]

# Delete NAs
df4 <- na.omit(df3)

# Plot a basic ggplot2 violin plot
ggplot(data = df4, aes(x=mod.country, y=capacity))+
  geom_violin()

Basic violin plot

We can further customize this plot to make it look better and fit this page theme. In the code below we improve the following aspects:

  • geom_violin(color = "#FF6885", fill = "#2E3031", size = 0.9) changes in the color and size of line and fill of the violin plot;
  • geom_jitter(width = 0.05, alpha = 0.2, color = "gray") adds the data points jittered to avoid overplotting and show where the points are concentrated;
  • coord_flip() flips the two axis so that is more evident that a violin plot is simply a mirrored density curve;
  • the other geom layes add title, labels and a new theme to the plot.


tips_and_updates

 

To learn more about geom_jitter, please see this

link.


content_copy
Copy

ggplot(data = df4, aes(x=mod.country, y=capacity))+
  geom_violin(color = "#FF6885", fill = "#2E3031", size = 0.9)+
  geom_jitter(width = 0.05, alpha = 0.2, color = "gray")+
  ggtitle("Roman Amphitheaters")+
  xlab("Modern Country")+
  ylab("Capacity of Spectators")+
  coord_flip()+
  theme_bw()+
  theme(text=element_text(color = 'white'),
      # Changes panel, plot and legend background to dark gray:
      panel.background = element_rect(fill = '#2E3031'),
      plot.background = element_rect(fill = '#2E3031'),
      legend.background = element_rect(fill='#2E3031'),
      legend.key = element_rect(fill = '#2E3031'),
      # Changes legend texts color to white:
      legend.text =  element_text(color = 'white'),
      legend.title = element_text(color = 'white'),
      # Changes color of plot border to white:
      panel.border = element_rect(color = 'white'),
      # Eliminates grids:
      panel.grid.minor = element_blank(),
      panel.grid.major = element_blank(),
      # Changes color of axis texts to white
      axis.text.x = element_text(color = 'white'),
      axis.text.y = element_text(color = 'white'),
      axis.title.x = element_text(color= 'white'),
      axis.title.y = element_text(color= 'white'),
      # Changes axis ticks color to white
      axis.ticks.y = element_line(color = 'white'),
      axis.ticks.x = element_line(color = 'white'),
      legend.position = "bottom")

Final violin plot

Note that amphitheaters in the territory of modern Tunisia tended to have less variation in their capacity and most of them were below 10,000 spectators. On the other hand, amphitheaters in the Italian Peninsula exhibit greater variation.

Can you guess what the outlier on the very right of the Italian distribution is? Yes! It’s the Flavian Amphitheater at Rome, also known as the Colosseum, with an impressive capacity of 50,000 people. If you have any questions, please feel free to comment below!


4. Conclusions

  • A violin plot, a type of density curve, is useful for exploring data distribution;
  • Coding a ggplot2 violin plot can be easily accomplished with geom_violin().


To leave a comment for the author, please follow the link and comment on their blog: coding-the-past.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: Unveiling Roman Amphitheaters with a ggplot2 violin plot

Analyses and Implications of Utilizing Violin Plots in Data Visualization

The designated text describes the implementation and significance of violin plots, especially within the context of R programming language. These plots are essentially mirrored density plots, depicting the distribution of numeric data. The article subsequently provides an illustrative snippet of how to generate a violin plot using library packages such as ggplot2 in R.

Long-Term Implications

The long-term implications of this analytical tool provide far-reaching applications in the field of data analysis, not just limited to R programming. Violin plots present an intuitive and compact way to visualize and compare data distributions across different subgroups or categories within datasets. This is extremely beneficial in diverse fields such as finance, sales, healthcare, physics, social sciences, and more.

To exemplify these cases, imagine a company trying to compare its monthly sales across different regions or a healthcare researcher analyzing the spread of disease symptoms across diverse demographic subgroups. Violin plots can offer excellent visual insights into these exploratory data questions.

Possible Future Developments

While violin plots have significant merits, the ability to convey multivariate distributions intuitively and compactly remains an open question. Hence, focusing on the development of such visual aids can be a prospective future direction for improving data analysis capability.

Besides, as the importance of presenting complex data in accessible formats continues to grow across industries, we can expect an increasing number of tools and programming languages to adopt and refine violin plot capabilities.

Actionable Advice

For both seasoned coders and beginners in data analysis, continue exploring and honing violin plot techniques. Given the growing analytics demand across industries, developing skills in efficiently conveying complex data insights puts you at an advantage.

Educational institutions should consider integrating data visualization techniques such as violin plots in their curriculum, given the pressing need to comprehend and convey complex data across academic disciplines.

Meanwhile, companies should encourage data analysis literacy among employees, enabling them to understand and utilize such visual tools for better business decisions. Providing easy-to-understand resources and opportunities for learning would be a significant starting point in this direction.

Lastly, future developers should consider the idea of designing more user-friendly tools that help generate violin plots as well as other forms of data visualizations, with minimal coding know-how.

Note: The use of any software or package such as R or ggplot2 should align with their usage license agreements and guidelines.

Read the original article