The Goal

As a data scientist you often find yourself creating a notebook documenting a new model or an update to an existing model. Recently in this particular situation I was analyzing how my model was performing within windows of time throughout the prediction window. I wanted a visualization to show these results.

The Problem

The first visualization I came up with below, showed the counts of incorrect and correct predictions within the windows. It appeared my predictions were improving over time as the number of incorrect predictions decreased, however the total number of predictions was also decreasing over time. So while this chart is helpful in some ways, it is also a bit misleading, because the incorrect count is decreasing, but so is the total count.

require(dplyr)
require(ggplot2)

pred_stats <- read.csv('pred_stats.csv')

ggplot() + 
  geom_bar(data = pred_stats, aes(fill=pred_correct, x=current_days_down_period, y=count), position="stack", stat="identity")+
  labs(title = 'Random Forest Model Predicting on 2020',        
       subtitle = 'Prediction Performance by Current Days Not Flown Period',
       x = 'Current Days Not Flown',
       y = 'Count',
       fill = '') 

I decided to switch it to percentage instead of count. This chart, shown below, does show how the previous plot was a little misleading. Instead I see that my predictions actually became a little worse as time went on.

ggplot() + 
  geom_bar(data = pred_stats, aes(fill=pred_correct, x=current_days_down_period, y=count), position="fill", stat="identity")+
  labs(title = 'Random Forest Model Predicting on 2020',        
       subtitle = 'Prediction Performance by Current Days Not Flown Period',
       x = 'Current Days Not Flown',
       y = 'Percent',
       fill = '') 

I could just stick with this visualization, but showing that the number of predictions was decreasing over time was also important to me. I could use both plots, but I actually had this plot for multiple years and multiple models so using both would be a lot of plots in my notebook. I decided that I would like to combine the two plots. Use the first plot with the counts and then overlay a line plot showing the percent of incorrect predictions.

Hadley Wickham, the creator of ggplot2, is not a fan of dual axis plots. His words on the topic: “I agree that they can be useful when the axes are simple linear transformations of each other, but I don’t think they’re useful enough for me to spend hours to implement them.”

Due to this, it is challenging to implement a dual axis plot in ggplot2, and is really only possible when the two axes are related to one another. In this case they are, although not as directly as is easily implemented.

Scaling the Second Axis

As mentioned above, when you create a secondary axis in ggplot2 it has to relate to the first axis. My secondary axis was a percentage of incorrect predictions, while the primary axis was showing the total count of predictions. While these two are related, there wasn’t an easy operation to convert from the count to the percent. However, my counts on all plots totaled just over 200 so I made my secondary axis as half the primary. You can see the syntax in the code below setting up the two axes.

scale_y_continuous(name = 'Count', sec.axis = sec_axis(~./2, name = 'Percent Incorrect'))

You also need to do the inverse to your second y variable, because it is still lining up with the primary axis, it is not actually scaling off of the secondary axes. So in this case that code looked like this, when defining my y variable in both my geom_point() and geom_line():

y = percent_incorrect*2

These two steps created the secondary axes and made sure that the secondary y variable lined up properly.

Fixes to geom_line Issues

I thought at this point I would be set but I had issues with my line plot, resulting in it not showing up on the plot. The points were showing up correctly, but no line connecting them. The first fix for this was to assign group = 1 in geom_line, this groups all the points together to make a single line.

The second fix took longer to figure out. Initially I had created a column of percentage of incorrect for each row that represented incorrect prediction counts. For correct predictions I just put NA, because I only wanted to show the percentage of incorrect predictions. The resulting dataframe is below.

pred_stats %>% 
  mutate(percent_incorrect = ifelse(pred_correct == 'Incorrect',
                             count/total*100,
                             NA))

Knowing there were NAs in the column being used for the y variable I added na.rm = TRUE to my geom_line to deal with this. Yet, my line was still not showing up. I decided to try a different approach, and create a separate dataframe of incorrect percentages to remove the NAs. That dataframe is shown below.

pred_stats_per <- pred_stats %>% 
  filter(pred_correct == 'Incorrect') %>% 
  mutate(percent_incorrect = count/total*100)

pred_stats_per

Using the original dataframe for the counts in the geom_bar and then this new dataframe for the percentages in the geom_point and geom_line worked. The results are below.

ggplot() + 
  geom_bar(data = pred_stats, aes(fill=pred_correct, x=current_days_down_period, y=count), position="stack", stat="identity") +
  geom_point(data = pred_stats_per, aes(x=current_days_down_period, y=percent_incorrect*2), size = 3, color = 'black') +
  geom_line(data = pred_stats_per, aes(x=current_days_down_period, y=percent_incorrect*2), color = 'black', size = 1, group = 1, na.rm = TRUE) +
  scale_fill_manual(values = c("Correct" = "#00C1AA", "Incorrect" = "#FF6666")) +
  scale_y_continuous(name = 'Count',
                     sec.axis = sec_axis(~./2, name = 'Percent Incorrect')) +
  labs(title = 'Random Forest Model Predicting on 2020',        
       subtitle = 'Prediction Performance by Current Days Not Flown Period',
       x = 'Current Days Not Flown',
       fill = '') 

Adjusting the Legend & Scaling the Axes

This plot was showing exactly what I wanted, the relationship between the counts of correct and incorrect predictions as well as the percentage of incorrect predictions. I was nearly done, but couldn’t help but try to add to the legend that the line represented the “Percent Incorrect”. This luckily proved to be a simple addition. By adding lty = 'Percent Incorrect' to aes of geom_line.

Additionally I added scale_linetype('') to remove an unwanted title on the new legend.

My final adjustment was to scale each of the axes by using breaks = seq() as shown in the code below.

ggplot() + 
  geom_bar(data = pred_stats, aes(fill=pred_correct, x=current_days_down_period, y=count), position="stack", stat="identity") +
  geom_point(data = pred_stats_per, aes(x=current_days_down_period, y=percent_incorrect*2), size = 3, color = 'black') +
  geom_line(data = pred_stats_per, aes(x=current_days_down_period, y=percent_incorrect*2, lty = 'Percent Incorrect'), color = 'black', group = 1, na.rm = TRUE) +
  scale_fill_manual(values = c("Correct" = "#00C1AA", "Incorrect" = "#FF6666")) +
  scale_linetype('') +
  scale_y_continuous(name = 'Count',
                     breaks = seq(0,250,25),
                     sec.axis = sec_axis(~./2, name = 'Percent Incorrect', breaks = seq(0,105,15))) +
  labs(title = 'Logistic Regression Model Predicting on 2019',        subtitle = 'Prediction Performance by Current Days Not Flown Period',
       x = 'Current Days Not Flown',
       fill = '')

Conclusion

Like many problems, in attempting to solve my issue I ran into more issues, but in the end, I am happy with my final result and what it will communicate to my reader. In my research, I was unable to find one resource that helped with this issue and had to piece together solutions from multiple sources. I am also hopeful that documenting my struggles will benefit those of you with the same problem in the future.