This post is part 2 of a series on Data Manipulation in R. In part 1, my colleague showed the uses of the
data.table package. In this part I will explore the
dplyr package using the same examples used in part 1.
dplyr is a package for the R programming language and is part of the tidyverse. The purpose of the package is simple data manipulation. Documentation can be found here.
dplyr can be used with any basic dataframe. It can also be used with a data.table object, but that would be a little pointless since data.table has similar functions.
We will be using the same data that was used in the data.table blog. As a reminder, the data consists of information on fights in the Ultimate Fighting Championship (UFC) for select fighters. Each row is a single fight between two fighters,
F2 for Fighter 1 and Fighter 2, respectively. For each fight, the data includes the fighter names, the date and name of the event, the weight class, whether or not the fight was a title fight for a championship belt, the method of victory for the winning fighter, the time and round in which the fight ended, the reach and height of each fighter, whether or not Fighter 1 won the fight, and the number of strikes landed and attempted for each fighter.
library(dplyr) df <- read.csv('https://raw.githubusercontent.com/NateLatshaw/Tutorials/main/data/ufc_fighter_histories.csv')
One of the arguments made in advantage of dplyr is its readability. The syntax is very easy to learn and easy to understand even for someone not familiar with the package.
Like all packages in the tidyverse, dplyr uses the pipe
%>% (short-cut in rstudio: CTRL+Shift+M) which makes the code easier to write and read.
The syntax is:
dataframe %>% verb()
The pipe can be used to chain together multiple verbs at the same time.
The verbs used to perform the functions are generally self-explanatory. We will focus on some of the most used verbs.
Filtering Rows and Selecting Columns
We will filter our dataframe for the opponents and dates of all of Conor McGregor’s UFC fights. To accomplish this we will use both
df %>% filter(F1_Name == 'Conor McGregor') %>% select(Opponent = F2_Name, FightDate = Date)
In the example above you see that we chained together filter and select. It is possible to filter for more than one condition, simply seperate the conditions with a ‘,’. It is also important to note you can filter using the following logical and boolean operators:
< > <= >= == is.na() !is.na() %in% ! | & xor()
select, is used to select the columns you would like returned, in addition the columns are renamed in the same step simply using the ‘=’ symbol. Another usual functionality is if you want to select all but one column, instead of listing the columns you want put a ‘-’ symbol in front of the column(s) you don’t want.
Creating and Manipulating Columns
To create a new column in a
dplyr, we use the verb
mutate. We will create the same new columns as in the data.table blog, striking accuracy. We compute striking accuracy by dividing the number of strikes landed by the number of strikes attempted for each fighter.
df %>% mutate(F1_StrikeAccuracy = F1_StrikesLanded / F1_StrikesAttempted, F2_StrikeAccuracy = F2_StrikesLanded / F2_StrikesAttempted) %>% select(F1_Name, F1_StrikeAccuracy, F2_Name, F2_StrikeAccuracy)
As shown above, again you can create multiple columns at the same time using a ‘,’ to separate each new column name and condition. We then used select again to only show the columns we are interested in.
Group By and Summarize
To perform operations by group in
dplyr, we use
group_by(). For instance, we can compute the mean number of strikes landed per fight for each fighter by combining group_by with
df %>% group_by(FighterName = F1_Name) %>% summarize(MeanStrikes = mean(F1_StrikesLanded))
Summarise can be used in many ways. Some of the most common uses are:
- n() – number of values
- n_distinct() – number of unique values
Here is another example using
summarize, where we return the number of UFC fights per fighter among fighters with 15 or more fights.
df %>% group_by(FighterName = F1_Name) %>% summarise(Total_Fight_Count = n()) %>% filter(Total_Fight_Count >= 15)
To sort by a value or multiple values, use the verb
arrange. Using the output above, if we wanted to sort by Total_Fight_Count in decreasing order we would add the following to the code:
df %>% group_by(FighterName = F1_Name) %>% summarise(Total_Fight_Count = n()) %>% filter(Total_Fight_Count >= 15) %>% arrange(desc(Total_Fight_Count))
Notice using desc() gives us descending order. It is also possible to sort by multiple columns by using a ‘,’ between different columns. It will give priority to the sort in the order listed.
As shown above, chaining verbs together can very helpful and make your code more concise. It is also possible to use the same verb more than once, for example if you wanted to create a new column with mutate, then filter on that column, then create another new column using mutate again.
Suppose we want to compute the cumulative fight number of each fighter’s career, where a 1 represents the first fight of his/her career.
First, we need to group by
F1_Name and then sort by
Date to insure the fights are in the correct order. Then, we can use mutate to create a new column that sequentially counts the number of fights for each fighter. We will then only show the date, fighter’s name, opponent, and fight number. We’ll also arrange by fighter in the end to clearly show the count increasing.
df %>% group_by(F1_Name) %>% arrange(Date) %>% mutate(F1_CareerFightNumber = row_number()) %>% select(Date, Fighter = F1_Name, Opponent = F2_Name, F1_CareerFightNumber) %>% arrange(Fighter)
Alternatively, if we wanted to create a column including the cumulative number of wins for each Fighter 1 (up to and including the current fight), then we would sort
DT appropriately and use the
cumsum function with the proper
by to cumulatively sum the
F1_Win column for each fighter.
df %>% group_by(F1_Name) %>% arrange(Date) %>% mutate(F1_CareerWins = cumsum(F1_Win)) %>% select(Date, Fighter = F1_Name, Opponent = F2_Name, F1_Win, F1_CareerWins) %>% arrange(Fighter)
The options are endless for as far as you want to take it. From here you could filter for a certain number of career wins or group by the fighter and filter for the maximum career wins to have only the most recent fight show up for each fighter. These are just a couple of the limitless options you have.
In addition to the
merge function in
dplyr has several join functions.
However, in cases where a full merge is not needed, like when you want to add only one column from one dataframe to another, a join may be more appropriate.
The join functions available in dplyr are:
To illustrate how to perform a join, we will first define
# create two data.tables df1 <- data_frame(food = c('apple', 'banana', 'carrot'), type = c('fruit', 'fruit', 'vegetable')) df2 <- data_frame(id = c('carrot', 'apple', 'banana'), color = c('orange', 'red', 'yellow'))
Suppose we want to add the
color column in
df1. In this example I will use a left_join, but because it is such a simple example any of the joins would have the same result.
df1 %>% left_join(df2, by = c('food' = 'id'))
Notice the syntax above to specify which columns to match on from each dataframe since they have different names. If the name was the same, you would simply list that once. You can also join on more than one column.
I hope that this blog has shown the simplicity of both reading and writing code using dplyr in order to manipulate data in R. The question might now be, what is the advantage of using data.table?
data.table has a very clear advantage when it comes to performance, and for that reason if you are working with large datasets, learning the syntax for
data.table would be worth your time. If you are working with smaller datasets,
dplyr is quick and easy to use.