In the R programming language, data.table
is both a package and an object within that package. The data.table
package serves as a wrapper for the data.frame
object in base
R. A data.table
object features many enhancements over a data.frame
– to include increased speed and memory allocation efficiency. The value of these enhancements to the user increases as the size of the data increases.
R users tend to use either the data.table
or the dplyr
package for data manipulation. In general, the data.table
package provides superior performance for large tables (say, 1 million rows or more). In addition, the data.table
package generally provides more concise syntax, but there is some debate over which package provides the more user-friendly syntax. See this StackOverflow post for a detailed comparison of these two packages. To quote Hadley Wickham’s response to that post:
There is some subjectivity to syntax, but I stand by my statement that the concision of data.table makes it harder to learn and harder to read. This is partly because dplyr is solving a much easier problem!
One really important thing that dplyr does for you is that it constrains your options. I claim that most single table problems can be solved with just five key verbs filter, select, mutate, arrange and summarise, along with a “by group” adverb. That constraint is a big help when you’re learning data manipulation, because it helps order your thinking about the problem.
Every data.table
is a data.frame
so users may use dplyr
operations on a data.table
. However, the dtplyr package was added to the Tidyverse to allow dplyr
users to take advantage of data.table
efficiency while writing dplyr
code. The package translates dplyr
code to data.table
code and then executes the data.table
code, so the user experiences data.table
speed, plus a translation cost that scales with the complexity of the code (not the size of the data), with dplyr
syntax.
Getting started with the data.table
package
Creating a data.table
should look familiar for those who have worked with data in base
R.
library(data.table)
DT <- data.table(Numbers = 1:3,
Letters = c('a', 'b', 'c'),
Booleans = T)
DT
## Numbers Letters Booleans
## 1: 1 a TRUE
## 2: 2 b TRUE
## 3: 3 c TRUE
To convert an existing data.frame
to a data.table
users can use the setDT
function.
DF <- data.frame(Numbers = 1:3,
Letters = c('a', 'b', 'c'),
Booleans = T)
DT <- setDT(DF)
str(DT)
## Classes 'data.table' and 'data.frame': 3 obs. of 3 variables:
## $ Numbers : int 1 2 3
## $ Letters : Factor w/ 3 levels "a","b","c": 1 2 3
## $ Booleans: logi TRUE TRUE TRUE
## - attr(*, ".internal.selfref")=<externalptr>
Finally, the fread
function is used to read in data as a data.table
from a CSV. Even though it is not shown here, the fwrite
function is used to save a data.table
as a CSV. Both fread
and fwrite
are extremely fast compared to functions offered in readr
and base
R, especially as the size of the data increases.
Now we’ll read in some data from GitHub in order to explore the data.table
package. The data consists of information on fights in the Ultimate Fighting Championship (UFC) for select fighters. Each row is a single fight between two fighters, F1
and F2
for Fighter 1 and Fighter 2, respectively. For each fight, the data includes the fighter names, the date and name of the event, the weight class, whether or not the fight was a title fight for a championship belt, the method of victory for the winning fighter, the time and round in which the fight ended, the reach and height of each fighter, whether or not Fighter 1 won the fight, and the number of strikes landed and attempted for each fighter.
DT <- fread('https://raw.githubusercontent.com/NateLatshaw/Tutorials/main/data/ufc_fighter_histories.csv')
head(DT)
## F1_Name F2_Name Date
## 1: Amanda Nunes Julia Budd 2011-01-07
## 2: Amanda Nunes Alexis Davis 2011-09-10
## 3: Amanda Nunes Sheila Gaff 2013-08-03
## 4: Amanda Nunes Germaine de Randamie 2013-11-06
## 5: Amanda Nunes Cat Zingano 2014-09-27
## 6: Amanda Nunes Shayna Baszler 2015-03-21
## Event WeightClass TitleFight
## 1: Strikeforce Challengers 13 Women's Featherweight FALSE
## 2: Strikeforce - Barnett vs Kharitonov Women's Bantamweight FALSE
## 3: UFC 163: Aldo vs Jung Women's Bantamweight FALSE
## 4: UFC Fight Night: Fight for the Troops 3 Women's Bantamweight FALSE
## 5: UFC 178: Johnson vs Cariaso Women's Bantamweight FALSE
## 6: UFC Fight Night: Maia vs LaFlare Women's Bantamweight FALSE
## Result EndRound EndTime F1_Reach_in F2_Reach_in F1_Height_in F2_Height_in
## 1: Knockout 1 0:14 69 67 68 67
## 2: Knockout 2 4:53 69 68 68 66
## 3: Knockout 1 2:08 69 NA 68 65
## 4: Knockout 1 3:56 69 71 68 69
## 5: Knockout 3 1:21 69 68 68 66
## 6: Knockout 1 1:56 69 NA 68 67
## F1_Win F1_Loss F1_StrikesLanded F2_StrikesLanded F1_StrikesAttempted
## 1: TRUE FALSE 11 3 19
## 2: FALSE TRUE 66 126 109
## 3: TRUE FALSE 16 5 25
## 4: TRUE FALSE 23 1 67
## 5: FALSE TRUE 36 93 54
## 6: TRUE FALSE 17 1 20
## F2_StrikesAttempted
## 1: 6
## 2: 168
## 3: 10
## 4: 6
## 5: 116
## 6: 8
Basic syntax
In its most basic form, the data.table
syntax is: DT[i, j, by]
. Relating this syntax to SQL, this can be thought of as:
SELECT j
FROM DT
WHERE i
GROUP BY by
In other words, this syntax is saying: Take DT
, subset rows on i
, and then select or compute j
for each group in by
. This basic syntax will be shown more concretely in subsequent examples.
Filtering rows and selecting columns
For now we will focus on filtering rows in i
and selecting columns in j
while ignoring by
. To subset our data on a single fighter, say Conor McGregor, we can simply call DT[F1_Name == 'Conor McGregor']
, which returns another data.table
. Note that when only using i
we do not need any trailing commas to denote an empty j
or an empty by
. Also, when using a data.table
we do not need to redundantly refer back to the original object name inside the square brackets. In other words, we do not need to call DT[DT$F1_Name == 'Conor McGregor']
as we would in base
R.
To select only certain columns, say the dates in which these fights occurred, we can call DT[, Date]
. Note that we need the preceding comma to denote an empty i
when only using j
, but again, we do not need a trailing comma to denote an empty by
. Also, passing a single column name or a vector of column names in j
returns a single vector containing the selected column (or a single vector of concatenated columns if multiple columns are selected). If we instead wish to return another data.table
, then we need to pass the desired column names to j
as a list
. That is, DT[, list(Date)]
returns a data.table
while DT[, date]
returns a vector. Finally, instead of using list()
in j
, users of data.table
often use .()
instead, which is a more concise alias of list()
.
Bringing these concepts together, we can now return a data.table
containing the opponents and dates for all Conor McGregor’s UFC fights.
# subset on Conor McGregor fights
# select opponent name (F2_Name) and date of fight (Date)
# return a data.table
DT[F1_Name == 'Conor McGregor', .(F2_Name, Date)]
## F2_Name Date
## 1: Marcus Brimage 2013-04-06
## 2: Max Holloway 2013-08-17
## 3: Diego Brandao 2014-07-19
## 4: Dustin Poirier 2014-09-27
## 5: Dennis Siver 2015-01-18
## 6: Chad Mendes 2015-07-11
## 7: Jose Aldo 2015-12-12
## 8: Nate Diaz 2016-03-05
## 9: Nate Diaz 2016-08-20
## 10: Eddie Alvarez 2016-11-12
## 11: Khabib Nurmagomedov 2018-10-06
## 12: Donald Cerrone 2020-01-18
Note that renaming columns in the resulting data.table
is as simple as renaming the selected columns in j
.
# subset on Conor McGregor fights
# select opponent name (F2_Name) and date of fight (Date)
# rename selected columns
# return a data.table
DT[F1_Name == 'Conor McGregor', .(Opponent = F2_Name, FightDate = Date)]
## Opponent FightDate
## 1: Marcus Brimage 2013-04-06
## 2: Max Holloway 2013-08-17
## 3: Diego Brandao 2014-07-19
## 4: Dustin Poirier 2014-09-27
## 5: Dennis Siver 2015-01-18
## 6: Chad Mendes 2015-07-11
## 7: Jose Aldo 2015-12-12
## 8: Nate Diaz 2016-03-05
## 9: Nate Diaz 2016-08-20
## 10: Eddie Alvarez 2016-11-12
## 11: Khabib Nurmagomedov 2018-10-06
## 12: Donald Cerrone 2020-01-18
Creating and manipulating columns
To create a new column in a data.table
, we use :=
in j
. For instance, we compute striking accuracy by dividing the number of strikes landed by the number of strikes attempted for each fighter.
# create new striking accuracy columns
DT[, F1_StrikeAccuracy := F1_StrikesLanded / F1_StrikesAttempted]
DT[, F2_StrikeAccuracy := F2_StrikesLanded / F2_StrikesAttempted]
# display striking accuracy for each fighter in Conor McGregor fights
head(DT[F1_Name == 'Conor McGregor', .(F1_Name, F1_StrikeAccuracy, F2_Name, F2_StrikeAccuracy)])
## F1_Name F1_StrikeAccuracy F2_Name F2_StrikeAccuracy
## 1: Conor McGregor 0.6111111 Marcus Brimage 0.2926829
## 2: Conor McGregor 0.4104046 Max Holloway 0.3076923
## 3: Conor McGregor 0.5441176 Diego Brandao 0.4347826
## 4: Conor McGregor 0.3103448 Dustin Poirier 0.5263158
## 5: Conor McGregor 0.5416667 Dennis Siver 0.3142857
## 6: Conor McGregor 0.6166667 Chad Mendes 0.6792453
Notice that using :=
in j
creates a new column in the data.table
without having to reassign DT
by calling something like DT <- ...
. Creating and modifying columns in a data.table
using :=
is done by reference. While this may lead to some undesirable functionality (more on that later), modification by reference is a memory efficient way to process data.
When passing a list to j
using .()
, a new column is not created in the original data.table
. For instance, below we generate an identical data.table
to the one printed above, but by using .()
in j
, the original data.table
, DT
, is not modified.
# create a data.table with new columns without modifying DT
head(DT[F1_Name == 'Conor McGregor', .(F1_Name,
F1_StrikeAccuracy = F1_StrikesLanded / F1_StrikesAttempted,
F2_Name,
F2_StrikeAccuracy = F2_StrikesLanded / F2_StrikesAttempted)])
## F1_Name F1_StrikeAccuracy F2_Name F2_StrikeAccuracy
## 1: Conor McGregor 0.6111111 Marcus Brimage 0.2926829
## 2: Conor McGregor 0.4104046 Max Holloway 0.3076923
## 3: Conor McGregor 0.5441176 Diego Brandao 0.4347826
## 4: Conor McGregor 0.3103448 Dustin Poirier 0.5263158
## 5: Conor McGregor 0.5416667 Dennis Siver 0.3142857
## 6: Conor McGregor 0.6166667 Chad Mendes 0.6792453
In addition to performing operations between columns in j
, we can also use functions. For instance, we can find the mean number of strikes landed by Conor McGregor across all of his fights.
# mean strikes landed by Conor McGregor per fight
DT[F1_Name == 'Conor McGregor', mean(F1_StrikesLanded)]
## [1] 58.5
Note that we use mean
inside the square brackets, which allows the data.table
package to handle the operation. We could get the same answer by pulling mean
outside of the square brackets: mean(DT[F1_Name == 'Conor McGregor', F1_StrikesLanded])
. This would return the same answer but is not the most efficient way to use data.table
since it would only use data.table
to return a vector of strikes landed by Conor McGregor and then use base
R to compute the mean of that vector.
Group by
To perform operations by group in data.table
, we add a column name or a list of column names to by
in DT[i, j, by]
. For instance, we can compute the mean number of strikes landed per fight for each fighter.
# mean strikes landed by fighter
DT[, mean(F1_StrikesLanded), by = F1_Name]
## F1_Name V1
## 1: Amanda Nunes 58.81250
## 2: Conor McGregor 58.50000
## 3: Holly Holm 76.41667
## 4: Israel Adesanya 67.33333
## 5: Kamaru Usman 148.66667
## 6: Khabib Nurmagomedov 83.00000
## 7: Max Holloway 102.52174
## 8: Ronda Rousey 25.75000
## 9: Stipe Miocic 89.70588
## 10: Valentina Shevchenko 104.27273
Note that when using by
, the resulting object is always another data.table
, even if a list
is not passed to j
. Also, notice that we can pass a list
to j
and by
and rename the resulting columns.
# mean strikes landed by fighter
# rename columns in j and by
DT[, .(MeanStrikes = mean(F1_StrikesLanded)), by = .(FighterName = F1_Name)]
## FighterName MeanStrikes
## 1: Amanda Nunes 58.81250
## 2: Conor McGregor 58.50000
## 3: Holly Holm 76.41667
## 4: Israel Adesanya 67.33333
## 5: Kamaru Usman 148.66667
## 6: Khabib Nurmagomedov 83.00000
## 7: Max Holloway 102.52174
## 8: Ronda Rousey 25.75000
## 9: Stipe Miocic 89.70588
## 10: Valentina Shevchenko 104.27273
Recall that since we passed a list
to j
instead of using :=
to create the new column, DT
is not modified.
Additional features
Now that we have some familiarity with the basic data.table
syntax of DT[i, j, by]
, we will explore some common use cases.
Special symbol: .N
One powerful special symbol for a data.table
is .N
, which is used in j
to count the number of rows in the current group. For instance, DT[, .N]
returns the number of rows in DT
. Similarly, DT[F1_Name == 'Conor McGregor', .N]
returns the number of Conor McGregor fights, or more generally, the number of rows satisfying the condition in i
. We can also use .N
to quickly count the number of rows by group. In this case, we can return the number of UFC fights for each fighter.
# number of rows (or UFC fights) by fighter
DT[, .N, by = F1_Name]
## F1_Name N
## 1: Amanda Nunes 16
## 2: Conor McGregor 12
## 3: Holly Holm 12
## 4: Israel Adesanya 9
## 5: Kamaru Usman 12
## 6: Khabib Nurmagomedov 13
## 7: Max Holloway 23
## 8: Ronda Rousey 12
## 9: Stipe Miocic 17
## 10: Valentina Shevchenko 11
Chaining
Chaining multiple sequential operations on a single data.table
is simple: just add another set of square brackets. For instance, the general form of two chained operations is DT[i, j, by][i, j, by]
, but this can be extended beyond just two. As an example, we can return the number of UFC fights per fighter among fighters with 15 or more fights.
# number of rows (or UFC fights) by fighter
# only return fighters with 15 or more fights
DT[, .N, by = F1_Name][N >= 15]
## F1_Name N
## 1: Amanda Nunes 16
## 2: Max Holloway 23
## 3: Stipe Miocic 17
Sorting
There are multiple efficient ways to sort a data.table
. We can use the order
function in i
to sort a data.table
. For instance, referring back to the table we created showing the number of fights among fighters with at least 15 fights, we can add another operation that sorts the resulting data.table
in descending order of the number of fights (notice the minus in front of the column name inside order
to reverse the order from ascending to descending).
# number of rows (or UFC fights) by fighter
# only return fighters with 15 or more fights
# order table by number of fights in descending order
DT[, .N, by = F1_Name][N >= 15][order(-N)]
## F1_Name N
## 1: Max Holloway 23
## 2: Stipe Miocic 17
## 3: Amanda Nunes 16
The setkey
function also sorts columns in ascending order and changes the data.table
by reference, which means that we do not have to reassign the object using <-
to modify the data.table
. For instance, we can sort DT
in ascending order by Date
and then by F1_Name
name.
# sort DT by Date and F1_Name
setkey(DT, Date, F1_Name)
# print result
head(DT[, .(Date, F1_Name)])
## Date F1_Name
## 1: 2011-01-07 Amanda Nunes
## 2: 2011-08-12 Ronda Rousey
## 3: 2011-09-10 Amanda Nunes
## 4: 2011-10-08 Stipe Miocic
## 5: 2011-11-18 Ronda Rousey
## 6: 2012-01-20 Khabib Nurmagomedov
Using setkey
is typically the preferred method of sorting a data.table
because it also adds an attribute to the object itself that marks it as sorted. The columns by which the data.table
is sorted are set as keys, which enables fast joins.
Joining
To merge a data.table
with another, the merge
function works the same way it does in base
R. However, in cases where a full merge is not needed, like when you want to add only one column from one data.table
to another, a join may be more appropriate.
To illustrate how to perform a join, we will first define DT1
and DT2
, each of which is a data.table
.
# create two data.tables
DT1 <- data.table(food = c('apple', 'banana', 'carrot'), type = c('fruit', 'fruit', 'vegetable'))
DT2 <- data.table(id = c('carrot', 'apple', 'banana'), color = c('orange', 'red', 'yellow'))
Suppose we want to add the color
column in DT2
to DT1
. Then, we set the same key using setkey
for each data.table
. Note that the keys do not need to have the same name.
# set the keys
setkey(DT1, food)
setkey(DT2, id)
Finally, to perform the join, we place DT2
in i
of DT1
and create a new column using :=
by placing i.
in front of the column name from DT2
.
# join the color column from DT2 onto DT1
DT1[DT2, food_color := i.color]
# print result
DT1
## food type food_color
## 1: apple fruit red
## 2: banana fruit yellow
## 3: carrot vegetable orange
In this simple example, a merge would have been more appropriate, but if DT2
had dozens of columns and we only wanted to add a single column to DT1
, then a join is much more memory efficient.
Cumulative counts and sums
When working with panel data, it’s often helpful to compute cumulative or rolling metrics. Generating such features in data.table
is fast and concise. For instance, suppose we want to compute the cumulative fight number of each fighter’s career, where a 1 represents the first fight of his/her career.
First, we need to use setkey
to sort DT
properly – that is, by F1_Name
and then by Date
. Then, we can create a new column in j
that sequentially counts the number of fights for each group specified in by
. We can use a sequence to create this new column that counts from 1 to the number of fights observed for each fighter, which is denoted as .N
in data.table
. Note that once DT
is properly sorted, we are simply producing a column representing the cumulative row count for each F1_Name
.
# sort DT by F1_Name and Date
setkey(DT, F1_Name, Date)
# create career fight number for Fighter 1
DT[, F1_CareerFightNumber := 1:.N, by = F1_Name]
# print result
DT[F1_Name == 'Conor McGregor', .(Date, F1_Name, F2_Name, F1_CareerFightNumber)]
## Date F1_Name F2_Name F1_CareerFightNumber
## 1: 2013-04-06 Conor McGregor Marcus Brimage 1
## 2: 2013-08-17 Conor McGregor Max Holloway 2
## 3: 2014-07-19 Conor McGregor Diego Brandao 3
## 4: 2014-09-27 Conor McGregor Dustin Poirier 4
## 5: 2015-01-18 Conor McGregor Dennis Siver 5
## 6: 2015-07-11 Conor McGregor Chad Mendes 6
## 7: 2015-12-12 Conor McGregor Jose Aldo 7
## 8: 2016-03-05 Conor McGregor Nate Diaz 8
## 9: 2016-08-20 Conor McGregor Nate Diaz 9
## 10: 2016-11-12 Conor McGregor Eddie Alvarez 10
## 11: 2018-10-06 Conor McGregor Khabib Nurmagomedov 11
## 12: 2020-01-18 Conor McGregor Donald Cerrone 12
Alternatively, if we wanted to create a column including the cumulative number of wins for each Fighter 1 (up to and including the current fight), then we would sort DT
appropriately and use the cumsum
function with the proper by
to cumulatively sum the F1_Win
column for each fighter.
# sort DT by F1_Name and Date
setkey(DT, F1_Name, Date)
# create cumulative number of wins for Fighter 1 (including the current fight)
DT[, F1_CareerWins := cumsum(F1_Win), by = F1_Name]
# print result
DT[F1_Name == 'Conor McGregor', .(Date, F1_Name, F2_Name, F1_Win, F1_CareerWins)]
## Date F1_Name F2_Name F1_Win F1_CareerWins
## 1: 2013-04-06 Conor McGregor Marcus Brimage TRUE 1
## 2: 2013-08-17 Conor McGregor Max Holloway TRUE 2
## 3: 2014-07-19 Conor McGregor Diego Brandao TRUE 3
## 4: 2014-09-27 Conor McGregor Dustin Poirier TRUE 4
## 5: 2015-01-18 Conor McGregor Dennis Siver TRUE 5
## 6: 2015-07-11 Conor McGregor Chad Mendes TRUE 6
## 7: 2015-12-12 Conor McGregor Jose Aldo TRUE 7
## 8: 2016-03-05 Conor McGregor Nate Diaz FALSE 7
## 9: 2016-08-20 Conor McGregor Nate Diaz TRUE 8
## 10: 2016-11-12 Conor McGregor Eddie Alvarez TRUE 9
## 11: 2018-10-06 Conor McGregor Khabib Nurmagomedov FALSE 9
## 12: 2020-01-18 Conor McGregor Donald Cerrone TRUE 10
Usage in functions
As mentioned earlier, a data.table
object is often modified by reference, which results in memory efficient performance for many operations. However, modification by reference does have potentially undesirable consequences, especially when performing standard data.table
operations inside a function.
To illustrate this issue, we first create a new data.table
named newDT
.
# create new data.table
newDT <- data.table(ID = c('A', 'B', 'C'), ones = 1)
newDT
## ID ones
## 1: A 1
## 2: B 1
## 3: C 1
Then we create a function called AddColumns
that takes any data.table
as input and adds a column of twos to the object using :=
and a column of threes using base
R. Finally, we execute this function on newDT
to create another data.table
named newDT2
.
# define AddColumns function
AddColumns <- function(inputDT){
inputDT[, twos := 2]
inputDT$threes <- 3
return(inputDT)
}
# execute function on newDT to create newDT2
newDT2 <- AddColumns(newDT)
# print result
newDT2
## ID ones twos threes
## 1: A 1 2 3
## 2: B 1 2 3
## 3: C 1 2 3
We see that AddColumns
performed as intended. However, when we look back at newDT
, we see that executing AddColumns
modified this original data.table
.
newDT
## ID ones twos
## 1: A 1 2
## 2: B 1 2
## 3: C 1 2
The special character :=
modifies columns by reference. Hence, even though newDT
is passed to a function and modified within that function, a second copy of that object is not created in memory. Instead, a pointer is stored in memory that points back to the original object, so when a column is added in the function using :=
, the original object is also modified because that is the only data.table
object stored in memory.
However, we see that the threes
column was not added to newDT
. Modifying objects in base
R (or otherwise outside of data.table
) is typically not done by reference. In this case, that means a second copy of the data.table
is created in memory, and then the threes
column is added to the copy. This is often how we expect functions to work, but it is not memory efficient since it involves creating a second copy of the data in memory.
This illuminates one of the trade-offs with data.table
. We gain memory allocation efficiency by modifying by reference, but we need to be aware that certain operations will result in behavior that is different from the typical R functionality.
In this case with the AddColumns
function, we simply need to force R to make a copy of inputDT
by calling inputDT <- copy(inputDT)
. This breaks the pointer between newDT
and inputDT
such that any modification of inputDT
inside the function will no longer modify newDT
. The example below recreates newDT
, redefines the AddColumns
function, and shows that newDT
is not modified after executing the function.
# recreate newDT
newDT <- data.table(ID = c('A', 'B', 'C'), ones = 1)
# redefine AddColumns function
AddColumns <- function(inputDT){
inputDT <- copy(inputDT)
inputDT[, twos := 2]
inputDT$threes <- 3
return(inputDT)
}
# execute function on newDT to recreate newDT2
newDT2 <- AddColumns(newDT)
# print newDT
newDT
## ID ones
## 1: A 1
## 2: B 1
## 3: C 1
Conclusion
Those who work with large datasets in R and decide to spend the time required to learn the data.table
syntax will continually reap the performance rewards throughout their work. However, for many R users who tend to work with small to medium size datasets, the readability and simplicity of the syntax in the dplyr
package may outweigh the relatively small efficiency gains that data.table
would bring to operations on data of that size.
While this post covers much of the basic data.table
syntax, it really only scratches the surface. The data.table
package contains much more functionality and many more shortcuts than those described here.