Nate Latshaw

By Nate Latshaw

In the R programming language, data.table is both a package and an object within that package. The data.table package serves as a wrapper for the data.frame object in base R. A data.table object features many enhancements over a data.frame – to include increased speed and memory allocation efficiency. The value of these enhancements to the user increases as the size of the data increases.

R users tend to use either the data.table or the dplyr package for data manipulation. In general, the data.table package provides superior performance for large tables (say, 1 million rows or more). In addition, the data.table package generally provides more concise syntax, but there is some debate over which package provides the more user-friendly syntax. See this StackOverflow post for a detailed comparison of these two packages. To quote Hadley Wickham’s response to that post:

There is some subjectivity to syntax, but I stand by my statement that the concision of data.table makes it harder to learn and harder to read. This is partly because dplyr is solving a much easier problem!

One really important thing that dplyr does for you is that it constrains your options. I claim that most single table problems can be solved with just five key verbs filter, select, mutate, arrange and summarise, along with a “by group” adverb. That constraint is a big help when you’re learning data manipulation, because it helps order your thinking about the problem.

Every data.table is a data.frame so users may use dplyr operations on a data.table. However, the dtplyr package was added to the Tidyverse to allow dplyr users to take advantage of data.table efficiency while writing dplyr code. The package translates dplyr code to data.table code and then executes the data.table code, so the user experiences data.table speed, plus a translation cost that scales with the complexity of the code (not the size of the data), with dplyr syntax.

Getting started with the data.table package

Creating a data.table should look familiar for those who have worked with data in base R.

library(data.table)
DT <- data.table(Numbers = 1:3, 
                 Letters = c('a', 'b', 'c'), 
                 Booleans = T)
DT
##    Numbers Letters Booleans
## 1:       1       a     TRUE
## 2:       2       b     TRUE
## 3:       3       c     TRUE

To convert an existing data.frame to a data.table users can use the setDT function.

DF <- data.frame(Numbers = 1:3, 
                 Letters = c('a', 'b', 'c'), 
                 Booleans = T)
DT <- setDT(DF)
str(DT)
## Classes 'data.table' and 'data.frame':   3 obs. of  3 variables:
##  $ Numbers : int  1 2 3
##  $ Letters : Factor w/ 3 levels "a","b","c": 1 2 3
##  $ Booleans: logi  TRUE TRUE TRUE
##  - attr(*, ".internal.selfref")=<externalptr>

Finally, the fread function is used to read in data as a data.table from a CSV. Even though it is not shown here, the fwrite function is used to save a data.table as a CSV. Both fread and fwrite are extremely fast compared to functions offered in readr and base R, especially as the size of the data increases.

Now we’ll read in some data from GitHub in order to explore the data.table package. The data consists of information on fights in the Ultimate Fighting Championship (UFC) for select fighters. Each row is a single fight between two fighters, F1 and F2 for Fighter 1 and Fighter 2, respectively. For each fight, the data includes the fighter names, the date and name of the event, the weight class, whether or not the fight was a title fight for a championship belt, the method of victory for the winning fighter, the time and round in which the fight ended, the reach and height of each fighter, whether or not Fighter 1 won the fight, and the number of strikes landed and attempted for each fighter.

DT <- fread('https://raw.githubusercontent.com/NateLatshaw/Tutorials/main/data/ufc_fighter_histories.csv')
head(DT)
##         F1_Name              F2_Name       Date
## 1: Amanda Nunes           Julia Budd 2011-01-07
## 2: Amanda Nunes         Alexis Davis 2011-09-10
## 3: Amanda Nunes          Sheila Gaff 2013-08-03
## 4: Amanda Nunes Germaine de Randamie 2013-11-06
## 5: Amanda Nunes          Cat Zingano 2014-09-27
## 6: Amanda Nunes       Shayna Baszler 2015-03-21
##                                      Event           WeightClass TitleFight
## 1:              Strikeforce Challengers 13 Women's Featherweight      FALSE
## 2:     Strikeforce - Barnett vs Kharitonov  Women's Bantamweight      FALSE
## 3:                   UFC 163: Aldo vs Jung  Women's Bantamweight      FALSE
## 4: UFC Fight Night: Fight for the Troops 3  Women's Bantamweight      FALSE
## 5:             UFC 178: Johnson vs Cariaso  Women's Bantamweight      FALSE
## 6:        UFC Fight Night: Maia vs LaFlare  Women's Bantamweight      FALSE
##      Result EndRound EndTime F1_Reach_in F2_Reach_in F1_Height_in F2_Height_in
## 1: Knockout        1    0:14          69          67           68           67
## 2: Knockout        2    4:53          69          68           68           66
## 3: Knockout        1    2:08          69          NA           68           65
## 4: Knockout        1    3:56          69          71           68           69
## 5: Knockout        3    1:21          69          68           68           66
## 6: Knockout        1    1:56          69          NA           68           67
##    F1_Win F1_Loss F1_StrikesLanded F2_StrikesLanded F1_StrikesAttempted
## 1:   TRUE   FALSE               11                3                  19
## 2:  FALSE    TRUE               66              126                 109
## 3:   TRUE   FALSE               16                5                  25
## 4:   TRUE   FALSE               23                1                  67
## 5:  FALSE    TRUE               36               93                  54
## 6:   TRUE   FALSE               17                1                  20
##    F2_StrikesAttempted
## 1:                   6
## 2:                 168
## 3:                  10
## 4:                   6
## 5:                 116
## 6:                   8

Basic syntax

In its most basic form, the data.table syntax is: DT[i, j, by]. Relating this syntax to SQL, this can be thought of as:

SELECT j

FROM DT

WHERE i

GROUP BY by

In other words, this syntax is saying: Take DT, subset rows on i, and then select or compute j for each group in by. This basic syntax will be shown more concretely in subsequent examples.

Filtering rows and selecting columns

For now we will focus on filtering rows in i and selecting columns in j while ignoring by. To subset our data on a single fighter, say Conor McGregor, we can simply call DT[F1_Name == 'Conor McGregor'], which returns another data.table. Note that when only using i we do not need any trailing commas to denote an empty j or an empty by. Also, when using a data.table we do not need to redundantly refer back to the original object name inside the square brackets. In other words, we do not need to call DT[DT$F1_Name == 'Conor McGregor'] as we would in base R.

To select only certain columns, say the dates in which these fights occurred, we can call DT[, Date]. Note that we need the preceding comma to denote an empty i when only using j, but again, we do not need a trailing comma to denote an empty by. Also, passing a single column name or a vector of column names in j returns a single vector containing the selected column (or a single vector of concatenated columns if multiple columns are selected). If we instead wish to return another data.table, then we need to pass the desired column names to j as a list. That is, DT[, list(Date)] returns a data.table while DT[, date] returns a vector. Finally, instead of using list() in j, users of data.table often use .() instead, which is a more concise alias of list().

Bringing these concepts together, we can now return a data.table containing the opponents and dates for all Conor McGregor’s UFC fights.

# subset on Conor McGregor fights
# select opponent name (F2_Name) and date of fight (Date)
# return a data.table
DT[F1_Name == 'Conor McGregor', .(F2_Name, Date)]
##                 F2_Name       Date
##  1:      Marcus Brimage 2013-04-06
##  2:        Max Holloway 2013-08-17
##  3:       Diego Brandao 2014-07-19
##  4:      Dustin Poirier 2014-09-27
##  5:        Dennis Siver 2015-01-18
##  6:         Chad Mendes 2015-07-11
##  7:           Jose Aldo 2015-12-12
##  8:           Nate Diaz 2016-03-05
##  9:           Nate Diaz 2016-08-20
## 10:       Eddie Alvarez 2016-11-12
## 11: Khabib Nurmagomedov 2018-10-06
## 12:      Donald Cerrone 2020-01-18

Note that renaming columns in the resulting data.table is as simple as renaming the selected columns in j.

# subset on Conor McGregor fights
# select opponent name (F2_Name) and date of fight (Date)
# rename selected columns
# return a data.table
DT[F1_Name == 'Conor McGregor', .(Opponent = F2_Name, FightDate = Date)]
##                Opponent  FightDate
##  1:      Marcus Brimage 2013-04-06
##  2:        Max Holloway 2013-08-17
##  3:       Diego Brandao 2014-07-19
##  4:      Dustin Poirier 2014-09-27
##  5:        Dennis Siver 2015-01-18
##  6:         Chad Mendes 2015-07-11
##  7:           Jose Aldo 2015-12-12
##  8:           Nate Diaz 2016-03-05
##  9:           Nate Diaz 2016-08-20
## 10:       Eddie Alvarez 2016-11-12
## 11: Khabib Nurmagomedov 2018-10-06
## 12:      Donald Cerrone 2020-01-18

Creating and manipulating columns

To create a new column in a data.table, we use := in j. For instance, we compute striking accuracy by dividing the number of strikes landed by the number of strikes attempted for each fighter.

# create new striking accuracy columns
DT[, F1_StrikeAccuracy := F1_StrikesLanded / F1_StrikesAttempted]
DT[, F2_StrikeAccuracy := F2_StrikesLanded / F2_StrikesAttempted]
# display striking accuracy for each fighter in Conor McGregor fights
head(DT[F1_Name == 'Conor McGregor', .(F1_Name, F1_StrikeAccuracy, F2_Name, F2_StrikeAccuracy)])
##           F1_Name F1_StrikeAccuracy        F2_Name F2_StrikeAccuracy
## 1: Conor McGregor         0.6111111 Marcus Brimage         0.2926829
## 2: Conor McGregor         0.4104046   Max Holloway         0.3076923
## 3: Conor McGregor         0.5441176  Diego Brandao         0.4347826
## 4: Conor McGregor         0.3103448 Dustin Poirier         0.5263158
## 5: Conor McGregor         0.5416667   Dennis Siver         0.3142857
## 6: Conor McGregor         0.6166667    Chad Mendes         0.6792453

Notice that using := in j creates a new column in the data.table without having to reassign DT by calling something like DT <- .... Creating and modifying columns in a data.table using := is done by reference. While this may lead to some undesirable functionality (more on that later), modification by reference is a memory efficient way to process data.

When passing a list to j using .(), a new column is not created in the original data.table. For instance, below we generate an identical data.table to the one printed above, but by using .() in j, the original data.table, DT, is not modified.

# create a data.table with new columns without modifying DT
head(DT[F1_Name == 'Conor McGregor', .(F1_Name, 
                                       F1_StrikeAccuracy = F1_StrikesLanded / F1_StrikesAttempted, 
                                       F2_Name, 
                                       F2_StrikeAccuracy = F2_StrikesLanded / F2_StrikesAttempted)])
##           F1_Name F1_StrikeAccuracy        F2_Name F2_StrikeAccuracy
## 1: Conor McGregor         0.6111111 Marcus Brimage         0.2926829
## 2: Conor McGregor         0.4104046   Max Holloway         0.3076923
## 3: Conor McGregor         0.5441176  Diego Brandao         0.4347826
## 4: Conor McGregor         0.3103448 Dustin Poirier         0.5263158
## 5: Conor McGregor         0.5416667   Dennis Siver         0.3142857
## 6: Conor McGregor         0.6166667    Chad Mendes         0.6792453

In addition to performing operations between columns in j, we can also use functions. For instance, we can find the mean number of strikes landed by Conor McGregor across all of his fights.

# mean strikes landed by Conor McGregor per fight
DT[F1_Name == 'Conor McGregor', mean(F1_StrikesLanded)]
## [1] 58.5

Note that we use mean inside the square brackets, which allows the data.table package to handle the operation. We could get the same answer by pulling mean outside of the square brackets: mean(DT[F1_Name == 'Conor McGregor', F1_StrikesLanded]). This would return the same answer but is not the most efficient way to use data.table since it would only use data.table to return a vector of strikes landed by Conor McGregor and then use base R to compute the mean of that vector.

Group by

To perform operations by group in data.table, we add a column name or a list of column names to by in DT[i, j, by]. For instance, we can compute the mean number of strikes landed per fight for each fighter.

# mean strikes landed by fighter
DT[, mean(F1_StrikesLanded), by = F1_Name]
##                  F1_Name        V1
##  1:         Amanda Nunes  58.81250
##  2:       Conor McGregor  58.50000
##  3:           Holly Holm  76.41667
##  4:      Israel Adesanya  67.33333
##  5:         Kamaru Usman 148.66667
##  6:  Khabib Nurmagomedov  83.00000
##  7:         Max Holloway 102.52174
##  8:         Ronda Rousey  25.75000
##  9:         Stipe Miocic  89.70588
## 10: Valentina Shevchenko 104.27273

Note that when using by, the resulting object is always another data.table, even if a list is not passed to j. Also, notice that we can pass a list to j and by and rename the resulting columns.

# mean strikes landed by fighter
# rename columns in j and by
DT[, .(MeanStrikes = mean(F1_StrikesLanded)), by = .(FighterName = F1_Name)]
##              FighterName MeanStrikes
##  1:         Amanda Nunes    58.81250
##  2:       Conor McGregor    58.50000
##  3:           Holly Holm    76.41667
##  4:      Israel Adesanya    67.33333
##  5:         Kamaru Usman   148.66667
##  6:  Khabib Nurmagomedov    83.00000
##  7:         Max Holloway   102.52174
##  8:         Ronda Rousey    25.75000
##  9:         Stipe Miocic    89.70588
## 10: Valentina Shevchenko   104.27273

Recall that since we passed a list to j instead of using := to create the new column, DT is not modified.

Additional features

Now that we have some familiarity with the basic data.table syntax of DT[i, j, by], we will explore some common use cases.

Special symbol: .N

One powerful special symbol for a data.table is .N, which is used in j to count the number of rows in the current group. For instance, DT[, .N] returns the number of rows in DT. Similarly, DT[F1_Name == 'Conor McGregor', .N] returns the number of Conor McGregor fights, or more generally, the number of rows satisfying the condition in i. We can also use .N to quickly count the number of rows by group. In this case, we can return the number of UFC fights for each fighter.

# number of rows (or UFC fights) by fighter
DT[, .N, by = F1_Name]
##                  F1_Name  N
##  1:         Amanda Nunes 16
##  2:       Conor McGregor 12
##  3:           Holly Holm 12
##  4:      Israel Adesanya  9
##  5:         Kamaru Usman 12
##  6:  Khabib Nurmagomedov 13
##  7:         Max Holloway 23
##  8:         Ronda Rousey 12
##  9:         Stipe Miocic 17
## 10: Valentina Shevchenko 11

Chaining

Chaining multiple sequential operations on a single data.table is simple: just add another set of square brackets. For instance, the general form of two chained operations is DT[i, j, by][i, j, by], but this can be extended beyond just two. As an example, we can return the number of UFC fights per fighter among fighters with 15 or more fights.

# number of rows (or UFC fights) by fighter
# only return fighters with 15 or more fights
DT[, .N, by = F1_Name][N >= 15]
##         F1_Name  N
## 1: Amanda Nunes 16
## 2: Max Holloway 23
## 3: Stipe Miocic 17

Sorting

There are multiple efficient ways to sort a data.table. We can use the order function in i to sort a data.table. For instance, referring back to the table we created showing the number of fights among fighters with at least 15 fights, we can add another operation that sorts the resulting data.table in descending order of the number of fights (notice the minus in front of the column name inside order to reverse the order from ascending to descending).

# number of rows (or UFC fights) by fighter
# only return fighters with 15 or more fights
# order table by number of fights in descending order
DT[, .N, by = F1_Name][N >= 15][order(-N)]
##         F1_Name  N
## 1: Max Holloway 23
## 2: Stipe Miocic 17
## 3: Amanda Nunes 16

The setkey function also sorts columns in ascending order and changes the data.table by reference, which means that we do not have to reassign the object using <- to modify the data.table. For instance, we can sort DT in ascending order by Date and then by F1_Name name.

# sort DT by Date and F1_Name
setkey(DT, Date, F1_Name)
# print result
head(DT[, .(Date, F1_Name)])
##          Date             F1_Name
## 1: 2011-01-07        Amanda Nunes
## 2: 2011-08-12        Ronda Rousey
## 3: 2011-09-10        Amanda Nunes
## 4: 2011-10-08        Stipe Miocic
## 5: 2011-11-18        Ronda Rousey
## 6: 2012-01-20 Khabib Nurmagomedov

Using setkey is typically the preferred method of sorting a data.table because it also adds an attribute to the object itself that marks it as sorted. The columns by which the data.table is sorted are set as keys, which enables fast joins.

Joining

To merge a data.table with another, the merge function works the same way it does in base R. However, in cases where a full merge is not needed, like when you want to add only one column from one data.table to another, a join may be more appropriate.

To illustrate how to perform a join, we will first define DT1 and DT2, each of which is a data.table.

# create two data.tables
DT1 <- data.table(food = c('apple', 'banana', 'carrot'), type = c('fruit', 'fruit', 'vegetable'))
DT2 <- data.table(id = c('carrot', 'apple', 'banana'), color = c('orange', 'red', 'yellow'))

Suppose we want to add the color column in DT2 to DT1. Then, we set the same key using setkey for each data.table. Note that the keys do not need to have the same name.

# set the keys
setkey(DT1, food)
setkey(DT2, id)

Finally, to perform the join, we place DT2 in i of DT1 and create a new column using := by placing i. in front of the column name from DT2.

# join the color column from DT2 onto DT1
DT1[DT2, food_color := i.color]
# print result
DT1
##      food      type food_color
## 1:  apple     fruit        red
## 2: banana     fruit     yellow
## 3: carrot vegetable     orange

In this simple example, a merge would have been more appropriate, but if DT2 had dozens of columns and we only wanted to add a single column to DT1, then a join is much more memory efficient.

Cumulative counts and sums

When working with panel data, it’s often helpful to compute cumulative or rolling metrics. Generating such features in data.table is fast and concise. For instance, suppose we want to compute the cumulative fight number of each fighter’s career, where a 1 represents the first fight of his/her career.

First, we need to use setkey to sort DT properly – that is, by F1_Name and then by Date. Then, we can create a new column in j that sequentially counts the number of fights for each group specified in by. We can use a sequence to create this new column that counts from 1 to the number of fights observed for each fighter, which is denoted as .N in data.table. Note that once DT is properly sorted, we are simply producing a column representing the cumulative row count for each F1_Name.

# sort DT by F1_Name and Date
setkey(DT, F1_Name, Date)
# create career fight number for Fighter 1
DT[, F1_CareerFightNumber := 1:.N, by = F1_Name]
# print result
DT[F1_Name == 'Conor McGregor', .(Date, F1_Name, F2_Name, F1_CareerFightNumber)]
##           Date        F1_Name             F2_Name F1_CareerFightNumber
##  1: 2013-04-06 Conor McGregor      Marcus Brimage                    1
##  2: 2013-08-17 Conor McGregor        Max Holloway                    2
##  3: 2014-07-19 Conor McGregor       Diego Brandao                    3
##  4: 2014-09-27 Conor McGregor      Dustin Poirier                    4
##  5: 2015-01-18 Conor McGregor        Dennis Siver                    5
##  6: 2015-07-11 Conor McGregor         Chad Mendes                    6
##  7: 2015-12-12 Conor McGregor           Jose Aldo                    7
##  8: 2016-03-05 Conor McGregor           Nate Diaz                    8
##  9: 2016-08-20 Conor McGregor           Nate Diaz                    9
## 10: 2016-11-12 Conor McGregor       Eddie Alvarez                   10
## 11: 2018-10-06 Conor McGregor Khabib Nurmagomedov                   11
## 12: 2020-01-18 Conor McGregor      Donald Cerrone                   12

Alternatively, if we wanted to create a column including the cumulative number of wins for each Fighter 1 (up to and including the current fight), then we would sort DT appropriately and use the cumsum function with the proper by to cumulatively sum the F1_Win column for each fighter.

# sort DT by F1_Name and Date
setkey(DT, F1_Name, Date)
# create cumulative number of wins for Fighter 1 (including the current fight)
DT[, F1_CareerWins := cumsum(F1_Win), by = F1_Name]
# print result
DT[F1_Name == 'Conor McGregor', .(Date, F1_Name, F2_Name, F1_Win, F1_CareerWins)]
##           Date        F1_Name             F2_Name F1_Win F1_CareerWins
##  1: 2013-04-06 Conor McGregor      Marcus Brimage   TRUE             1
##  2: 2013-08-17 Conor McGregor        Max Holloway   TRUE             2
##  3: 2014-07-19 Conor McGregor       Diego Brandao   TRUE             3
##  4: 2014-09-27 Conor McGregor      Dustin Poirier   TRUE             4
##  5: 2015-01-18 Conor McGregor        Dennis Siver   TRUE             5
##  6: 2015-07-11 Conor McGregor         Chad Mendes   TRUE             6
##  7: 2015-12-12 Conor McGregor           Jose Aldo   TRUE             7
##  8: 2016-03-05 Conor McGregor           Nate Diaz  FALSE             7
##  9: 2016-08-20 Conor McGregor           Nate Diaz   TRUE             8
## 10: 2016-11-12 Conor McGregor       Eddie Alvarez   TRUE             9
## 11: 2018-10-06 Conor McGregor Khabib Nurmagomedov  FALSE             9
## 12: 2020-01-18 Conor McGregor      Donald Cerrone   TRUE            10

Usage in functions

As mentioned earlier, a data.table object is often modified by reference, which results in memory efficient performance for many operations. However, modification by reference does have potentially undesirable consequences, especially when performing standard data.table operations inside a function.

To illustrate this issue, we first create a new data.table named newDT.

# create new data.table
newDT <- data.table(ID = c('A', 'B', 'C'), ones = 1)
newDT
##    ID ones
## 1:  A    1
## 2:  B    1
## 3:  C    1

Then we create a function called AddColumns that takes any data.table as input and adds a column of twos to the object using := and a column of threes using base R. Finally, we execute this function on newDT to create another data.table named newDT2.

# define AddColumns function
AddColumns <- function(inputDT){
  inputDT[, twos := 2]
  inputDT$threes <- 3
  return(inputDT)
}
# execute function on newDT to create newDT2
newDT2 <- AddColumns(newDT)
# print result
newDT2
##    ID ones twos threes
## 1:  A    1    2      3
## 2:  B    1    2      3
## 3:  C    1    2      3

We see that AddColumns performed as intended. However, when we look back at newDT, we see that executing AddColumns modified this original data.table.

newDT
##    ID ones twos
## 1:  A    1    2
## 2:  B    1    2
## 3:  C    1    2

The special character := modifies columns by reference. Hence, even though newDT is passed to a function and modified within that function, a second copy of that object is not created in memory. Instead, a pointer is stored in memory that points back to the original object, so when a column is added in the function using :=, the original object is also modified because that is the only data.table object stored in memory.

However, we see that the threes column was not added to newDT. Modifying objects in base R (or otherwise outside of data.table) is typically not done by reference. In this case, that means a second copy of the data.table is created in memory, and then the threes column is added to the copy. This is often how we expect functions to work, but it is not memory efficient since it involves creating a second copy of the data in memory.

This illuminates one of the trade-offs with data.table. We gain memory allocation efficiency by modifying by reference, but we need to be aware that certain operations will result in behavior that is different from the typical R functionality.

In this case with the AddColumns function, we simply need to force R to make a copy of inputDT by calling inputDT <- copy(inputDT). This breaks the pointer between newDT and inputDT such that any modification of inputDT inside the function will no longer modify newDT. The example below recreates newDT, redefines the AddColumns function, and shows that newDT is not modified after executing the function.

# recreate newDT
newDT <- data.table(ID = c('A', 'B', 'C'), ones = 1)
# redefine AddColumns function
AddColumns <- function(inputDT){
  inputDT <- copy(inputDT)
  inputDT[, twos := 2]
  inputDT$threes <- 3
  return(inputDT)
}
# execute function on newDT to recreate newDT2
newDT2 <- AddColumns(newDT)
# print newDT
newDT
##    ID ones
## 1:  A    1
## 2:  B    1
## 3:  C    1

Conclusion

Those who work with large datasets in R and decide to spend the time required to learn the data.table syntax will continually reap the performance rewards throughout their work. However, for many R users who tend to work with small to medium size datasets, the readability and simplicity of the syntax in the dplyr package may outweigh the relatively small efficiency gains that data.table would bring to operations on data of that size.

While this post covers much of the basic data.table syntax, it really only scratches the surface. The data.table package contains much more functionality and many more shortcuts than those described here.