With the third week concluding, I want to share some updates about my Google Summer of Code project. These past few weeks have definitely been busier and I’ve gotten the chance to work with some more difficult problems, and I’m excited to say that I’ve learned quite a lot.
Consistent Replacement with NULL PR #6157
This aimed to solve an inconsistency with the replacement of row columns in data.table
s. As Issue #5558 depicts, when replacing a list column of a single-row data.table
with list(NULL)
, the column is deleted instead of replaced by an empty list. This is inconsistent with replacement of multi-row data.table
s as well as base R data.frame
s. This change also came with a host of documentation changes, as we expect that this change will cause revdep (reverse dependency) issues. Here’s a link to one of the comments I left in the PR explaining some of the changes with examples: https://github.com/Rdatatable/data.table/pull/6167#issuecomment-2163811494.
Function Wrappers to get a DT without Keys PR #6175
This simple PR closes issue #981. Currently, a data.table
stores its keys as a column, which may be unnecessary in certain cases. For example, the original issue outlines a use-case for a key-less function to return only the data of a data.table
for use in regressions. The fix to this issue is just a very simple wrapper using data.table
’s subset syntax and the .SD
key-word:
keyless <- function(x) x[, .SD, .SDcol = -key(x)]
Obviously, some type-checking and error throwing had to be implemented as well, and some unit tests. Although now after some discussion, it seems that something like this could be easily done by the user, so we’re leaning towards adding the example above to our vignettes to help users with this very niche use-case.
Subset Shallow Copy PR #6182
This small PR updated ?set
documentation and vignettes to reflect a slight inconsistency in the copying of data.table
s when doing subset operations. Currently, the behavior when we subset a data.table
with i
creates a deep copy of the data.table
and allows us to change the data.table
using the :=
operator in-place:
DT[a > 3] # returns a data.table where column a values a > 3 is true.
DT[a > 3, b := 2] # updates column b values to 2 whenever a > 3 is true.
However, a shallow copy is made whenever the i
argument is not provided or equal to TRUE
, which is essentially a reference to the original data.table
. This means that both the original and the copy may be updated by reference:
DT = data.table(a = 1:5)
DT[, address(a)]
# 0x55dd1a8e3758
dt1 = DT[] # doesn't create a copy, same as 'dt1 = DT'
dt1[, address(a)]
# 0x55dd1a8e3758
dt1[, a := 2] # updates dt1 AND DT
dt1[, address(a)]
# 0x55dd1a8e3758
all.equal(DT, dt1) # since dt1 isn't a copy, DT is also updated by reference when using ':='
# TRUE
# --------------------------------------- Copies when subset is non-empty
dt2 = DT[1:.N] # copies rows 1 - nrow(DT)
dt2[, address(a)]
# 0x55dd16ebbdd8
dt2[, a := 2] # only updates dt2
dt2[, address(a)]
# 0x55dd16ebbdd8
all.equal(DT, dt2) # dt2 is a copy, therefore DT isn't updated by reference
# Column \'a\': Mean relative difference: 0.5384615
set() not Updating Rows Should Still Add Cols Issue #5409
This issue brings up an inconsistency with set()
and other forms of assignment such as :=
when adding new columns. Apparently, when set()
doesn’t select a row to perform an operation on (or is not a valid row, such as negative indices), the internal C code terminates early and doesn’t complete the addition of a new column. This is confusing as set()
is supposed to be the exact same as the other assignment forms but with low overhead.
x <- data.table(a = 1)
x[0L, "b"] <- character(0)
x
#> a b
#> <num> <char>
#> 1: 1 <NA>
x <- data.table(a = 1)
x[0L, b := character(0)]
x
#> a b
#> <num> <char>
#> 1: 1 <NA>
# ---------------------------------------
x <- data.table(a = 1)
set(x, 0L, "b", character(0))
x
#> a
#> <num>
#> 1: 1
This doesn’t seem very intuitive because under the hood, I expect all the forms of setting a data.table
to use the same parts of the code, but this is not the case. For example, when using set()
, and selecting rows by indices, I get an warning about coercion when I pass in a double instead of an integer, however this doesn’t happen when using :=
. This means that very similar functions are calling different parts of the code, and as Ben suggests, there should be a refactoring of our internal code to bring these different forms closer together.
Open PRs
Some open PRs that are still under work/need more discussion: