Third Week of GSoC | Personal Blog

With the third week concluding, I want to share some updates about my Google Summer of Code project. These past few weeks have definitely been busier and I’ve gotten the chance to work with some more difficult problems, and I’m excited to say that I’ve learned quite a lot.

Consistent Replacement with NULL PR #6157

This aimed to solve an inconsistency with the replacement of row columns in data.tables. As Issue #5558 depicts, when replacing a list column of a single-row data.table with list(NULL), the column is deleted instead of replaced by an empty list. This is inconsistent with replacement of multi-row data.tables as well as base R data.frames. This change also came with a host of documentation changes, as we expect that this change will cause revdep (reverse dependency) issues. Here’s a link to one of the comments I left in the PR explaining some of the changes with examples: https://github.com/Rdatatable/data.table/pull/6167#issuecomment-2163811494.

Function Wrappers to get a DT without Keys PR #6175

This simple PR closes issue #981. Currently, a data.table stores its keys as a column, which may be unnecessary in certain cases. For example, the original issue outlines a use-case for a key-less function to return only the data of a data.table for use in regressions. The fix to this issue is just a very simple wrapper using data.table’s subset syntax and the .SD key-word:

keyless <- function(x) x[, .SD, .SDcol = -key(x)]

Obviously, some type-checking and error throwing had to be implemented as well, and some unit tests. Although now after some discussion, it seems that something like this could be easily done by the user, so we’re leaning towards adding the example above to our vignettes to help users with this very niche use-case.

Subset Shallow Copy PR #6182

This small PR updated ?set documentation and vignettes to reflect a slight inconsistency in the copying of data.tables when doing subset operations. Currently, the behavior when we subset a data.table with i creates a deep copy of the data.table and allows us to change the data.table using the := operator in-place:

DT[a > 3] # returns a data.table where column a values a > 3 is true.

DT[a > 3, b := 2] # updates column b values to 2 whenever a > 3 is true.

However, a shallow copy is made whenever the i argument is not provided or equal to TRUE, which is essentially a reference to the original data.table. This means that both the original and the copy may be updated by reference:

DT = data.table(a = 1:5)
DT[, address(a)]
# 0x55dd1a8e3758

dt1 = DT[] # doesn't create a copy, same as 'dt1 = DT'
dt1[, address(a)]
# 0x55dd1a8e3758

dt1[, a := 2] # updates dt1 AND DT
dt1[, address(a)]
# 0x55dd1a8e3758

all.equal(DT, dt1) # since dt1 isn't a copy, DT is also updated by reference when using ':='
# TRUE

# --------------------------------------- Copies when subset is non-empty

dt2 = DT[1:.N] # copies rows 1 - nrow(DT)
dt2[, address(a)]
# 0x55dd16ebbdd8

dt2[, a := 2] # only updates dt2
dt2[, address(a)]
# 0x55dd16ebbdd8

all.equal(DT, dt2) # dt2 is a copy, therefore DT isn't updated by reference
# Column \'a\': Mean relative difference: 0.5384615

set() not Updating Rows Should Still Add Cols Issue #5409

This issue brings up an inconsistency with set() and other forms of assignment such as := when adding new columns. Apparently, when set() doesn’t select a row to perform an operation on (or is not a valid row, such as negative indices), the internal C code terminates early and doesn’t complete the addition of a new column. This is confusing as set() is supposed to be the exact same as the other assignment forms but with low overhead.

x <- data.table(a = 1)
x[0L, "b"] <- character(0)
x
#>        a      b
#>    <num> <char>
#> 1:     1   <NA>

x <- data.table(a = 1)
x[0L, b := character(0)]
x
#>        a      b
#>    <num> <char>
#> 1:     1   <NA>

# ---------------------------------------

x <- data.table(a = 1)
set(x, 0L, "b", character(0))
x
#>        a
#>    <num>
#> 1:     1

This doesn’t seem very intuitive because under the hood, I expect all the forms of setting a data.table to use the same parts of the code, but this is not the case. For example, when using set(), and selecting rows by indices, I get an warning about coercion when I pass in a double instead of an integer, however this doesn’t happen when using :=. This means that very similar functions are calling different parts of the code, and as Ben suggests, there should be a refactoring of our internal code to bring these different forms closer together.

Open PRs

Some open PRs that are still under work/need more discussion:

#6165

#6158

Tags: R GSoC Google data.table