The first week of the Google of Summer of Code has concluded, and it’s been a good start to the project. This week, I submitted 5 PRs, with two merged, two being reviewed and one that is still a work in progress. Below, I’ll outline some of the work that has been done so far!
Fixing Windows Parsing Issue PR #6150
This PR closed issue #6141. This PR addressed a GHA (GitHub Action) failure with a part of our test suite with certain Windows builds. More specifically, the default Windows encoding in this particular test wasn’t able to handle the multibyte, Japanese characters used in some of the tests, causing the test script to not parse properly:
Error in parse(n = -1, file = file, srcfile = NULL, keep.source = FALSE) :
18450:119: unexpected INCOMPLETE_STRING
18449: DT = data.table(strrep(ja_ichi, 1L:4L))
18450: test(2253.10, options=list(datatable.prettyprint.char = 4L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), c("
I was able to recreate this failure on my Linux machine by setting locales to use the Linux equivalent Latin-1 encoding.
My first attempt at a fix was to use strrep()
and unicode escapes to avoid the explicit use of Japanese characters, so that the tests are able to be parsed properly at least:
# Before
c("一", "一一", "一一一", "一一一一")
# After
ja_ichi = "\u4E00"
strrep(ja_ichi, 1:4L)
This allowed R-CMD-Check
to parse properly, except the tests would still fail if the system encoding didn’t support multibyte characters. This is because the tests were written and passed on machines with UTF-8, which handles characters differently than other encodings, such as the Windows’ English_United States.1252, which was the encoding used in this GHA machine. To get around this, I simply wrapped the tests in a local function, changed the system LC_CTYPE
to use UTF-8 manually, and then restored the user’s old locale with on.exit()
.
Although this change looks good on paper and has been merged, we need to keep an eye on the CI tests and see if the changes actually worked, which is difficult as of right now as we are facing some unrelated issues with CI tests. After these issues are resolved, we are able to confirm if the changes work as intended.
Update “fifelse()” Documentation to Reflect Evaluation Behavior PR #6151
This PR addresses Issue #4549. data.table provides a fast and robust alternative to base-R’s ifelse()
called fifelse()
, offering much better performance and identical syntax. In many cases, it makes sense for users to substitute replace ifelse()
with fifelse()
.
However, a data.table user reported that when he wanted to use fifelse()
in a recursive function, it would fail, and the fifelse()
documentation wasn’t helpful in detailing such case. Below is an MRE for the reported issue:
# use base::ifelse
gcd_base <- function(x,y) {
r <- x%%y;
return(ifelse(r, gcd_base(y, r), y))
}
gcd_dt <- function(x,y) {
r <- x%%y;
return(data.table::fifelse(r, gcd_dt(y, r), y))
}
gcd_base(10, 1)
# [1] 1
gcd_dt(10, 1)
# Error: C stack usage 7971780 is too close to the limit
We can see that although both functions are identical, the fifelse()
version recursed infinitely, and the reason is not immediately obvious. In the discussion that occurred after the issue was reported, Cole pointed out that fifelse()
evaluates both yes
and no
fields during the call to the internal C API. This means that if a recursive function call lies in either field, the function would attempt to call itself before the call to our C API, resulting in an infinite loop.
It was agreed that a documentation change is the best way to close this issue, as modifying the current behaviour while maintaining performance would mean sacrificing type-checking, and probably cause unforseen issues down the road. So, I updated the “details” section of the ?fifelse
doc to better reflect this behavior, and point users to fcase()
, which offers a similar feature. This also called for an update for ?fcase
documentation as well, showing users how it can be used to handle recursion instead of fifelse()
through an example.
New Value for fread() blank.lines.skip PR #6158
This PR closes Issue #5611.
This new value for blank.lines.skip
gives users the option to turn line skipping off for every blank line, including those at the beginning of the file. Currently, fread()
starts reading at the first non-blank line of each file, which means blank lines at the beginning of each file are skipped even if users don’t wish to skip any blank lines at all. The solution to this is to turn off skipping with blank.lines.skip = "none"
, so we don’t break existing code by changing blank.lines.skip = FALSE
’s behaviour.
Old:
file <- tempfile()
writeLines(c("", "", NA), file)
data.table::fread(file, blank.lines.skip = FALSE)
#> V1
#> <lgcl>
#> 1: NA
New:
file <- tempfile()
writeLines(c("", "", NA), file)
data.table::fread(file, blank.lines.skip = "none")
#> V1
#> <lgcl>
#> 1: NA
#> 2: NA
#> 3: NA
I achieved this by adding a check before we do the initial skipping of lines depending on the value of blank.lines.skip
. This PR is still a work in progress, because in order for the changes to work, I have to bypass two assertions, which I’m not confident about. However, even with the bypassing, the changes seem to work, so I’ll be taking time to work on this further next week.
Update “strip.white” Documentation in fread() PR #6159
This was a really quick documentation update, closed issue #2492:
Old:
default is TRUE. Strips leading and trailing whitespaces of unquoted fields. If FALSE, only header trailing spaces are removed.
New:
Logical, default TRUE, in which case leading and trailing whitespace is stripped from unquoted “character” fields. “numeric” fields are always stripped of leading and trailing whitespace.
This behavior changed sometime ago but the documentation was never updated to reflect.
fwrite() Headers are no Longer Automatically Quoted when “na” is Provided. PR #6165
fwrite()
had some interesting behavior where when the na
argument is non-empty, and when quote
isn’t specified, headers/column names are quoted, quite unintuitively. To fix this issue, I added a new internal variable within fwrite()
to track the quoting logic we want with column names, which in most cases except the one described above, is the same as with fields. I actually wrote a more in-depth explanation here so feel free to read more if you’re interested.
Closing Thoughts
Overall, this week has been quite productive, and although several tasks were trivial, I learned quite a bit from the codebase. With this progression, I expect to be tackling larger tasks very soon. Thanks for reading!