GSoC Midterm Evaluations

By Joshua Wu

The midterm evaluations for GSoC have just concluded! Happy to say that I’ve passed and earned the first portion of the stipend for my work on the data.table project. These past few weeks I’ve made significant progress in my open tasks, went on a short trip to San Francisco and learned some cool things.

Progress Indicator for Large “by” Operations PR #6228

This PR closes Issue #3060. As many of data.table’s users would know, performing “by” operations can sometimes take minutes or even hours with large datasets. A highly requested feature was to include a progress indicator or bar to visually represent the completion of these large operations in real time. Currently, it is possible but cumbersome to print a progress bar for these by operations due to some very helpful special symbols in data.table, such as .BY (Read more about special symbols here!).

A helpful example by user @kodiologist of how one could print a progress bar in R with data.table’s special symbols:

bar = txtProgressBar(style = 3, min = 0, max = nrow(unique(d[, GROUP_EXPRESSION))))
d[, by = GROUP_EXPRESSION,
   {setTxtProgressBar(bar, .GRP)
    J_EXPRESSION}]
close(bar)

To solve this issue, I had to dive deep into data.table codebase. First, I took a high-level look at the function responsible for handling the square bracket syntax, ie DT[...]. After lots of print statements, I found that data.table handles “by” operations either with a SEXP dogroups call to our C API, or handled separately by gforce if the operation is optimized by data.table (more about gforce here).

From there, the implementation came pretty easily. Although this PR still isn’t finished, a working prototype has been finished that produces desirable output. I used a style of reporting similar to fwrite, but instead we give number of groups processed out of total groups, time elapsed, percentage completed, time elapsed and estimated time remaining. Getting these values were fairly easy, as the main loop I wrote my code in kept track of the index of the current group and the total number of groups.

Here’s a working example of the progress indicator I played around with when using the very helpful devtools package:

> library(devtools)
> load_all()
> dt = data.table(a = 1:100000000) # operation has to be large enough (>1s)
> dt[, 1, by = a]
# Processed 100000000 groups out of 100000000. 100% done. Time elapsed: 20s. ETA: 0s.
                   a    V1
               <int> <num>
        1:         1     1
        2:         2     1
        3:         3     1
        4:         4     1
        5:         5     1
       ---                
 99999996:  99999996     1
 99999997:  99999997     1
 99999998:  99999998     1
 99999999:  99999999     1
100000000: 100000000     1

I’m very excited to get this PR across the finish line, and will be working on it diligently for the next week!

“fwrite” Quoting Consistency Merged and Revdeps

Recently, PR #6165 was merged and with it came some changes to default behavior. As listed in the PR description, I had to change 7 test, which means this change in behavior may be breaking for other packages that depend on data.table. At first, I thought this would mean running the revdep.R script included in the data.table repository, but after talking with my mentors, I found out that revdeps (reverse dependencies) are automatically checked daily against the master branch (development version of data.table) with the Northern Arizona University’s Monsoon supercomputer. This would cut down the check time from over 10-20 days to less than 12 hours. See more about data.table reverse dependency checks here..

Now, I’m in the process of monitoring these reverse dependency checks, reporting issues with other packages/data.table, submitting PRs to the other packages to help them resolve these issues or to discuss further options if reverse dependency breakages are too severe.

GSoC Midterm Evaluation and Reflection

GSoC evaluations are done by project mentors, and I’m really happy to say that I passed the first evaluation! My evaluating mentor left some very encouraging words for me, complimenting me on the quality and frequency of my communication and contributions. This means I will continue with the program and receive a portion of the total stipend paid out to GSoC students. Looking back at it now that I’m halfway through the program, I’d say the reason for my success so far is the fun I have contributing to this open source project. I feel like I’m constantly learning every day that I’m looking at code, and merging PRs brings lots of joy. Even when my code isn’t perfect or has problems, my reviewers are always kind and bring wonderful insights. I’m looking forward to continue my contributions to data.table for the rest of my time here with GSoC, and even far into the future!

Share: LinkedIn