A few weeks ago, I accidentally generated 126TB worth of data. Before going home on a Thursday, I submitted a few R jobs to the cluster. When I checked up on them Friday after lunch, I discovered they had generated 126TB worth of text files.
After I had killed the running jobs and deleted the files, I launched a mini-investigation. By some miracle I avoided getting a stern email from IT, but I figured I should avoid making hard drive hogging a habit.
The culprit turned out to be some data.table code I had written and never properly tested. In my eagerness to get results, I had run the same script on several datasets before testing it on one. The script in question was one of the first ones I’d written using the data.table package, and some of my assumptions about its syntax turned out to be horribly, horribly wrong.
To avoid similar incidents in the future, I have gone back to basics and learned the syntax properly. In the process I also recreated my mistake for fun, further driving home the point.
Introduction to data.table
data.table is an R package for handling large datasets. It extends the built in data.frame structure to enable faster operations and updating columns by reference.
The introduction vignette is a great guide for getting started. There are also several vignettes that focus on summarizing and manipulating data, but for this short introduction I’m going to focus on selecting rows and columns.While the data.table syntax seems similar to data.frame one, there are a few obvious differences. Firstly, data.table treats column names as variables within the
[...]operator. Instead of using the dollar sign to select columns, we can refer to them directly by name. Secondly, row selection takes precedence over column selection. For a data.frame, the statement
[1:4]selects the first four columns, whereas for a data.table it selects the first four rows.
To see these differences more clearly, consider the NYC flights data from the vignette. The example below shows how to select all flights out of JFK with data.table and data.frame.
library(data.table); dt <- fread('~/data.fun/flights14.csv'); df <- read.csv('~/data.fun/flights14.csv'); jfk.dt <- dt['JFK' == origin]; jfk.df <- df['JFK' == df$origin, ]; print(jfk.dt);
## year month day dep_delay arr_delay carrier origin dest air_time ## 1: 2014 1 1 14 13 AA JFK LAX 359 ## 2: 2014 1 1 -3 13 AA JFK LAX 363 ## 3: 2014 1 1 2 9 AA JFK LAX 351 ## 4: 2014 1 1 2 1 AA JFK LAX 350 ## 5: 2014 1 1 -2 -18 AA JFK LAX 338 ## --- ## 81479: 2014 10 31 -4 -21 UA JFK SFO 337 ## 81480: 2014 10 31 -2 -37 UA JFK SFO 344 ## 81481: 2014 10 31 0 -33 UA JFK LAX 320 ## 81482: 2014 10 31 -6 -38 UA JFK SFO 343 ## 81483: 2014 10 31 -6 -38 UA JFK LAX 323 ## distance hour ## 1: 2475 9 ## 2: 2475 11 ## 3: 2475 19 ## 4: 2475 13 ## 5: 2475 21 ## --- ## 81479: 2586 17 ## 81480: 2586 18 ## 81481: 2475 17 ## 81482: 2586 9 ## 81483: 2475 11
Conveniently, data.table also has a different way of printing. By default, only the top and bottom five rows are printed to screen, effectively obliterating the need to use the
The learning curve gets steeper when you start manipulating columns. data.table has a column assignment operator for adding, updating, and deleting columns by reference. The operator is used directly within the
[...], and there’s no need to reassign the object. For example, we can use the following code to add a column indicating if the flight occurred in the morning or afternoon.
dt[, ampm := ifelse(hour < 12, 'am', 'pm')];
Syntax can also be different when selecting columns – and this turned out to be my 126TB downfall. When selecting a single column of a data.frame, base R will simplify to a vector by default. With data.table… it depends.
If you select a column using list notation (
dt[[ 'year' ]]), data.table will return a vector. Similarly,
dt[, year] will simplify to a vector. However,
dt[, 'year'] will return a single column data table.
str( dt[, year] );
## int [1:253316] 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
## year ## 1: 2014 ## 2: 2014 ## 3: 2014 ## 4: 2014 ## 5: 2014 ## --- ## 253312: 2014 ## 253313: 2014 ## 253314: 2014 ## 253315: 2014 ## 253316: 2014
Unfamiliar with the syntax, I inadvertently used
paste to join two single-column data table objects together. The result is a monster of a character string. Rather than pasting the vectors element wise, the whole columns are pasted together into a single string. In the dummy example below, combining three columns of a ~250,000 row data table results in a string with over 3 million characters.
test <- paste0(dt[, 'year'], '-', dt[, 'month'], '-', dt[, 'day']); print( nchar(test) );
##  3252502
To make matters worse, I assigned the resulting string to a column. Instead of storing it once, it was stored as every element of a column in a large data.table. To data.table’s credit, it was able to deal with my accidental monster. R ran as normal, and I only noticed the problems when I checked the available space on disk.
Recreating my Bug
Intrigued by the large consequences of a tiny bug in notation, I wanted to see the effect it would have on a smaller dataset. The flights14.csv data from the vignette is 15MB. I used this file and the bug shown above as a minimal example of my original problem.
To run this experiment without actually crashing my computer, I used a size-restricted Docker container. Docker can be run with different storage drivers and supporting backend filesystems. In order to use the
--storage-opt size option, we need to use the overlay storage driver with the XFS filesystem.
The overlay storage driver can be used for Docker on both Mac and Linux, but the XFS filesystem cannot be mounted on Mac. I had to run Linux on a VirtualBox virtual machine to get it to work on my laptop. I installed CentOS on the virtual machine. Its default filesystem is XFS, and I wanted to minimize setup for the Docker container. In the end I had to do three things to get Docker up and running:
- Download the minimal CentOS distribution and install on virtual machine
- Install Docker community edition by following the official instructions. I skipped the part about devicemapper drivers as I needed to use overlay anyways.
- Add pquota mount option to XFS. This was the trickiest part, and I had a few unsuccessful attempts. In the end I got it to work by following these instructions.
Afterwards, I was ready to launch Docker containers with the
--storage-opt size option. An easy way to check if everything works as expected is to run
df -h after launching a container. When I did not restrict the size, I got the following output (the virtual machine itself is restricted to 20GB).
--storage-opt size=5G restricts the space available to 5GB.
Just what we wanted! With my safe space set up, I moved on to setting up the container itself. I first wrote a minimal version of the culprit R script that could be run from the container, and saved it as
library(data.table); dt <- fread('flights14.csv'); bug <- paste0(dt[, 'year'], '-', dt[, 'month'], '-', dt[, 'day']); dt[, bug := bug]; write.table(dt, 'monster.txt');
To execute this within a Docker container, we need to create a
Dockerfile with setup instructions. The Dockerfile details what base image the new container should be based on, and what software should be installed. The r-base Docker image contains the latest version of R. Additionally, we need to install the data.table package from CRAN. The
COPY command copies the script and data file to the container, and
CMD sets a default command to be executed when the container is launched.
FROM r-base MAINTAINER Erle Holgersen RUN Rscript -e "install.packages('data.table', repos='https://cloud.R-project.org');" COPY flights14.csv crash.R src/ WORKDIR src/ CMD ["Rscript", "crash.R"]
From here on out things are fairly straightforward. We first need to build an image from our Dockerfile by running
docker build -t crash .
This will most likely take a few minutes, depending on what Docker images you have built in the past. Afterwards we can launch the container in detached mode and with a 3GB size restriction with the command
docker run --storage-opt size=3G -d -t crash
Detached mode runs the container in the background, which allows us to keep monitoring the state of the container through
docker ps. The
-s flag displays details on the size of the container. My container took about two and a half minutes to grow to 3.22GB, and then exited. When I tried to enter the re-enter the container, I got a “disk limit exceeded” error. On the bright side, deleting the Docker container was trivial, and instantly recovered the 3GB of space!
Reading documentation and testing your code are good ideas. And if you really want to overload a hard drive, you can use a Docker container.