What happened to the streak? rle: run length encoding

A common task is to isolate records when some condition changes. Failure analysis is an obvious example. When working in a tidy environment, this isn’t always easy because of its strong column bias. Looking back at previous rows is clunky, at best.

The sometimes overlooked which() provides one part of the solution. Consider a simple case of two vectors, easily combinable into a data frame, one of which indicates a classification and the other, whether or not that classification is present.

genres <- c("Action", "Animation", "Comedy", "Documentary", "Romance", "Short")
row_vector <- c(0,0,1,1,0,0,0)
indices <-min(which(row_vector == TRUE))
indices

## [1] 3

Both genres and row_vector are equal-length character and numeric vectors and the correspondence depends on position, with the Action 0 pair being the first and Romance 0 the last. Python users will recognize this as a hash.

The indices atomic (single element) vector applies, working from the inside out, the which() function to select the elements of row_vector that are TRUE, which evaluates to 1 and min finds the position of the first among that subset. So, we end up with 3 and genres[3] evaluates to Comedy.

Using the positions of elements in one vector to identify elements in another provides a way to use rle(), the run-length encoding function.

What rle does is to keep track of the number of times an element in a vector appears repeated zero or more times.

As usual, it helps to run the help() example, with some inspection:

x <- rev(rep(6:10, 1:5))
x

##  [1] 10 10 10 10 10  9  9  9  9  8  8  8  7  7  6

y <- rle(x)
y

## Run Length Encoding
##   lengths: int [1:5] 5 4 3 2 1
##   values : int [1:5] 10 9 8 7 6

str(y)

## List of 2
##  $ lengths: int [1:5] 5 4 3 2 1
##  $ values : int [1:5] 10 9 8 7 6
##  - attr(*, "class")= chr "rle"

The two pieces of y, y$lengths (or y[1]) and y$values tell us that there are five repetitions of 10, four of 9, etc.

Let’s create a simple data frame

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tibble)) 
my_data <- tibble(type=rep(c(1:2), each= 9), hour= rep(1:9, 2), event = c(1,1,1,0,0,0,0,1,1,1,1,0,0,0,1,0,1,0))
my_data

## # A tibble: 18 x 3
##     type  hour event
##    <int> <int> <dbl>
##  1     1     1     1
##  2     1     2     1
##  3     1     3     1
##  4     1     4     0
##  5     1     5     0
##  6     1     6     0
##  7     1     7     0
##  8     1     8     1
##  9     1     9     1
## 10     2     1     1
## 11     2     2     1
## 12     2     3     0
## 13     2     4     0
## 14     2     5     0
## 15     2     6     1
## 16     2     7     0
## 17     2     8     1
## 18     2     9     0

Let 1 in column event indicate success and 0 failure. Where does each string of successes turn into failure?

runs <- rle(my_data$event)
runs <- tibble(runs$lengths, runs$values)
colnames(runs) <- c("lengths", "values")
runs

## # A tibble: 8 x 2
##   lengths values
##     <int>  <dbl>
## 1       3      1
## 2       4      0
## 3       4      1
## 4       3      0
## 5       1      1
## 6       1      0
## 7       1      1
## 8       1      0

sequences <- sequences <- tibble(lengths = runs$lengths, values = runs$values) %>% mutate(indices = cumsum(runs$lengths))
sequences

## # A tibble: 8 x 3
##   lengths values indices
##     <int>  <dbl>   <int>
## 1       3      1       3
## 2       4      0       7
## 3       4      1      11
## 4       3      0      14
## 5       1      1      15
## 6       1      0      16
## 7       1      1      17
## 8       1      0      18

post_zero <- sequences %>%  filter(values == 0)
post_zero

## # A tibble: 4 x 3
##   lengths values indices
##     <int>  <dbl>   <int>
## 1       4      0       7
## 2       3      0      14
## 3       1      0      16
## 4       1      0      18

result <- left_join(sequences, post_zero, by = "indices") %>% select(1:3) %>% filter(values.x == 1)
colnames(result) <- c("lengths", "runs", "indices")
result

## # A tibble: 4 x 3
##   lengths  runs indices
##     <int> <dbl>   <int>
## 1       3     1       3
## 2       4     1      11
## 3       1     1      15
## 4       1     1      17

my_data[result$indices,]

## # A tibble: 4 x 3
##    type  hour event
##   <int> <int> <dbl>
## 1     1     3     1
## 2     2     2     1
## 3     2     6     1
## 4     2     8     1

The variable type = 1 had one string of successes that ended at hour three, type = 2 had three ending at hours two, six and eight.

More interesting, of course, is the case where hour is a datetime object and you can bring date arithmetic into play.

The main point is that if you can design a logical test to mutate a numeric column, rle provides a straightforward way of subsetting sequences based on the the test result.

What happened to the streak? rle: run length encoding

Table of Contents

Related

Latest