Table of Contents
A common task is to isolate records when some condition changes. Failure analysis is an obvious example. When working in a tidy
environment, this isn’t always easy because of its strong column bias. Looking back at previous rows is clunky, at best.
The sometimes overlooked which()
provides one part of the solution. Consider a simple case of two vectors, easily combinable into a data frame, one of which indicates a classification and the other, whether or not that classification is present.
genres <- c("Action", "Animation", "Comedy", "Documentary", "Romance", "Short")
row_vector <- c(0,0,1,1,0,0,0)
indices <-min(which(row_vector == TRUE))
indices
## [1] 3
Both genres
and row_vector
are equal-length character and numeric vectors and the correspondence depends on position, with the Action 0
pair being the first and Romance 0
the last. Python users will recognize this as a hash
.
The indices
atomic (single element) vector applies, working from the inside out, the which()
function to select the elements of row_vector
that are TRUE
, which evaluates to 1
and min
finds the position of the first among that subset. So, we end up with 3
and genres[3]
evaluates to Comedy.
Using the positions of elements in one vector to identify elements in another provides a way to use rle()
, the run-length encoding function.
What rle
does is to keep track of the number of times an element in a vector appears repeated zero or more times.
As usual, it helps to run the help()
example, with some inspection:
x <- rev(rep(6:10, 1:5))
x
## [1] 10 10 10 10 10 9 9 9 9 8 8 8 7 7 6
y <- rle(x)
y
## Run Length Encoding
## lengths: int [1:5] 5 4 3 2 1
## values : int [1:5] 10 9 8 7 6
str(y)
## List of 2
## $ lengths: int [1:5] 5 4 3 2 1
## $ values : int [1:5] 10 9 8 7 6
## - attr(*, "class")= chr "rle"
The two pieces of y
, y$lengths
(or y[1]
) and y$values
tell us that there are five repetitions of 10, four of 9, etc.
Let’s create a simple data frame
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tibble))
my_data <- tibble(type=rep(c(1:2), each= 9), hour= rep(1:9, 2), event = c(1,1,1,0,0,0,0,1,1,1,1,0,0,0,1,0,1,0))
my_data
## # A tibble: 18 x 3
## type hour event
## <int> <int> <dbl>
## 1 1 1 1
## 2 1 2 1
## 3 1 3 1
## 4 1 4 0
## 5 1 5 0
## 6 1 6 0
## 7 1 7 0
## 8 1 8 1
## 9 1 9 1
## 10 2 1 1
## 11 2 2 1
## 12 2 3 0
## 13 2 4 0
## 14 2 5 0
## 15 2 6 1
## 16 2 7 0
## 17 2 8 1
## 18 2 9 0
Let 1
in column event
indicate success and 0
failure. Where does each string of successes turn into failure?
runs <- rle(my_data$event)
runs <- tibble(runs$lengths, runs$values)
colnames(runs) <- c("lengths", "values")
runs
## # A tibble: 8 x 2
## lengths values
## <int> <dbl>
## 1 3 1
## 2 4 0
## 3 4 1
## 4 3 0
## 5 1 1
## 6 1 0
## 7 1 1
## 8 1 0
sequences <- sequences <- tibble(lengths = runs$lengths, values = runs$values) %>% mutate(indices = cumsum(runs$lengths))
sequences
## # A tibble: 8 x 3
## lengths values indices
## <int> <dbl> <int>
## 1 3 1 3
## 2 4 0 7
## 3 4 1 11
## 4 3 0 14
## 5 1 1 15
## 6 1 0 16
## 7 1 1 17
## 8 1 0 18
post_zero <- sequences %>% filter(values == 0)
post_zero
## # A tibble: 4 x 3
## lengths values indices
## <int> <dbl> <int>
## 1 4 0 7
## 2 3 0 14
## 3 1 0 16
## 4 1 0 18
result <- left_join(sequences, post_zero, by = "indices") %>% select(1:3) %>% filter(values.x == 1)
colnames(result) <- c("lengths", "runs", "indices")
result
## # A tibble: 4 x 3
## lengths runs indices
## <int> <dbl> <int>
## 1 3 1 3
## 2 4 1 11
## 3 1 1 15
## 4 1 1 17
my_data[result$indices,]
## # A tibble: 4 x 3
## type hour event
## <int> <int> <dbl>
## 1 1 3 1
## 2 2 2 1
## 3 2 6 1
## 4 2 8 1
The variable type = 1
had one string of successes that ended at hour three, type = 2
had three ending at hours two, six and eight.
More interesting, of course, is the case where hour
is a datetime
object and you can bring date arithmetic into play.
The main point is that if you can design a logical test to mutate a numeric column, rle
provides a straightforward way of subsetting sequences based on the the test result.