Intelligently peruse data • sift

sift facilitates intelligent & efficient exploration of datasets.

# install.packages("devtools")
devtools::install_github("sccmckenzie/sift")

sift is designed to work seamlessly with tidyverse.

library(tidyverse) # needed for below examples
library(sift)

1. `sift::sift()`

Imagine `dplyr::filter()` that includes neighboring observations.

Perhaps you remember the Utah monolith. The buzz surrounding its discovery (and disappearance) served as a welcome diversion from the otherwise upsetting twists and turns of 2020.

Suppose we are asked: what else was happening in the world around this time?

Let’s peruse the nyt2020 dataset to refresh our memory.

nyt2020 %>% 
  filter(str_detect(headline, "Monolith")) %>% 
  glimpse()
#> Rows: 1
#> Columns: 6
#> $ headline     <chr> "Monolith Discovered in Utah Desert"
#> $ abstract     <chr> "A metal monolith, planted firmly in the ground with no c~
#> $ byline       <chr> "By Storyful"
#> $ pub_date     <date> 2020-11-24
#> $ section_name <chr> "Science"
#> $ web_url      <chr> "https://www.nytimes.com/video/science/earth/100000007471~

The monolith story broke on 2020-11-24. Prior to writing this documentation, I certainly would not have remembered this happening in November specifically.

Let’s take a peek at other headlines from ±2 days.

nyt2020 %>% 
  filter(pub_date > "2020-11-22",
         pub_date < "2020-11-26") %>% 
  select(headline, pub_date)
#> # A tibble: 15 x 2
#>    headline                                                           pub_date  
#>    <chr>                                                              <date>    
#>  1 Biden Has Chosen a Secretary of State                              2020-11-23
#>  2 Pat Quinn, Who Promoted A.L.S. Ice Bucket Challenge, Dies at 37    2020-11-23
#>  3 Business Leaders, Citing Damage to Country, Urge Trump to Begin T~ 2020-11-23
#>  4 Pandemic Crowds Bring ‘Rivergeddon’ to Montana’s Rivers            2020-11-23
#>  5 No, Joe Biden did not have a maskless birthday party last week.    2020-11-23
#>  6 Monolith Discovered in Utah Desert                                 2020-11-24
#>  7 Coronavirus in N.Y.: Latest Updates                                2020-11-24
#>  8 Two Darwin Notebooks, Missing for Decades, Were Most Likely Stolen 2020-11-24
#>  9 Recent Commercial Real Estate Transactions                         2020-11-24
#> 10 Trump Administration Approves Start of Formal Transition to Biden  2020-11-24
#> 11 The C.D.C. is considering shortening its recommended quarantine p~ 2020-11-25
#> 12 A Poem of Gratitude From Nebraska                                  2020-11-25
#> 13 Casualties From Banned Cluster Bombs Nearly Doubled in 2019, Most~ 2020-11-25
#> 14 Iran Frees British-Australian Scholar in Prisoner Swap             2020-11-25
#> 15 A Poem of Gratitude From West Virginia                             2020-11-25

Notice that it took two steps to achieve the above result. We first had to find the date of the monolith story then perform a subsequent call to filter(). This procedure would quickly become a nuisance after a few iterations.

sift() provides an interface to perform this exact process in one step.

nyt2020 %>% 
  sift(pub_date, scope = 2, str_detect(headline, "Monolith")) %>% 
  select(headline, pub_date)
#> # A tibble: 15 x 2
#>    headline                                                           pub_date  
#>    <chr>                                                              <date>    
#>  1 Biden Has Chosen a Secretary of State                              2020-11-23
#>  2 Pat Quinn, Who Promoted A.L.S. Ice Bucket Challenge, Dies at 37    2020-11-23
#>  3 Business Leaders, Citing Damage to Country, Urge Trump to Begin T~ 2020-11-23
#>  4 Pandemic Crowds Bring ‘Rivergeddon’ to Montana’s Rivers            2020-11-23
#>  5 No, Joe Biden did not have a maskless birthday party last week.    2020-11-23
#>  6 Monolith Discovered in Utah Desert                                 2020-11-24
#>  7 Coronavirus in N.Y.: Latest Updates                                2020-11-24
#>  8 Two Darwin Notebooks, Missing for Decades, Were Most Likely Stolen 2020-11-24
#>  9 Recent Commercial Real Estate Transactions                         2020-11-24
#> 10 Trump Administration Approves Start of Formal Transition to Biden  2020-11-24
#> 11 The C.D.C. is considering shortening its recommended quarantine p~ 2020-11-25
#> 12 A Poem of Gratitude From Nebraska                                  2020-11-25
#> 13 Casualties From Banned Cluster Bombs Nearly Doubled in 2019, Most~ 2020-11-25
#> 14 Iran Frees British-Australian Scholar in Prisoner Swap             2020-11-25
#> 15 A Poem of Gratitude From West Virginia                             2020-11-25

Under the hood, sift() passes str_detect(headline, "Monolith") to dplyr::filter(), then augments the filtered observations to include any rows falling in ±2 day window (specified by pub_date and scope = 2).

2. `sift::break_join()`

Harness combined power of `dplyr::left_join()` & `findInterval()`.

Take a look at the structure of us_uk_pop and us_uk_leaders below. How would you join these two datasets together? Specifically, we want each row in us_uk_pop to contain information (name, party) for the leader at that time.

us_uk_pop %>% 
  group_by(country) %>% 
  slice_head(n = 3)
#> # A tibble: 6 x 3
#> # Groups:   country [2]
#>   country date       population
#>   <chr>   <date>          <int>
#> 1 UK      1995-01-21   57997197
#> 2 UK      1996-01-19   58168519
#> 3 UK      1997-01-21   58346633
#> 4 USA     1995-01-20  268039654
#> 5 USA     1996-01-20  271231546
#> 6 USA     1997-01-19  274606475

us_uk_leaders
#> # A tibble: 11 x 4
#>    country name     start      party       
#>    <chr>   <chr>    <date>     <chr>       
#>  1 USA     Bush     1989-01-20 Republican  
#>  2 USA     Clinton  1993-01-20 Democratic  
#>  3 USA     Bush     2001-01-20 Republican  
#>  4 USA     Obama    2009-01-20 Democratic  
#>  5 UK      Thatcher 1979-05-04 Conservative
#>  6 UK      Major    1990-11-28 Conservative
#>  7 UK      Blair    1997-05-02 Labour      
#>  8 UK      Brown    2007-06-27 Labour      
#>  9 UK      Cameron  2010-05-11 Conservative
#> 10 UK      May      2016-07-13 Conservative
#> 11 UK      Johnson  2019-07-24 Conservative

If you look closely at the dates in us_uk_pop, they typically fall around January 20th (US inauguration day). Joining by country & year(date/start) would sweep this inconvenient detail under the rug.

For one country alone, we could use findInterval.

us_uk_pop %>% 
  filter(country == "USA") %>% 
  mutate(name = filter(us_uk_leaders, country == "USA")$name[findInterval(date, filter(us_uk_leaders, country == "USA")$start)])
#> # A tibble: 19 x 4
#>    country date       population name   
#>    <chr>   <date>          <int> <chr>  
#>  1 USA     1995-01-20  268039654 Clinton
#>  2 USA     1996-01-20  271231546 Clinton
#>  3 USA     1997-01-19  274606475 Clinton
#>  4 USA     1998-01-20  278053607 Clinton
#>  5 USA     1999-01-20  281419130 Clinton
#>  6 USA     2000-01-19  284594395 Clinton
#>  7 USA     2001-01-18  287532638 Clinton
#>  8 USA     2002-01-19  290270187 Bush   
#>  9 USA     2003-01-18  292883010 Bush   
#> 10 USA     2004-01-17  295487267 Bush   
#> 11 USA     2005-01-19  298165797 Bush   
#> 12 USA     2006-01-19  300942917 Bush   
#> 13 USA     2007-01-21  303786752 Bush   
#> 14 USA     2008-01-21  306657153 Bush   
#> 15 USA     2009-01-21  309491893 Obama  
#> 16 USA     2010-01-18  312247116 Obama  
#> 17 USA     2011-01-18  314911752 Obama  
#> 18 USA     2012-01-19  317505266 Obama  
#> 19 USA     2013-01-21  320050716 Obama

The above code is somewhat unintelligible. Additionally, there is no straightforward way to accommodate UK & USA rows.

break_join() provides a simple interface leveraging functionality of dplyr::left_join() and findInterval().

break_join(us_uk_pop, us_uk_leaders, brk = c("date" = "start"))
#> Joining, by = "country"
#> # A tibble: 38 x 5
#>    country date       population name    party     
#>    <chr>   <date>          <int> <chr>   <chr>     
#>  1 USA     1995-01-20  268039654 Clinton Democratic
#>  2 USA     1996-01-20  271231546 Clinton Democratic
#>  3 USA     1997-01-19  274606475 Clinton Democratic
#>  4 USA     1998-01-20  278053607 Clinton Democratic
#>  5 USA     1999-01-20  281419130 Clinton Democratic
#>  6 USA     2000-01-19  284594395 Clinton Democratic
#>  7 USA     2001-01-18  287532638 Clinton Democratic
#>  8 USA     2002-01-19  290270187 Bush    Republican
#>  9 USA     2003-01-18  292883010 Bush    Republican
#> 10 USA     2004-01-17  295487267 Bush    Republican
#> # ... with 28 more rows

Notice that country was detected as a common variable (courtesy of dplyr::left_join()).

Alternatively, we could have supplied by explicitly.

# effectively the same as above call
break_join(us_uk_pop, us_uk_leaders, brk = c("date" = "start"), by = "country")

Additional arguments supplied to ... will be automatically directed to dplyr::left_join() and findInterval().

set.seed(1)
a <- tibble(x = 1:5, y = runif(5, 4, 6))
b <- tibble(y = c(4, 5), z = c("A", "B"))

break_join(a, b, brk = "y")
#> # A tibble: 5 x 3
#>       x     y z    
#>   <int> <dbl> <chr>
#> 1     5  4.40 A    
#> 2     1  4.53 A    
#> 3     2  4.74 A    
#> 4     3  5.15 B    
#> 5     4  5.82 B

break_join(a, b, brk = "y", all.inside = TRUE)
#> # A tibble: 5 x 3
#>       x     y z    
#>   <int> <dbl> <chr>
#> 1     5  4.40 A    
#> 2     1  4.53 A    
#> 3     2  4.74 A    
#> 4     3  5.15 A    
#> 5     4  5.82 A

3. `sift::kluster()`

Imagine 1D K-means, except K is chosen automatically.

Consider the faithful dataset.

Density plot below clearly demonstrates there are 2 clusters of eruptions.

Currently, these clusters are implicit, meaning we do not have a categorical variable associating each observation with a cluster. We could manually assign clusters by drawing a line at, say, 3.0.

kluster() does this automatically - no extra inputs needed.

k <- kluster(faithful$eruptions)

sift

1. `sift::sift()`

Imagine `dplyr::filter()` that includes neighboring observations.

2. `sift::break_join()`

Harness combined power of `dplyr::left_join()` & `findInterval()`.

3. `sift::kluster()`

Imagine 1D K-means, except K is chosen automatically.

License

Developers

Dev status

sift

1. sift::sift()

Imagine dplyr::filter() that includes neighboring observations.

2. sift::break_join()

Harness combined power of dplyr::left_join() & findInterval().

3. sift::kluster()

Imagine 1D K-means, except K is chosen automatically.

License

Developers

Dev status

1. `sift::sift()`

Imagine `dplyr::filter()` that includes neighboring observations.

2. `sift::break_join()`

Harness combined power of `dplyr::left_join()` & `findInterval()`.

3. `sift::kluster()`