Imagine dplyr::filter that includes neighboring observations. Choose how many observations to include by adjusting inputs sift.col and scope.

sift(.data, sift.col, scope, ...)

Arguments

.data

A data frame.

sift.col

Column name, as symbol, to serve as "sifting/augmenting" dimension. Must be non-missing and coercible to numeric.

scope

Specifies augmentation bandwidth relative to "key" observations. Parameter should share the same scale as sift.col.

If length 1, bandwidth used is +/- scope.

If length 2, bandwidth used is (-scope[1], +scope[2]).

...

Expressions passed to dplyr::filter, of which the results serve as the "key" observations. The same data-masking rules used in dplyr::filter apply here.

Value

A sifted data frame, with 2 additional columns:

  • .cluster <int>: Identifies resulting group formed by each key observation and its neighboring rows. When the key observations are close enough together, the clusters will overlap.

  • .key <lgl>: TRUE indicates key observation.

Details

sift() can be understood as a 2-step process:

  1. .data is passed to dplyr::filter, using subsetting expression(s) provided in .... We'll refer to these intermediate results as "key" observations.

  2. For each key observation, sift expands the row selection bidirectionally along dimension specified by sift.col. Any row from the original dataset within scope units of a key observation is captured in the final result.

Essentially, this allows us to "peek" at neighboring rows surrounding the key observations.

Examples

# See current events from same timeframe as 2020 Utah Monolith discovery. sift(nyt2020, pub_date, scope = 2, grepl("Monolith", headline))
#> # A tibble: 15 x 8 #> headline abstract byline pub_date section_name web_url .cluster .key #> <chr> <chr> <chr> <date> <chr> <chr> <dbl> <lgl> #> 1 Biden Has… The presi… "By Gi… 2020-11-23 U.S. https:/… 1 FALSE #> 2 Pat Quinn… Mr. Quinn… "By Co… 2020-11-23 U.S. https:/… 1 FALSE #> 3 Business … At the ur… "By Ka… 2020-11-23 U.S. https:/… 1 FALSE #> 4 Pandemic … As urbani… "By Ji… 2020-11-23 U.S. https:/… 1 FALSE #> 5 No, Joe B… The video… "By Li… 2020-11-23 Technology https:/… 1 FALSE #> 6 Monolith … A metal m… "By St… 2020-11-24 Science https:/… 1 TRUE #> 7 Coronavir… Upper Man… "By Tr… 2020-11-24 New York https:/… 1 FALSE #> 8 Two Darwi… Cambridge… "By Me… 2020-11-24 World https:/… 1 FALSE #> 9 Recent Co… Recent co… "By Is… 2020-11-24 Business Day https:/… 1 FALSE #> 10 Trump Adm… A key off… "By Mi… 2020-11-24 U.S. https:/… 1 FALSE #> 11 The C.D.C… Federal h… "By Ro… 2020-11-25 World https:/… 1 FALSE #> 12 A Poem of… The New Y… "" 2020-11-25 U.S. https:/… 1 FALSE #> 13 Casualtie… The war i… "By Ri… 2020-11-25 World https:/… 1 FALSE #> 14 Iran Free… Iranian s… "By Fa… 2020-11-25 World https:/… 1 FALSE #> 15 A Poem of… The New Y… "" 2020-11-25 U.S. https:/… 1 FALSE
# or Biden's presidential victory. sift(nyt2020, pub_date, scope = 2, grepl("Biden is elected", headline))
#> # A tibble: 15 x 8 #> headline abstract byline pub_date section_name web_url .cluster .key #> <chr> <chr> <chr> <date> <chr> <chr> <dbl> <lgl> #> 1 As China’… New telev… By Viv… 2020-11-06 World https:/… 1 FALSE #> 2 Al Roker,… Mr. Roker… By Joh… 2020-11-06 Business Day https:/… 1 FALSE #> 3 The Lates… A new stu… By Ame… 2020-11-06 U.S. https:/… 1 FALSE #> 4 Secretari… While sta… By Sha… 2020-11-06 U.S. https:/… 1 FALSE #> 5 Democracy… WILMINGTO… By Tho… 2020-11-06 U.S. https:/… 1 FALSE #> 6 Joe Biden… WILMINGTO… By Kat… 2020-11-07 U.S. https:/… 1 TRUE #> 7 Biden def… Joseph R.… By Mik… 2020-11-07 U.S. https:/… 1 FALSE #> 8 Tension, … The news … By Joh… 2020-11-07 Business Day https:/… 1 FALSE #> 9 Voters Sa… About a f… By Sab… 2020-11-07 U.S. https:/… 1 FALSE #> 10 After War… Amid the … By Rei… 2020-11-07 U.S. https:/… 1 FALSE #> 11 Turkey’s … President… By Car… 2020-11-08 Business Day https:/… 1 FALSE #> 12 There’s n… On Twitte… By Jim… 2020-11-08 U.S. https:/… 1 FALSE #> 13 A Nation … Nebraska … By Dio… 2020-11-08 U.S. https:/… 1 FALSE #> 14 Five Take… As he add… By Ada… 2020-11-08 U.S. https:/… 1 FALSE #> 15 Read Joe … In his vi… By Mat… 2020-11-08 U.S. https:/… 1 FALSE
# We can specify lower & upper scope to see what happened AFTER Trump tested positive. sift(nyt2020, pub_date, scope = c(0, 2), grepl("Trump Tests Positive", headline))
#> # A tibble: 10 x 8 #> headline abstract byline pub_date section_name web_url .cluster .key #> <chr> <chr> <chr> <date> <chr> <chr> <dbl> <lgl> #> 1 "Trump Te… The presi… "By Pe… 2020-10-02 U.S. https:/… 1 TRUE #> 2 "‘You don… President… "By Da… 2020-10-02 U.S. https:/… 1 FALSE #> 3 "17 Repub… Seventeen… "By Ca… 2020-10-02 U.S. https:/… 1 FALSE #> 4 "TV news … Televisio… "By Mi… 2020-10-02 U.S. https:/… 1 FALSE #> 5 "" Positive … "" 2020-10-02 World https:/… 1 FALSE #> 6 "Battling… The blaze… "By Ma… 2020-10-03 World https:/… 1 FALSE #> 7 "Obama of… Former Pr… "By Li… 2020-10-03 U.S. https:/… 1 FALSE #> 8 "Contact … Tracing i… "By Be… 2020-10-03 World https:/… 1 FALSE #> 9 "What to … Dr. Conle… "By Al… 2020-10-03 U.S. https:/… 1 FALSE #> 10 "Trump’s … Outside e… "By Gi… 2020-10-03 Health https:/… 1 FALSE
# sift recognizes dplyr group specification. library(dplyr) library(mopac)
#> Error in library(mopac): there is no package called ‘mopac’
express %>% group_by(direction) %>% sift(time, 30, plate == "EAS-1671") # row augmentation performed within groups.
#> Error in group_by(., direction): object 'express' not found