Join two tables based on observations in one table happening after observations in the other. Each table must have a user_id column, which must always match for two observations to be joined, and a time column, which must be greater in y than in x for the two to be joined. Supports all types of dplyr joins (inner, left, anti, etc.) and requires a type argument to specify which observations in a funnel get kept (see details for supported types).

after_join(x, y, by_time, by_user, mode = "inner",
  type = "first-first", max_gap = NULL, min_gap = NULL,
  gap_col = FALSE, suffix = c(".x", ".y"))

after_inner_join(x, y, by_time, by_user, type, max_gap = NULL,
  min_gap = NULL, gap_col = FALSE, suffix = c(".x", ".y"))

after_left_join(x, y, by_time, by_user, type, max_gap = NULL,
  min_gap = NULL, gap_col = FALSE, suffix = c(".x", ".y"))

after_right_join(x, y, by_time, by_user, type, max_gap = NULL,
  min_gap = NULL, gap_col = FALSE, suffix = c(".x", ".y"))

after_full_join(x, y, by_time, by_user, type, max_gap = NULL,
  min_gap = NULL, gap_col = FALSE, suffix = c(".x", ".y"))

after_anti_join(x, y, by_time, by_user, type, max_gap = NULL,
  min_gap = NULL, gap_col = FALSE, suffix = c(".x", ".y"))

after_semi_join(x, y, by_time, by_user, type, max_gap = NULL,
  min_gap = NULL, gap_col = FALSE, suffix = c(".x", ".y"))

Arguments

x

A tbl representing the first event to occur in the funnel.

y

A tbl representing an event to occur in the funnel.

by_time

A character vector to specify the time columns in x and y. This would typically be a datetime or a date column. These columns are used to filter for time y being after time x.

by_user

A character vector to specify the user or identity columns in x and y.

mode

The method used to join: "inner", "full", "anti", "semi", "right", "left". Each also has its own function, such as after_inner_join.

type

The type of funnel used to distinguish between event pairs, such as "first-first", "last-first", or "any-firstafter". See details for more.

max_gap

Optional: the maximum gap allowed between events. Can be a integer representing the number of seconds or a difftime object, such as as.difftime(2, units = "hours").

min_gap

Optional: the maximum gap allowed between events. Can be a integer representing the number of seconds or a difftime object, such as as.difftime(2, units = "hours").

gap_col

Whether to include a numeric column, .gap, with the time difference in seconds between the events.

suffix

If there are non-joined duplicate variables in x and y, these suffixes will be added to the output to disambiguate them. Should be a character vector of length 2.

Details

type can be any combination of first, last, any, lastbefore, firstwithin with first, last, any, firstafter. Some common ones you may use include:

first-first

Take the earliest x and y for each user before joining. For example, you want the first time someone entered an experiment, followed by the first time someone ever registered. If they registered, entered the experiment, and registered again, you do not want to include that person.

first-firstafter

Take the first x, then the first y after that. For example, you want when someone first entered an experiment and the first course they started afterwards. You don't care if they started courses before entering the experiment.

lastbefore-firstafter

First x that's followed by a y before the next x. For example, in last click paid ad attribution, you want the last time someone clicked an ad before the first subscription they did afterward.

any-firstafter

Take all Xs followed by the first Y after it. For example, you want all the times someone visited a homepage and their first product page they visited afterwards.

any-any

Take all Xs followed by all Ys. For example, you want all the times someone visited a homepage and all the product pages they saw afterward.

Examples

library(dplyr)
#> #> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’: #> #> filter, lag
#> The following objects are masked from ‘package:base’: #> #> intersect, setdiff, setequal, union
landed <- tribble( ~user_id, ~timestamp, 1, "2018-07-01", 2, "2018-07-01", 2, "2018-07-01", 3, "2018-07-02", 4, "2018-07-01", 4, "2018-07-04", 5, "2018-07-10", 5, "2018-07-12", 6, "2018-07-07", 6, "2018-07-08" ) %>% mutate(timestamp = as.Date(timestamp)) registered <- tribble( ~user_id, ~timestamp, 1, "2018-07-02", 3, "2018-07-02", 4, "2018-06-10", 4, "2018-07-02", 5, "2018-07-11", 6, "2018-07-10", 6, "2018-07-11", 7, "2018-07-07" ) %>% mutate(timestamp = as.Date(timestamp)) after_inner_join(landed, registered, by_user = "user_id", by_time = "timestamp", type = "first-first")
#> # A tibble: 4 x 3 #> user_id timestamp.x timestamp.y #> <dbl> <date> <date> #> 1 1 2018-07-01 2018-07-02 #> 2 3 2018-07-02 2018-07-02 #> 3 6 2018-07-07 2018-07-10 #> 4 5 2018-07-10 2018-07-11
# You can use different methods of joining: after_left_join(landed, registered, by_user = "user_id", by_time = "timestamp", type = "first-first")
#> # A tibble: 6 x 3 #> user_id timestamp.x timestamp.y #> <dbl> <date> <date> #> 1 1 2018-07-01 2018-07-02 #> 2 2 2018-07-01 NA #> 3 4 2018-07-01 NA #> 4 3 2018-07-02 2018-07-02 #> 5 6 2018-07-07 2018-07-10 #> 6 5 2018-07-10 2018-07-11
after_anti_join(landed, registered, by_user = "user_id", by_time = "timestamp", type = "any-any")
#> # A tibble: 4 x 2 #> user_id timestamp #> <dbl> <date> #> 1 2 2018-07-01 #> 2 2 2018-07-01 #> 3 4 2018-07-04 #> 4 5 2018-07-12