fosstodon.org is one of the many independent Mastodon servers you can use to participate in the fediverse.
Fosstodon is an invite only Mastodon instance that is open to those who are interested in technology; particularly free & open source software. If you wish to join, contact us for an invite.

Administered by:

Server stats:

10K
active users

posit::glimpse()

dplyr 1.1.0 is coming soon!! 🎉🎉

We are so excited to introduce you to the new features we've been working on, including:
- Temporary inline grouping with `.by`
- Non-equi joins
- Faster `arrange()`

And SO much more!

Check out the blog post from @davis tidyverse.org/blog/2022/11/dpl

www.tidyverse.orgdplyr 1.1.0 is coming soondplyr 1.1.0 is coming soon! This post introduces some of the exciting new features coming in 1.1.0, and includes a call-for-feedback as we finalize the release.

@posit_glimpse I'm so excited to have joins with expressions!

@posit_glimpse Temporary grouping in dplyr just for one function call is something I'll definitely use! So many bugs have been caused by grouping lingering for the next call, e.g. when converting some columns into factors while the dataframe is still grouped tidyverse.org/blog/2022/11/dpl #rstats

www.tidyverse.orgdplyr 1.1.0 is coming soondplyr 1.1.0 is coming soon! This post introduces some of the exciting new features coming in 1.1.0, and includes a call-for-feedback as we finalize the release.

@posit_glimpse @davis the difference in ordering between the alternative grouping syntax is regrettable. Feels like a pitfall not a pit of success.

@milesmcbain @posit_glimpse having some way to maintain order by appearance has come up quite a few times, it’s fairly hard to hack your way into it right now. I imagined that it really won’t matter too much, because:
1) if you already sort your data while exploring, it’ll be sorted after the summary too
2) does the ordering of summary results matter all that much until the very end when you need some kind of human consumable table?

@milesmcbain @posit_glimpse

3) with the addition of.locale in arrange(), any sorting we would do would be in the C locale, which isn’t likely to be what you want for table output, so then we end up sorting twice for no reason. It’s easier to just leave results in the order we found them and let the user sort once with their preferred arrange() options if the order actually matters (including possible desc() usage too)

@posit_glimpse @milesmcbain @davis I can see cases where I’d want to keep the original ordering for mutate, but for summarizing, when the groups are collapsed, it’s not intuitive to me what it even means if the data isn’t sorted. It uses the order of the first appearance of a given combination of the grouping variable? (Like adding a row number, taking the min of it in the summary, and then sorting by that?)

@davis @posit_glimpse a fair few times I’ve grouped by date to summarise at a daily granularity and then take advantage of the data grouped in ascending order by date to do some other operation. E.g lag. Not sure if I always manually arranged or not. But If I haven’t that code becomes a trap for anyone who wants to swap it to the alternative syntax.

@milesmcbain @davis I get the sense that quite a few people are surprised that summarise() automatically arranges, but it's possible that more people would be surprised if it didn't

@hadleywickham @davis I can’t say I ever really questioned it as a rookie. Now I can appreciate it could be frustrating for some people. Should summarise() get a .order param to provide the escape hatch, and keep both grouping alternatives the same?

@milesmcbain @davis that would be one way to go, if it seems like .by has legs. But that would feel extra weird for the default to be different. We might reconsider

@posit_glimpse @davis I think that .by should return the same exact results than you would've get from group_by() + ungroup(). The difference in ordering is really confusing. Otherwise, seems like fine syntax to avoid sneaky grouped tibbles messing with your analysis.

@eliocamp @davis it's already not identical because you can do (e.g.) group_by(z = x + y), but not .by = c(z = x + y), so we thought this could generally be an opportunity to rethink the behaviour.

@hadleywickham @davis Are you open to allowing something like that in the future? Like data.table's by = list(z = x+ y).

Will group_by eventually move to not sorting?

@posit_glimpse @davis

If I understand right, `group_by()` uses <data-masking>, while `.by` uses <tidy-select>.

Is it worth making that distinction more prominent? (I guess computing in grouping is a bit of an edge case)

@ijlyttle @posit_glimpse we put that in the table of differences with group_by on the .by specific help page! dplyr.tidyverse.org/dev/refere

dplyr.tidyverse.orgTemporary grouping with .by — dplyr_byThere are two ways to group in dplyr: Persistent grouping with group_by() Temporary grouping with .by This help page is dedicated to explaining where and why you might want to use the latter. Grouping radically affects the computation of the dplyr verb you use it with, and one of the goals of .by is to allow you to place that grouping specification alongside the code that actually uses it. As an added benefit, with .by you no longer need to remember to ungroup() after summarise(), and summarise() won't ever message you about how it's handling the groups! This great idea comes from data.table, which allows you to specify by alongside modifications in j, like: dt[, .(x = mean(x)), by = g]. Supported verbs mutate() summarise() reframe() filter() slice() and its variants, such as slice_head() Differences between .by and group_by() .bygroup_by() Grouping only affects a single verbGrouping is persistent across multiple verbs Selects variables with tidy-selectComputes expressions with data-masking Summaries use existing order of group keysSummaries sort group keys in ascending order Using .by Let's take a look at the two grouping approaches using this expenses data set, which tracks costs accumulated across various ids and regions: expenses <- tibble( id = c(1, 2, 1, 3, 1, 2, 3), region = c("A", "A", "A", "B", "B", "A", "A"), cost = c(25, 20, 19, 12, 9, 6, 6) ) expenses #> # A tibble: 7 x 3 #> id region cost #> <dbl> <chr> <dbl> #> 1 1 A 25 #> 2 2 A 20 #> 3 1 A 19 #> 4 3 B 12 #> 5 1 B 9 #> 6 2 A 6 #> 7 3 A 6 Imagine that you wanted to compute the average cost per region. You'd probably write something like this: expenses %>% group_by(region) %>% summarise(cost = mean(cost)) #> # A tibble: 2 x 2 #> region cost #> <chr> <dbl> #> 1 A 15.2 #> 2 B 10.5 Instead, you can now specify the grouping inline within the verb: expenses %>% summarise(cost = mean(cost), .by = region) #> # A tibble: 2 x 2 #> region cost #> <chr> <dbl> #> 1 A 15.2 #> 2 B 10.5 Grouping with .by is temporary, meaning that since expenses was an ungrouped data frame, the result after applying .by will also always be an ungrouped data frame, regardless of the number of grouping columns. expenses %>% summarise(cost = mean(cost), .by = c(id, region)) #> # A tibble: 5 x 3 #> id region cost #> <dbl> <chr> <dbl> #> 1 1 A 22 #> 2 2 A 13 #> 3 3 B 12 #> 4 1 B 9 #> 5 3 A 6 Compare that with group_by() %>% summarise(), where summarise() generally peels off 1 layer of grouping by default, typically with a message that it is doing so: expenses %>% group_by(id, region) %>% summarise(cost = mean(cost)) #> `summarise()` has grouped output by 'id'. You can override using the `.groups` #> argument. #> # A tibble: 5 x 3 #> # Groups: id [3] #> id region cost #> <dbl> <chr> <dbl> #> 1 1 A 22 #> 2 1 B 9 #> 3 2 A 13 #> 4 3 A 6 #> 5 3 B 12 Because .by grouping is temporary, you don't need to worry about ungrouping, and it never needs to emit a message to remind you what it is doing with the groups. Note that with .by we specified multiple columns to group by using the tidy-select syntax c(id, region). If you have a character vector of column names you'd like to group by, you can do so with .by = all_of(my_cols). It will group by the columns in the order they were provided. To prevent surprising results, you can't use .by on an existing grouped data frame: expenses %>% group_by(id) %>% summarise(cost = mean(cost), .by = c(id, region)) #> Error in `summarise()`: #> ! Can't supply `.by` when `.data` is a grouped data frame. So far we've focused on the usage of .by with summarise(), but .by works with a number of other dplyr verbs. For example, you could append the mean cost per region onto the original data frame as a new column rather than computing a summary: expenses %>% mutate(cost_by_region = mean(cost), .by = region) #> # A tibble: 7 x 4 #> id region cost cost_by_region #> <dbl> <chr> <dbl> <dbl> #> 1 1 A 25 15.2 #> 2 2 A 20 15.2 #> 3 1 A 19 15.2 #> 4 3 B 12 10.5 #> 5 1 B 9 10.5 #> 6 2 A 6 15.2 #> 7 3 A 6 15.2 Or you could slice out the maximum cost per combination of id and region: expenses %>% slice_max(cost, n = 1, by = c(id, region)) #> # A tibble: 5 x 3 #> id region cost #> <dbl> <chr> <dbl> #> 1 1 A 25 #> 2 2 A 20 #> 3 3 B 12 #> 4 1 B 9 #> 5 3 A 6 Result ordering When used with .by, summarise(), reframe(), and slice() all maintain the ordering of the existing data. This is different from group_by(), which has always sorted the group keys in ascending order. df <- tibble( month = c("jan", "jan", "feb", "feb", "mar"), temp = c(20, 25, 18, 20, 40) ) # Uses ordering by "first appearance" in the original data df %>% summarise(average_temp = mean(temp), .by = month) #> # A tibble: 3 x 2 #> month average_temp #> <chr> <dbl> #> 1 jan 22.5 #> 2 feb 19 #> 3 mar 40 # Sorts in ascending order df %>% group_by(month) %>% summarise(average_temp = mean(temp)) #> # A tibble: 3 x 2 #> month average_temp #> <chr> <dbl> #> 1 feb 19 #> 2 jan 22.5 #> 3 mar 40 If you need sorted group keys, we recommend that you explicitly use arrange() either before or after the call to summarise(), reframe(), or slice(). This also gives you full access to all of arrange()'s features, such as desc() and the .locale argument. Verbs without .by support If a dplyr verb doesn't support .by, then that typically means that the verb isn't inherently affected by grouping. For example, pull() and rename() don't support .by, because specifying columns to group by would not affect their implementations. That said, there are a few exceptions to this where sometimes a dplyr verb doesn't support .by, but does have special support for grouped data frames created by group_by(). This is typically because the verbs are required to retain the grouping columns, for example: select() always retains grouping columns, with a message if any aren't specified in the select() call. distinct() and count() place unspecified grouping columns at the front of the data frame before computing their results. arrange() has a .by_group argument to optionally order by grouping columns first. If group_by() didn't exist, then these verbs would not have special support for grouped data frames.

@davis @posit_glimpse

Looks great, sorry I missed that!

[homer-backs-into-shrubbery.gif]