dplyr 1.1.0 is coming soon!!
We are so excited to introduce you to the new features we've been working on, including:
- Temporary inline grouping with `.by`
- Non-equi joins
- Faster `arrange()`
And SO much more! #rstats
Check out the blog post from @davis https://www.tidyverse.org/blog/2022/11/dplyr-1-1-0-is-coming-soon/
@posit_glimpse @davis Oh, boy. I might never again teach group_by().
@posit_glimpse I'm so excited to have joins with expressions!
@posit_glimpse Temporary grouping in dplyr just for one function call is something I'll definitely use! So many bugs have been caused by grouping lingering for the next call, e.g. when converting some columns into factors while the dataframe is still grouped https://www.tidyverse.org/blog/2022/11/dplyr-1-1-0-is-coming-soon/ #rstats
@posit_glimpse @davis the difference in ordering between the alternative grouping syntax is regrettable. Feels like a pitfall not a pit of success.
@milesmcbain @posit_glimpse having some way to maintain order by appearance has come up quite a few times, it’s fairly hard to hack your way into it right now. I imagined that it really won’t matter too much, because:
1) if you already sort your data while exploring, it’ll be sorted after the summary too
2) does the ordering of summary results matter all that much until the very end when you need some kind of human consumable table?
3) with the addition of.locale in arrange(), any sorting we would do would be in the C locale, which isn’t likely to be what you want for table output, so then we end up sorting twice for no reason. It’s easier to just leave results in the order we found them and let the user sort once with their preferred arrange() options if the order actually matters (including possible desc() usage too)
@posit_glimpse @milesmcbain @davis I can see cases where I’d want to keep the original ordering for mutate, but for summarizing, when the groups are collapsed, it’s not intuitive to me what it even means if the data isn’t sorted. It uses the order of the first appearance of a given combination of the grouping variable? (Like adding a row number, taking the min of it in the summary, and then sorting by that?)
@davis @posit_glimpse a fair few times I’ve grouped by date to summarise at a daily granularity and then take advantage of the data grouped in ascending order by date to do some other operation. E.g lag. Not sure if I always manually arranged or not. But If I haven’t that code becomes a trap for anyone who wants to swap it to the alternative syntax.
@davis @posit_glimpse so the answer to 2. is sometimes!
@milesmcbain @davis I get the sense that quite a few people are surprised that summarise() automatically arranges, but it's possible that more people would be surprised if it didn't
@hadleywickham @davis I can’t say I ever really questioned it as a rookie. Now I can appreciate it could be frustrating for some people. Should summarise() get a .order param to provide the escape hatch, and keep both grouping alternatives the same?
@milesmcbain @davis that would be one way to go, if it seems like .by has legs. But that would feel extra weird for the default to be different. We might reconsider
@hadleywickham @milesmcbain @davis I didn't know that!
@posit_glimpse @davis I think that .by should return the same exact results than you would've get from group_by() + ungroup(). The difference in ordering is really confusing. Otherwise, seems like fine syntax to avoid sneaky grouped tibbles messing with your analysis.
@hadleywickham @davis Are you open to allowing something like that in the future? Like data.table's by = list(z = x+ y).
Will group_by eventually move to not sorting?
If I understand right, `group_by()` uses <data-masking>, while `.by` uses <tidy-select>.
Is it worth making that distinction more prominent? (I guess computing in grouping is a bit of an edge case)
@ijlyttle @posit_glimpse we put that in the table of differences with group_by on the .by specific help page! https://dplyr.tidyverse.org/dev/reference/dplyr_by.html#differences-between-by-and-group-by-