R get name of list element in lapply

Lists

Lists are a key data type in R but can be avoided by beginners due to being a bit harder to work with. The do however hold the key to many of Rs more advanced features.

Nội dung chính Show

The apply family functions
Working with the result of lists
Example - create custom ANOVA
Where to go from here - purrr()
Video liên quan

The apply family functions

We move now to apply functions. We will look at the three most popular: lapply(), vapply() and apply()

lapply()

List apply will take each element of your list, call a function upon it, then return a list of the results. lapply() can be really useful to write compact, understandable code.

It is often used to replace what would be for loops in other languages, and offers benefits like better memory management and vectorisation. Its one reason some cite use of for() loops in R code as bad R-style, although there is still some scope for them.

Guide to using lapply()

Lets build up an lapply function, that will work on a list of data.frames.

Choose what you want the lapply() function to iterate across. In this example we have a list of data.frames, a_list_of_df
Choose our function. This will operate on each element,
Test the function on one element i.e.the contents of a_list_of_df[[x]] NOT a_list_of_df[x]. Once happy we know it should work on the entire list.
Any constant arguments get added to the end of the lapply - e.g.is you are using sum() you may want all the sum functions to ignore NA e.g. sum(x, na.rm = TRUE)
Construct the full function result <- lapply(a_list, myfunction, constant_arguments = TRUE)

Example

Here we would like to just check the number of rows for each data.frame, so the function chosen is nrow().

Testing the function chosen against our example, we see it works as expected:

nrow(a_list_of_df[[1]]) # 345

We can now apply this to all the data.frames in the list:

lapply(a_list_of_df, nrow)

Note we dont add arguments to the function call (nrow, not nrow()).

If we want to add arguments to the function, we can supply those as named arguments to lapply()

Here we take advantage that a data.frame is a list of equal length vectors:

## this applies mean to every column lapply(web_data, mean, trim = 0.5)

Using your own function

What if you want to supply your own function? You can predefine it before the lapply() and supply it:

my_func <- function(x){ sum(x) } lapply(my_vector, my_func)

or you can add it straight into the lapply function itself - this is normally done when the function is small:

A function defined this way is called an anonymous function, as it has no name.

lapply(my_vector, function(x){ sum(x) })

vapply()

vapply() is an extension on top of lapply(). It works in the same way, but you also supply a FUN.VALUE which is a template of what you expect the result to be. If the result is not of that class, then it will raise an error.

This is good to stop any nasty surprises messing up your code later, and is recommended over lapply() when creating functions.

It also has the USE.NAMES argument which is useful to have the list output have the same names as the input list.

apply()

apply() is similar but lets you work with data.frame rows as inputs, instead of list elements. Here each element is a data.frame row, which you can then operate on from your supplied function:

apply(mtcars, MARGIN = 2, sum)## mpg cyl disp hp drat wt qsec vs ## 642.900 198.000 7383.100 4694.000 115.090 102.952 571.160 14.000 ## am gear carb ## 13.000 118.000 90.000

Exercise

Write a function that iterates over this list of data.frames supplied below, to output the sum of each column. The output should be a list of numeric vectors, each vector element being the maxium number.

## don't touch this, it just creates your data a_list_of_df <- lapply(1:10, function(x){ data.frame(matrix(runif(1000), ncol = 10)) }) ## perhaps you want to pass more arguments in, replace as you see fit your_function <- function(a_data_frame){ } your_result <- lapply(a_list_of_df, your_function)

Hint - use colSums

How would you change this to use vapply ?

Working with the result of lists

There are some handy functions to work with lists: I use mostly Reduce(). This takes a list in its second argument and applies the function given in its first argument in turn, adding the result of the previous call to the next.

As an example, you can use rbind() which binds a data.frame to another and apply it to a list of data.frames. Reduce will rbind() the first two data.frames, then take that result and rbind() it to the third, then take that result and rbind() it to the forth, etc. until you have added up all the data.frames into one:

Reduce(rbind, list_of_df)

Example - create custom ANOVA

Lets make our own function that performs ANOVA on some data using lapply() over a data.frames columns. There are R functions that will do all these steps for you, but its educational to build it up yourself.

First step is to create the data:

library(tidyverse) anova_data <- web_data %>% filter(deviceCategory == "desktop") %>% select(date, sessions, channelGrouping) %>% spread(channelGrouping, sessions) %>% select(-date)

19	133	307	17	431	555	131	68	NA
156	1003	196	43	1077	1060	226	158	3
35	1470	235	29	696	489	179	66	90
31	1794	321	70	1075	558	235	46	898
27	1899	309	74	1004	478	218	47	461
21	1972	204	299	974	494	246	47	418

Do the different groups differ significantly from each other? You can probably see for yourself in this dataset, but for cases such as AB tests this is less obvious.

Recap - 1-way ANOVA

The steps for 1-way ANOVA were outlined earlier, and repeated here:

Calculate the mean for each column
Calculate the mean for all columns combined
Calculate sum of squares error
Calculate sum of squares between groups
Calculate mean square errors
Determine the actual F-value
Look up the critical F-value
Compare actual and critical F-values

We create a function that will create this from fundamentals, using lapply() to help us perform on each column (each element of our list)

Calculate the mean per colum

col_mean <- lapply(anova_data, mean, na.rm = TRUE) ## compare with.. colMeans(anova_data)## (Other) Direct Display Email Organic Search ## 50.79812 1397.08920 416.37559 104.92958 733.07042 ## Paid Search Referral Social Video ## 725.51643 353.30047 105.32394 NA

Calculate the sum of squares for each column

my_ss <- function(df_col){ ## df_col is a column of the data.frame we pass into the lapply mean_col <- mean(df_col, na.rm = TRUE) ## squares error se <- (df_col - mean_col)^2 ## sum of squares error sum(se, na.rm = TRUE) } ss_a <- lapply(anova_data, my_ss) SS_e <- Reduce(sum, ss_a) SS_e## [1] 418843067

Calculate the total mean

totalMean <- sum(colSums(anova_data), na.rm = TRUE) / (ncol(anova_data) * nrow(anova_data)) totalMean## [1] 431.8226

Calculate the sum of squares between groups

my_ssg <- function(df_col){ mean_col <- mean(df_col, na.rm = TRUE) ((totalMean - mean_col)^2)*length(df_col) } ssg <- lapply(anova_data, my_ssg) SS_b <- Reduce(sum, ssg) SS_b## [1] 313925827

Calculate mean square errors

df1 <- (nrow(anova_data)*ncol(anova_data)) - ncol(anova_data) MSe <- SS_e / df1 MSe## [1] 219519.4

Calculate mean square between groups

df2 <- ncol(anova_data) - 1 MSb <- SS_b / df2 MSb## [1] 39240728

Determine the F-value

Fvalue <- MSb / MSe Fvalue## [1] 178.7574

Look up critical F-value

Fcrit <- pf(.95, df1 = df1, df2 = df2) Fcrit## [1] 0.3939492

Compare the Actual vs critical F-Value

## If TRUE, we can reject null hypothesis that the means of the groups are equal ## The number of sessions does differ by channel Fvalue > Fcrit## [1] TRUE

How would we really do this?

anova_data2 <- web_data %>% filter(deviceCategory == "desktop") %>% select(sessions, channelGrouping) anova_data2$channelGrouping <- as.factor(anova_data2$channelGrouping) anova2 <- aov(sessions ~ channelGrouping , data = anova_data2, na.action = na.exclude) summary(anova2)## Df Sum Sq Mean Sq F value Pr(>F) ## channelGrouping 8 309380887 38672611 176.1 <2e-16 *** ## Residuals 1907 418843067 219635 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The numbers are a little different due to ways of treating NAs in the data.

Where to go from here - purrr()

For more advanced useage, purrr is a tidyverse package which is very useful for both working with lists and manipulating data.

Consult the cheatsheet of list-columns for some ideas, and this Jennifer Bryan presentation

Top List List