After reading through this, you should have a working understanding of what functions are in R, how to use them, and how to write your own. Parts of these notes were inspired by what Thiago Silva taught in 2021.
One of the most useful attributes of R is the ability to make and use
packages. Packages allow custom code, data, and documentation to be
published and used by anyone. Up until now, we have been looking at
functions (e.g., hist
or plot
) that are
available in base R (i.e., they are part of the R language, and anyone
who installs R can use these functions). But by downloading and
installing a package, it is possible to use functions written by anyone
in the R community. As an example, we can download the swirl package (https://swirlstats.com/).
The swirl package is a useful package for learning R step-by-step. You
can get started by running the code below.
install.packages("swirl");
library("swirl");
swirl();
##
## | Hi! Type swirl() when you are ready to begin.
The install.packages
command is all that you need to
install a package directly from the Comprehensive R Archive Network
(CRAN; https://cran.r-project.org/). After R installs the
package, the library
function is used to deploy the package
functions into the console. There are 18000+ packages in R, all free to
download and use. If you use one of these packages, you can cite it with
the citation()
function in base R.
citation("swirl");
## To cite package 'swirl' in publications use:
##
## Kross S, Carchedi N, Bauer B, Grdina G (2020). _swirl: Learn R, in
## R_. R package version 2.4.5,
## <https://CRAN.R-project.org/package=swirl>.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {swirl: Learn R, in R},
## author = {Sean Kross and Nick Carchedi and Bill Bauer and Gina Grdina},
## year = {2020},
## note = {R package version 2.4.5},
## url = {https://CRAN.R-project.org/package=swirl},
## }
As one more example, a popular package is ggplot2
, which
some people prefer to use in plotting instead of the available base R
functions. We can return to the Bumpus
dataset from the sessions on Getting
Started and Coding
Types.
dat <- read.csv("Bumpus_data.csv"); # Read in the data
In the session on Getting
Started, we mad a scatterplot to show the relationship between
sparrow body mass (wgt
) and total length
(totlen
) using the plot
function. If we wanted
to plot sparrow weight versus total length using a slightly different
plotting style from plot
, then we could download
ggplot2
and run the ggplot
function.
install.packages("ggplot2");
library("ggplot2");
ggplot(data = dat, mapping = aes(x = wgt, y = totlen)) +
geom_point();
There are ways to make the plot above look much nicer, but I am not
an expert in ggplot
. For your own data, I recommend
exploring and trying new tools and new code. Most of what you try will
not work out they way you want it to, but you will almost always learn
something new and get practice.
Functions can be written in R, then used in ways that make data
wrangling, analysis, and all kinds of other tasks much more easy and
efficient. Being able to write your own functions is a highly useful
skill. For example, if you collect data in a particular way more than
once, and need to reorganise those data to a different format than you
collected for statistical analysis, then it might be useful to write a
function that automates this so that you can simply run the function on
your collected dataset. To write a function, you use the
function
function, which specifies the function name, the
arguments that your custom function will take (along with any defaults),
what the function does, and what the function returns. Below is a very
simple example that converts a Fahrenheit value to Celsius.
F_to_C <- function(F_temp = 70){
C_temp <- (F_temp - 32) * 5/9;
return(C_temp);
}
The function is assigned to the name F_to_C
. The
function takes one argument F_temp
, which has a default
value of 70. It then calculates (F_temp - 32) * 5/9
and
assigns it to the variable C_temp
. It then returns
C_temp
. To input this function into R, we just need to
highlight the whole function and run it. After it is run, then we can
then use F_to_C
just as we would any other function. This
is demonstrate below with a Fahrenheit temperature of 100.
F_to_C(F_temp = 100);
## [1] 37.77778
Note that we do not need to actually write out F_temp
as
an argument. R recognises the order of arguments, or in this case, that
there is only one argument in the F_To_C
function, so it
must be F_temp
. We can therefore do the same thing as above
without specifying the argument.
F_to_C(100);
## [1] 37.77778
Also note that because we gave the argument F_temp
a
default value of 70, if we do not specify any argument, then it runs the
function with this default F_temp = 70
.
F_to_C();
## [1] 21.11111
If we realy want to, we can also set the function without
return
. In this case, R simply returns the last variable
assigned.
F_to_C <- function(F_temp = 70){
C_temp <- (F_temp - 32) * 5/9;
}
The above leaves out the return argument, but it still returns
C_temp
because this is the last variable assigned.
converted_temp <- F_to_C(100);
print(converted_temp);
## [1] 37.77778
It is general best to use return
to specify what a
function should return.
We can also create custom functions that use custom functions. For
example, assume that we wanted to convert from temperature Fahrenheit to
temperature Kelvin. Temperature Kelvin is simply Celsius plus 273.15,
\(K = C + 273.15\). We could convert
from Fahrenheit to Kelvin by calculating
((F_temp - 32) * 5/9) + 273.15
, or we could make use of the
F_to_C
function within a new function
F_to_K
.
F_to_K <- function(F_temp){
K_temp <- F_to_C(F_temp = F_temp) + 273.15;
return(K_temp);
}
What is going on above? We are defining the new function
F_to_K
, which takes an argument F_temp
and
converts it to Kelvin. It does this by passing this F_temp
into the previously written function F_to_C
. It calculates
the output of F_to_C(F_temp)
and then adds 273.15 and
assigns the value to K_temp
. Then it returns the calculated
K_temp
.
F_to_K(F_temp = 100);
## [1] 310.9278
Note that I have not given F_temp
a default value in the
F_to_K
function above, so we cannot just call it without
defining a value. If we do, then we get an error message.
F_to_K();
## Error in F_to_K(): argument "F_temp" is missing, with no default
We can try something else. Maybe we want to write a function that converts degrees Fahrenheit to either Celsius or Kelvin. We can do this with a new function that includes two arguments, one specifying the degrees Fahrenheit, and the second specifying whether the function should convert to Celsius or Fahrenheit.
F_convert <- function(F_temp = 70, conversion = "Celsius"){
if(conversion == "Celsius"){
converted <- F_to_C(F_temp = F_temp);
}
if(conversion == "Kelvin"){
converted <- F_to_K(F_temp = F_temp);
}
return(converted);
}
The below will get the job done, provided we use it correctly.
F_convert(F_temp = 70, conversion = "Kelvin");
## [1] 294.2611
But the function could be written better. If, for example, we spell “Kelvin” incorrectly, or even forget to capitalise it, then we will get an uninformative error.
F_convert(F_temp = 70, conversion = "kelvin");
## Error in F_convert(F_temp = 70, conversion = "kelvin"): object 'converted' not found
To improve the function, we can redefine it to make sure it gives an
informative error message. We can check to make sure that
conversion
is either “Celsius” or “Kelvin”, and that
“F_temp” is a numeric input. If this is not the case, then the function
should stop
with an error message.
F_convert <- function(F_temp = 70, conversion = "Celsius"){
if(conversion != "Celsius" & conversion != "Kelvin"){
stop("conversion argument must be 'Celsius' or 'Kelvin'.")
}
if(is.numeric(F_temp) == FALSE){
stop("F_temp argument must be numeric");
}
if(conversion == "Celsius"){
converted <- F_to_C(F_temp = F_temp);
}else{
converted <- F_to_K(F_temp = F_temp);
}
return(converted);
}
Note that the first if
statement in the function above
checks to see if the argument conversion
does not
equal “Celsius” (conversion != "Celsius"
)
and (&
) does not equal
“Kelvin” (conversion != "Kelvin"
). If this is
TRUE
, then conversion
must be incorrectly
specified because it does not equal one of the two permissible options.
We therefore need to return an error message. Note that since we have
established that conversion
must be either “Celsius” or
“Kelvin” (returning an error message if not), then we can just use the
else
at the end of the
if(conversion == "Celsius")
statement. We can try the
incorrect specification of conversion = "kelvin"
again.
F_convert(F_temp = 70, conversion = "kelvin");
## Error in F_convert(F_temp = 70, conversion = "kelvin"): conversion argument must be 'Celsius' or 'Kelvin'.
Similarly, if we specify a value of F_temp
that is not
numeric, then is.numeric(F_temp)
will be
FALSE
, causing the function to stop
with a
relevant error message.
F_convert(F_temp = "seventy", conversion = "Kelvin");
## Error in F_convert(F_temp = "seventy", conversion = "Kelvin"): F_temp argument must be numeric
We can run the function correctly now.
F_convert(F_temp = 70, conversion = "Kelvin");
## [1] 294.2611
Note that we could also assign the output of the above to a new variable.
new_variable <- F_convert(F_temp = 70, conversion = "Kelvin");
print(new_variable);
## [1] 294.2611
It is often useful to write error messages like this into functions even if you know that you will be the only person who uses them. Next, try to write a function that converts from Celsius to either Fahrenheit or Kelvin. The ‘Details’ pulldown below gives one potential way to do it.
C_convert <- function(C_temp, conversion = "Kelvin"){
if(conversion != "Fahrenheit" & conversion != "Kelvin"){
stop("conversion argument must be 'Fahrenheit' or 'Kelvin'.")
}
if(is.numeric(C_temp) == FALSE){
stop("C_temp argument must be numeric");
}
if(conversion == "Celsius"){
converted <- C_temp + 273.15;
}else{
converted <- (C_temp * 9/5) + 32;
}
return(converted);
}
Try creating some other functions that might be useful for your own purposes, or that do something fun.
There is one trick that we can do with functions called recursion. Recursion occurs when a function calls itself until some kind of stopping condition is met. This is almost never necessary to use once you know how to use a loop (which we will learn in the next session). Since learning to code, there has been exactly one time I can recall in which I basically needed to use recursive programming, and this was programming in C, not R. Nevertheless, it is a bit of a cool trick to know.
Suppose we wanted to create a function that samples from a range of
values interval
iteratively until it finds the value that
makes it stop (stop_number
). Further suppose that we want
to record all of the values tried until sampling the
stop_number
. There are many ways that we could do this in
R, but the program below shows how to do it with recursive
programming.
recursive_function <- function(stop_number, interval = 1:10, rand_tries = NULL){
rand_number <- sample(x = interval, size = 1);
rand_tries <- c(rand_tries, rand_number);
if(rand_number == stop_number){
return(rand_tries);
}else{
new_try <- recursive_function(stop_number = stop_number,
interval = interval,
rand_tries = rand_tries);
}
}
Verbally, the program first randomly samples a number
rand_number
(by default the random number is from 1 to 10).
The number is added to a growing list called rand_tries
,
which starts out as NULL
with the first call to the
function. If the rand_number
equals the
stop_number
, then the function returns the list
rand_tries
. But if it does not equal
stop_number
, then the function runs again, so a new
rand_number
is selected and added to
rand_tries
. This continues until the
rand_number == stop_number
condition is met, at which point
the function returns the list rand_tries
. This is a bit
confusing and counter-intuitive at first, and, as already mentioned,
there almost never any need for it once you know how to use a loop.
Nevertheless, we can run the function to see the output.
new_rec1 <- recursive_function(stop_number = 4, interval = 1:10);
return(new_rec1);
## [1] 9 6 1 5 6 6 9 9 10 2 9 3 3 2 1 2 10 9 1 4
We can try it again.
new_rec2 <- recursive_function(stop_number = 4, interval = 1:10);
return(new_rec2);
## [1] 5 7 7 5 9 6 10 9 10 5 10 8 10 8 6 9 7 5 9 10 1 7 9 2 10
## [26] 6 1 3 3 7 4
One more time.
new_rec3 <- recursive_function(stop_number = 4, interval = 1:10);
return(new_rec3);
## [1] 2 4
Note that we get a different answer each time, with a vector of varying lengths.
There are a few additional things to keep in mind when writing functions (note, these are useful points from Thiago that I would have probably otherwise forgotten to include).
The code run within an argument executes within its own local environment. What this means is that any objects assigned in a function cannot be used unless they are returned by the function. Here is a quick example.
sum_three_vals <- function(starting_val){
aa <- starting_val + 0;
bb <- starting_val + 2;
cc <- starting_val + 4;
dd <- aa + bb + cc;
return(dd);
}
The function sum_three_vals
defined above takes one
argument starting_val
and uses it to assign a value to
aa
, bb
, and cc
in the function.
It then sums aa
, bb
, and cc
and
returns the summed value.
sum_three_vals(starting_val = 1);
## [1] 9
We get the answer of 9 as expected, but note that the ‘global’
environment outside of the function does not retain aa
,
bb
, or cc
. If we try to print one of these
values, R cannot find it.
print(aa);
## Error in eval(expr, envir, enclos): object 'aa' not found
This is not true in the other direction, going from the global to the
local. For example, say that we first define xx
,
yy
, and zz
, then a function summing them
up.
xx <- 1;
yy <- 2;
zz <- 3;
sum_xxyyzz <- function(){
dd <- xx + yy + zz;
return(dd);
}
We can now see that the function does technically work.
sum_xxyyzz()
## [1] 6
Nevertheless, this is really not good practice. The
sum_xxyyzz
function is not portable. You could not actually
use it unless the three xx
, yy
, and
zz
were already calculated, and if these values change,
then the function could return unexpected values. It is important to
always make functions self contained, so we can rewrite the above.
sum_xxyyzz <- function(xx, yy, zz){
dd <- xx + yy + zz;
return(dd);
}
Now we specify all of the arguments that the function takes, and the funciton is not reliant upon any other objects in the global environment. Note that the function no longer works by itself as before.
sum_xxyyzz();
## Error in sum_xxyyzz(): argument "xx" is missing, with no default
This is because the function is looking for xx
,
yy
, and zz
to be assigned as arguments, and
with no defaults, it will (correctly and helpfully) produce an error
message.
Note that if arguments are not specified, then R assumes that any
values input are defined in the order of the function arguments. We can
consider a function that just adds two values, val1
and two
times val2
.
add_two_eg <- function(val1, val2){
out_val <- val1 + (2 * val2);
return(out_val);
}
We can specify what we want the values to be in any order. Consider
val1 = 1
and val2 = 2
.
add_two_eg(val1 = 1, val2 = 2);
## [1] 5
We can write this with the order of arguments flipped and get the same answer.
add_two_eg(val2 = 2, val1 = 1);
## [1] 5
But if we do not specify the arguments explicitly, then R just
assumes that they are in the order of the original function
val1, val2
. We can see this below, where we get a different
answer if we set 2
first.
add_two_eg(2, 1);
## [1] 4
In the above, R just assumes that we meant
add_two_eg(val1 = 2, val2 = 1);
.
Note that all of the examples that we have used return a single value. This was done for simplicity, but it is not necessary. For example, we might want to return a list of values instead.
func_list <- function(name = "A name", value = 3, vector = 1:10){
new_list <- list(name, value * 2, vector);
return(new_list);
}
The above function is a very simple example. It simply returns the
three arguments name
, value
(times 2), and
vector
as a list.
func_list(name = "Grace Hopper", value = 85, vector = 1906:1992);
## [[1]]
## [1] "Grace Hopper"
##
## [[2]]
## [1] 170
##
## [[3]]
## [1] 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920
## [16] 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935
## [31] 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950
## [46] 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965
## [61] 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980
## [76] 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992
Lastly, we can use the function do.call
to read
arguments into a function as a list. Here is what that looks like, using
the func_list
example from above.
my_args <- list("Grace Hopper", 85, vector = 1906:1992)
do.call(what = func_list, args = my_args);
## [[1]]
## [1] "Grace Hopper"
##
## [[2]]
## [1] 170
##
## [[3]]
## [1] 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920
## [16] 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935
## [31] 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950
## [46] 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965
## [61] 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980
## [76] 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992
The do.call
function seems a bit unnecessary, bit it is
actually quite useful in some context. I have
used it when writing R packages that include functions with many
arguments, and a need for flexibility given user input. But it is rarely
needed, in my experience, for data analysis.