Stirling Coding Club - BaseR

1. Practice dataset 1

The file greti.csv in the Github repository contains biometric data of 101 individual birds.

Load this file into R as a dataframe.
Examine the data (e.g. using summary(), str(), names()).
Calculate some summary statistics (e.g. a mean or SD) for male and female individuals (column “SEX”), for both body weights (WT) and wing length (WING).
Add the mean for males and females as a new column to the data frame.

2. Practice datasets 2 & 3

The files cc_age.csv and cc_wing.csv contain different biometric data of the same individuals, but in two separate tables.

Load these files into separate data frames.
Create a new data frame that contains all the biometric data in a single table
(hint: be careful. Do all individuals occur in each table, and why is that important?)
Calculate the mean wing length for each age category.

3. Practice datasets 4

The file birdlist.Rdata in the github repository contains more bird biometric data.

Load this file.
Examine the data. Note that this is not a simple dataframe. Each element represents a different species.
Calculate the mean of wing lengths for each species.
Calculate the mean wing length for each age class for each species.

Solutions - Practice dataset 1

dat <- read.csv('greti.csv', header=T)
head(dat)

##      RING  SPEC SEX WING   WT
## 1 L555028 GRETI   F   76 18.7
## 2 L555044 GRETI   F   72 18.9
## 3 L555050 GRETI   F   74 18.8
## 4 L555050 GRETI   F   75 19.0
## 5 L555052 GRETI   F   75 20.1
## 6 L555052 GRETI   F   75 20.2

str(dat)

## 'data.frame':    97 obs. of  5 variables:
##  $ RING: Factor w/ 80 levels "L555027","L555028",..: 2 3 5 5 6 6 8 8 9 13 ...
##  $ SPEC: Factor w/ 1 level "GRETI": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SEX : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ WING: int  76 72 74 75 75 75 73 72 76 74 ...
##  $ WT  : num  18.7 18.9 18.8 19 20.1 20.2 18.5 18.2 19.2 21.7 ...

summary(dat)

##       RING       SPEC    SEX         WING             WT       
##  L555027: 3   GRETI:97   F:49   Min.   :70.00   Min.   :16.80  
##  L555052: 3              M:48   1st Qu.:73.00   1st Qu.:18.50  
##  L555054: 3                     Median :75.00   Median :19.00  
##  L555050: 2                     Mean   :75.11   Mean   :19.04  
##  L555056: 2                     3rd Qu.:77.00   3rd Qu.:19.70  
##  L555100: 2                     Max.   :80.00   Max.   :21.70  
##  (Other):82                     NA's   :5

sex_wt_mean <- tapply(dat$WT, dat$SEX, mean)
sex_wt_mean

##        F        M 
## 18.50612 19.59043

There are missing values in the wing length column WING (NAs). So “just” calculating a mean does not work:

sex_wing_mean <- tapply(dat$WING, dat$SEX, mean)
sex_wing_mean

##  F  M 
## NA NA

To avoid this, we can use a custom “mean” function that ignores the NA’s:

mean_no_na <- function(x) {
  return(mean(x, na.rm=T))
}

And now we can use tapply() to calculate the means using this function:

sex_wing_mean <- tapply(dat$WING, dat$SEX, mean_no_na)
sex_wing_mean

##        F        M 
## 73.57447 76.71111

One way to do this is to use match() to match data for each sex (the wing and weight means) to each sex in the dat table. To be able to do this, we first want to express sex_wing_mean and sex_wt_mean as dataframes.

sex_wing_mean <- as.data.frame(sex_wing_mean)
sex_wt_mean <- as.data.frame(sex_wt_mean)

We can now do a match() for each individual bit of data, creating a new column for each. Two important things to note here. First, we have to make sure we refer to the “first” and only column in each of the “means” tables. Second, when using the match, we have no explicit “sex” column in each of the “means” tables. In this cases, the code for each sex is the row name in the tables. So we need to refer to this in the match using the row.names() function.

dat$sex_wing_mean <- sex_wing_mean[,1][match(dat$SEX, row.names(sex_wing_mean))]
dat$sex_wt_mean <- sex_wt_mean[,1][match(dat$SEX, row.names(sex_wt_mean))]

Solutions - Practice datasets 2 & 3

cc_age <- read.csv("cc_age.csv", header=T)
cc_wing <- read.csv("cc_wing.csv", header=T)

We can examine the data sets by using e.g. str(). This gives us, among other things how many “levels” there are in the variable ring_no, which indicates the different individuals. Note that there are fewer individuals in cc_wing table. This means that we should match the cc_wing data into the cc_age table; if we do it the other way around we will lose some of the individuals for which we have age data but no wing lengths.

cc_age$wing <- cc_wing$wing_length[match(cc_age$ring_no, cc_wing$ring_no)]
head(cc_age)

##   ring_no species_name age wing
## 1  KPJ625   Chiffchaff   3   63
## 2  KPJ623   Chiffchaff   3   62
## 3  KPJ621   Chiffchaff   3   63
## 4  KPJ617   Chiffchaff  3J   NA
## 5  KPJ613   Chiffchaff   3   56
## 6  KPJ601   Chiffchaff  3J   62

We can now use the data in the new column to calculate the mean wing length for each age category. Again, there are missing values in the wing length data so we again need to “ignore” these when calculating the mean.

tapply(cc_age$wing, cc_age$age, mean)

##     3    3J     4 
##    NA    NA 58.75

Instead of explicitly using the function we defined for this, we can also do this quickly in a single line (without creating an explicit function first). This is a bit less easy to read, but it does exactly the same thing:

tapply(cc_age$wing, cc_age$age, function(x) mean(x, na.rm=T))

##        3       3J        4 
## 60.56000 60.22727 58.75000

Solutions - practice dataset 4

load("birdlist.Rdata")
str(birdlist)

## List of 2
##  $ cc:'data.frame':  101 obs. of  3 variables:
##   ..$ ring_no    : Factor w/ 101 levels "BLP601","BLP620",..: 101 100 99 98 97 96 12 11 10 9 ...
##   ..$ age        : Factor w/ 3 levels "3","3J","4": 1 1 1 2 1 2 2 2 2 2 ...
##   ..$ wing_length: int [1:101] 63 62 63 NA 56 62 57 61 63 63 ...
##  $ ww:'data.frame':  141 obs. of  3 variables:
##   ..$ ring_no    : Factor w/ 141 levels "BLP621","BLP622",..: 141 140 139 138 137 136 9 8 7 6 ...
##   ..$ age        : Factor w/ 3 levels "3","3J","4": 1 1 2 1 2 2 3 3 3 3 ...
##   ..$ wing_length: int [1:141] 66 68 63 67 66 66 62 64 64 64 ...

So this is a list of two dataframes. We can calculate the mean for one of the columns in each dataframe using lapply. Note we have to make sure to ignore any missing values!

lapply(birdlist, function(x) mean(x$wing_length, na.rm=T))

## $cc
## [1] 59.93548
## 
## $ww
## [1] 65.62698

We can do more complex operations by creating our own custom function. In this case, the function calculates the mean wing length for each age class, in each list element (i.e. for each species).

age_means <- function(x) {
  tapply(x$wing_length, x$age, function(x) mean(x, na.rm=T))
}

We now apply this custom function to the list by doing:

lapply(birdlist, age_means)

## $cc
##        3       3J        4 
## 60.56000 60.22727 58.75000 
## 
## $ww
##        3       3J        4 
## 65.27500 65.09091 66.03125

Stirling Coding Club - BaseR - Practice

Jeroen Minderman

01/10/2019