SCC: Some tips & tricks using Base-R

class: top, left, title-slide

# SCC: Some tips & tricks using Base-R
## with a focus on data manipulation
### Jeroen Minderman (<a href="mailto:jeroen.minderman2@stir.ac.uk" class="email">jeroen.minderman2@stir.ac.uk</a>)
### 26 September 2019 (updated: 02/10/19)

---

# "Base-R"

## Wot now?

https://stat.ethz.ch/R-manual/R-devel/library/base/DESCRIPTION

```html
Package: base 
Version: 3.7.0 
Priority: base 
Title: The R Base Package 
Author: R Core Team and contributors worldwide 
Maintainer: R Core Team <R-core@r-project.org> 
Description: Base R functions. 
License: Part of R 3.7.0 
Suggests: methods 
Built: R 3.7.0; ; Mon Sep 23 01:12:01 UTC 2019; unix 
```

"Base-R": the R functionality you get on first installation, without any extra packages

---

# What *"extra packages"* be this?

## Tonnes of options for data wrangling in R
--
![](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2019/05/tidyverse-default.png)

---
class: inverse, center, middle

![](https://media.giphy.com/media/gsSEDeDJvKRDq/source.gif)

---

# So why use Base-R?

.left-column[
]

.right-column[
{{content}}
]

--
- You are a masochist

![](https://media.giphy.com/media/YPjFIgSccMHS0/source.gif)
---

# So why use Base-R?

.left-column[
]

.right-column[
{{content}}
]

--
- You are a masochist
{{content}}
--
- You like to know the basics
{{content}}
--
- You like to be flexible
{{content}}
--
- Backwards "portability" (?)
{{content}}
--
- You prefer "base R" coding style
{{content}}
--
- **Not reliant on "third party" packages**

{{content}}
--
**Basically, it's a matter of ___preference___...  **

{{content}}
--
*..but there are some advantages!*

---

# Session outline

.left-column[
]

.right-column[
{{content}}
]
--
Just a few examples of handy functions for data wrangling in base-R...
{{content}}
--
1. Matrices, dataframes, indexing (a *very brief* recap)
{{content}}
--
2. Using *match()* to merge data frames
{{content}}
--
3. Doing stuff with rows and columns: *apply()*
{{content}}
--
4. Doing stuff with factors: *tapply()*
{{content}}
--
5. Lists and doing stuff to them: *lapply()*
{{content}}
--
6. ?   
{{content}}

**Health warning:** your mileage (and methods) may vary.  
This is just one way, and one I like using.

---

#Matrices, dataframes, indexing (1)

##"Row, Column"

- So, .content-box-red[`df1[i,j]`]  = row *i* and column *j* from data frame (or matrix) *df1*.
- Leaving one of the indexes blank means return "all"

.pull-left[

```r
df1
```

```
##   c1 c2
## 1  a  1
## 2  b  2
## 3  c  3
```
]
--
.pull-right[

```r
df1[1,]
```

```
##   c1 c2
## 1  a  1
```

```r
df1[,2]
```

```
## [1] 1 2 3
```
]

---

#Matrices, dataframes, indexing (2)

##Indexing works on both **matrices** and dataframes.

- Matrices are just tables of numbers 
- So you can think of them of matrices in the mathematical sense

.pull-left[

```r
mat1
```

```
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
```

```r
mat1[1,]
```

```
## [1] 1 3
```
]
--
.pull-right[

```r
eigen(mat1)$values
```

```
## [1]  5.3722813 -0.3722813
```

```r
mat1[1,] %*% mat1[2,]
```

```
##      [,1]
## [1,]   14
```
]

---

#Matrices, dataframes, indexing (3)

##Indexing works on both matrices and **dataframes**.

- Dataframes are matrices with a few more features
- They (can) have descriptive row- and column names
- They can have columns with different data types

.pull-left[

```r
df2
```

```
##      id measure
## row1  A     1.1
## row2  B     2.2
## row3  C     3.3
```

```r
df2[2,]
```

```
##      id measure
## row2  B     2.2
```
]
--
.pull-right[
Calculus doesn't (always) work...

```r
eigen(df2)
```
But you can express (numeric) dataframes as matrices or vice versa:

```r
df3 <- as.data.frame(mat1)
```
]

---

#Matrices, dataframes, indexing (4)

##Dataframes - indexing by column- or row names

.pull-left[

```r
df2
```

```
##      id measure
## row1  A     1.1
## row2  B     2.2
## row3  C     3.3
```

```r
names(df2)
```

```
## [1] "id"      "measure"
```

```r
row.names(df2)
```

```
## [1] "row1" "row2" "row3"
```
]
--
.pull-right[

```r
df2[,2]
df2$measure
df2[,"measure"]
```

```
## [1] 1.1 2.2 3.3
```

]

---

#Matrices, dataframes, indexing (5)

##Dataframes - more uses of column/row names

.pull-left[

```r
df2
```

```
##      id measure
## row1  A     1.1
## row2  B     2.2
## row3  C     3.3
```

```r
names(df2)
```

```
## [1] "id"      "measure"
```

```r
names(df2) <- c("indiv","val")
df2
```

```
##      indiv val
## row1     A 1.1
## row2     B 2.2
## row3     C 3.3
```
]
--
.pull-right[
**Adding a column**

```r
df2$newcol <- c(7.0,7.3,7.9)
df2
```

```
##      indiv val newcol
## row1     A 1.1    7.0
## row2     B 2.2    7.3
## row3     C 3.3    7.9
```
**Removing a column**:

```r
df2$val <- NULL
df2
```

```
##      indiv newcol
## row1     A    7.0
## row2     B    7.3
## row3     C    7.9
```

]

---

#Using _match()_ to merge data frames (1)

##Now for something more interesting..!
- `match()` matches values from one vector to values from another.
- We can use this to "match up" data from one dataframe to another (equivalent of "merge" or "joins").

.pull-left[

```r
df2
```

```
##      indiv newcol
## row1     A    7.0
## row2     B    7.3
## row3     C    7.9
```
]
--
.pull-right[

```r
df3
```

```
##   id multimeasure
## 1  B           81
## 2  B           87
## 3  C           91
## 4  C           93
```
]

.content-box-red[
We want to insert 'ncol' values for each 'indiv' from dataframe `df2` into dataframe `df3`, by individual identifier.
]

---

#Using _match()_ to merge data frames (2)

.pull-left[

```r
df3
```

```
##   id multimeasure
## 1  B           81
## 2  B           87
## 3  C           91
## 4  C           93
```
]

.pull-right[

```r
df2
```

```
##      indiv newcol
## row1     A    7.0
## row2     B    7.3
## row3     C    7.9
```
]

```r
df3$newcol <- df2$newcol[match(df3$id, df2$indiv)]
```

.center[

```r
df3
```

```
##   id multimeasure newcol
## 1  B           81    7.3
## 2  B           87    7.3
## 3  C           91    7.9
## 4  C           93    7.9
```
]

---

# Working with rows and columns: _apply()_ (1)
The `apply()` function takes a dataframe/matrix and "applies" a function to one of its dimensions: rows, or columns.

.center[

```r
df4
```

```
##   col1 col2
## 1  1.1  4.2
## 2  2.1  5.2
## 3  3.1  6.2
```
Rows, columns...
]
--
.pull-left[
**Row means** (rows = 1)

```r
apply(df4, 1, mean)
```

```
## [1] 2.65 3.65 4.65
```
]
--
.pull-right[
**Column medians** (row = 2)

```r
apply(df4, 2, median)
```

```
## col1 col2 
##  2.1  5.2
```
]

---

# Working with rows and columns: _apply()_ (2)

`apply()` works with custom functions, too!

.pull-left[

```r
df4[2,2] <- NA
df4
```

```
##   col1 col2
## 1  1.1  4.2
## 2  2.1   NA
## 3  3.1  6.2
```

```r
apply(df4, 2, mean)
```

```
## col1 col2 
##  2.1   NA
```
]
--
.pull-right[

```r
mean_no_na <- function(x) {
 return(mean(x, na.rm=T))
}

apply(df4, 2, mean_no_na)
```

```
## col1 col2 
##  2.1  5.2
```

]

---

# Working with factors: _tapply()_ (1)

- `tapply()` is similar to apply() but can apply a function to a vector while subsetting using another vector  
- This is useful for calculating group means, for example.

.pull-left[

```r
df3
```

```
##   id multimeasure newcol
## 1  B           81    7.3
## 2  B           87    7.3
## 3  C           91    7.9
## 4  C           93    7.9
```
]
--
.pull-right[

```r
df3$id
```

```
## [1] B B C C
## Levels: B C
```
]
--
.center[

```r
tapply(df3$multimeasure, df3$id, mean)
```

```
##  B  C 
## 84 92
```
]

---

# Working with factors: _tapply()_ (2)

- What if we want to add the individual means to the table?
- One option is to use `tapply()` to both calculate both means and the number of measures per group (using `length()`)
- Then use `rep()` to create a new column

```r
id_means <- tapply(df3$multimeasure, df3$id, mean)
id_obs <- tapply(df3$multimeasure, df3$id, length)
```
--
.left-column[

```r
id_means
```

```
##  B  C 
## 84 92
```

```r
id_obs
```

```
## B C 
## 2 2
```
]
--
.right-column[

```r
means_col <- rep(id_means, id_obs)
df3$means_col <- means_col
df3
```

```
##   id multimeasure newcol means_col
## 1  B           81    7.3        84
## 2  B           87    7.3        84
## 3  C           91    7.9        92
## 4  C           93    7.9        92
```
]

---

# Working with lists: _lapply()_ (1)

- Lists are collections of other objects, e.g. matrices or dataframes.
- They are convenient in e.g. working with more than "two dimensions"

```r
mylist <- list(df2, df3)
mylist
```

```
## [[1]]
##      id  c2
## row1  A 7.0
## row2  B 7.3
## row3  C 7.9
## 
## [[2]]
##   id c1  c2 c3
## 1  B 81 7.3 84
## 2  B 87 7.3 84
## 3  C 91 7.9 92
## 4  C 93 7.9 92
```

---

# Working with lists: _lapply()_ (2)

- They can be indexed in the same way as matrices/DF's, but using double square brackets`[[i]]`

.pull-left[

```r
mylist[[1]]
```

```
##      id  c2
## row1  A 7.0
## row2  B 7.3
## row3  C 7.9
```

```r
mylist[[1]][1,]
```

```
##      id c2
## row1  A  7
```
]
--
.pull-right[

```r
mylist[[2]]
```

```
##   id c1  c2 c3
## 1  B 81 7.3 84
## 2  B 87 7.3 84
## 3  C 91 7.9 92
## 4  C 93 7.9 92
```

```r
mylist[[2]][,1]
```

```
## [1] B B C C
## Levels: B C
```
]

---

# Working with lists: _lapply()_ (3)

- Lists can also be named

```r
names(mylist) <- c("element1", "element2")
```

.pull-left[

```r
mylist
```

```
## $element1
##      id  c2
## row1  A 7.0
## row2  B 7.3
## row3  C 7.9
## 
## $element2
##   id c1  c2 c3
## 1  B 81 7.3 84
## 2  B 87 7.3 84
## 3  C 91 7.9 92
## 4  C 93 7.9 92
```
]
--
.pull-right[

```r
mylist[[2]]
mylist[["element2"]]
```

```r
mylist$element2
```

```
##   id c1  c2 c3
## 1  B 81 7.3 84
## 2  B 87 7.3 84
## 3  C 91 7.9 92
## 4  C 93 7.9 92
```
]

---
# Working with lists: _lapply()_ (4)

- Finally, `lapply()` can be used to apply functions to (elements of) lists.
- Only two arguments, the name of the list and the function to apply.
- Simple example, find the column names (`names(`) for each list element:

```r
lapply(mylist, names)
```

```
## $element1
## [1] "id" "c2"
## 
## $element2
## [1] "id" "c1" "c2" "c3"
```
---
# Working with lists: _lapply()_ (5)

This is far more interesting when using custom functions.
For example, calculate the mean for column `c2` in each list element:

.pull-left[

```r
mylist
```

```
## $element1
##      id  c2
## row1  A 7.0
## row2  B 7.3
## row3  C 7.9
## 
## $element2
##   id c1  c2 c3
## 1  B 81 7.3 84
## 2  B 87 7.3 84
## 3  C 91 7.9 92
## 4  C 93 7.9 92
```
]

--
.pull-right[

```r
c2_mean <- function(x) {
 return(mean(x$c2))
}

lapply(mylist, c2_mean)
```

```
## $element1
## [1] 7.4
## 
## $element2
## [1] 7.6
```
]

---
class: inverse, center, middle

![](https://media.giphy.com/media/NhmsYsjPq5ZKg/giphy.gif)

# Wanna play?

[https://stirlingcodingclub.github.io/Base-R/Base-R-practice.html](https://stirlingcodingclub.github.io/Base-R/Base-R-practice.html)  
or the Github repo itself:
[https://github.com/StirlingCodingClub/Base-R](https://github.com/StirlingCodingClub/Base-R)