The ability to simulate data is a useful tool for better understanding statistical analyses and planning experimental designs. These notes illustrate how to simulate data using a variety of different functions in the R programming language, then discuss how data simulation can be used in research. These notes borrow heavily from a Stirling Coding Club session on randomisation, and to a lesser extent from a session on linear models. After working through these notes, the reader should be able to simulate their own data sets and use them to explore data visualisations and statistical analysis. These notes are also available as a PDF.
sample
The ability generate simulated data is very useful in a lot of
research contexts. Simulated data can be used to better
understand statistical methods, or in some cases to actually run
statistical analyses (e.g., simulating a null distribution against
which to compare a sample). Here I want to demonstrate how to simulate
data in R. This can be accomplished with base R functions including rnorm
, runif
, rbinom
, rpois
, or rgamma
; all of
these functions sample univariate data (i.e., one variable) from a
specified distribution. The function sample
can be used to sample elements
from an R object with or without replacement. Using the MASS
library, the mvtnorm
function will
sample multiple variables with a known correlation structure (i.e., we
can tell R how variables should be correlated with one another) and
normally distributed errors.
Below, I will first demonstrate how to use some common functions in R for simulating data. Then, I will illustrate how these simulated data might be used to better understand common statistical analyses and data visualisation.
Below, I introduce some base R functions that simulate (pseudo)random numbers from a given distribution. Note that most of what follows in this section is a recreation of a similar section in the notes for randomisation analysis in R.
Sampling from a uniform distribution
The runif
function returns some number (n
)
of random numbers from a uniform
distribution with a range from \(a\) (min
) to \(b\) (max
) such that \(X\sim\mathcal U(a,b)\) (verbally, \(X\) is sampled from a uniform distribution
with the parameters \(a\) and \(b\)), where \(-\infty < a < b < \infty\)
(verbally, \(a\) is greater than
negative infinity but less than \(b\),
and \(b\) is finite). The default is to
draw from a standard uniform distribution (i.e., \(a = 0\) and \(b =
1\)) as done below.
rand_unifs_10 <- runif(n = 10, min = 0, max = 1);
The above code stores a vector of ten numbers
rand_unifs_10
, shown below. Note that the numbers will be
different each time we re-run the runif
function above.
## [1] 0.32624815 0.66646266 0.08118940 0.43764419 0.30294628 0.18661117
## [7] 0.19025308 0.26780157 0.53324194 0.05166473
We can visualise the standard uniform distribution that is generated
by plotting a histogram of a very large number of values created using
runif
.
rand_unifs_10000 <- runif(n = 10000, min = 0, max = 1);
hist(rand_unifs_10000, xlab = "Random value (X)", col = "grey",
main = "", cex.lab = 1.5, cex.axis = 1.5);
The random uniform distribution is special in some ways. The algorithm for generating random uniform numbers is the starting point for generating random numbers from other distributions using methods such as rejection sampling, inverse transform sampling, or the Box Muller method (Box and Muller 1958).
Sampling from a normal distribution
The rnorm
function returns some number (n
)
of randomly generated values given a set mean (\(\mu\); mean
) and standard
deviation (\(\sigma\);
sd
), such that \(X\sim\mathcal
N(\mu,\sigma^2)\). The default is to draw from a standard normal (a.k.a.,
“Gaussian”) distribution (i.e., \(\mu =
0\) and \(\sigma = 1\)).
rand_norms_10 <- rnorm(n = 10, mean = 0, sd = 1);
The above code stores a vector of 10 numbers, shown below.
## [1] -1.62366254 0.44484705 -0.36194250 0.02640805 -0.21144759 0.85520311
## [7] 1.52687944 1.20101330 0.12107444 -0.53072527
We can verify that a standard normal distribution is generated by
plotting a histogram of a very large number of values created using
rnorm
.
rand_norms_10000 <- rnorm(n = 10000, mean = 0, sd = 1);
hist(rand_norms_10000, xlab = "Random value (X)", col = "grey",
main = "", cex.lab = 1.5, cex.axis = 1.5);
Generating a histogram using data from a simulated distribution like
this is often a useful way to visualise distributions, or to see how
samples from the same distribution might vary. For example, if we wanted
to compare the above distribution with a normal distribution that had a
standard deviation of 2 instead of 1, then we could simply sample 10000
new values in rnorm
with sd = 2
instead of
sd = 1
and create a new histogram with hist
.
If we wanted to see what the distribution of sampled data might look
like given a low sample size (e.g., 10), then we could repeat the
process of sampling from rnorm(n = 10, mean = 0, sd = 1)
multiple times and looking at the shape of the resulting histogram.
Sampling from a poisson distribution
Many processes in biology can be described by a Poisson
distribution. A Poisson process describes events happening with some
given probability over an area of time or space such that \(X\sim Poisson(\lambda)\), where the rate
parameter \(\lambda\) is both the mean
and variance of the Poisson distribution (note that by definition, \(\lambda > 0\), and although \(\lambda\) can be any positive real number,
data are always integers, as with count data). Sampling from a Poisson
distribution can be done in R with rpois
, which takes only
two arguments specifying the number of values to be returned
(n
) and the rate parameter (lambda
).
rand_poissons <- rpois(n = 10, lambda = 1.5);
print(rand_poissons);
## [1] 2 1 1 2 1 1 2 2 0 2
There are no default values for rpois
. We can plot a
histogram of a large number of values to see the distribution when \(\lambda = 4.5\) below.
rand_poissons_10000 <- rpois(n = 10000, lambda = 4.5);
hist(rand_poissons_10000, xlab = "Random value (X)", col = "grey",
main = "", cex.lab = 1.5, cex.axis = 1.5);
Sampling from a binomial distribution
Sampling from a binomial distribution in R with rbinom
is a bit more complex than using runif
, rnorm
,
or rpois
. Like those previous functions, the
rbinom
function returns some number (n
) of
random numbers, but the arguments and output can be slightly confusing
at first. Recall that a binomial
distribution describes the number of ‘successes’ for some number of
independent trials (\(\Pr(success) =
p\)). The rbinom
function returns the number of
successes after size
trials, in which the probability of
success in each trial is prob
. For a concrete example,
suppose we want to simulate the flipping of a fair coin 1000 times, and
we want to know how many times that coin comes up heads (‘success’). We
can do this with the following code.
coin_flips <- rbinom(n = 1, size = 1000, prob = 0.5);
print(coin_flips);
## [1] 496
The above result shows that the coin came up heads 496 times. Note,
however, the (required) argument n
above. This allows the
user to set the number of sequences to run. In other words, if we set
n = 2
, then this could simulate the flipping of a fair coin
1000 times once to see how many times heads comes up, then repeating the
whole process a second time to see how many times heads comes up again
(or, if it is more intuitive, the flipping of two separate fair coins
1000 times).
coin_flips_2 <- rbinom(n = 2, size = 1000, prob = 0.5);
print(coin_flips_2);
## [1] 500 474
In the above, a fair coin was flipped 1000 times and returned 500
heads, and then another fair coin was flipped 1000 times and returned
474 heads. As with the rnorm
and runif
functions, we can check to see what the distribution of the binomial
function looks like if we repeat this process. Suppose, in other words,
that we want to see the distribution of the number of times heads comes
up after 1000 flips. We can, for example, simulate the process of
flipping 1000 times in a row with 10000 different coins using the code
below.
coin_flips_10000 <- rbinom(n = 10000, size = 1000, prob = 0.5);
I have not printed the above coin_flips_10000
for
obvious reasons, but we can use a histogram to look at the results.
hist(coin_flips_10000, xlab = "Random value (X)", col = "grey",
main = "", cex.lab = 1.5, cex.axis = 1.5);
As would be expected, most of the time ‘heads’ occurs around 500 times out of 1000, but usually the actual number will be a bit lower or higher due to chance. Note that if we want to simulate the results of individual flips in a single trial, we can do so as follows.
flips_10 <- rbinom(n = 10, size = 1, prob = 0.5);
## [1] 0 0 0 0 1 0 1 1 0 0
In the above, there are n = 10
trials, but each trial
consists of only a single coin flip (size = 1
). But we can
equally well interpret the results as a series of n
coin
flips that come up either heads (1
) or tails
(0
). This latter interpretation can be especially useful to
write code that randomly decides whether some event will happen
(1
) or not (0
) with some probability
prob
.
sample
Sometimes it is useful to sample a set of values from a vector or
list. The R function sample
is very flexible for sampling a
subset of numbers or elements from some structure (x
) in R
according to some set probabilities (prob
). Elements can be
sampled from x
some number of times (size
)
with or without replacement (replace
), though an error will
be returned if the size
of the sample is larger than
x
but replace = FALSE
(default).
Sampling random numbers from a list
To start out simple, suppose we want to ask R to pick a random number from one to ten with equal probability.
rand_number_1 <- sample(x = 1:10, size = 1);
print(rand_number_1);
## [1] 3
The above code will set rand_number_1
to a randomly
selected value, in this case 3. Because we have not specified a
probability vector prob
, the function assumes that every
element in 1:10
is sampled with equal probability. We can
increase the size
of the sample to 10
below.
rand_number_10 <- sample(x = 1:10, size = 10);
print(rand_number_10);
## [1] 3 7 9 1 6 4 8 10 5 2
Note that all numbers from 1 to 10 have been sampled, but in a random
order. This is becaues the default is to sample with replacement,
meaning that once a number has been sampled for the first element in
rand_number_10
, it is no longer available to be sampled
again. To change this and allow for sampling with replacement, we can
change the default.
rand_number_10_r <- sample(x = 1:10, size = 10, replace = TRUE);
print(rand_number_10_r);
## [1] 4 8 10 9 9 6 3 8 6 4
Note that the numbers {4, 6, 8, 9} are now repeated in the set of randomly sampled values above. We can also specify the probability of sampling each element, with the condition that these probabilities need to sum to 1. Below shows an example in which the numbers 1-5 are sampled with a probability of 0.05, while the numbers 6-10 are sampled with a probability of 0.15, thereby biasing sampling toward larger numbers.
prob_vec <- c( rep(x = 0.05, times = 5), rep(x = 0.15, times = 5) );
rand_num_bias <- sample(x = 1:10, size = 10, replace = TRUE, prob = prob_vec);
print(rand_num_bias);
## [1] 10 6 8 6 3 10 4 9 6 2
Note that rand_num_bias
above contains more numbers from
6-10 than from 1-5.
Sampling random characters from a list
Sampling characters from a list of elements is no different than sampling numbers, but I am illustrating it separately because I find that I often sample characters for conceptually different reasons. For example, if I want to create a simulated data set that includes three different species, I might create a vector of species identities from which to sample.
species <- c("species_A", "species_B", "species_C");
This gives three possible categories, which I can now use
sample
to draw from. Assume that I want to simulate the
sampling of these three species, perhaps with species_A
being twice as common as species_B
and
species_C
. I might use the following code to sample 24
times.
sp_sample <- sample(x = species, size = 24, replace = TRUE,
prob = c(0.5, 0.25, 0.25)
);
Below are the values that get returned.
## [1] "species_B" "species_B" "species_A" "species_A" "species_A" "species_A"
## [7] "species_A" "species_A" "species_A" "species_C" "species_A" "species_A"
## [13] "species_A" "species_C" "species_A" "species_C" "species_B" "species_A"
## [19] "species_A" "species_C" "species_A" "species_B" "species_B" "species_B"
We can generate variables \(X_{1}\) and \(X_{2}\) that have known correlations \(\rho\) with with one another. The code below does this for two standard normal random variables with a sample size of 10000, such that the correlation between them is 0.3.
N <- 10000;
rho <- 0.3;
x1 <- rnorm(n = N, mean = 0, sd = 1);
x2 <- (rho * x1) + sqrt(1 - rho*rho) * rnorm(n = N, mean = 0, sd = 1);
Mathematically, these variables are generated by first simulating the
sample \(x_{1}\) (x1
above) from a standard normal distribution. Then, \(x_{2}\) (x2
above) is
calculated as below,
\(x_{2} = \rho x_{1} + \sqrt{1 - \rho^{2}}x_{rand},\)
Where \(x_{rand}\) is a sample from
a normal distribution with the same variance as \(x_{1}\). A simple call to the R function
cor
will confirm that the correlation does indeed equal
rho
(with some sampling error).
cor(x1, x2);
## [1] 0.307195
This is useful if we are only interested in two variables, but there is a much more efficient way to generate any number of variables with different variances and correlations to one another. To do this, we need to use the MASS library, which can be installed and loaded as below.
install.packages("MASS");
library("MASS");
In the MASS
library, the function mvrnorm
can be used to generate any
number of variables for a pre-specified covariance structure.
Suppose we want to simulate a data set of three measurements from a species of organisms. Measurement 1 (\(M_1\)) has a mean of \(\mu_{M_{1}} =\) 159.54 and variance of \(Var(M_1) =\) 12.68, measurement 2 (\(M_2\)) has a mean of \(\mu_{M_{1}} =\) 245.26 and variance of \(Var(M_2) =\) 30.39, and measurement 3 (\(M_2\)) has a mean of \(\mu_{M_{1}} =\) 25.52 and variance of \(Var(M_3) =\) 2.18. Below is a table summarising.
measurement | mean | variance |
---|---|---|
M1 | 159.54 | 12.68 |
M2 | 245.26 | 30.39 |
M3 | 25.52 | 2.18 |
Further, we want the covariance between \(M_1\) and \(M_2\) to equal \(Cov(M_{1}, M_{2}) =\) 13.95, the covariance between \(M_1\) and \(M_3\) to equal \(Cov(M_{1}, M_{3}) =\) 3.07, and the covariance between \(M_2\) and \(M_3\) to equal \(Cov(M_{2}, M_{3}) =\) 4.7. We can put all of this information into a covariance matrix \(\textbf{V}\) with three rows and three columns. The diagonal of the matrix holds the variances of each variable, with the off-diagonals holding the covariances (note also that the variance of a variable \(M\) is just the variable’s covariance with itself; e.g., \(Var(M_{1}) = Cov(M_{1}, M_{1})\)).
\[ V = \begin{pmatrix} Var(M_{1}), & Cov(M_{1}, M_{2}), & Cov(M_{1}, M_{3}) \\ Cov(M_{2}, M_{1}), & Var(M_{2}), & Cov(M_{2}, M_{3}) \\ Cov(M_{3}, M_{1}), & Cov(M_{3}, M_{2}), & Var(M_{3}) \\ \end{pmatrix}. \]
In R, we can create this matrix as follows.
matrix_data <- c(12.68, 13.95, 3.07, 13.95, 30.39, 4.70, 3.07, 4.70, 2.18);
cv_mat <- matrix(data = matrix_data, nrow = 3, ncol = 3, byrow = TRUE);
rownames(cv_mat) <- c("M1", "M2", "M3");
colnames(cv_mat) <- c("M1", "M2", "M3");
Here is what cv_mat
looks like (note that it is
symmetrical along the diagonal).
## M1 M2 M3
## M1 12.68 13.95 3.07
## M2 13.95 30.39 4.70
## M3 3.07 4.70 2.18
Now we can add the means to a vector in R.
mns <- c(159.54, 245.26, 25.52);
We are now ready to use the mvrnorm
function in R to
simulate some number n
of sampled organisms with these
three measurements. We use the mvrnorm
arguments
mu
and Sigma
to specify the vector of means
and covariance matrix, respectively.
sim_data <- mvrnorm(n = 40, mu = mns, Sigma = cv_mat);
Here are the example data below.
M1 | M2 | M3 |
---|---|---|
155.2651 | 238.8919 | 24.07884 |
160.1899 | 244.7838 | 26.29690 |
156.1196 | 245.0469 | 24.62519 |
158.9075 | 248.4630 | 26.37611 |
159.5190 | 243.4377 | 24.76282 |
165.0315 | 253.6626 | 28.54185 |
162.6464 | 245.9994 | 25.20926 |
163.5349 | 254.9935 | 26.59349 |
158.4219 | 240.4827 | 24.17196 |
164.4322 | 256.9492 | 26.96259 |
156.5350 | 238.5975 | 26.10578 |
164.5456 | 257.1145 | 25.67510 |
157.9869 | 243.0143 | 26.22417 |
168.2971 | 253.9956 | 28.94757 |
162.3973 | 248.6685 | 25.05107 |
159.7965 | 247.9087 | 26.98007 |
163.9835 | 253.6689 | 27.17218 |
160.0837 | 248.8687 | 26.39687 |
157.7130 | 244.6729 | 24.51724 |
160.3263 | 239.3547 | 24.66200 |
156.7600 | 243.4928 | 24.19963 |
160.3282 | 243.3611 | 26.19117 |
158.3644 | 244.2543 | 26.21110 |
159.3612 | 244.3898 | 25.38269 |
155.8408 | 233.9546 | 23.58026 |
158.8804 | 241.5434 | 24.51186 |
159.4017 | 244.3609 | 26.12040 |
165.2231 | 250.1320 | 26.40259 |
152.1460 | 235.6425 | 23.15554 |
157.7196 | 243.5617 | 25.10958 |
158.3931 | 237.6672 | 26.11306 |
161.3122 | 250.6453 | 27.13218 |
162.4960 | 249.4207 | 24.81866 |
160.6010 | 248.4333 | 25.49542 |
158.8079 | 244.3131 | 24.84637 |
166.7818 | 255.5355 | 27.66026 |
156.3915 | 245.6346 | 24.18855 |
161.9607 | 251.4644 | 26.88508 |
155.0859 | 239.4893 | 22.74358 |
156.4410 | 245.6795 | 26.01055 |
We can check to confirm that the mean values of each column are
correct using apply
.
apply(X = sim_data, MARGIN = 2, FUN = mean);
## M1 M2 M3
## 159.95074 246.03878 25.65274
And we can check to confirm that the covariance structure of the data
is correct using cov
.
cov(sim_data);
## M1 M2 M3
## M1 12.233584 17.090830 3.596888
## M2 17.090830 33.898138 5.655008
## M3 3.596888 5.655008 1.871685
Note that the values are not exact, but should become closer to the
specified values as increase the sample size n
. We can
visualise the data too; for example, we might look at the close
correlation between \(M_{1}\) and \(M_{2}\) using a scatterplot, just as we
would for data sampled from the field.
par(mar = c(5, 5, 1, 1));
plot(x = sim_data[,1], y = sim_data[,2], pch = 20, cex = 1.25, cex.lab = 1.25,
cex.axis = 1.25, xlab = expression(paste("Value of ", M[1])),
ylab = expression(paste("Value of ", M[2])));
We could even run an ordination on these simulated data. For example,
we could extract the principle components with prcomp
, then
plot the first two PCs to visualise these data. We might, for example,
want to compare different methods of ordination using a data set with
different, pre-specified properties (e.g.,
Minchin 1987). We might also want to use simulated data sets to
investigate how different statistical tools perform. I show this in the
next section, where I put a full data set together and run linear
models on it.
Putting everything together, here I will create a data set of three
different species from which three different measurements are taken. We
can just call these measurements ‘length’, ‘width’, and ‘mass’. For
simplicity, let us assume that these measurements always covary in the
same way that we saw with \(\textbf{V}\) (i.e., cv_mat
)
above. But let’s also assume that we have three species with slightly
different mean values. Below is the code that will build a new data set
of \(N = 20\) samples with four
columns: species, length, width, and mass.
N <- 20;
matrix_data <- c(12.68, 13.95, 3.07, 13.95, 30.39, 4.70, 3.07, 4.70, 2.18);
cv_mat <- matrix(data = matrix_data, nrow = 3, ncol = 3, byrow = TRUE);
mns_1 <- c(159.54, 245.26, 25.52);
sim_data_1 <- mvrnorm(n = N, mu = mns, Sigma = cv_mat);
colnames(sim_data_1) <- c("Length", "Width", "Mass");
# Below, I bind a column for indicating 'species_1' identity
species <- rep(x = "species_1", times = 20); # Repeats 20 times
sp_1 <- data.frame(species, sim_data_1);
Let us add one more data column. Suppose that we can also sample the
number of offspring each organism has, and that the mean number of
offspring that an organism has equals one tenth of the organism’s mass.
To do this, we can use rpois
, and take advantage of the
fact that the argument lambda
can be a vector rather than a
single value. So to get the number of offspring for each organism based
on its body mass, we can just insert the mass vector
sp_1$Mass
times 0.1 for lambda
.
offspring <- rpois(n = N, lambda = sp_1$Mass * 0.1);
sp_1 <- cbind(sp_1, offspring);
I have also bound the offspring number to the data set
sp_1
. Here is what it looks like below.
species | Length | Width | Mass | offspring |
---|---|---|---|---|
species_1 | 160.0961 | 249.7890 | 25.93705 | 3 |
species_1 | 160.5342 | 247.5876 | 26.06193 | 3 |
species_1 | 159.9338 | 244.3800 | 26.44370 | 2 |
species_1 | 155.9637 | 246.8168 | 27.22432 | 4 |
species_1 | 158.8237 | 246.1629 | 26.73524 | 2 |
species_1 | 156.2496 | 241.0867 | 23.84185 | 4 |
species_1 | 157.5895 | 242.6299 | 25.90330 | 4 |
species_1 | 158.2515 | 250.9222 | 25.90506 | 2 |
species_1 | 160.8467 | 245.7263 | 24.37506 | 2 |
species_1 | 156.7972 | 243.3848 | 25.46426 | 4 |
species_1 | 156.0726 | 246.5050 | 25.81432 | 2 |
species_1 | 165.4721 | 252.4606 | 25.94337 | 3 |
species_1 | 155.4319 | 240.2571 | 24.42234 | 1 |
species_1 | 161.3542 | 248.6756 | 24.97249 | 0 |
species_1 | 163.3283 | 246.9854 | 27.75439 | 3 |
species_1 | 158.3656 | 248.1103 | 25.26652 | 0 |
species_1 | 165.0526 | 249.2232 | 26.04674 | 1 |
species_1 | 160.7815 | 245.3200 | 25.64529 | 2 |
species_1 | 162.6027 | 249.0639 | 26.49572 | 2 |
species_1 | 157.5804 | 246.8049 | 26.35311 | 3 |
To add two more species, let us repeat the process two more times, but change the expected mass just slightly each time. The code below does this, and puts everything together in a single data set.
# First making species 2
mns_2 <- c(159.54, 245.26, 25.52 + 3); # Add a bit
sim_data_2 <- mvrnorm(n = N, mu = mns, Sigma = cv_mat);
colnames(sim_data_2) <- c("Length", "Width", "Mass");
species <- rep(x = "species_2", times = 20); # Repeats 20 times
offspring <- rpois(n = N, lambda = sim_data_2[,3] * 0.1);
sp_2 <- data.frame(species, sim_data_2, offspring);
# Now make species 3
mns_3 <- c(159.54, 245.26, 25.52 + 4.5); # Add a bit more
sim_data_3 <- mvrnorm(n = N, mu = mns, Sigma = cv_mat);
colnames(sim_data_3) <- c("Length", "Width", "Mass");
species <- rep(x = "species_3", times = 20); # Repeats 20 times
offspring <- rpois(n = N, lambda = sim_data_3[,3] * 0.1);
sp_3 <- data.frame(species, sim_data_3, offspring);
# Bring it all together in one data set
dat <- rbind(sp_1, sp_2, sp_3);
Our full data set now looks like the below.
species | Length | Width | Mass | offspring |
---|---|---|---|---|
species_1 | 160.0961 | 249.7890 | 25.93705 | 3 |
species_1 | 160.5342 | 247.5876 | 26.06193 | 3 |
species_1 | 159.9338 | 244.3800 | 26.44370 | 2 |
species_1 | 155.9637 | 246.8168 | 27.22432 | 4 |
species_1 | 158.8237 | 246.1629 | 26.73524 | 2 |
species_1 | 156.2496 | 241.0867 | 23.84185 | 4 |
species_1 | 157.5895 | 242.6299 | 25.90330 | 4 |
species_1 | 158.2515 | 250.9222 | 25.90506 | 2 |
species_1 | 160.8467 | 245.7263 | 24.37506 | 2 |
species_1 | 156.7972 | 243.3848 | 25.46426 | 4 |
species_1 | 156.0726 | 246.5050 | 25.81432 | 2 |
species_1 | 165.4721 | 252.4606 | 25.94337 | 3 |
species_1 | 155.4319 | 240.2571 | 24.42234 | 1 |
species_1 | 161.3542 | 248.6756 | 24.97249 | 0 |
species_1 | 163.3283 | 246.9854 | 27.75439 | 3 |
species_1 | 158.3656 | 248.1103 | 25.26652 | 0 |
species_1 | 165.0526 | 249.2232 | 26.04674 | 1 |
species_1 | 160.7815 | 245.3200 | 25.64529 | 2 |
species_1 | 162.6027 | 249.0639 | 26.49572 | 2 |
species_1 | 157.5804 | 246.8049 | 26.35311 | 3 |
species_2 | 160.6686 | 249.5760 | 26.84163 | 2 |
species_2 | 162.8042 | 246.9504 | 26.91129 | 6 |
species_2 | 157.0182 | 238.9712 | 24.16082 | 2 |
species_2 | 155.2083 | 238.8988 | 24.10644 | 5 |
species_2 | 155.3958 | 240.9339 | 23.35060 | 1 |
species_2 | 166.3838 | 253.2659 | 28.15383 | 4 |
species_2 | 160.0383 | 246.0190 | 27.22687 | 6 |
species_2 | 165.6907 | 245.9201 | 27.50694 | 2 |
species_2 | 165.4295 | 249.8916 | 25.70299 | 2 |
species_2 | 160.1070 | 242.6397 | 23.56593 | 1 |
species_2 | 166.1218 | 253.5333 | 26.40232 | 2 |
species_2 | 162.6699 | 250.7782 | 26.53539 | 1 |
species_2 | 156.2615 | 241.2675 | 25.07833 | 0 |
species_2 | 157.3258 | 250.9699 | 25.09196 | 2 |
species_2 | 157.0622 | 237.3069 | 25.25632 | 1 |
species_2 | 156.6568 | 246.1136 | 23.47298 | 0 |
species_2 | 161.1317 | 251.4044 | 26.87695 | 1 |
species_2 | 165.4140 | 251.9115 | 26.71136 | 5 |
species_2 | 156.3627 | 238.7914 | 24.00011 | 1 |
species_2 | 156.3154 | 242.4368 | 23.26298 | 5 |
species_3 | 157.3481 | 243.0727 | 24.69311 | 3 |
species_3 | 159.2547 | 242.8557 | 24.85465 | 4 |
species_3 | 161.1190 | 247.3744 | 27.03653 | 2 |
species_3 | 157.2414 | 239.8949 | 25.17715 | 2 |
species_3 | 156.0892 | 242.4798 | 22.56819 | 3 |
species_3 | 156.4469 | 237.6946 | 25.42245 | 3 |
species_3 | 154.2984 | 238.9348 | 25.88996 | 2 |
species_3 | 159.8300 | 240.9981 | 21.57141 | 1 |
species_3 | 155.4234 | 239.1683 | 26.60738 | 2 |
species_3 | 162.4654 | 247.9817 | 22.47856 | 1 |
species_3 | 162.2042 | 249.6214 | 25.80513 | 2 |
species_3 | 164.0491 | 242.2735 | 25.04036 | 3 |
species_3 | 161.1303 | 244.6946 | 24.88290 | 2 |
species_3 | 153.4831 | 232.7507 | 23.02110 | 1 |
species_3 | 159.0085 | 244.0369 | 24.36326 | 3 |
species_3 | 156.6950 | 240.2346 | 23.13310 | 3 |
species_3 | 151.9992 | 241.2254 | 22.63430 | 2 |
species_3 | 161.7864 | 242.6725 | 24.89283 | 1 |
species_3 | 166.4894 | 254.0258 | 29.23088 | 1 |
species_3 | 161.3459 | 241.7033 | 24.46137 | 1 |
To summarise, we now have a simulated data set of measurements from three different species, all of which have known variances and covariances of length, width, and mass. Each species has a slightly different mean mass, and for all species, each unit of mass increases the expected number of offspring by 0.1. Because we know these properties of the data for certain, we can start asking questions that might be useful to know about our data analysis. For example, given this covariance structure and these small differences in mass, is a sample size of 20 really enough to even get a significant difference among species masses using an ANOVA? What if we tried to test for differences among masses using some sort of randomisation approach Instead? Would this give us more or less power? Let us run an ANOVA to see if the difference between group means (which we know exists) is recovered.
aov_result <- aov(Mass ~ species, data = dat);
summary(aov_result);
## Df Sum Sq Mean Sq F value Pr(>F)
## species 2 13.89 6.943 3.137 0.051 .
## Residuals 57 126.16 2.213
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It appears not! What about the relationship between body mass and offspring production that we know exists? Below is a scatterplot of the data for the three different species.
This looks like there might be a positive relationship, but it is very difficult to determine just from the scatterplot. We can use a generalised linear model to test it with species as a random effect, as we might do if these were data sampled from the field (do not worry about the details of the model here; the key point is that we can use the simulated data with known properties to assess the performance of a statistical test).
library(lme4);
## Loading required package: Matrix
mod <- glmer(offspring ~ Mass + (1 | species), data = dat, family = "poisson");
## boundary (singular) fit: see help('isSingular')
summary(mod);
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: poisson ( log )
## Formula: offspring ~ Mass + (1 | species)
## Data: dat
##
## AIC BIC logLik deviance df.resid
## 210.1 216.4 -102.0 204.1 57
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.5051 -0.6416 -0.2359 0.5355 2.2414
##
## Random effects:
## Groups Name Variance Std.Dev.
## species (Intercept) 2.514e-18 1.586e-09
## Number of obs: 60, groups: species, 3
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.36171 1.44195 -0.944 0.345
## Mass 0.08626 0.05636 1.531 0.126
##
## Correlation of Fixed Effects:
## (Intr)
## Mass -0.998
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
There does not appear to be any effect here either! To get one, it appears that we will need to simulate a larger data set (or a bigger effect size – or just get lucky when re-simulating a new data set).
Note that I have run a linear model that might be reasonable given
the structure of our data. But the advantage of working with simulated
data and knowing for certain what the relationship is between the
underlying variables is that we can explore different statistical
techniques. For example, we know that our response variable
offspring
is count data, so we are supposed to specify a
Poisson error structure using the family = "poisson"
argument above, right? But what would happen if we just used a normal
error structure anyway? Would this really be so bad? Now is the
opportunity to test because we know what the correct answer is
supposed to be! Trying statistical methods that are normally ill-advised
can actually be useful here, as it can help us see for ourselves when a
technique is bad – or perhaps when it really is not (e.g., Ives 2015).
Simulating data can be a powerful tool for learning and investigating different statistical analyses. The main benefits of using simulated data are flexibility and certainty. Simulation gives us the flexibility to explore any number of hypotheticals, including different sample sizes, effect sizes, relationships between variables, and error distributions. It also works from a point of certainty; we know what the real relationship is between variables, and what the actual effect sizes are because we define them when generating random samples. So if we want to better understand what would happen if we were unable to sample an important variable in our system, or if we were to use a biased estimator, or if we we were to violate key model assumptions, simulated data is a very useful tool.