07_Random_Numbers.utf8

In this lesson we learn about generating “random” numbers in R.

Remember: In this lesson, all R code is below the R Source button, while the output is hidden. To see it you will need to click on the green “R Output” button. At the bottom of the page, you will find a bar which allows you to change the theme of the webpage (changing colors and format) so it can easily adapt to your system and preferences. There you also find “Code highlighting” which changes how R code is displayed to you, and Toggle R code and Figures.

13. Generating “Random” Numbers

The random word is between quotes above because R - or any software - cannot really generate truly random numbers. It gets really close though. This is because softwares use an algorithm for generating a sequence of numbers whose properties approximate the properties of sequences of random numbers. This is termed a pseudorandom number generator, which is not truly random, because it is completely determined by a relatively small set of initial values, called the seed.

Sampling from a vector

Below we will sample from several vectors, and emulate tossing a coin, rolling a die, playing the lottery, etc.

# Toss a coin
sample(0:1, 1)

[1] 0

# Toss 10 coins
sample(0:1, 10, replace = T)

 [1] 0 1 1 0 0 1 0 0 1 0

# Roll a die
sample(1:6, 1)

[1] 1

# Roll 10 dice
sample(1:6, 10, replace = T)

 [1] 1 2 1 1 4 3 6 1 5 4

# Roll a Icosahedron (20 sides die)
sample(1:20, 10, replace = T)

 [1]  3  3  1  4  7 20  4 15 18  7

# Lottery
sample(1:49, 6, replace = F)

[1] 49 28 46 44 35 32

# Pick a card, any card...
sample(1:54, 6, replace = F)

[1] 43 22 45  9  6 39

# That is a bit clunky, let's try something more advanced
sample(paste(rep(c(1:10, "Jacks", "Queen", "King"), 4), rep(c("of Hearts", "of Diamonds", 
    "of Spades", "of Clubs"), 13), sep = " "), 1)

[1] "7 of Spades"

Advanced Concept: Random & Distributions

In computational statistics, and in R, random numbers are described by a distributions, which is a function specifying the probability that a random number is within some range. If the random number is continuous this is called “probability density function”, if the random number is discrete then the term is “probability mass function”.

If you want to learn more about Probability density functions, you can run the code below in your RStudio, which will produce a Shiny App displaying the probability density functions for the Normal, Poisson and Beta distributions. Here’s the link if you want the raw App

install.packages("shiny", dependencies = TRUE)
library(shiny)
runGist("00641dba8784bb81bd10")

Sampling from a Distribution

How to choose a random number in R? As a language for statistical analysis, R has a comprehensive library of functions for generating random numbers from various statistical distributions.

Distribution	R Function
Uniform	runif
Normal	rnorm
Student’s t	rt
F	rf
chi-squared	rchisq
Exponential	rexp
Log normal	rlnorm
Beta	rbeta
Binomial	rbinom
Negative Binomial	rnbinom
Poisson	rpois
Gamma	rgamma
Weibull	rweibull
Cauchy	rcauchy
Multinomial	rmultinom
Geometric	rgeom
?Distributions	full list

The Uniform Distribution

If you want to generate a decimal number where any value (including fractional values) between the stated minimum and maximum is equally likely, the runif() function is what you are looking for.This function generates values from the Uniform distribution. Here’s how to generate one random number between 0 and 1:

runif(1, min = 0, max = 1)

[1] 0.9137376

Of course, when you run this, you’ll get a different number, but it will definitely be between 0 and 1. You won’t get the values 0 or 1 exactly, either.

If you want to generate multiple random values, you can generate several values at once by specifying the number of values you want as the first argument to runif. Here’s how to generate 10 values between 0 and 1.

runif(10, 0, 1)

 [1] 0.87885608 0.81477497 0.65235701 0.87882472 0.79091215 0.02547837
 [7] 0.52762316 0.93280762 0.60910035 0.06869623

The Normal Distribution

$ϕ (z) = \frac{1}{\sqrt{2 π}} e^{- \frac{1}{2} z^{2}}, z \in R$

To generate numbers from a normal distribution, use rnorm().

# By default the mean is 0 and the standard deviation is 1
rnorm(1)

[1] 1.289339

rnorm(5)

[1]  0.9932258  1.0766345  0.2174389 -0.9219810  0.0725744

mean(rnorm(100))

[1] 0.03337551

# Define mean and standard deviation
rnorm(1, mean = 5, sd = 1)

[1] 5.393273

rnorm(5, mean = 5, sd = 3)

[1] 4.672877 7.401792 6.644993 4.941394 5.270967

rnorm(100, mean = 10, sd = 10)

 [1] 10.31894530 16.80250304 18.10784722  0.08568995 21.59440470
 [6] 19.55401943 -1.69759765 21.59461800  0.35120539 21.87267300
[11] -1.15231776 15.59054433  2.37211899  6.32556566 -7.61343484
[16]  3.89236555 29.18125869 16.23048050 -0.92424896  8.63113547
[21] 23.29894397 12.99061960  5.43935907 -3.31674416 -3.15785654
[26] 16.48814651 16.93066353  5.48412664 10.72263970 15.75018507
[31] 28.26182093  8.83974250  3.00131295 21.18738242 14.93161070
[36]  7.03396348 12.32072099 13.77648226 -3.47017397  3.50100191
[41] -2.96389458 -8.95821856 -2.33917897 12.35430728 17.69840678
[46] -0.75574540 18.27382409 33.62624533 15.87790376  6.87102435
[51] -2.59023601  9.84552986 11.87886572  6.17844728  3.39351901
[56] -6.17943581 30.33384480  5.50139386  4.57538429 -3.10978809
[61] -4.94923862  5.19114587 24.20771765 25.34217982 22.80346400
[66] -1.81224864  8.40414483 11.09749019 25.27509982 10.19619880
[71] 16.13360918 14.56508488 19.90997433 16.86898027  0.30329242
 [ reached getOption("max.print") -- omitted 25 entries ]

# Check that the distribution looks right, make a histogram of the numbers
hist(rnorm(100, mean = 10, sd = 10), breaks = 10)

The Poisson Distribution

To generate numbers from a poisson distribution, use rpois(). The Poisson distribution is popular for modeling the number of times an event occurs in an interval of time or space. The Poisson distribution may be useful to model events such as:

The number of meteors greater than 1 meter diameter that strike earth in a year
The number of occurrences of the DNA sequence “ACGT” in a gene
The number of patients arriving in an emergency room between 11 and 12 pm

In probability theory, a Poisson process is a stochastic process that counts the number of independent events in a given time interval. The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume.

$p (x; λ) = \frac{e^{- λ} λ^{x}}{x!} for x = 0, 1, 2, \dots$

# 
rpois(10, 1)

 [1] 1 1 0 2 0 3 0 2 1 0

rpois(10, 4)

 [1] 5 5 2 5 4 5 4 4 3 7

# In the poisson ditribution mean and variance are the same: lambda
rpois(10, lambda = 1)

 [1] 1 1 1 0 0 0 2 2 1 0

rpois(10, lambda = 5)

 [1] 6 6 6 6 6 6 2 6 5 6

# Check that the distribution looks right, make a histogram of the numbers
hist(rpois(1000, lambda = 1), breaks = 5)

13.1 Wrap-up Exercise

Let’s say I have a categorical variable which can take the values A, B, C and D. How can I generate 10000 random data points and control for the frequency of each? For example: A = 10% B = 20% C = 65% D = 5%. Any ideas how to do this? Don’t fret if you have no ideas. I didn’t when I first tried to solve it. But it helped me a great deal to practice the skills I learned and how they can be useful.

Solution 1: Elegant and quickest

sample(LETTERS[1:4], 10000, replace = TRUE, prob = c(0.1, 0.2, 0.65, 0.05))

 [1] "C" "C" "C" "A" "B" "C" "C" "B" "A" "A" "D" "A" "C" "C" "C" "B" "B"
[18] "B" "B" "C" "D" "C" "C" "C" "A" "C" "A" "A" "A" "C" "C" "C" "C" "B"
[35] "C" "C" "C" "B" "C" "C" "B" "C" "C" "C" "C" "C" "A" "B" "D" "C" "A"
[52] "C" "B" "D" "C" "B" "C" "D" "B" "C" "C" "C" "C" "C" "C" "D" "C" "C"
[69] "D" "A" "A" "C" "C" "C" "C"
 [ reached getOption("max.print") -- omitted 9925 entries ]

# But how do I know it is right? Save it to a 'vector' named x
x <- sample(LETTERS[1:4], 10000, replace = TRUE, prob = c(0.1, 0.2, 0.65, 0.05))

# And then table its results
table(x)

x
   A    B    C    D 
1042 2008 6441  509

# If you want the check the exact proportions
prop.table(table(x))

x
     A      B      C      D 
0.1042 0.2008 0.6441 0.0509

Solution 2: Clever, but dirty

x <- sample(c(rep("A", 0.1*10000), 
              rep("B", 0.2*10000),
              rep("C", 0.65*10000), 
              rep("D", 0.05*10000)),
            1000)
x

 [1] "C" "C" "C" "C" "D" "C" "D" "C" "D" "C" "C" "B" "B" "B" "B" "C" "C"
[18] "C" "B" "A" "A" "C" "C" "A" "C" "B" "C" "C" "C" "D" "C" "C" "B" "C"
[35] "B" "C" "C" "C" "C" "D" "A" "B" "B" "C" "A" "A" "C" "C" "C" "C" "C"
[52] "D" "C" "C" "C" "C" "C" "C" "C" "C" "D" "A" "C" "B" "C" "A" "B" "C"
[69] "C" "C" "C" "A" "C" "B" "C"
 [ reached getOption("max.print") -- omitted 925 entries ]

# Equally good, but perhaps less straightforward
x <- rep(c("A","B","C","D"), 10000*c(0.1,0.2,0.65,0.05))
x

 [1] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
[18] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
[35] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
[52] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
[69] "A" "A" "A" "A" "A" "A" "A"
 [ reached getOption("max.print") -- omitted 9925 entries ]

# But how do I know it is right? table its results and check the exact proportions
prop.table(table(x))

x
   A    B    C    D 
0.10 0.20 0.65 0.05

Solution 3: Brute force (reversed thinking?)

x <- runif(10000) # Why use runif()? Hint: what it is the function's default?

# Now we rename x values
x[x <= 0.1]              <- "A"
x[x >  0.1  & x <= 0.3]  <- "B"
x[x >  0.3  & x <= 0.95] <- "C"
x[x >  0.95 & x <= 1]    <- "D"

# Check
prop.table(table(x))

x
     A      B      C      D 
0.1014 0.1997 0.6482 0.0507