In this lesson we learn about datasets. How to create and manipulate it.

Remember: In this lesson, all R code is below the R Source button, while the output is hidden. To see it you will need to click on the green “R Output” button. At the bottom of the page, you will find a bar which allows you to change the theme of the webpage (changing colors and format) so it can easily adapt to your system and preferences. There you also find “Code highlighting” which changes how R code is displayed to you, and Toggle R code and Figures.

14. Datasets

14.1 Creating data

Often it is useful to learn how to create a dataset in R. Above we created one, which we will reproduce the code here. A dataset can be created in R with the function data.frame(), and we fill it with the variables we would like to create. We conveniently named our dataset data. Note that after we specified the function data.frame(), we can give each of the variables its own name. After that, we assign to that name our wanted values.

We use set.seed() because it is a function in R that allows all ‘pseudo-random’ functions in R to yield the same results in different computers. It has the purpose of reproducibility. Otherwise, all of you would have different values for the pseudo-random values. Cool, right?

set.seed(1234)

data <- data.frame(                                             
  Names   = c("Alan", "Brian", "Carlos", "Dalton", "Ethan", 
              "Flora", "Gaia", "Helen", "Ingrid", "Jennifer"),
  Age     = rpois(10, 30),
  Height  = rnorm(10, 170, 10),
  Weight  = rnorm(10, 70, 15),
  Gender  = gl(2, 5, labels = c("Male", "Female")),
  Courses = rpois(10, 2)
)

Now, we print our dataset in the R console with the function print()

print(data)
      Names Age   Height   Weight Gender Courses
1      Alan  23 170.6446 76.89384   Male       1
2     Brian  31 179.5949 59.59420   Male       2
3    Carlos  31 168.8971 48.27693   Male       0
4    Dalton  25 164.8899 78.62134   Male       2
5     Ethan  32 160.8880 54.64516   Male       0
6     Flora  26 161.6283 69.77293 Female       4
7      Gaia  35 194.1584 55.96077 Female       0
8     Helen  26 171.3409 86.53446 Female       3
9    Ingrid  27 165.0931 62.86610 Female       0
10 Jennifer  20 165.5945 59.35840 Female       2
data
      Names Age   Height   Weight Gender Courses
1      Alan  23 170.6446 76.89384   Male       1
2     Brian  31 179.5949 59.59420   Male       2
3    Carlos  31 168.8971 48.27693   Male       0
4    Dalton  25 164.8899 78.62134   Male       2
5     Ethan  32 160.8880 54.64516   Male       0
6     Flora  26 161.6283 69.77293 Female       4
7      Gaia  35 194.1584 55.96077 Female       0
8     Helen  26 171.3409 86.53446 Female       3
9    Ingrid  27 165.0931 62.86610 Female       0
10 Jennifer  20 165.5945 59.35840 Female       2

We can also print only the first three observations, to inspect big datasets.

head(data, 3)
   Names Age   Height   Weight Gender Courses
1   Alan  23 170.6446 76.89384   Male       1
2  Brian  31 179.5949 59.59420   Male       2
3 Carlos  31 168.8971 48.27693   Male       0

Or alternatively, we can also print the last 4 rows to see whether we created the data well.

tail(data, 4)
      Names Age   Height   Weight Gender Courses
7      Gaia  35 194.1584 55.96077 Female       0
8     Helen  26 171.3409 86.53446 Female       3
9    Ingrid  27 165.0931 62.86610 Female       0
10 Jennifer  20 165.5945 59.35840 Female       2

Importantly, you can explore your data in a new window with the command View(). Note that there are two buttons on the left hand side: one which will pop-up the data in a new window, and the other that you can use to filter the cases you are interested in inspecting.

View(data)

Important note. All variables must have the same number of observations. In our case, 10. If I put one less name (I forgot to include Jennifer) it would give an error and the dataset would not be created. For example, in that case the following error would be returned:

Error in data.frame(Names = c(“Alan”, “Brian”, “Carlos”, “Dalton”, “Ethan”, : arguments imply differing number of rows: 9, 10

R is telling you where specifically - out of all your variables - the error is occurring , and also, its reason: “arguments imply differing number of rows: 9, 10”. In my eyes, it even gives you the solution: There are nine rows, but we need 10 to make it work.

14.2 Indexing [or Data Manipulation]

Indexing is to select/subset parts of R objects. Let’s take our data example to learn about indexing. There are many ways to do this. We will see several methods here, although they are not exhaustive.

If we want to select only the Number of R Courses people have attended in the past, we use the operators ‘$’ between the name of the data.frame and the name of the variable. (Note, it is important of this type of sub-setting that there are no spaces in the variable’s name - e.g., “Students names”, rather choose either Students.names or Students_names).

data$Courses
 [1] 1 2 0 2 0 4 0 3 0 2

Another way to do the same thing is to make reference to the dimensionality of the data. When we use this method, we utilize squared brackets in R: [] or [[]]. We know our data has 10 rows and 6 columns. So the data dimensions are 10 x 6. So, if I want only Gender, which is the sixth column, I can select it by typing:

data[, 6]
 [1] 1 2 0 2 0 4 0 3 0 2

In the same way, if we are interested in all the data collected for Carlos, we can select it by typing:

data[3, ]
   Names Age   Height   Weight Gender Courses
3 Carlos  31 168.8971 48.27693   Male       0

Consequently, if we want to know how many R courses Carlos has taken in the past, we can select it by typing:

data[3, 6]
[1] 0

Now let’s expand this a little bit by using logical conditions to select more complex subsets of data. For example, if we are interested in sub-setting the data for Females or for those are older than 25 years. So, let’s start by selecting the observations for females. So, we need to say to R that we want all rows that fulfill the ‘Gender = Female’ condition. So, in R terms, we want all rows which the value for Gender is equal to Female. We translate this in the following way: data$Gender == ‘Female’.

data[data$Gender == "Female", ]
      Names Age   Height   Weight Gender Courses
6     Flora  26 161.6283 69.77293 Female       4
7      Gaia  35 194.1584 55.96077 Female       0
8     Helen  26 171.3409 86.53446 Female       3
9    Ingrid  27 165.0931 62.86610 Female       0
10 Jennifer  20 165.5945 59.35840 Female       2

If we want older than 25 years old

data[data$Age > 25, ]
   Names Age   Height   Weight Gender Courses
2  Brian  31 179.5949 59.59420   Male       2
3 Carlos  31 168.8971 48.27693   Male       0
5  Ethan  32 160.8880 54.64516   Male       0
6  Flora  26 161.6283 69.77293 Female       4
7   Gaia  35 194.1584 55.96077 Female       0
8  Helen  26 171.3409 86.53446 Female       3
9 Ingrid  27 165.0931 62.86610 Female       0

And, if we want both, Female and older than 25 years old

data[data$Gender == "Female" & data$Age > 25, ]
   Names Age   Height   Weight Gender Courses
6  Flora  26 161.6283 69.77293 Female       4
7   Gaia  35 194.1584 55.96077 Female       0
8  Helen  26 171.3409 86.53446 Female       3
9 Ingrid  27 165.0931 62.86610 Female       0

Another method is to use the subset() function in R. The subset function is available in base R and can be used to return subsets of a vector, matrix, or data frame which meet a particular condition.

subset(data, Courses > 1)
      Names Age   Height   Weight Gender Courses
2     Brian  31 179.5949 59.59420   Male       2
4    Dalton  25 164.8899 78.62134   Male       2
6     Flora  26 161.6283 69.77293 Female       4
8     Helen  26 171.3409 86.53446 Female       3
10 Jennifer  20 165.5945 59.35840 Female       2
subset(data, Height < 178 & Weight > 65)
   Names Age   Height   Weight Gender Courses
1   Alan  23 170.6446 76.89384   Male       1
4 Dalton  25 164.8899 78.62134   Male       2
6  Flora  26 161.6283 69.77293 Female       4
8  Helen  26 171.3409 86.53446 Female       3
subset(data, Age > 29)
   Names Age   Height   Weight Gender Courses
2  Brian  31 179.5949 59.59420   Male       2
3 Carlos  31 168.8971 48.27693   Male       0
5  Ethan  32 160.8880 54.64516   Male       0
7   Gaia  35 194.1584 55.96077 Female       0
subset(data, Age < 23 | Courses > 5)
      Names Age   Height  Weight Gender Courses
10 Jennifer  20 165.5945 59.3584 Female       2
subset(data, Names == "Gaia")
  Names Age   Height   Weight Gender Courses
7  Gaia  35 194.1584 55.96077 Female       0
subset(data, Gender == "Male", select = c(Age, Courses))
  Age Courses
1  23       1
2  31       2
3  31       0
4  25       2
5  32       0
subset(data, Courses != 0 & Gender == "Female" & Age > 24)
  Names Age   Height   Weight Gender Courses
6 Flora  26 161.6283 69.77293 Female       4
8 Helen  26 171.3409 86.53446 Female       3

14.3 Indexing vectors

Remember we created three vectors called n.R.courses, height and height.female on lesson 04 and 05 ? Lets use these to learn how to index vectors.

If I would like to select my first observation of n.R.courses, I make use of square brackets.

n.R.courses[1]
[1] 1

If I would like to select my first and my second observations of n.R.courses, I make use of the ‘:’ operator, which denotes sequence. So, for R, I am effectively telling it I want observations from 1 to 2.

n.R.courses[1:2]
[1] 1 2

What if I would like to select my first and my third observations of n.R.courses? Since you cannot use ‘:’, you want to combine with the c() function the wanted observations.

n.R.courses[c(1, 3)]
[1] 1 0

What if I would like to select my first, my third through fifth observations of n.R.courses?

n.R.courses[c(1, 3:5)]
[1] 1 0 2 0

What if I would like to know which are the observed heights are taller than 170 cm? In this case we can make use of logical operators. As reviewed above, we saw that logical operations return either TRUE or FALSE. So, we can use this as to ‘select’ the observations we want given a condition. So we just write it as above:

height > 170
[1] FALSE  TRUE  TRUE FALSE  TRUE

Then, we put this condition inside the vector, which will return what we want.

height[height > 170]
[1] 172.8 180.8 174.3

This is similar, inside R, as this:

height[c(FALSE, TRUE, TRUE, FALSE, TRUE)]
[1] 172.8 180.8 174.3

Now, what if I would like to know which is the location in of observed heights are taller than 170 cm?

which(height > 170)
[1] 2 3 5

14.4 Understanding your data

Let’s assume we received this data from someone, and we have to analyze it. One of the first thing you need to do is to understand the basic characteristics of your data. How many variables/columns does it have? How many observations? What are the characteristics of each variable? And so on.

Dimensions

Here you are asking R to return the number of rows and columns.

dim(data)
[1] 10  6

Variable names

Here you are asking R to return the names of each variable in your dataset.

names(data)
[1] "Names"   "Age"     "Height"  "Weight"  "Gender"  "Courses"
colnames(data)
[1] "Names"   "Age"     "Height"  "Weight"  "Gender"  "Courses"

Variables classes

Here you get class of all the variables (columns) in your dataset

sapply(data, class)
    Names       Age    Height    Weight    Gender   Courses 
 "factor" "integer" "numeric" "numeric"  "factor" "integer" 

Structure of the data

Here you are asking R to return the characteristics of each variable in your dataset. It shows a summary of each variable, and the first observations. For example, it tells us that there are 10 obs. of 6 variables. It also itemizes each variable by name, and displays the class of each variable. Gender, for example, is a factor with two levels, one for Male and another for Female. Levels are the different possible varieties of a factor. We could made out of Height one factor with levels “tall” and “short”. Names, for its turn has 10 levels because it only present unique entries (e.g., the different names for each one of our hypothesized participants)

str(data)
'data.frame':  10 obs. of  6 variables:
 $ Names  : Factor w/ 10 levels "Alan","Brian",..: 1 2 3 4 5 6 7 8 9 10
 $ Age    : int  23 31 31 25 32 26 35 26 27 20
 $ Height : num  171 180 169 165 161 ...
 $ Weight : num  76.9 59.6 48.3 78.6 54.6 ...
 $ Gender : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 2 2 2 2 2
 $ Courses: int  1 2 0 2 0 4 0 3 0 2

Summaries of your data

summary(data)
     Names        Age            Height          Weight         Gender 
 Alan   :1   Min.   :20.00   Min.   :160.9   Min.   :48.28   Male  :5  
 Brian  :1   1st Qu.:25.25   1st Qu.:164.9   1st Qu.:56.81   Female:5  
 Carlos :1   Median :26.50   Median :167.2   Median :61.23             
 Dalton :1   Mean   :27.60   Mean   :170.3   Mean   :65.25             
 Ethan  :1   3rd Qu.:31.00   3rd Qu.:171.2   3rd Qu.:75.11             
 Flora  :1   Max.   :35.00   Max.   :194.2   Max.   :86.53             
 (Other):4                                                             
    Courses   
 Min.   :0.0  
 1st Qu.:0.0  
 Median :1.5  
 Mean   :1.4  
 3rd Qu.:2.0  
 Max.   :4.0  
              
# Best for continous variables
psych::describe(data[, c("Age", "Height", "Weight", "Courses")])
        vars  n   mean    sd median trimmed   mad    min    max range skew
Age        1 10  27.60  4.58  26.50   27.62  5.93  20.00  35.00 15.00 0.01
Height     2 10 170.27 10.01 167.25  168.46  5.56 160.89 194.16 33.27 1.25
Weight     3 10  65.25 12.23  61.23   64.71 11.21  48.28  86.53 38.26 0.35
Courses    4 10   1.40  1.43   1.50    1.25  2.22   0.00   4.00  4.00 0.39
        kurtosis   se
Age        -1.29 1.45
Height      0.48 3.16
Weight     -1.39 3.87
Courses    -1.37 0.45
# Best for categorical variables
Hmisc::describe(data[, c("Names", "Gender", "Courses")])
data[, c("Names", "Gender", "Courses")] 

 3  Variables      10  Observations
---------------------------------------------------------------------------
Names 
      n missing  unique 
     10       0      10 

          Alan Brian Carlos Dalton Ethan Flora Gaia Helen Ingrid Jennifer
Frequency    1     1      1      1     1     1    1     1      1        1
%           10    10     10     10    10    10   10    10     10       10
---------------------------------------------------------------------------
Gender 
      n missing  unique 
     10       0       2 

Male (5, 50%), Female (5, 50%) 
---------------------------------------------------------------------------
Courses 
      n missing  unique    Info    Mean 
     10       0       5    0.92     1.4 

           0  1  2  3  4
Frequency  4  1  3  1  1
%         40 10 30 10 10
---------------------------------------------------------------------------

Contingency tables

xtabs(~Gender, data = data)
Gender
  Male Female 
     5      5 
xtabs(~Names, data = data)
Names
    Alan    Brian   Carlos   Dalton    Ethan    Flora     Gaia    Helen 
       1        1        1        1        1        1        1        1 
  Ingrid Jennifer 
       1        1 
xtabs(~Courses, data = data)
Courses
0 1 2 3 4 
4 1 3 1 1 

Two-way contingency tables for categorical data

xtabs(~Courses + Gender, data = data)
       Gender
Courses Male Female
      0    2      2
      1    1      0
      2    2      1
      3    0      1
      4    0      1

This recipe creates a useful table showing the number of instances that belong to each class as well as the percentage that this represents from the entire dataset.

cbind(Frequencies = table(data$Gender),
      Percentage = prop.table(table(data$Gender))*100
      )
       Frequencies Percentage
Male             5         50
Female           5         50