09_Importing_Data_in

In this lesson we learn about how to import data into R. This lesson was inspired on Datacamp’s manuals on loading data in R, which you can find here and here

Remember: In this lesson, all R code is below the R Source button, while the output is hidden. To see it you will need to click on the green “R Output” button. At the bottom of the page, you will find a bar which allows you to change the theme of the webpage (changing colors and format) so it can easily adapt to your system and preferences. There you also find “Code highlighting” which changes how R code is displayed to you, and Toggle R code and Figures.

15. Importing data into R

Instead of creating your dataset - as we did in the last lesson - it is often necessary to import data into R, so you start working on your analysis. Before we move on and learn how to load data into R, it might be useful to go over the below checklist which can make it easier to import the data correctly, and potentially avoid loading problems.

In spreadsheets, the first row is usually reserved for the header. while the first column is used to identify the sampling unit (e.g., persons, cases).
While naming variables you want to avoid names, values or fields with blank spaces. If you do this, you incur in the risk of each word being interpreted as a separate variable, resulting in errors that are related to the number of elements per line in your data set.
If you want to concatenate words, try inserting a ‘.’ or a ’_’ in between to words instead of a space.
Short names are preferred over longer names
Try to avoid using names that are known functions such as mean, t, sd, log, etc.
Try to avoid using names that contain symbols such as ?, $,%, ^, &, *, (, ),-,?,,,<,>, /, |, , [ ,] ,{, }, and #.
Delete any comments that you have made in your Excel file to avoid extra columns or NA’s to be added to your file
Make sure that any missing values in your data set are indicated with NA, or that you specify this when loading the data.
If you have other statistical software that you use - and since industry standard softwares such as SPSS, SAS, STATA, MPlus are all capable of saving data in .csv format - perhaps the easiest way to load data in R is through saving the dataset into a .csv format.

15.1 Preparing Your R Workspace

In R, you want to make sure that the Enviroment and loaded packages from another session does not - by accident - interfere with the current analysis. So, it might be a good idea to start with a clean slate. One way of doing this is by typing in the below code in your console.

# This prompts garbage collector to take place.  This is useful to prompt R
# to return memory to the operating system.  gc() also return a summary of
# the occupy memory
gc()

# Restarts the R session for a clean slate
.rs.restartR()

# Unloads all previously loaded packages It may give an error in the lapply
# if there were no loaded packages
pkgs <- names(sessionInfo()$otherPkgs)
pkgs <- paste("package:", pkgs, sep = "")
lapply(pkgs, detach, character.only = TRUE, unload = TRUE)

# This removes all R objects from the environment
rm(list = ls(all = TRUE))

# And this just cleans the console for a fresh start
cat("\f")

Next, you might also find it handy to know where your working directory is set at the moment:

getwd()

And you might consider changing the path that you get as a result of this function, maybe to the folder in which you have stored your data set (note the bars are inverted):

setwd("C:/Users/myusername/folder/R Workshop/Datasets")

15.1 Datasets

For the purposes of this lessons, some datasets are available for your to download, place in your working directory and load in R. Please follow this link.

15.2 Load in R .txt and .dat

For these file extensions you want to use the read.table() function

dat <- read.table("pbcseq.dat")

You also want to check if the number of columns is the same for all rows.

cf <- count.fields("pbcseq.dat")
cf
# unique(cf) find number of unique observations on your vector all(cf == 19)
# Alternatevely, check if all are 19

15.3 Coding Missing data

Below we learn how to define “Missing data” while loading data in, For example, in our file pbcseq.dat missing data was codded as “.”. In R, however, missing is coded as “NA”s, so we need to make sure to tell our read.table() function about that difference.

dat <- read.table("pbcseq.dat", na.strings = ".")

15.3.1 Make sure to include the header

dat <- read.table("pbcseq.dat", header = TRUE, na.strings = ".")

15.4 Load in R *.csv

dat <- read.csv("aids.csv")

15.5 Load in R a dataset file from a url

read.table("http://flavioazevedo.github.io/blog/R-Introduction-Course/Datasets/sexualat.dat")

15.5 Load in R Excel .xls Files

Quite frequently, the sample data is in Excel format, and needs to be imported into R prior to use. For this, we can use the function read.xls from the gdata package. It reads from an Excel spreadsheet and returns a data frame. The following shows how to load an Excel spreadsheet named “mydata.xls”. This method requires Perl runtime to be present in the system, which you can download from this link.

library(gdata)  # load gdata package 
dat <- read.xls("mydata.xls")  # read from first sheet
# help(read.xls) # documentation

Alternatively, we can use the function loadWorkbook from the XLConnect package to read the entire workbook, and then load the worksheets with readWorksheet. The XLConnect package requires Java to be pre-installed, which you can download from this link.

library(XLConnect)
wk <- loadWorkbook("mydata.xls")
df <- readWorksheet(wk, sheet = "Sheet1")

15.5 Load in R SPSS Files

For the data files in SPSS format, it can be opened with the function read.spss from the foreign package. There is a “to.data.frame” option for choosing whether a data frame is to be returned. By default, it returns a list of components instead.

library(foreign)  # load the foreign package 
dat <- read.spss("myfile", to.data.frame = TRUE)
# help(read.spss) # documentation

If you would like to know more about loading data into R, see this link.

The official CRAN manual on importing/exporting data can be found here

16 Exporting Data

There are numerous methods for exporting R objects into other formats . For SPSS, SAS and Stata. you will need to load the foreign packages. For Excel, you will need the xlsReadWrite package.

16.1 To A Tab Delimited Text File

write.table(mydata, "c:/mydata.txt", sep = "\t")

16.2 To an Excel Spreadsheet

library(xlsx)
write.xlsx(mydata, "c:/mydata.xlsx")

16.3 To SPSS

library(foreign)
write.foreign(mydata, "c:/mydata.txt", "c:/mydata.sps", package = "SPSS")

16.4 To SAS

library(foreign)
write.foreign(mydata, "c:/mydata.txt", "c:/mydata.sas", package = "SAS")

16.5 To Stata

library(foreign)
write.dta(mydata, "c:/mydata.dta")

16.6 .RDA or RData

If your data file is one that you have saved in R as an .rdata file, you can read it in as follows:

load(".RDA")

In RStudio, you have the option to download data via the menu as well. A good - and visual - tutorial on how to do it can be found here