1 tidyverse: collection of R packages for EDA

We introduce package tidyverse, and some basic functions in the sub-packages for EDA. For more details, please see https://www.tidyverse.org/. This section is based on Dr. Bradley Boehmke’s short course for MSBA students at Lindner College of Business. The course materials can be downloaded from here.

install.packages("tidyverse")

library(tidyverse)

## -- Attaching packages ------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.0     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ---------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

1.1 Data Manipulation with dplyr

1.1.1 Filtering and Indexing

We introduce dplyr package with some very user-friendly functions for data manipulation. These functions are:

filter()
select()
arrange()
rename()
mutate()

1.1.1.1 Filtering (Subsetting) data

Here I introduce 4 ways to get subsets of data that satisfy certain logical conditions: subset(), logical vectors, SQL, and filter(). These kind of operations are called filtering in Excel. Knowing any one of these well is enough. Do not worry about memorizing the syntax, you can always look them up.

Suppose we want to get the observations that have Sepal.Length > 5 and Sepal.Width > 4. We can use logical operators: != not equal to; == equal to; | or; & and.

Use subset function

data(iris)
subset(x = iris, subset = Sepal.Length > 5 & Sepal.Width > 4)

You can omit the x = and subset = part

subset(iris, Sepal.Length > 5 & Sepal.Width > 4)

Use logical vectors

iris[(iris$Sepal.Length > 5 & iris$Sepal.Width > 4), ]

Use SQL statement

install.packages('sqldf')
library(sqldf)
sqldf('select * from iris where `Sepal.Length` > 5 and `Sepal.Width` > 4')

In earlier version of sqldf all dots(.) in variable names need to be changed to underscores(_).

filter() is a power function in package dplyr to perform fitering like Excel Filter.

# filter by row observations
data(iris)
iris_filter <- filter(iris, Sepal.Length<=5 & Sepal.Width>3)
iris_filter2 <- filter(iris, Species=="setosa", Sepal.Width<=3 | Sepal.Width>=4)

1.1.1.2 Subsetting the Dataset: Random Sample

The following code random sample (without replacement) 90% of the original dataset and assgin them to a new variable iris_sample.

iris_sample <- iris[sample(x = nrow(iris), size = nrow(iris)*0.90),]

The dplyr package provides more convinient ways for generating random samples. You can take a fixed number of samples using sample_n() or a fraction using sample_frac() as follows

install.packages('dplyr')
library(dplyr)
iris_sample <- sample_frac(iris, 0.9)

The dplyr package provides more convinient ways for generating random samples. You can take a fixed number of samples using sample_n() or a fraction using sample_frac() as follows

install.packages('dplyr')
library(dplyr)
iris_sample <- sample_frac(iris, 0.9)
# using dplyr for logical subsetting
filter(iris, Sepal.Length> 5, Sepal.Width > 4)

I recommend you to go through the dplyr tutorial and lubridate tutorial. They make common data manipulation tasks and dealing with time-date much easier in R.

1.1.1.3 Sorting

Sorting by one or more variables is a common operation that you can do with datasets. With RStudio version 0.99+, you can sort a dataset when viewing it by clicking column header.

To do it with code, let’s suppose that you would like to find the top 5 rows in iris dataset with largest Sepal.Length.

iris[order(iris$Sepal.Length, decreasing = TRUE)[1:5], ]

The syntax is cleaner with the arrange() function in the dplyr package:

arrange(iris, desc(Sepal.Length))[1:5, ]

1.1.1.4 Select columns

If you want to select one or more variables of a data frame, there are two ways to do that. First is using indexing by “[]”. Second is select() function in dplyr. For example, suppose we want to select variable “Sepal.Length”:

iris[, "Sepal.Length"]

or alternatively select two variables: “Sepal.Length”, “Sepal.Width”

iris[, c("Sepal.Length", "Sepal.Width")]

On the other hand, select() in dplyr package can be used to filter by column, i.e., selecting or dropping variables.

# Keep the variable Sepal.Length, Sepal.Width
varname <- c("Sepal.Length", "Sepal.Width")
iris_select <- select(iris, varname)

## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(varname)` instead of `varname` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

# verify if we did correctly
names(iris_select)

## [1] "Sepal.Length" "Sepal.Width"

# This is equivalent to 
iris_select <- iris[,varname]

What about dropping variables?

iris_select2 <- select(iris, -Sepal.Length, -Petal.Length, -Species)
names(iris_select2)

## [1] "Sepal.Width" "Petal.Width"

This is equivalent to

varname <- c("Sepal.Length", "Petal.Length", "Species")
iris_select2 <- iris[,!names(iris) %in% varname]
names(iris_select2)

## [1] "Sepal.Width" "Petal.Width"

1.1.1.5 Exercise

It would be easier if you know the order of the variables that you want to drop or keep. Try to obtain iris_select and iris_select2 by using “dataname[, "variable_index"].”

1.1.2 Re-ordering columns and sorting rows

Sorting by one or more variables is a common operation that you can do with datasets. With RStudio version 0.99+, you can sort a dataset when viewing it by clicking column header.

To do it with code, let’s suppose that you would like to find the top 5 rows in iris dataset with largest Sepal.Length.

iris[order(iris$Sepal.Length, decreasing = TRUE)[1:5], ]

The syntax is cleaner with the arrange() function in the dplyr package:

arrange(iris, desc(Sepal.Length))[1:5, ]

# re-ordering the columns
iris_order <- select(iris, Species, Petal.Width, everything())
names(iris_order)

## [1] "Species"      "Petal.Width"  "Sepal.Length" "Sepal.Width"  "Petal.Length"

# sorting rows by particular variable
iris_sort<- arrange(iris, Sepal.Length)
# sorting by more than one variable
iris_sort2<- arrange(iris, Sepal.Length, Sepal.Width)
# descending order
iris_sort_desc<- arrange(iris, desc(Sepal.Length))

Note that missing values are always sorted at the end.

1.1.3 Renaming variable

iris_rename<- rename(iris, SL=Sepal.Length, SW=Sepal.Width)
names(iris_rename)

## [1] "SL"           "SW"           "Petal.Length" "Petal.Width"  "Species"

1.1.4 Creating New Variables

iris_newvar<- mutate(iris, Sepal.L_W=Sepal.Length/Sepal.Width)
names(iris_newvar)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"      "Sepal.L_W"

1.1.4.1 Exercise

Try to obtain iris_newvar WITHOUT using mutate() function. (You may need multiple steps, so mutate() is very useful especially you need to create many new variables.)

1.2 Data Visualization with ggplot2

ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics. More details can be found at http://ggplot2.org/. Here is a very good tutorial.

go to top

1.3 Things to remember

How to obtain basic summary statistics
Summary statistics by groups
Pivot table
Use of “[ ]” for subsetting and indexing
Functions in dplyr packages.
Basic R graphics.

go to top