Exploratory Data Analysis (EDA) is usually the first step when you analyze data, because you want to know what information the dataset carries. In this lab, we introudce basic R functions for EDA in both quantitative and graphical approaches. In the end, we will also learn some useful functions for data manipulation, which is often necessary in data analysis.

1 Exploratory Data Analysis

1.1 Basic summary Statistics

Before start, always do

  • set the working directory!
  • create a new R script (unless you are continuing last project)
  • Save the R script.

Let’s first load the Iris dataset. This is a very famous dataset in almost all data mining, machine learning courses, and it has been an R build-in dataset. The dataset consists of 50 samples from each of three species of Iris flowers (Iris setosa, Iris virginicaand Iris versicolor). Four features(variables) were measured from each sample, they are the length and the width of sepal and petal, in centimeters. It is introduced by Sir Ronald Fisher in 1936.

  • 3 Species

  • Four features of flower: length and the width of sepal and petal

go to top

1.2 Explore the iris Dataset with R

1.2.1 Load Data

The iris flower data set is included in R. It is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.

First, load iris data to the current workspace

data(iris)
iris

1.2.2 What is in the dataset?

You can use head() or tail() to print the first or last few rows of a dataset:

head(iris)

Check dimensionality, the dataset has 150 rows(observations) and 5 columns (variables)

dim(iris)
## [1] 150   5

Another way to get the dim is to use ncol or nrow:

ncol(iris)
## [1] 5
nrow(iris)
## [1] 150

Variable names or column names

names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

You can also use this command

colnames(iris)

Structure of the dataframe, note that the difference between num and Factor

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

By default, R treat strings as factors (or categorical variables), in many situations (for example, building a regression model) this is what you want because R can automatically create “dummy variables” from the factors. However when merging data from different sources this can cause errors. In this case you can use stringsAsFactors = FALSE option in read.table.

class(iris[,1])
## [1] "numeric"
class(iris[,5])
## [1] "factor"

1.2.3 Simple summary statistics

Try the summary() function.

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

It only produces the location statistics for continues variable, and count for categorical variable. How about standard deviation, another important summary statistic?

sd(iris$Sepal.Length)
## [1] 0.8280661
quantile(iris$Sepal.Length)
##   0%  25%  50%  75% 100% 
##  4.3  5.1  5.8  6.4  7.9

Using apply() to calculate a particular statistic for multiple variables at the same time.

apply(iris[,1:4], 2, sd)  # "2" means "by column"
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.8280661    0.4358663    1.7652982    0.7622377

1.3 Summary by groups

1.3.1 Use aggregate() function to find summary statistics by group.

# group mean
aggregate(.~Species, iris, mean) 
# group standard deviation
aggregate(.~Species, iris, sd)

What if there are multiple “factor” variables?

Let’s first artificially create a new column that categorizes “Sepal.Length” by quantile.

# use function cut for categorization based on quntiles
Cate.SepalLength<- cut(iris$Sepal.Length, breaks=quantile(iris$Sepal.Length), include.lowes=T)
# Add the created categorical variable to the data
iris1<- iris
iris1$Sepal.Length.Cate <- Cate.SepalLength

Average value of numerical varialbes by two categorical variables: Species and Sepal.Length.Cate:

aggregate(.~Species+Sepal.Length.Cate, iris1, mean)

1.3.2 Pivot table

# One-way count table
table(iris1$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50
# Two-way count table
table(iris1$Species, iris1$Sepal.Length.Cate)
##             
##              [4.3,5.1] (5.1,5.8] (5.8,6.4] (6.4,7.9]
##   setosa            36        14         0         0
##   versicolor         4        20        17         9
##   virginica          1         5        18        26

1.4 Exercise:

Download the customer data and read into R.

customer <- read.csv(file = "https://xiaoruizhu.github.io/Data-Mining-R/lecture/data/CustomerData.csv")
  1. How many rows and columns of the dataset?
  2. Print first few rows the dataset.
  3. Obtain the summary statistics (Min, Median, Max, Mean and Std.) for Age, EducationYears, HHIncome, and CreditDebt.
  4. Obtain the mean of HHIncome by MaritalStatus
  5. Obtain a pivot table of LoanDefault vs. JobCategory. Which Job Category has the highest and lowerst loan default rate?

go to top

2 Data Manipulation

2.1 Re-ordering columns and sorting rows

Sorting by one or more variables is a common operation that you can do with datasets. With RStudio version 0.99+, you can sort a dataset when viewing it by clicking column header.

To do it with code, let’s suppose that you would like to find the top 5 rows in iris dataset with largest Sepal.Length.

iris[order(iris$Sepal.Length, decreasing = TRUE)[1:5], ] 

2.2 Sorting

Sorting by one or more variables is a common operation that you can do with datasets. With RStudio version 0.99+, you can sort a dataset when viewing it by clicking column header.

To do it with code, let’s suppose that you would like to find the top 5 rows in iris dataset with largest Sepal.Length.

iris[order(iris$Sepal.Length, decreasing = TRUE)[1:5], ] 

The syntax is cleaner with the arrange() function in the dplyr package:

arrange(iris, desc(Sepal.Length))[1:5, ]

2.3 Select columns

If you want to select one or more variables of a data frame, there are two ways to do that. First is using indexing by “[]”. Second is select() function in dplyr. For example, suppose we want to select variable “Sepal.Length”:

iris[, "Sepal.Length"]

or alternatively select two variables: “Sepal.Length”, “Sepal.Width”

iris[, c("Sepal.Length", "Sepal.Width")]

2.4 Missing data

2.4.1 Detect missing values

Recall the customer dataset we in previous exercise.

customer <- read.csv("data/CustomerData.csv")
# How many missing values are in customer dataset?
sum(is.na(customer))
## [1] 76
# How many missing values are in each variable?
data.frame(num_missing=colSums(is.na(customer)))

2.4.2 How to deal with missing values?

  • Simply delete those observations(rows) that contains missing value.
  • Impute missing values by certain statistics, predictions or random number from estimated distributions.
# Simply delete rows with missings
clean_customer<- na.omit(customer)
nrow(clean_customer)
## [1] 4940
# Impute missing values by median
medHS<- median(customer$HouseholdSize, na.rm = T)
customer$HouseholdSize[is.na(customer$HouseholdSize)==T]<- medHS

2.5 Exercise:

Download the customer data and read into R.

customer <- read.csv(file = "https://xiaoruizhu.github.io/Data-Mining-R/lecture/data/CustomerData.csv")
  1. How many rows and columns of the dataset?
  2. Print first few rows the dataset.
  3. Obtain the summary statistics (Min, Median, Max, Mean and Std.) for Age, EducationYears, HHIncome, and CreditDebt.
  4. Obtain the mean of HHIncome by MaritalStatus
  5. Obtain a pivot table of LoanDefault vs. JobCategory. Which Job Category has the highest and lowerst loan default rate?

go to top