1 Introduction to R and R Studio

1.1 Powerful Statistical language: R

Without a doubt, R is nibbling at other programming languages popularly used by data scientists or statisticians due to its free, open-source, and active community full of great researchers. Most statisticians use R to analyze data, fit models, conduct research, or even write papers. Researchers and many other software engineers are contributing to the R community. All those cooperations are helping R become one of the world’s most remarkable statistical programming languages. R Studio, REvolution, GGplot2, and Shiny App are a few great tools in this community that play a significant role in making this language great.

For people who want to study frontier statistical methodologies, R helps them to be able to focus on the implementation of statistical methods. One can quickly implement new estimation methods or algorithms for simulation or comparison purposes. Meanwhile, a researcher can soon develop R packages of his proposed tool. The package makes the tool usable, maybe after the method is published. This fast development of newly published methods could not happen a couple of years ago when statisticians knew little about software development. However, with an easy developing process of R packages, researchers in bioinformatics, statistics, and computer science, can develop an R package for their new methods very quickly. Then, other researchers can apply the latest techniques in the packages to accelerate their research or business, just as simply as clicking a button in SPSS. Likewise, practitioners can also develop a package for their projects. Their packages become a product of their intellectual property that can be used in the future (they don’t necessarily need to publish their packages to the public). Although open source software usually faces issues like lack of maintenance or security, it makes the flow of knowledge seamless.

And it is easy to learn how to use a package in R. Here is the tutorial: https://shiny.rstudio.com/tutorial/

1.2 Extreme Stong IDE: R Studio

R Studio is a very active community to enrich functions of R. There are several great features I want to emphisis and recommend.

  1. R Markdown or R Notebook

Markdown is an elegant syntax for writers, from novelists to researchers in social science or natural science fields. It also allows sophisticated notations or graphical and numerical results. Its simple syntax lets writers focus on the essential part of knowledge sharing: the content and idea itself. Compared to LaTex, an excellent tool for generating a beautiful document, Markdown users don’t need to know much about formatting, typesetting, referencing, and perfectly positioning statistical tables or figures. Because a smooth writing process leads to fluent communication with readers, the advantage of Markdown language is evident.

Here is Markdown Basics: http://rmarkdown.rstudio.com/authoring_basics.html And many other tutorials are available in the same website.

The new version of R studio provides an R Notebook feature similar to R markdown. Both offer an accessible environment to test and iterate when writing an article with code. You can run programs on manuscripts and display whatever results you want to insert.

  1. Shiny Web App

Shiny Web App provides an essential pipeline for statisticians and data scientists to interact with audiences or customers. With this simple and user-friendly tool, data analysts can directly present their findings or visualize their results to the audiences. This process used to be complicated and needed professionals’ involvement in software or web development. However, the Shiny App simplifies the process and provides an intuitive way to build interactive applications upon statistical results. Therefore, it is an excellent tool worthy of learning to transfer your idea into an authentic product smoothly.

2 Preparations

2.1 Download and Installation

R

R Studio

Shiny App

https://shiny.rstudio.com/

R Markdown

http://rmarkdown.rstudio.com/

The download and installation should be straightforward, in case you encounter problems you can check the following video tutorials.

Install R: http://www.youtube.com/watch?v=SJ9sVyqWJn8&hd=1

Install R Studio: http://www.youtube.com/watch?v=6aTRbo7kdGk&hd=1

go to top

2.2 R Studio

RStudio is running based on R. It is an IDE (Intergrated Development Environment) with many advanced features. This lab notes is created based R Markdown, a very nice and useful tool from RStudio.

After you open RStudio, it should look like this:

There are three panels showing. However, you need the forth one, which is the editor window. Click the green-plus icon on left-top corner, and select R Script. You write all your code in this editor window, and remember to save it!

Other Panels

  • Console: It shows any command you have run and corresponding output.
  • Environment: It shows what you currently have. Data you have loaded, functions that have been defined, and other R objectives.
  • File/Plot…: Files in current working directory, latest plot you have generated…

2.3 Packages

R is open source software. That means, everyone can contribute to it by writing R packages and sharing to the community. An package usually consists of several R functions and datasets that are designed for specific tasks. There are over 10,000 packages in CRAN.

You need to download the package first and then load it to working environment before using some particular functions. We will see this later.

You may call yourself software developer if you can write R packages. If you are interested in writing package, here is a good book to read http://r-pkgs.had.co.nz/.

2.4 Learning Resource

  • Google: simply search “how to … with R”.
  • Stack Overflow: a searchable Q&A site oriented toward programming issues.
  • Cross Validated: a searchable Q&A site oriented toward statistical analysis.
  • R-bloggers: a central hub of content from over 500 bloggers who provide news and tutorials about R.
  • Use help() or Question Mark in R console: This is the most convinient way to learn R functions. More than 80% of the time during my programming was looking at the help document in R.
    • Please try “?lm” (type it in your console), or help(lm)

3 Fundamentals

3.1 Before Coding

3.1.1 Set Working Directory

Always set working directory before you start coding. Working directory where you may read external data, write data, and save the code.

  • Look at current working directory: type getwd() in console.
  • Set working directory: use setwd(“the path”),

Or

Click Session -> Set Working Directory -> Choose Directory, then choose the folder to which you wish to save your work. This will be the default Create a “R Script”, name your R Script and save it. Then you can start writing code in the editor window.

Your objects (loaded datasets, variables, functions, etc.) are contained in your “current workspace”, which can be saved any time. In Rstudio: Session -> Load Workspace/Save Workspace As….

Remember: Keep it tidy! Keep separate projects (code, data files) in separate workspaces/directories.

3.2 Use R as Calculator

You can assign numbers and lists of numbers (vector) to a variable. Assignment is specified with the “<-” or “=” symbol. They assign the RHS value to LHS object. There are some subtle differences between them but most of the time they are equivalent. I highly suggest you to use “<-” when you want to do assignment, but use “=” in the argument of function(May explain later).

Here we define two variables \(x = 10\) and \(y = 5\), then we calculate the result of \(x+y=\). Type following code in the editor and run line by line. To run a line of code, you can move cursor to that line, and use Crtl+Enter (Command+Enter for Mac). If you want to run multiple lines of code, simply highlight those lines and use the same command. (Note that you can put a # in front of a line to write comment in code.)

x <- 10
y = 5
x+y
## [1] 15

After you run the code, what did you find in the Global Environment (Workspace) window?

In RStudio, you can view every variable you defined in the Global Environment (Workspace) window, along with other objects such as imported datasets in the Workspace panel. You can use R as an over-qualified calculator. Try the following commands. You already have \(x, y\) defined. Then you can calculate \(log(x)=\)

log(x)
## [1] 2.302585

\(exp(y)=\)

exp(y)
## [1] 148.4132

\(cos(x)=\)

cos(x)
## [1] -0.8390715

The log, exp, cos operators are functions in r. They take inputs (also called arguments) in parentheses and give outputs.

You can also run logical operations, such as \(x == y, x > y\):

x == y
## [1] FALSE
x > y
## [1] TRUE

Exercise:

Economic Order Quantity Model: \(Q= \sqrt{2DK/h}\)

  • D=5000: annual demand quantity
  • K=$4: fixed cost per order
  • h=$0.5: holding cost per unit
  • Q=?

3.3 Data Structure: Vector, Matrix, Data frame, and List

There are four types of data structure in R: Vector, Matrix, Data frame, and List

3.3.1 Vector

  • Vector is a list of numbers (or strings), such as the \(z\) above. It is a vector with elements: \([3,5,7,9]\). There are some basic calculations on list of numbers.

To assign a list of numbers (vector) to a variable, the numbers within the c command are separated by commas. As an example, we can create a new variable, called “z” which will contain the numbers 3, 5, 7, and 9.

# Define numerical vector z
z<- c(3,5,7,9)
# Define character vector zz, where numerical operations cannot be directly applied.
zz<- c("cup", "plate", "pen", "paper")

3.3.1.1 Average

#Average
mean(z)
## [1] 6

3.3.1.2 Standard devidation

#Standard devidation
sd(z)
## [1] 2.581989

3.3.1.3 Median

#Median
median(z)
## [1] 6

3.3.1.4 Maximum

#Max
max(z)
## [1] 9

3.3.1.5 Minimum

#Min
min(z)
## [1] 3

3.3.1.6 Summary statistics

#Summary Stats
summary(z)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0     4.5     6.0     6.0     7.5     9.0

3.3.1.7 Calculation

Elementwise operations for single vector or vectors:

z
## [1] 3 5 7 9
z+2
## [1]  5  7  9 11
z/10
## [1] 0.3 0.5 0.7 0.9
# define vector z1
z1 <- c(2,4,6,8)
# Elementwise operations (must be the same length)
z+z1
## [1]  5  9 13 17
z*z1
## [1]  6 20 42 72

Vector of multiple vectors is still a vector. z2

# define vector z2
z2 <- c(z, z1)
z2
## [1] 3 5 7 9 2 4 6 8

3.3.1.8 Indexing

How to extract the second entry of vector z2=(3,5,7,9,2,4,6,8)?

z2[2]
## [1] 5

How to extract all elements greater than 3 from vector z2?

z2[z2>3]
## [1] 5 7 9 4 6 8

How to extract all elements greater than 3 and smaller than 6 from vector z2?

z2[z2>3 & z2<6]
## [1] 5 4

How to order the vector z2 from smallest to largest?

z2[order(z2)]
## [1] 2 3 4 5 6 7 8 9

Exercise:

  • What is dot product(inner product) of z and z1?
  • Find the elements of z2 that smaller than 3 or greater than 7.

3.3.2 Matrix

Matrix is a table of numbers (or strings). \(A\) is a matrix with 2 rows and 3 columns.

3.3.2.1 Creating

z = c(3,5,7,9)

A = matrix(data = c(1,2,3,4,5,6), nrow = 2)

matrix() is a function that creates a matrix from a given vector. Some of the arguments in a function can be optional. For example you can also add the ncol arguments, which is unnecessary in this situation.

A <- matrix(data = c(1,2,3,4,5,6), nrow = 2, ncol = 3)

Another way to write the function is to ignore the argument names and just put arguments in the right order, but this may cause confusion for readers.

A <- matrix(c(1,2,3,4,5,6), 2, 3)

Question: Think about what would it be if specify ncol=2, or ncol=4?

The default order to position the numbers of a vector to matrix is by column(from top to bottom), but you can specify it as by row using an additional argument byrow=TRUE.

A <- matrix(data = z2, nrow = 4, ncol = 2, byrow = TRUE)

3.3.2.2 Calculation

Elementwise operations for matrices:

A
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
A+2
##      [,1] [,2] [,3]
## [1,]    3    5    7
## [2,]    4    6    8

Deimension

# Dimensions of A
dim(A)
## [1] 2 3

Transpose and Multiplication

# Transpose
t(A)
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    6
# Matrix multiplication is doable if and only if the number of columns in A1 equals the number of rows in A2
t(A) %*% A
##      [,1] [,2] [,3]
## [1,]    5   11   17
## [2,]   11   25   39
## [3,]   17   39   61
# New matrix with dimension 4*2
A2 <- A * 2
# Matrix calculation should satisfy the rules of matrix algebra
A + A2
##      [,1] [,2] [,3]
## [1,]    3    9   15
## [2,]    6   12   18

Question: What would happen if run A %*% A2?

3.3.2.3 Indexing

How to extract the second entry of second row from matrix A?

A[2,2]
## [1] 4

How to extract the first row from matrix A?

A[1, ]
## [1] 1 3 5

How to extract first two column from matrix A?

A[,1:2]
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

Exercise:

  • What are the diagonal elements of t(A) %*% A?

3.3.3 Data frames

Data frames in R are the “datasets”, that is tables of data with each row as an observation, and each column representing a variable. Data frames have column names (variable names) and row names.

3.3.3.1 Creating

  • Convert a matrix to data frame

You can use data.frame() to transform a matrix into a dataframe. Most of the time you will import a text file as a data frame or use one of the example datasets that come with R.

mydf <- data.frame(A) 
class(mydf)
## [1] "data.frame"
  • Read external data file (.txt and .csv files)

Use the read.table or read.csv function to import comma/space/tab delimited text files. You can also use the Import Dataset Wizard in RStudio. Package “readxl” allows you to read xls/xlsx files. First, download the storks datasets ( storks.cvs and storks.txt files) and save them into your Working Directory.

mydata_csv <- read.csv("storks.csv", header=TRUE)
mydata_txt <- read.table("storks.txt", header=TRUE, sep = "\t")
  • Load built-in dataset
#Load cars dataset that comes with R (50 obs, 2 variables)
data(cars)
  • Summary of a dataset
#Dimension 
dim(cars)
#Preview the first few rows
head(cars)
#Variable names
names(cars)
#Summary
summary(cars)
#Structure
str(cars)

3.3.3.2 Manipulating

Subsetting elements from data frames is similar to subsetting from matrices. On the other hand, since data frames have variable names (label for each column), you can also use the following two ways to refer variables of a data frame:

  • df$var will select var from df
  • df[, c(‘var1’,‘var2’)] will select var1 and var2 from df

In RStudio, hitting tab after df$ allows you to select/autocomplete variable names in df

  • Adding and dropping variables

Add new variable to the data

#First 2 obs of the variable dist in cars
cars$dist[1:2]
## [1]  2 10
cars1<- cars
cars1$time<- cars$dist/cars$speed

Drop variable time

# since "time" is the third column, we can do
cars2<- cars1[,-3]
# we can also drop "time" by keeping the other two variables
cars3<- cars1[c("speed", "dist")]

3.3.4 List

List is a container. You can put different types of objects into a list to create your own list of all you have in hand.

mylist<- list(myvector=z, mymatrix=A, mydata=cars)

Most of the output of R function is a list that contains severl objects.

# Load car dataset that comes with R
data(cars)
#fit a simple linear regression between braking distance and speed
lm(dist~speed, data=cars)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

There are three ways to get an element from a list:

  • use listname[[i]] to get the ith element of the list;
  • use listname[[“elementname”]];
  • use listname$elementname.

Note that you use double square brackets for indexing a list.

reg <- lm(dist~speed, data = cars)
reg[[1]]
reg[["coeffcients"]]
reg$coeffcients

If you have done object oriented programming before, the list “reg” is actually an object that belongs to class “lm”. The element names such as “coeffcients” are fields of the “lm” class.

3.3.5 Exercise

  1. Define a vector with values (5, 2, 11, 19, 3, -9, 8, 20, 1). Calculate the sum, mean, and standard deviation.

  2. Re-order the vector from largest to smallest, and make it a new vector.

  3. Convert the vector to a 3*3 matrix ordered by column. What is the sum of first column? What is the number in column 2 row 3? What is the column sum?

  4. Use the following code to load the CustomerData to your R.

customer <- read.csv(file = "https://xiaoruizhu.github.io/Data-Mining-R/lecture/data/CustomerData.csv")
  • How many rows and columns are there?
  • Extract all variable names.
  • What is the average “Debt to Income Ratio”?
  • What is the proportion of “Married” customers?

3.4 Basic Plotting

A Simple Scatter Plot

plot(cars)

3.5 Probability Distributions

Types of distributions: norm, binom, beta, cauchy, chisq, exp, f, gamma, geom, hyper, lnorm, logis, nbinom, t, unif, weibull, wilcox

Four prefixes:

  1. ‘d’ for density (PDF)

  2. ‘p’ for distribution (CDF)

  3. ‘q’ for quantile (percentiles)

  4. ‘r’ for random generation (simulation)

dbinom(x=4,size=10,prob=0.5)
## [1] 0.2050781
pnorm(1.86)
## [1] 0.9685572
qnorm(0.975)
## [1] 1.959964
rnorm(10)
##  [1]  0.5558683 -0.2287420 -1.2203218 -0.7468588 -0.2876556 -0.9365588
##  [7]  0.3034647  2.0568812  0.5633708 -2.2570392
rnorm(n=10,mean=100,sd=20)
##  [1]  93.13600  82.69666  62.08860  82.06300 104.84185 122.54211 115.04454
##  [8]  90.35712  94.64884 109.88496

4 Summary

4.1 Do you like R?

4.1.1 YES!

That is great! Go R!

4.1.2 NO!

You may have trouble in the rest of the semester…, so please try to get used to it!

4.2 What you need to remember

  • Set working directory.
  • How to creat a vector?
  • How to load an external data file?
  • How to subset/index a vector, a matrix, and a data frame?
  • Basic calculations and syntax.

go to top