Contents & Purposes
This introductory lecture is related to:
- fundamental math concepts
- programming with R
- basic statistics
The lecture aims at:
- providing the basis of R & RStudio
- getting an idea of what is programming with R
- reviewing some statistical topics
Math Essentials
Matrices and Vectors
“A matrix is a rectangular array[1] of numbers, symbols, or expressions, arranged in rows and columns.” (Wikipedia)
In particular, matrices can be divided by the following common types
Matrix Algebra
Matrix
Algebra
The link send to a webpage containing a detailed explanation of all
matrix operations.
Functions
“A function is a process or a relation that associates each element \(x\) of a set \(X\), the domain of the function, to a single element \(y\) of another set \(Y\) (possibly the same set), the codomain of the function.” (Wikipedia)
- the element \(x\) is the argument,
the input or the independent variable of the function;
- \(f(.)\) is the function or the
program that maps \(x\) to \(y\) (\(x \to f(.)
\to y\));
- \(y\) is the value of the function, the output, the dependent variable or the image of \(x\);
Thus, a function describes the relation between input and output, the independent variable and the dependent variable, and the relation is denoted by \(y = f(x)\).
Functions can have one or more arguments or variables. The latter is generally defined as \(y = f(x_1, x_2, ..., x_n)\), which is a generic function \(f\) taking \(n\) inputs or arguments \((x_1, x_2, ..., x_n)\) and producing a single output \(y\).
A simple example of a function
Programming
What is a Programming Language?
A programming language is …
“a formal language, which comprises a set of instructions that produce various kinds of output. Programming languages are used in computer programming to implement algorithms.” (Wikipedia)
“a computer language engineered to create a standard form of commands. These commands can be interpreted into a code understood by a machine.” (Techopedia)
“a computer language programmers use to develop software programs, scripts, or other sets of instructions for computers to execute.” (Computerhope)
Summarizing, a programming language is:
- a formal language
- a standard form of commands
- a set of instructions to communicate with machines and to produce
output
- commands used to develop software programs (etc.).
Moreover, programming languages may be classified according to various criteria:
- high level (more like human language) / low level (close to machine language)
- imperative (as a sequence of operations to perform, that is the how to) / declarative (as what should be achieved, not how to do that)
- compiled (the machine understand the code just compiling it) / interpreted (the machine needs an interpreter that translate the code to understand it)
- general purpose (capable to create all types of programs) / domain specific
- object-oriented or oop (code can be structured as reusable components, some of which may share properties or behaviors)
Actually, have been created more than 2500 “programming languages”. Here are just few of the most common.
Programming & Natural Language: Similarities and Differences
First of all, the main function of languages, be it a programming language or a natural language, is communication. This is the most important similarity between them, and one of the main reasons we refer to both of them as languages.
Second, another important feature that they have in common is structure. I would like you to focus for a moment on the notion of formal language. A formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules. This is actually what we learned in primary school when we started to learn our country’s natural language (or spoken language). Indeed, every natural language of the world is conceived exactly as a formal language, that is a combination of:
- an alphabet (with letters and symbols)
- a list of well-formed words
- a set of rules, that is grammar, semantics and linguistic rules.
When we talked about programming language, we saw that it is actually defined as a formal language. The reason is that each programming language has its own words and combines them to create sentences and “novel” following specific grammar, syntax and semantic rules. In this regard, a programming language is very similar to a natural language.
In particular, I like to think to programming in this way:
- objects (in oop languages) are nouns
- functions are verbs
- functions’ arguments are adjectives
The combination of these components in a formal way (that is following the specific rules of the programming language in use) creates a program (instead of a poems or a novel), which may be seen as a big function that performs some computer task. Moreover, in computer science, the practice combining words in sentences with the aim of writing computer programs is called coding, and code is written with a certain idea or intention in mind (semantics) while following the set of rules around the use of variables, functions, different kinds of parenthesis, colons, etc. (syntax).
The main difference between a programming language and a natural language is that the former is created to communicate with machines, while the latter to humans. This is obvious, of course, but it means that what is written in a programming language is not free to interpretation. While in natural language context may exists many possible interpretations of single words and even more of sentences, in programming language this is not the case. A sentence, which is the combination of nouns (objects), verbs (functions) and adjectives (arguments), has just one possible meaning: which is what the code actually does. The lack of free interpretability may be seen as link to the mathematical definition of function: computer programs, written in a programming language, are actually functions that take some input (one or many), perform some task and return AN output.
R
What is the R Language?
According to the TIOBE Index, and many other programming language surveys, R is not among the most important programming languages. Why? and why is it so popular nonetheless?
R is a free, open-source software and programming language developed in 1995 at the University of Auckland as an environment for statistical computing and graphics. R is actually the evolution of the S programming language that was conceived in the Bell Laboratories around the ’80s and it was developed by Robert Gentleman & John Chambers. Since then R has become one of the dominant software environments for data analysis and is used by a variety of scientific disciplines, including social sciences, biology and bioinformatics, medicine, geoinformatics, finance and economics. R is particularly powerful for its graphical capabilities, its enormous amount of libraries regarding statistical modeling and it is also prized for its GIS capabilities which make it relatively easy to generate raster-based models.
R is an interpreted language, not a compiled one, meaning that all commands typed on the keyboard are directly executed without requiring to build a complete program like in most computer languages (C, FORTRAN, Pascal, …). Moreover, R is also in the domain of Object-Oriented Programming (OOP) languages since it uses objects and method applied to them as fundamental units.
All you need to know about R can be found on the R-Project website.
Installing R
Installing R is very easy, all you need to do is to pick up the right solution based on your laptop’s OS and configuration from the R Project website above or from the CRAN website below.
CRAN & Packages
On the Comprehensive R Archive Network (CRAN), instead, you can find all the R packages or libraries, coupled with their documentation. Indeed, the CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R.
An R package (or library) is the fundamental unit of reproducible R code. It includes reusable R functions, the documentation that describes how to use them, and sample data.
Objects and Functions
The R world
- everything is an object
- everything that happen is a call to a function (function call)
In R
- every object belongs to a class
- functions, whenever called, recognize the object’s class and apply
the corresponding method (dispatching)
- the method performs ad hoc operations on the object and return an output
When R is running, variables, data, functions, results, etc, are stored in the active memory of the computer in the form of objects which have a name. The user can do actions on these objects with operators (arithmetic, logical, comparison, …) and functions (which are themselves objects). All the actions of R are done on objects stored in the active memory of the computer.
Example
my_object <- "Hello World" # here an object called my_object is created
print(my_object) # here the print function is called on the object my_object and the output is shown below
## [1] "Hello World"
R Environment
Environment can be thought of as a collection of objects (functions, variables etc.). An environment is created when we first fire up the R interpreter. Any variable we define, is now in this environment.
The job of an environment is to associate, or bind, a set of names to a set of values. You can think of an environment as a bag of names: each name points to an object stored elsewhere in memory; the objects don’t live in the environment so multiple names can point to the same object.
R messages
Messages in R are displayed by functions call into the R Console and can be of three main types:
- errors which stop the function because of some
unexpected problem. They need to be fixed in order to be able to
proceed.
- warnings are weaker than errors. They signal that
something has gone wrong, but the code has been able to recover and
continue. Unlike errors, you can have multiple warnings from a single
function call.
- messages are just informational.
All these kind of messages are very important because they allow the user to understand if a function call is going well or not, and they help him to eventually recover problems.
The R GUI
The R GUI (Graphical User Interface) is the simplest interface which comes with the installation of the R software.
The RStudio IDE
RStudio is an integrated development environment (IDE) that allows you to interact with R more readily. RStudio is similar to the standard R GUI, but is considerably more user friendly. It has more drop-down menus, windows with multiple tabs, and many customization options. The first time you open RStudio, you will see three windows. A forth window is hidden by default, but can be opened by clicking the File drop-down menu, then New File, and then R Script. Detailed information on using RStudio can be found at RStudio’s Website.
Installing RStudio
RStudio can be easily installed by downloading the correct file at RStudio Download.
Working with the RStudio IDE
RStudio remarkably ease working with R, since it allows the user to work through a variety of user friendly panes and tabs. Indeed, you can easily:
- write, modify and save R scripts
- interact with the code by means of the console
- constantly control the working environment
- navigate through files in your PC
- readily install R packages
- look at various types of graphs by the Plot tab
- look for help in the Help menu
and many other options.
How to write R code: style guidelines
When writing code, it is very important to keep in mind that it should be easy to read. Hence, everyone should follow some very simple guidelines regarding the code style, that is:
- pick up a case and be consistent when naming object (use either
camelCase or snake_case);
- maximum 80 characters per line;
- use
<-
not=
for assignment;
- separate functions and arguments by single space;
- add comments to scripts with
#
.
To see more look at Google guidelines.
Basic R
Assignment Operator and Variables
We have seen that R works with objects which are, of course,
characterized by their names and their content, but also by attributes
which specify the kind of data represented by an object.
However, how are objects created in R? Here comes the assignment
operators.
In R you can create an object (or a variable) by means of
<-
or =
(leftward assignment) and
->
(rightward assignment).
Essentially, the object creation is done as follows:
variable_name <- "content of the variable" # character
another_variable <- 2019 # numeric
Usually the assignment is always on the left, meaning that on the left hand side there is the name of the variable (that will appear in the environment), while on the right hand side there is the value or content of the variable. The rightward assignments, although available, are rarely used.
Some fundamental R commands
In the base
R (which are the base functions that come
with the R installation), there exist some fundamental commands (or
functions) that everyone need to know before starting to program with
R.
Here are listed some of the most used:
getwd()
&setwd()
gets or sets the working directoryinstall.packages()
installs R packages
library()
load and attach R packages (so that you can use their functions)help()
or?
brings up a help pagels()
lists all the R objects in your workspace
rm()
removes objects from the workspace
print()
prints the entire object (avoid with large tables)class()
extracts the class of an object:
,seq()
&rep()
allow to create sequenceshead()
(tail()
) prints the first (last) 6 lines of your data
names()
lists the column names (i.e., headers) of your data
str()
shows the data structure of an R object
dim()
shows the dimension of your data
nrow()
&ncol()
get number of rows and columns of an object
length()
extracts the length of an object
View()
send the output to the viewer pane
plot()
base graphic function of R
# Run the following codes to see the output
getwd() # get the current working directory
setwd("your path") # set a specific working directory
install.packages("readxl") # install package readxl
library(readxl) # load package readxl
help("print") # get the documentation for the print function
?print
ls() # the environment is empty
# creating objects
x <- 1
y <- FALSE
z <- "Hello World"
ls() # the environment contains 3 variables
rm(x) # removes x from the workspace
ls() # x is no more in the environment
rm(list = ls()) # clean all the workspace
print(iris) # prints iris objects
class(iris) # check class of iris object
typeof(iris) # check type of iris object
head(iris) # top 6 rows
tail(iris) # last 6 rows
str(iris) # structure of iris
dim(iris) # dimensions of iris
names(iris) # column names of iris
nrow(iris) # number of rows
ncol(iris) # number of columns
length(iris) # get the length of the iris dataframe
View(iris) # send iris to the Viewer
plot(iris) # plot iris
1:10 # sequence from 1 to 10
seq(10) # same
seq(from = 1, to = 10, by = 0.5) # sequence from 1 to 10 by 0.5
sequence <- 1:10 # assign the sequence to an object named sequence
rep(1, 10) # repeates 1 ten times
Arithmetic with R
Performing arithmetic operations in R is as easy as in simple
calculators. The main difference is that you can perform arithmetic on
objects.
The fundamental arithmetic operators are:
+
addition
-
subtraction
*
multiplication
/
division
%/%
integer division
^
exponential
%%
modulus
# Run the following codes to see the output
# arithmetics
20 + 19
20 - 19
20 * 19
20 / 19
20 %/% 19
20 ^ 19
20 %% 19
# with objects
a <- 4
b <- 5
a + b
a - b
a * b
a / b
a %/% b
a ^ b
a %% b
Data Types
In R there exists the following 5 main data types
All numbers are numeric or double by default unless otherwise
specified. Indeed, integer numbers are obtained adding an L
after the last digit, while complex numbers are given by adding an
i
. Characters or strings are created surrounding a letter,
a word or a sentence by single '...'
or double
"..."
quotes. Logical instead may assume just two values,
that is TRUE
or FALSE
.
# Run the following codes to see the output
# Numeric
class(2019)
a <- 2019
class(a)
# Integer
class(2019L)
a <- 2019L
class(a)
# Complex
class(2019i)
a <- 2019i
class(a)
# Character
class('Hello World')
a <- "Hello World"
class(a)
# Logical
class(TRUE)
class(FALSE)
a <- TRUE
class(a)
# note that in this example the object a is overwritten every time
Finally, there are few special value types:
NA
is used to represent missing values. (NA
stands for “not available.”) You may encounterNA
values in text loaded into R (to represent missing values) or in data loaded from databases (to replaceNULL
values).Inf
and-Inf
, if a computation results in a number that is too big, R will returnInf
for a positive number and-Inf
for a negative number (meaning positive and negative infinity, respectively).NaN
, sometimes a computation will produce a result that makes little sense. In these cases, R will often returnNaN
(meaning “not a number”).NULL
, additionally, there is a null object in R, represented by the symbolNULL
. (The symbolNULL
always points to the same object.)NULL
is often used as an argument in functions to mean that no value was assigned to the argument. Additionally, some functions may returnNULL
. Note thatNULL
is not the same asNA
,Inf
,-Inf
, orNaN
.
Data Structures
Vectors
Vectors are one-dimension arrays
that can hold numeric data, character data, or logical data. In other
words, a vector is a simple tool to store data of the same
type.
In R, you create a vector with the combine function c()
.
You place the vector elements separated by a comma between the
parentheses. For example:
numeric_vector <- c(1, 2, 3)
character_vector <- c("a", "b", "c")
boolean_vector <- c(TRUE, FALSE, TRUE)
Empty vectors may also be created by means of the generic function
vector()
specifying the mode of the vector you want to
create.
numeric_empty_vector <- vector("numeric", 10)
character_empty_vector <- vector("character", 10)
An easy way to create numeric vectors is by creating sequences or repeating numbers.
numeric_vector <- 1:10
numeric_vector <- seq(from = 1, to = 20, by = 2)
numeric_vector <- rep(1, 10)
You may want to perform operations on vectors or with vectors, and
also apply functions on vectors, or combine vectors using
c()
.
x <- c(1, 2, 3)
y <- c(4, 5, 6)
x + y
x * y
mean(x) # mean of x
sd(y) # standard deviation of y
z <- c(x, y) # combine x and y into a new object z
Factors
The term factor refers to a statistical data type
used to store categorical variables. The difference between a
categorical variable and a continuous
variable is that a categorical variable can belong to a limited number
of categories. A continuous variable, on the other hand, can correspond
to an infinite number of values.
It is important that R knows whether it is dealing with a continuous or
a categorical variable, as the statistical models you will develop in
the future treat both types differently.
A good example of a categorical variable is gender. In many
circumstances you can limit the sex categories to “Male” or
“Female”.
To create factors in R, you make use of the function
factor()
. First thing that you have to do is create a
vector that contains all the observations that belong to a limited
number of categories. For example, sex_vector contains the sex of 5
different individuals:
sex_vector <- c("Male", "Female", "Female", "Male", "Male")
It is clear that there are two categories, or in R-terms
‘factor levels’, at work here: “Male” and
“Female”.
The function factor()
will encode the vector as a
factor:
factor_sex_vector <- factor(sex_vector)
factor_sex_vector
Moreover, there are two types of categorical variables: a
nominal categorical variable and an ordinal
categorical variable.
A nominal variable is a categorical variable without an implied
order. This means that it is impossible to say that ‘one is
worth more than the other’.
In contrast, ordinal variables do have a natural
ordering. For instance the temperature variable is an ordered
factor and is created by specifying the argument
ordered = TRUE
and the levels.
temperature_vector <- c("High", "Low", "High", "Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, ordered = TRUE, levels = c("Low", "Medium", "High"))
factor_temperature_vector
Matrices
A matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns. Since you are only working with rows and columns, a matrix is called two-dimensional.
You can construct a matrix in R with the matrix()
function
matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)
The option byrow indicates whether the values given
by data must fill successively the columns (the default) or the rows (if
TRUE
). The option dimnames allows to give
names to the rows and columns.
Here is an example of matrix creation
mat <- matrix(c(1:9), nrow = 3, ncol = 3)
mat
mat <- matrix(c(1:9), nrow = 3, ncol = 3, byrow = TRUE,
dimnames = list(c("row1", "row2", "row3"), c("col1", "col2", "col3")))
mat
Data Frames
A dataframe is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column. It is not a matrix because in matrices all the elements have to be of the same type. In a dataframe, instead, you can store different data types among different columns. Type consistency, however, must be preserved within each column.
In R one can create a dataframe with the data.frame()
function
df <- data.frame(ID = c(1:5),
Name = c("Rick","Dan","Michelle","Ryan","Gary"),
Grade = c(18, 20, 23, 21, 29),
stringsAsFactors = FALSE) # specify this argument to keep characters as is
df
One can apply some useful function to quickly explore a dataframe (or a matrix).
head(df, 2) # look the first 2 row of dataframe df
tail(df, 2) # look the last 2 row of dataframe df
str(df) # check the structure of dataframe df
names(df) # check columns' names
Lists
Summarizing:
- Vectors (one dimensional array): can hold numeric,
character or logical values. The elements in a vector all have the same
data type.
- Matrices (two dimensional array): can hold numeric,
character or logical values. The elements in a matrix all have the same
data type.
- Data frames (two-dimensional objects): can hold numeric, character or logical values. Within a column all elements have the same data type, but different columns can be of different data type.
A list object is just a container where you can put
every object you like, with no bindings on types and lengths.
A list in R is similar to your to-do list at work or school: the
different items on that list most likely differ in length,
characteristic, and type of activity that has to be done. It allows you
to gather a variety of objects under one name (that is, the name of the
list) in an ordered way. These objects can be matrices, vectors, data
frames, even other lists, etc. It is not even required that these
objects are related to each other in any way. You could say that a list
is some kind super data type: you can store practically any piece of
information in it.
To construct a list you use the function list()
:
l <- list(comp1, comp2 ...)
The arguments to the list function are the list components. Remember, these components can be matrices, vectors, other lists, …
Following there are few examples:
l <- list(1, "a", TRUE, 1L) # unnamed list
l
l <- list(x = 1, y = "a", w = TRUE, z = 1L) # named list
l
my_vector <- 1:10 # vector with numerics from 1 up to 10
my_matrix <- matrix(1:9, ncol = 3) # matrix with numerics from 1 up to 9
my_df <- head(mtcars, 10) # first 10 elements of the built-in dataframe mtcars
my_list <- list(my_vector, my_matrix, my_df) # construct list with these different elements
my_list
as.() and is.() family functions
There exists some very useful functions that test whether an R object
is of a certain type or structure. These are the is.()
function.
x <- c(0:5)
is.numeric(x)
is.character(x)
is.logical(x)
is.vector(x)
is.matrix(x)
is.factor(x)
is.data.frame(x)
is.list(x)
Furthermore, there is another important family of R functions that
allow you to convert an object from one type or structure to another.
These are the as.()
functions.
x <- c(0:5)
as.numeric(x)
as.character(x)
as.logical(x)
as.vector(x)
as.matrix(x)
as.factor(x)
as.data.frame(x)
as.list(x)
and here are summarized some conversion rules
Before going to the next section, it is important to know that there
exists also a special function to check if a value is or not a missing
value, that is the is.na()
function.
Accessing values & Subsetting
It may happen you don’t need to work with the entire objects you’ve just created, but only some parts or elements that objects contain.
In R one can access the values of an object by indexing or
naming.
The indexing system is an efficient and flexible way to
access selectively the elements of an object; it can be either
numeric or logical indexing.
The most common way of extracting elements from an object (or
sub-setting it) is to use the square bracket operator [
according to the type structure of the object.
For instance, one could do the following on vectors, matrices, dataframes and lists
# Run the following codes to see the output
help("[")
# Vectors
x <- 11:19 # creates a numeric vector whose elements are the values from 11 to 20
x
x[4] # extracts the fourth element
x[-4] # extracts all but the fourth element
x[2:4] # extracts elements from two to four
x[-(2:4)] # extracts all elements except from two to four
x[c(1, 5)] # extracts elements one and five
# Matrices
m <- matrix(x, 3, 3, byrow = TRUE)
m
m[2, ] # selects all row two
m[, 1] # selects all column one
m[2, 3] # selects the element on row two and column three (a23)
m[c(1, 3), c(1, 3)] # elements a11, a13, a31 and a33
m[, -2] # selects all except column two
# Lists
l <- list(x = 1:5, y = c('a', 'b'), z = c(TRUE, FALSE, FALSE))
l
l[1] # extracts the first element of l (a new list with only x)
l[[2]] # extracts the values of the second element of l (the character vector y)
l[[3]][1] # extracts the first value of the third element of l (the logical vector z)
# Dataframes
df <- as.data.frame(m, stringsAsFactors = FALSE)
df
# df allow both matrix and list subsetting
df[, 2] # extracts the second column
df[1, ] # extracts the first row
df[2, 3] # extracts the element a23
df[2] # extracts the second column (with column name)
df[[3]] # extracts the elements of the third column
The second way of extracting values in R objects is by using the naming system. The names are labels of the elements of an object, and thus of mode character. They are generally optional attributes. There are several kinds of names (names, colnames, rownames, dimnames).
Hence you can easily access the elements of an objects by passing its
name attribute to the [
operator or through the
$
operator.
# Run the following codes to see the output
help("[")
help("$")
# Vectors
x <- 11:19 # creates a numeric vector
names(x) # with no names
names(x) <- letters[1:length(x)] # assign names to the names of x
names(x)
x # now x is a named vector
x["a"] # extracts the element "a"
x[names(x)[3:5]] # extracts the elements with names in position 3, 4 and 5
# Matrices
m <- matrix(x, 3, 3, byrow = TRUE)
colnames(m)
colnames(m) <- c("X1", "X2", "X3") # assign names to columns
colnames(m)
rownames(m)
rownames(m) <- c("row1", "row2", "row3") # assign names to rows
rownames(m)
m
m["row3",] # extracts third row
m[, "X1"] # extracts first column
# Lists
l <- list(x = 1:5, y = c('a', 'b'), z = c(TRUE, FALSE, FALSE))
names(l) # the list is already named
l
l["y"] # extracts "y" object within the list l
l[["x"]] # extracts the values of x object within the list l
l$x # same but with $ operator
l[["z"]][1] # extracts the first value of z object within the list l
l$z[1] # same but with $ operator
# Dataframes
df <- as.data.frame(m, stringsAsFactors = FALSE)
df
names(df)
colnames(df)
rownames(df)
# same as with lists
df["X1"]
df[["X2"]]
df$X3
Finally, another common thing that can be done by indexing or naming
is to replace values directly within an R object simply by using the
well-known assignment operator <-
.
# Run the following codes to see the output
x <- 11:19 # creates a numeric vector
x[4] <- 0 # replace the fourth element of x with value 0
x[6:8] <- 9999 # replace six to eigth elements of x with value 9999
x
# Replacement can be applied to any object be it a
# vector, matrix, dataframe or list.
Intermediate R
Relational Operators
Relational operators are used to compare between values. Here is a list of relational operators available in R:
<
less than
>
greater than
<=
less than or equal to
>=
greater than or equal to
==
equal to
!=
not equal to
# Run the following codes to see the output
# comparisons
a <- 5
b <- 16
a < b
a > b
a <= 5
b >= 20
b == 16
a != 5
# on vectors
x <- c(2, 8, 3)
y <- c(6, 4, 1)
x > y
Logical Operators
Logical operators are used to carry out Boolean operations like AND, OR etc.
!
logical NOT (the opposite of)
&
element-wise logical AND
|
element-wise logical OR
&&
logical AND
||
logical OR
Operators &
and |
perform element-wise
operation producing result having length of the longer operand. Zero is
considered FALSE
and non-zero numbers are taken as
TRUE
. But &&
and ||
examines only the first element of the operands resulting into a single
length logical vector.
# Run the following codes to see the output
# logical operations
x <- c(TRUE, FALSE, 0, 1)
y <- c(FALSE, TRUE, FALSE, TRUE)
!x
x & y
x && y
x | y
x || y
# combine all together
a <- 5
b <- 16
c <- 8
d <- 29
(a * b >= c) & (d %/% b == a)
# 80 >= 8 & 1 == 5
# TRUE & FALSE = FALSE
(a * b >= c) | (d %/% b == a)
# 80 >= 8 | 1 == 5
# TRUE | FALSE = TRUE
Conditional subsetting
Value extraction and subsetting in R can be performed also by means
of logical values. Obviously R will keep only those values that are in
the position of TRUE
s, for instance
# Run the codes to see the output
x <- 11:15
x[c(TRUE, TRUE, FALSE, FALSE, TRUE)]
# It can be done also on matrices, dataframes and lists.
This implies that you can perform extractions and/or subsetting based on conditions using relational and logical operators. This is called conditional subsetting.
Let’s see some examples
# Run the codes to see the output
x <- 20:40
x[x == 10] # the elements of x whose value is equal to 10
x[x == 30] # the elements of x whose value is equal to 30
x[x < 35] # the elements of x whose values are less than 35
x[x == 22 | x >= 38] # the elements of x whose values are equal to 22 or greater and equal to 38
x[x != 20 & x != 30 & x != 40] # the elements of x whose values are not equal to 20, 30 and 40
# It can be done also on matrices, dataframes and lists.
In general, relational and logical operators returns either
TRUE
or FALSE
. Sometimes however one could
need the indices related to the elements that satisfy the condition
tested. The R function that allows to do this is the
which()
function.
# Run the codes to see the output
x <- 20:40
x == 22 | x >= 38 # the condition returns TRUE or FALSE
which(x == 22 | x >= 38) # the which() function returns the indices
x[which(x == 22 | x >= 38)] # then subsetting is as usual
# It can be done also on matrices, dataframes and lists.
Conditional Statements
Conditional statements play a very important role in the Control Flow
process as they allow you to perform actions based on conditions. For
instance, if some condition is respected you may want your code to
produce some output and something else otherwise. Every time this is the
case, you end up using if
or if ... else
statements.
if
control statement: it evaluates a single condition. It is quite easy as it just has a single keyword “if” followed by the condition and then certain set of statements that needs to get executed in case it is true.
In this flowchart, the code will respond in the following way:
- the condition is checked
- if the condition is true, conditional code or the statements written
will be executed
- if the condition is false, the statements gets ignored.
In R the if
control statement has the following
syntax
if (condition) {
Statement 1 # perform some actions if condition is met
}
and here is an example
# Run the following codes to see the output
# Check if a number is even:
number <- 10
# 1) if the modulo is 0 then it is even
number %% 2
# 2) we need to create a condition that returns to TRUE or FALSE
(number %% 2) == 0 # this is our condition
is_even <- (number %% 2) == 0
# 3) create the if control statement
if (is_even) {
print("Even") # this is the statement which prints "Even" only if the condition is met
}
# Try to change the value of the number object and see what happens.
# In the case of odd numbers the print function is not executed!
if else
control statement: it evaluates a group of conditions and selects the statements.
In this flowcharts, the code will respond in the following way:
- the first
if
condition is checked
- if the condition is true, the first ‘if’ statement will get
executed
- if the condition is false, then it goes to the
else if
condition and if it is true, the ‘else if’ code will be executed (right flowchart)
- finally, if the ‘else if’ code is also false, then it will go to
else
code and it gets executed. This means if none of these conditions are true, then the ‘else’ statement gets executed.
In R the if else
control statement has the following
syntax
if (condition) {
Statement 1 # perform some actions if condition is met
} else {
Statement 2 # perform other actions if condition is not met
}
if (condition1) {
Statement 1 # perform some actions if condition 1 is met
} else if (condition2) {
Statement 2 # perform some other actions if condition 1 is not met but condition 2 does
} else {
Statement 3 # perform some actions if none of the conditions is met
}
# You can expand to as many condition and statements you need.
and here are few examples
# Run the following codes to see the output
# Check if a number is even or odd
number <- 11
# 1) if the modulo is 0 then it is even
number %% 2
# 2) we need to create a condition that returns to TRUE or FALSE
(number %% 2) == 0 # this is our condition
is_even <- (number %% 2) == 0
# 3) create the if else control statement
if (is_even) {
print("Even") # this is the statement which prints "Even" only if the condition is met
} else {
print("Odd") # this is the statement which prints "Odd" when the condition is not met
}
# Try to change the value of the number object and see what happens.
# In the case of odd numbers the second print function is executed!
# Run the following codes to see the output
# Check if a number is either divisible by 3 or 5 or none
number <- 12
# create the if else control statement
if ((number %% 5) == 0) {
print("Divisible by 5")
} else if ((number %% 3) == 0) {
print("Divisible by 3")
} else {
print("None")
}
# Try to change the value of the number object and see what happens.
# Oops! Try to set the value of number to 15. What's missing? Try to fix the problem by yourself, then check the solutions.
It is also useful to know that there exists the ifelse()
function which acts exactly as the “IF” function of MS Excel. This is
defined as
ifelse(test, yes, no)
the first argument being the condition, the second is the output if the condition is TRUE while the third is the value if the condition is FALSE.
Now try to solve the following exercises.
Happy Birthday
Write a simple control statement that prints “Happy Birthday!” if its
your birth date.
Hint: make use of the function Sys.Date()
in the condition
to extract today’s date.
Solution
birth_date <- "yyyy-mm-dd" # write here your birth date
if (birth_date == Sys.Date()) {
print("Happy Birthday!")
} else {
print("Today is not your birthday.")
}
How much did you sell 1?
Examine whether a variable stored as “quantity” is above 20. If quantity is greater than 20, the code will print “You sold a lot!” otherwise Not enough for today.
Solution
quantity <- 25
if (quantity > 20) {
print("You sold a lot!")
} else {
print("Not enough for today")
}
How much did you sell 2?
We are interested to know if we sold quantities between 20 and 30. If we do, then the pint “Average day”. If quantity is greater than 30 we print “What a great day!”, otherwise “Not enough for today”.
Solution
quantity <- 10
if (quantity < 20) {
print("Not enough for today.")
} else if (quantity > 20 & quantity <= 30) {
print("Average day.")
} else {
print("What a great day!")
}
Divisible by 5 & 3
Check by means of a control statement if a number is divisible by 5,
3, both or none.
Hint: make use of the &
logical operator if you don’t
like math.
Solution
number <- 15
# if you don't like math
if ((number %% 5) == 0 & (number %% 3) == 0) {
print("Divisible by 5 and 3")
} else if ((number %% 5) == 0) {
print("Divisible by 5")
} else if ((number %% 3) == 0) {
print("Divisible by 3")
} else {
print("None")
}
# if you like math
if ((number %% 15) == 0) {
print("Divisible by 5 and 3")
} else if ((number %% 5) == 0) {
print("Divisible by 5")
} else if ((number %% 3) == 0) {
print("Divisible by 3")
} else {
print("None")
}
Take Control!
In this exercise, you will combine everything that you’ve learned so far: relational operators, logical operators and control constructs. You’ll need it all!
Code two values beforehand: li
and fb
,
denoting the number of profile views your LinkedIn and Facebook profile
had on the last day of recordings. Go through the instructions to create
R code that generates a “social media score”, sms
, based on
the values of li
and fb
.
Create the control-flow construct with the following behavior:
- initialize the li and fb variable to 9 and 15.
- if both li and fb are 15 or higher, set sms equal to double the sum
of li and fb.
- if both li and fb are strictly below 10, set sms equal to half the
sum of li and fb.
- in all other cases, set sms equal to li + fb.
- finally, print the resulting sms variable to the console.
Solution
# Variables related to your last day of recordings
li <- 15
fb <- 9
# Code the control-flow construct
if (li >= 15 & fb >= 15) {
sms <- 2 * (li + fb)
} else if (li < 10 & fb < 10) {
sms <- 0.5 * (li + fb)
} else {
sms <- li + fb
}
# Print the resulting sms to the console
sms
Looping Together
In the Control Flow process, Loops help you to repeat certain set of actions so that you don’t have to perform them repeatedly. Imagine you need to perform an operation 10 times, if you start writing the code for each time, the length of the program increases and it would be difficult for you to understand it later. But at the same time by using a loop, if I write the same statement inside a loop, it saves time and makes easier for code readability. It also gets more optimized with respect to code efficiency.
In the above image, repeat
and while
loops
help you to execute a certain set of rules until the condition is true.
A for
instead is a loop that is used when you know how many
times you want to repeat a block of statements. Now, if you know that
you want to repeat it for 10 times, then you go with for
loop but if you are not sure about how many times you want the code to
be repeated, you will go with repeat
or while
loop.
repeat
loop: helps to execute the same set of code again and again until a stop condition is met.
In the above flowchart, the code will respond in the following steps:
- first it will enter and execute a set of code
- next it will check the condition, if it is true it will go back and
execute the same set of code again until it is meant to be false
- if it is found to be false, it will directly exit the loop.
In R the repeat
loop statement has the following
syntax
repeat {
statement # executes some code
if (condition) { # check the condition
break # if the condition is met, exit the loop with break
}
}
and here is an example
# Run the following codes to see the output
x <- 1 # initialize a variable
repeat {
print(x) # execute some code
x <- x + 1 # increment the variable
if (x == 6) { # check the condition
break # if TRUE exit the loop
}
}
While
loop: helps to execute the same set of code again and again until a stop condition is met.
In the above flowchart, the code will respond in the following steps:
- first it will check the condition
- if it is found to be true, it will execute the set of code
- next, it again checks the condition, if its true it will execute the same code again. As soon as the condition is found to be false, it immediately exits the loop.
In R the while
loop statement has the following
sintax
while (condition) { # check the condition
statement # if TRUE executes code otherwise exit the loop
}
and here is an example
# Run the following codes to see the output
x <- 2 # initialize a variable
while (x < 1000) { # check the condition
x <- x^2 # if TRUE do something: square the variable
print(x) # and print its value
}
for
loop: is used when you need to execute a block of code several number of times (you must know how many times).
In the above flowchart, the code will respond in the following steps:
- first of all there is initialization where you specify how many
times you want the loop to repeat
- next it will execute the set of code for the specified number of times.
In R the for
loop statement has the following syntax
for (i in vector) {
statements
}
and here is an example
# Run the following codes to see the output
# Type 1
for (i in 1:10) {
print(paste("User", i))
}
# Type 2
x <- c(2, 4, 6, 8, 10)
for (i in x) {
print(paste("User", i))
}
# Type 3
x <- c("User 1", "User 2", "User 3", "User 4", "User 5")
for (i in 1:length(x)) {
print(x[i])
}
Now try to solve the following exercises.
Fill objects with loops
Creates a non-linear function by using the polynomial of x between 1 and 6 and store it in a list.
Solution
# Create an empty list
l <- list()
# Create a for statement to populate the list
for (i in seq(1, 6, by = 1)) {
list[[i]] <- i * i
}
print(list)
Loop over a list
Looping over a list is just as easy and convenient as looping over a vector. Create a list containing few elements and try to extract its element sequentially.
Solution
# Create a list with three vectors
fruit <- list(Basket = c('Apple', 'Orange', 'Passion fruit', 'Banana'),
Money = c(10, 12, 15),
purchase = FALSE)
for (p in fruit) {
print(p)
}
Loop over a matrix
A matrix has 2-dimension, rows and columns. To iterate over a matrix, we have to define two for loop, namely one for the rows and another for the column. Create a matrix and use for loops to extract every value sequentially.
Solution
# Create a matrix
mat <- matrix(data = seq(11, 22, by = 1), nrow = 6, ncol = 2)
mat
# Create the loop with r and c to iterate over the matrix
for (r in 1:nrow(mat)) {
for (c in 1:ncol(mat)) {
print(paste("Row", r, "and column", c, "have values of", mat[r, c]))
}
}
Combine all together: The “Bizz-Buzz” Game
Create a numeric vector of length 100 and fill it with numbers from 1 to 100. Loops on this vector and print the numbers it contains except that:
- print “Bizz” instead of numbers divisible by 3;
- print “Buzz” instead of numbers divisible by 5;
- print “Bizz-Buzz” instead of numbers divisible by 15.
Solution
x <- 1:100
for (i in x) {
if (i %% 15 == 0) {
print("Bizz-Buzz")
} else if (i %% 5 == 0) {
print("Buzz")
} else if (i %% 3 == 0) {
print("Bizz")
} else {
print(i)
}
}
The map() functions
A very powerful tool of some programming languages (not all) and of R
is the so called Functional Programming. The idea of passing a function
to another function is extremely powerful idea, and it’s one of the
behaviors that makes R a functional programming language.
Here, we’ll focus on the purrr
package and the
map()
function which eliminates the need for many common
for loops.
The apply family of functions in base R
(apply()
, lapply()
, tapply()
,
etc) solve a similar problem, but purrr
is more consistent
and easier to learn.
install.packages("purrr")
library(purrr)
help("map")
The simplest map()
function allows essentially to apply
a function over vectors, lists, dataframes and matrices. It transforms
its input by applying a function to each element and returning a vector
(or list) the same length as the input. Here are few examples
# Map over list
l <- list(a = rnorm(100, 0, 1),
b = rnorm(100, 20, 1),
c = rnorm(100, -10, 2))
map(l, mean) # apply the function mean to all the element of list l
map(l, median) # apply the function median to all the element of list l
map(l, sd) # apply the function standard deviation to all the element of list l
# Map over dataframe
df <- as.data.frame(l) # transform l to a dataframe
map(df, mean) # apply the function mean to all columns of dataframe df
map(df, median) # apply the function median to all columns of dataframe df
map(df, sd) # apply the function standard deviation to all columns of dataframe df
# Map over matrix
mat <- as.matrix(df)
map(mat, mean) # apply the function mean to all columns of matrix mat
map(mat, median) # apply the function median to all columns of matrix mat
map(mat, sd) # apply the function standard deviation to all columns of matrix mat
Functions
Most of R’s work is done with functions which arguments are given within parentheses. Users can write their own functions, and these will have exactly the same properties than other functions in R. Writing your own functions allows an efficient, flexible, and rational use of R.
Define a function
As in other programming language, in R the concept of function is
directly built on that of mathematical functions, that is “a
relation between some inputs and an output”.
In the context of computer programming we can state this principle as
the black-box rule.
Hence, we can imagine functions as black-box taking inputs and giving back an output. They essentially transforms inputs into an output.
In R a function is defined in the following way
# Function of one variable
my_fun <- function(input) { # the function takes an input (or argument)
body # the body is code to be executed when the function is called
return(output) # the function return an output using the return function
}
# Function of more than one variable
my_fun <- function(input1, input2) {
body
return(output)
}
Notice that this recipe uses the assignment operator
(<-
) just as if you were assigning a vector to a
variable for example. This is not a coincidence. Creating a function in
R basically is the assignment of a function object to a
variable! In the recipe above, you’re creating a new R variable
my_fun
, that becomes available in the workspace as soon as
you execute the definition. From now on, you can use the
my_fun
as a function.
The second important rule regarding R functions is related to how arguments are matched. Argument matching in R may be summarized in the following way:
Consider for example the function mean()
, which computes
the standard mean of a vector of numbers
arguments may be matched by name, meaning that you
explicit write the name of the arguments you want to use and assign them
their values, or by position, you just impute the
values in the order of the arguments and let R assigning them. You can
check the arguments of a function through the help()
or the
args()
commands.
Finally, the third fundamental rule of R function is that arguments may have default values.
Now it’s time to create your own functions!
Create functions
Try to solve the following exercises that teach you how to build
functions in R and compare your results with the proposed
solutions.
Remember: most of the time, there is no unique way to code something but
the results must be the same.
Fun Hello World
Create a function that prints to the console a character string you want and use it with “Hello World, this is my first function!”
Solution
new_print <- function(string) {
return(print(string)) # with return
}
# or
new_print <- function(string) {
print(string) # without return
}
new_print("Hello World, this is my first function!")
# Try it with numbers!
# As you saw, this is simply not necessary at all since you just need to call
# the print() function to do this.
Fun Hello World 2
Now create a new version of the previous function. This time it
should take two arguments and combine them together in a single string
to be printed out.
Hint: make use of the function paste()
.
Solution
new_print2 <- function(string1, string2) {
string_complete <- paste(string1, string2)
return(print(string_complete)) # with return
}
new_print2("Hello World", "this is my second function!")
Square Root
Create a function that performs the square root of a number. Hint: the square root is just an fractional exponential.
Solution
new_sqrt <- function(x) {
x_square_root <- x ^ (1/2)
return(x_square_root)
}
sqrt(4)
new_sqrt(4)
# Try it on vector
sqrt(c(4, 9, 16))
new_sqrt(c(4, 9, 16))
# it works perfectly since the ^ (exp) operator in R is vectorised!
(E)very Power
Now modify the previous function to be able to apply any kind of
power you want, with default to 2.
Hint: this new function should have two arguments.
Solution
power <- function(x, exp = 2) {
x_power <- x ^ exp
return(x_power)
}
power(4) # if you call it without specifying the argument exp it apply the square
power(4, 1/2)
power(3, 3)
power(27, 1/3)
Is 2020 a leap year?
In this exercise you have to combine many topics that you have learnt
so far.
Follow the instruction to create a function that tests if an year is or
not a leap year.
- if a year is divisible by 4, 100 and 400, it’s a leap year;
- if a year is divisible by 4 and 100 but not divisible by 400, it’s
not a leap year;
- if a year is divisible by 4 but not divisible by 100, it’s a leap
year;
- if a year is not divisible by 4, it’s not a leap year.
Moreover, control that the year passed to the function is numeric
with is.numeric()
. In the case it doesn’t exit the function
printing an error message by means of the stop()
function.
Solution
leap_year <- function(year) {
if (!is.numeric(year)) { # is.numeric returns TRUE if it numeric, FALSE otherwise, hence use ! to negate
stop("The year argument must be numeric!") # stop the execution
}
if ((year %% 4) == 0) {
if ((year %% 100) == 0) {
if ((year %% 400) == 0) {
print(paste(year, "is a leap year.")) # divisible by 4, 100 and 400
} else { print(paste(year, "is not a leap year.")) } # divisible by 4 and 100 but not by 400
} else { print(paste(year, "is a leap year.")) } # divisible by 4 but not by 100
} else { print(paste(year, "is not a leap year.")) } # not divisible by 4
} # close the function
leap_year(2020)
leap_year("2020")
Extra Topics
Tidy Data
It is often said that 80% of data analysis is spent on the cleaning
and preparing data. To get a handle on the problem, there are some
aspects of data cleaning that are called data tidying:
structuring datasets to facilitate analysis.
The principles of tidy data provide a standard way to organize data
values within a dataset. A dataset is a collection of
values, usually either numbers (if quantitative) or strings (if
qualitative). Values are organised in two ways. Every
value belongs to a variable and an observation. A
variable contains all values that measure the same
underlying attribute (like height, temperature, duration) across units.
An observation contains all values measured on the same
unit (like a person, or a day, or a race) across attributes.
Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
Most of these concepts are taken from Hadley Wickham’s Tidy Data paper.
Importing Data with R
Nowadays data can be stored everywhere: on databases, on cloud storage containers, on web pages and also on local files. Nonetheless one of the most common ways of storing data is on csv (comma separated value) or xlsx (Excel) files. Hence, it is fundamental to be able to import data in R from those messy format.
In R there exist many packages that include functions to work with
data stored in different file formats. The readr
and
readxl
package belongs to the Tidyverse and are very useful
libraries to import data from csv, the former, and xlsx, the latter.
This packages include the functions read_csv() and read_excel() that allow exactly to import data in R in a common R structure (usually a dataframe, possibly a tidy one).
install.packages(c("readr", "readxl"))
library(readr)
library(readxl)
read_csv(file)
read_excel(path, sheet)
RStudio allows also to import data in a more user friendly and less programmatic way. Indeed, you can navigate through the File pane on bottom right until you reach the csv or Excel file, click on the file name and select “Import Dataset”.
Statistics
“Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital.” Aaron Levenstein
“Statistics is a branch of mathematics working with data collection, organization, analysis, interpretation and presentation.” Wikipedia It is the science of data, and uses probability to make conclusions regarding reality.
There are two main branches of statistics:
- descriptive statistics which is only concerned with
properties of the observed data, that is it is just the analysis of the
data at hand.
- inferential statistics which uses data analysis to deduce properties of a population and build general models.
As already said, R is a statistical programming language and it means that it can be used to do as many statistical analysis as you can imagine. Almost every model and statistical technique is implemented in R, which means that there exists at least one R package doing the analysis you are looking for.
The aim of this section is just to give a very brief introduction on some basic tools of descriptive statistics and a light presentation of the most known inferential method, the linear regression model, with R.
If you want more details on these topics you can use the following free online material
Descriptive Statistics
# Run the following codes to see the output
# Create a population (uniformly distributed)
p <- rep(-5:5, 100000) # population
# Univariate Case
set.seed(2507)
x <- sample(p, 1000) # sample 1
min(x)
max(x)
mean(x) # mean
var(x) # variance
sd(x) # standard deviation
table(x) # absolute frequencies
prop.table(table(x))*100 # relative frequencies
median(x) # median
quantile(x, probs = c(0.25, 0.5, 0.75)) # quartiles
summary(x) # some descriptive statistics
Hmisc::describe(x)
hist(x)
boxplot(x)
# Bivariate Case
set.seed(2607)
y <- sample(p, 1000) # sample 2
summary(y)
cov(x, y, method = "pearson") # low covariance
cor(x, y, method = "pearson") # zero correlation
cor(sort(x), sort(y), method = "pearson") # perfect positive correlation
cor(sort(x, decreasing = TRUE), sort(y), method = "pearson") # perfect negative correlation
library(PerformanceAnalytics)
chart.Correlation(cbind(x, y), histogram = TRUE, pch = 19)
chart.Correlation(cbind(sort(x), sort(y)), histogram = TRUE, pch = 19)
chart.Correlation(cbind(sort(x, decreasing = TRUE), sort(y)), histogram = TRUE, pch = 19)
my_data <- mtcars[, c(1, 3, 4, 5, 6, 7)]
chart.Correlation(my_data, histogram = TRUE, pch = 19)
Inference: the Linear Regression model
“All models are wrong, but some are useful.” George E. P. Box
data(marketing, package = "datarium")
summary(marketing)
table(marketing == 0)
marketing <- marketing + 0.01 # to avoid logaritmic problems
chart.Correlation(marketing, histogram = TRUE, pch = 19)
# Estimate a multiple linear regression model
mod <- lm(sales ~ youtube + facebook + newspaper, data = marketing)
summary(mod)
marketing <- round(log1p(marketing), 2) # transform variables to log scale
names(marketing) <- paste0("log_", names(marketing)) # change names
# Estimate a multiple log-linear regression model
mod_log <- lm(log_sales ~ log_youtube + log_facebook + log_newspaper, data = marketing)
summary(mod_log)
JOINS in R
There exists at least five possible joining operations that you can do when working with data:
- left join
- right join
- inner join
- anti join
- full join
These can be easily performed in R by means of the base
merge
function. Otherwise, the package dplyr
from the Tidyverse contains the very same SQL functions:
left_join()
, right_join()
, etc.
Conclusions
Don’t be scared, R does not eat humans and remember:
Google is always your best friend, but…
questions need to be well defined otherwise the answer will be 42!