Brief Information
 Name : Exploratory Data Analysis (the 4rd course of Data Science Specialization in Coursera)
 Lecturer : Roger D. Peng
 Duration: 20150805 ~ 0830 (4 weeks)
 Course : Data Science Specialization in Coursera
 Syllabus : [Syllabus] Coursera) DSS) #4) Exploratory Data Analysis
 Course Content
 Making exploratory graphs
 Principles of analytic graphics
 Plotting systems and graphics devices in R
 The base, lattice, and ggplot2 plotting systems in R
 Clustering methods
 Dimension reduction techniques
 In short
 This course covers the essential exploratory techniques for summarizing data. These techniques are typically applied before formal modeling commences and can help inform the development of more complex statistical models. Exploratory techniques are also important for eliminating or sharpening potential hypotheses about the world that can be addressed by the data. We will cover in detail the plotting systems in R as well as some of the basic principles of constructing data graphics. We will also cover some of the common multivariate statistical techniques used to visualize highdimensional data.
Lectures
Week 1
#1 Principles of Analytic Graphics
 Show comparisons
 Show causality, mechanism, explanation, systematic structure
 Show multivariate data
 Integrate multiple modes of evidence
 Describe and document the evidence with appropriate labels, scales, sources, etc
 Content is king
#2 Exploratory Graphs
The goal of exploratory graphs is for personal understanding, not for communicating.
 Simple Summaries of Data
 1 Dimension
 five number summary (Min. 1st Qu, Median, Mean, 3rd Qu, Max) – summary()
 boxplot – boxplot()
 histogram – hist()
 overlaying features – abline()
 density plot – rug()
 barplot – barplot()
 2 Dimensions
 Multiple/overlayed 1D plots (Lattice/ggplot2)
 Multiple boxplot
 multiple histogram
 scatterplots – with( , plot())
 smooth scatterplots
 Multiple/overlayed 1D plots (Lattice/ggplot2)
 >2 Dimensions
 Overlayed/multiple 2D plots; coplots
 Use color, size, shape to add dimensions
 Spinning plots
 Actual 3D plots (not that useful)
 1 Dimension
 The List of Functions used in this lecture
 Main functions
 summary, boxplot, hist, barplot, plot
 Setting the graphic device
 par(*its parameters are important)
 with
 Plot
 plot, boxplot, hist, barplot
 Plot additional information
 rug, abline
 Etc.
 summary, subset
 Main functions
#3 Plotting Systems in R
 The base plotting system
 “artist’s palette” model
 plot, with, data
 The lattice system
 Entire plot specified by one function; conditioning
 xyplot
 The ggplot2 system
 Mixes elements of Base and Lattice
 qplot
#4 The Base Plotting System in R
 the datasets package
 Getting some important base graphics parameters
 par(“pch”), par(“lty”), par(“lwd”), par(“col”), par(“mfrow”), par(“las”), par(“bg”)
 Base plot with annotation
 point()
 with(data_frame, points(x, y, col = “red“))
 tile()
 title(main = “title name“)
 legend()
 legend(“topright”, pch = 1, col = c(“blue”, “red”), legend = c(“May”, “Other Months”))
 mtext()
 point()
 Multiple Base plots
 with(data_frame, {
plot()
plot()
mtext()
})
 with(data_frame, {
#5 Graphics Devices in R
 Two types of graphics devices in R
 Screen device – RStudioGD, windows
 File device
 Vector format – pdf, svg, win.metafile, postscript
 Bitmap format – png, jpeg, tiff, bmp
 Functions of the graphics devices
 dev.cur, dev.set, dev.copy, dev.off
 Two approaches to plotting
 Plotting on a screen device
 Plot – plot, xyplot, …
 Annotate – title, abline, legend, point, …
 Plotting on a file device
 Open a file device – pdf( file = file_name ), png( file = file_name), …
 Plot – same as the first approach
 Annotate – – same as the first approach
 Close the file device – dev.off()
 Plotting on a screen device
 Copying Plots
 plot → annotate → dev.copy → dev.off
 plot → annotate → dev.copy2pdf → dev.off
Week 2
#6 The Lattice Plotting System in R
 The lattice plotting system
 Plotting and annotation are constructed with a single function call
 Good for the same kind of plot under many different conditions
 Using panel functions, a lattice plot function can be specified/customized.
 packages: lattice, grid
#7 Plotting with ggplot2
 qplot
 ggplot
 how to print ggplot
Week 3
#8 Hierarchical Clustering
 Clustering Analysis
 Steps
 Find the closest distance of two things
 Group two things as one dot and define the location of the group.
 If there are more than 2 dots, go to step 1. If not, go to step 4.
 Make a hierarchical tree graph.
 Functions
 dist
 hClust
dataFrame &lt; data.frame( x = x, y = y) disxy &lt; dist( dataFrame ) hClustering &lt; hClust(distxy) plot(hClustering)
 myplclust
 heatmap
#9 Kmeans Clustering
 Clustering Analysis
 Steps
 Fix the number of clusters
 Get “centroids” of each cluster
 Assign things to closest centroid
 Recalculate centroids
 If the new recalaulated centroids are different as before, go to step 3. If not, go to step 6.
 plot the clusters
 Functions
 kmeans

dataFrame &lt; data.frame(x, y) # x : a xaxis set elements, y : a yaxis set of elements kmeansObj &lt; kmeans(dataFrame, centers = 3) # centers : the number of centroids names(kmeansObj) kmeansObj$cluster
 heatmap
#10 Principal Components Analysis and Singular Value Decomposition
#11 Plotting and Color in R
 Summary
 Careful use of colors in plots/maps/etc. can make it easier for the reader to get what you’re trying to say.
 The RColorBrewer package is an R package that provides color palettes for sequential, categorical, and diverging data.
 The colorRamp and colorRampPalette functions can be used in conjunction with color palettes to connect data to colors.
 Transparency cna sometimes be used to clarify plots with many points
 grDevices package (imported by default)
 colorRamp
 colorRampPalette
 colors : this function returns the names of colors in R.
 RColorBrewer package
 brewer.pal
 others
 rgb function : to product an color via RGB system
 colorspace package : used for a different control over colors. used to convert between two different color spaces.
Week 4
#12 EDA Case Study – Understanding Human Activity with Smart Phones
 A case study. Following instructions and realize the process of exploratory data analysis.
Quiz
Quiz 1
 principles of analytic graphics
 the role of exploratory graphs in data analysis
 the base plotting system
 an example of a valid graphics device in R
 an example of a vector graphics device in R
 Bitmapped file formats can be most useful for
 functions used to annotate a plot in the base graphics system
 the function opens the screen graphics device on Windows – windows()
 the options of par()
 If I want to save a plot to a PDF file, …
Quiz 2
 Under the lattice graphics system, the primary plotting functions return an object of class trellis.
 Predict the result that is produced by a particular lattice plot – xyplot()
 Which of the following functions can be used to annotate the panels in a multipanel lattice plot? – llines()
 The autoprint process in the lattice graphics system
 trellis.par.set() can be used to finely control the appearance of all lattice plots in the lattice system.
 ggplot2 is the Grammar of Graphics developed by Leland Wilkinson.
 When examining how the relationship between ozone and wind speed varies across each month, what would be the appropriate code to visualize that using ggplot2?
 A geom in the ggplot2 system
 The process to make a plot appear using ggplot()
 How to modify the given qplot() code to add a smoother to the scatterplot – add geom_smooth()
Course Project
Course Project 1
[Course Project 1] Exploratory Data Analysis – Coursera
Course Project 2
[Course Project 2] Exploratory Data Analysis – Coursera
swirl Programming Assignment
Result
Scores
 Total = ?/100 points
 Quiz 1 = 20/20 points
 Quiz 2 = 20/20 points
 Course Project 1 = 25/25 points
 Course Project 2 = ?/35 points
Certificate
 append the link