Exploratory Data Analysis | Data Science Specialization | Coursera | Course

      Brief Information
  • Name : Exploratory Data Analysis (the 4rd course of Data Science Specialization in Coursera)
  • Lecturer : Roger D. Peng
  • Duration: 2015-08-05 ~ 08-30 (4 weeks)
  • Course : Data Science Specialization in Coursera
  • Syllabus : [Syllabus] Coursera) DSS) #4) Exploratory Data Analysis
  • Course Content
    • Making exploratory graphs
    • Principles of analytic graphics
    • Plotting systems and graphics devices in R
    • The base, lattice, and ggplot2 plotting systems in R
    • Clustering methods
    • Dimension reduction techniques
  • In short
    • This course covers the essential exploratory techniques for summarizing data. These techniques are typically applied before formal modeling commences and can help inform the development of more complex statistical models. Exploratory techniques are also important for eliminating or sharpening potential hypotheses about the world that can be addressed by the data. We will cover in detail the plotting systems in R as well as some of the basic principles of constructing data graphics. We will also cover some of the common multivariate statistical techniques used to visualize high-dimensional data.

Lectures

Week 1
#1 Principles of Analytic Graphics
  1. Show comparisons
  2. Show causality, mechanism, explanation, systematic structure
  3. Show multivariate data
  4. Integrate multiple modes of evidence
  5. Describe and document the evidence with appropriate labels, scales, sources, etc
  6. Content is king
#2 Exploratory Graphs

The goal of exploratory graphs is for personal understanding, not for communicating.

  • Simple Summaries of Data
    • 1 Dimension
      • five number summary (Min. 1st Qu,  Median,  Mean, 3rd Qu, Max) – summary()
      • boxplot – boxplot()
      • histogram – hist()
        • overlaying features – abline()
      • density plot – rug()
      • barplot – barplot()
    • 2 Dimensions
      • Multiple/overlayed 1-D plots (Lattice/ggplot2)
        • Multiple boxplot
        • multiple histogram
      • scatterplots – with( , plot())
      • smooth scatterplots
    • >2 Dimensions
      • Overlayed/multiple 2-D plots; coplots
      • Use color, size, shape to add dimensions
      • Spinning plots
      • Actual 3-D plots (not that useful)
  • The List of Functions used in this lecture
    • Main functions
      • summary, boxplot, hist, barplot, plot
    • Setting the graphic device
      • par(*its parameters are important)
      • with
    •  Plot
      • plot, boxplot, hist, barplot
    • Plot additional information
      • rug, abline
    •  Etc.
      • summary, subset
#3 Plotting Systems in R
  1. The base plotting system
    1. “artist’s palette” model
    2. plot, with, data
  2. The lattice system
    1. Entire plot specified by one function; conditioning
    2. xyplot
  3. The ggplot2 system
    1. Mixes elements of Base and Lattice
    2. qplot
#4 The Base Plotting System in R
  • the datasets package
  • Getting some important base graphics parameters
    • par(“pch”), par(“lty”), par(“lwd”), par(“col”), par(“mfrow”), par(“las”), par(“bg”)
  • Base plot with annotation
    • point()
      • with(data_frame, points(x, y, col = “red“))
    • tile()
      • title(main = “title name“)
    • legend()
      • legend(“topright”, pch = 1, col = c(“blue”, “red”), legend = c(“May”, “Other Months”))
    • mtext()
  • Multiple Base plots
    • with(data_frame, {
      plot()
      plot()
      mtext()
      })
#5 Graphics Devices in R
  • Two types of graphics devices in R
    • Screen device – RStudioGD, windows
    • File device
      1. Vector format – pdf, svg, win.metafile, postscript
      2. Bitmap format – png, jpeg, tiff, bmp
  • Functions of the graphics devices
    • dev.cur, dev.set, dev.copy, dev.off
  • Two approaches to plotting
    • Plotting on a screen device
      1. Plot – plot, xyplot, …
      2. Annotate – title, abline, legend, point, …
    • Plotting on a file device
      1. Open a file device – pdf( file = file_name ), png( file = file_name), …
      2. Plot – same as the first approach
      3. Annotate – – same as the first approach
      4. Close the file device – dev.off()
  • Copying Plots
    • plot → annotate → dev.copy → dev.off
    • plot → annotate → dev.copy2pdf → dev.off
Week 2
#6 The Lattice Plotting System in R
  • The lattice plotting system
    • Plotting and annotation are constructed with a single function call
    • Good for the same kind of plot under many different conditions
    • Using panel functions, a lattice plot function can be specified/customized.
    • packages: lattice, grid
#7 Plotting with ggplot2
  • qplot
  • ggplot
    • how to print ggplot
Week 3
#8 Hierarchical Clustering
  • Clustering Analysis
  • Steps
    1. Find the closest distance of two things
    2. Group two things as one dot and define the location of the group.
    3. If there are more than 2 dots, go to step 1. If not, go to step 4.
    4. Make a hierarchical tree graph.
  • Functions
    • dist
    • hClust
      dataFrame <- data.frame( x = x, y = y)
      disxy <- dist( dataFrame )
      hClustering <- hClust(distxy)
      plot(hClustering)
      
    • myplclust
    • heatmap
#9 K-means Clustering
  • Clustering Analysis
  • Steps
    1. Fix the number of clusters
    2. Get “centroids” of each cluster
    3. Assign things to closest centroid
    4. Recalculate centroids
    5. If the new recalaulated centroids are different as before, go to step 3. If not, go to step 6.
    6. plot the clusters
  • Functions
    • kmeans
    • dataFrame <- data.frame(x, y) # x : a x-axis set elements, y : a y-axis set of elements
      kmeansObj <- kmeans(dataFrame, centers = 3) # centers : the number of centroids
      names(kmeansObj)
      kmeansObj$cluster
      
    • heatmap
#10 Principal Components Analysis and Singular Value Decomposition
#11 Plotting and Color in R
  • Summary
    • Careful use of colors in plots/maps/etc. can make it easier for the reader to get what you’re trying to say.
    • The RColorBrewer package is an R package that provides color palettes for sequential, categorical, and diverging data.
    • The colorRamp and colorRampPalette functions can be used in conjunction with color palettes to connect data to colors.
    • Transparency cna sometimes be used to clarify plots with many points
  • grDevices package (imported by default)
    • colorRamp
    • colorRampPalette
    • colors : this function returns the names of colors in R.
  • RColorBrewer package
    • brewer.pal
  • others
    • rgb function : to product an color via RGB system
    • colorspace package : used for a different control over colors. used to convert between two different color spaces.
Week 4
#12 EDA Case Study – Understanding Human Activity with Smart Phones
  • A case study. Following instructions and realize the process of exploratory data analysis.

Quiz
Quiz 1
  1. principles of analytic graphics
  2. the role of exploratory graphs in data analysis
  3. the base plotting system
  4. an example of a valid graphics device in R
  5. an example of a vector graphics device in R
  6. Bitmapped file formats can be most useful for
  7. functions used to annotate a plot in the base graphics system
  8. the function opens the screen graphics device on Windows – windows()
  9. the options of par()
  10. If I want to save a plot to a PDF file, …
Quiz 2
  1. Under the lattice graphics system, the primary plotting functions return an object of class trellis.
  2. Predict the result that is produced by a particular lattice plot – xyplot()
  3. Which of the following functions can be used to annotate the panels in a multi-panel lattice plot? – llines()
  4. The auto-print process in the lattice graphics system
  5. trellis.par.set() can be used to finely control the appearance of all lattice plots in the lattice system.
  6. ggplot2 is the Grammar of Graphics developed by Leland Wilkinson.
  7. When examining how the relationship between ozone and wind speed varies across each month, what would be the appropriate code to visualize that using ggplot2?
  8. geom in the ggplot2 system
  9. The process to make a plot appear using ggplot()
  10. How to modify the given qplot() code to add a smoother to the scatterplot – add geom_smooth()

Course Project
Course Project 1

[Course Project 1] Exploratory Data Analysis – Coursera

Course Project 2

[Course Project 2] Exploratory Data Analysis – Coursera

swirl Programming Assignment

Result
Scores
  • Total = ?/100 points
    • Quiz 1 = 20/20 points
    • Quiz 2 = 20/20 points
    • Course Project 1 = 25/25 points
    • Course Project 2 = ?/35 points
Certificate
  • append the link

References
References Provided in this Course
Useful References

Leave a Reply

Your email address will not be published. Required fields are marked *