Getting and Cleaning Data | Data Science Specialization | Coursera | Course

Brief Information
  • Name : Getting and Cleaning Data (the 3rd course of Data Science Specialization in Coursera)
  • Lecturer : Jeff Leek
  • Duration: 2015-07-06 ~ 08-02 (4 weeks)
  • Course : Data Science Specialization in Coursera
  • Syllabus : Syllabus__Getting and Cleaning Data – Coursera
  • In short
    • This course will cover the basic ways that data can be obtained. The course will cover obtaining data from the web, from APIs, from databases and from colleagues in various formats. It will also cover the basics of data cleaning and how to make data “tidy”. Tidy data dramatically speed downstream data analysis tasks. The course will also cover the components of a complete data set including raw data, processing instructions, codebooks, and processed data. The course will cover the basics needed for collecting, cleaning, and sharing data.

 Lectures

한 줄 요약 부탁합니다.

Week 1
  1. Obtaining Data Motivation
  2. Raw and Processed Data
  3. Components of Tidy Data
    1. The 4 things you should have
      1. raw data
      2. tidy data
      3. code book
      4. instruction list
  4. Downloading Files
    1. setwd, file.exists, file.create, dir.exists, dir.create, download.file, lists.files
  5. Reading Local Files
    1. read.table,read.csv
  6. Reading Excel Files
    1. xlsx package
      1. read.xlsx, read.xlsx2, write.xlsx
    2. XLConnect package
      1. more options than xlsx.
      2. XLConnect vignette: a good place to start
  7. Reading XML
    1. XML package
      1. xmlTreeParse, rootNode, xmlName, xmlSApply, xpathSApply, htmlTreeParse,
  8. Reading JSON
    1. jsonlite package
      1. toJSON, fromJSON,
    2. etc
      1. cat
  9. The data.table Package
    1. data.table package (fast, more functional than data.frame)
      1. data.table, tables, by, join, .N, setkey, merge, read.table
    2. etc.
      1. { } curly brackets, system.time, fread, melt, dcast
Week 2
  1. Reading from MySQL
    1. RMySQL package
      1. dbConnect, dbDisconnect, dbListTables,  dbReadTable,, dbGetQuery, dbSendQuery, fetch, dbListFields
    2. concepts
      1. connection, database, table, field, query, fetch
    3. etc.
      1. dim – used to know the number of the rows and the columns of a table
  2. Reading from HDF5
    1. source – execute a R script (= execute a list of R instructions in the R script)
  3. Reading from The Web
    1. Themes
      1. What is Web Scraping
      2. Parsing with XML
      3. GET from the httr package
      4. Accessing websites with passwords
    2. base package
      1. url, readLines, close,
    3. XML package
      1. htmlTreeParse, xpathSApply
    4. httr package
      1. GET, content, htmlParse
  4. Reading from APIs
  5. Reading from Other Sources
Week 3
  1. Subsetting and Sorting
    1. subsetting
      1. [], [,], [number], [name]
    2. Dealing with missing values[NAs]
      1. which
    3. Sorting
      1. sort, order, arrange(plyr)
    4. Adding rows and columns
      1. table<em>column</em> <- <em>new_column</em>, cbind, rbind</li> </ol> </li> 	<li></li> </ol> </li> 	<li>Summarizing Data</li> 	<li>Creating New Variables</li> 	<li>Reshaping Data</li> 	<li>Managing Data Frames with dplyr - Instroduction</li> 	<li>Managing Data Frames with dplyr - Basic Tools</li> 	<li>Merging Data</li> </ol> <h6>Week 4</h6> <ol> 	<li>Editing Text Variables</li> 	<li>Regular Expression I <ol> 	<li>Key <ol> 	<li>literals, metacharacters</li> </ol> </li> 	<li>Metacharacters <ol> 	<li>^,, [aA], [a-z], [a-zA-Z], “”
  2. Regular Expression II
  3. Working With Dates
  4. Data Resources

Quiz
Quiz 1
  1. Download and read a csv file and count rows that have a particular value.
  2. The principles of tidy data
  3. Download and read a Excel file and manipulate the data as given.
  4. Download and read a XML file and count rows that have a particular value.
  5. Download and read a csv file and find the fastest way to calculate the average value by a particular variable broken down by another variable using the data.table package
Quiz 2
  1. Read a json file and extract particular information this question requires from the file.
  2. sqldf package. Read a csv file and retrieve a table this question requires using sqldf function of sqldf package.
    1. Basic SQL understanding – how to use select, from and where –  needed.
  3. sqldf package. Same as question 2 except requirement. unique.
    1. select, distinct and from
  4. httr package. Read a html document and find length of particular lines using nchar
  5. Read a table of fixed width formatted data file. Function read.fwf is used to get the required table and data.
Quiz 3
  1. Reading a csv file. Understanding a cookbook.Using subsetting, select rows which satisfy certain conditions. Remove NA data from a data set using the which function.
  2. Reading a jpeg file. Find the 30th and 80th percentiles using the quantile function.
  3. Reading a csv file. Remove rows and columns from a data frame. Convert a factor-type data to a numeric-type data. Join two data frames using the merge function. Reorder a data frame using the arrange function in the plyr package. (In question 3, 4 and 5 the same data sources are used.)
  4. Find the average of particular groups in a data frame.
  5. Reorder a data frame using the arrange function in the plyr package. Count rows in a particular group.
Quiz 4
  1. the strsplit function.
  2. Remove a character in a string using the gsub function in the stringr package.
  3. Select names from a character vector using the grep function and a regular expression.
  4. Select rows from a column using using grep and a regular expression.
  5. Extract the year from a date class using the format function. Extract the weekday from a date class using the as.POSIXlt function.

Course project

[Blank]


swirl Programming Assignment
  1. Manipulating Data with dplyr
    1. Purpose: how to manipulate data using dplyr’s five main functions
    2. select(), filter(), arrange(), mutate(), summarize()
      1. select(data, column) : select columns
      2. filter(data, column > 10) : select rows
      3. arrange(data, desc(column)) : order rows
      4. mutate(data, new_column = column * 100) : add columns
      5. summarize(data, mean(column)) : collapses the data set to a single row
  2. Grouping and Chaining with dplyr
    1. group_by(), quantile(), View(), %>%
      1. group_by() : break up your dataset into groups of rows based on the values of one or more variables
      2. quantile(variable, probs = ) : return a quantile of the given probability.
      3. %>% : a chaining operator. a binary operator. See details using ?chain.
  3. Tidying Data with tidyr
    1. 3 conditions of tidy data
      1. Each variable forms a column
      2. Each observation forms a row
      3. Each type of observational unit forms a table
    2. 5 characteristics of messy data
      1. Column headers are values, not variable names
      2. Variables are stored in both rows and columns
      3. A single observational unit is stored in multiple tables
      4. Multiple types of observational units are stored in the same
        table
      5. Multiple variables are stored in one column
    3. the tidyr pacakge
      1. gather
      2. spread (the opposite operation of gather)
      3. separate
  4. Dates and Times with lubridate
    1. the lubridate package
      1. today, now
      2. year, month, day, wday, hour, minute, second
      3. years, months, days, hours, minutes, seconds
      4. ymd, ydm, dym, dmy, dym, dmy
      5. hms
      6. ymd_hms, ydm_hms, dym_hms, dmy_hms, dym_hms, dmy_hms
      7. update
      8. with_tz
      9. new_interval
      10. as.period

Result
Scores
  • Total = 104/100 points
    • Quiz 1 = 15/15 points
    • Quiz 2 = 15/15 points
    • Quiz 3 = 15/15 points
    • Quiz 4 = 15/15 points
    • Course project = 41/40 points
    • swirl Programming Assignment = 3/3 points
      • swirl Programming Assignment 1 = 1/1 point
      • swirl Programming Assignment 2 = 1/1 point
      • swirl Programming Assignment 3 = 1/1 point
      • swirl Programming Assignment 4 = 1/1 point
Certificate

References
References Provided in this Course
Useful References

Leave a Reply

Your email address will not be published. Required fields are marked *