Getting and Cleaning Data | Data Science Specialization | Coursera | Course

Brief Information

Name : Getting and Cleaning Data?(the 3rd course of Data Science Specialization in Coursera)
Lecturer 😕Jeff Leek
Duration: 2015-07-06 ~ 08-02 (4 weeks)
Course : Data Science Specialization in Coursera
Syllabus : Syllabus__Getting and Cleaning Data – Coursera
In short
- This course will cover the basic ways that data can be obtained. The course will cover obtaining data from the web, from APIs, from databases and from colleagues in various formats. It will also cover the basics of data cleaning and how to make data “tidy”. Tidy data dramatically speed downstream data analysis tasks. The course will also cover the components of a complete data set including raw data, processing instructions, codebooks, and processed data. The course will cover the basics needed for collecting, cleaning, and sharing data.

?Lectures

한 줄 요약 부탁합니다.

Week 1

Obtaining Data Motivation
Raw and Processed Data
Components of Tidy Data
1. The 4 things you should have
  1. raw data
  2. tidy data
  3. code book
  4. instruction list
Downloading Files
1. setwd, file.exists, file.create, dir.exists, dir.create, download.file, lists.files
Reading Local Files
1. read.table,read.csv
Reading Excel Files
1. xlsx package
  1. read.xlsx, read.xlsx2, write.xlsx
2. XLConnect package
  1. more options than xlsx.
  2. XLConnect vignette: a good place to start
Reading XML
1. XML package
  1. xmlTreeParse, rootNode, xmlName, xmlSApply, xpathSApply, htmlTreeParse,
Reading JSON
1. jsonlite package
  1. toJSON, fromJSON,
2. etc
  1. cat
The data.table Package
1. data.table package (fast, more functional than data.frame)
  1. data.table, tables, by, join, .N, setkey, merge, read.table
2. etc.
  1. { } curly brackets, system.time, fread,?melt, dcast

Week 2

Reading from MySQL
1. RMySQL package
  1. dbConnect,?dbDisconnect, dbListTables,? dbReadTable,,?dbGetQuery, dbSendQuery, fetch, dbListFields
2. concepts
  1. connection, database, table, field, query, fetch
3. etc.
  1. dim – used to know the number of the rows and the columns of a table
Reading from HDF5
1. source – execute a R script (= execute a list of R instructions in the R script)
Reading from The Web
1. Themes
  1. What is Web Scraping
  2. Parsing with XML
  3. GET from the httr package
  4. Accessing websites with passwords
2. base package
  1. url, readLines, close,
3. XML package
  1. htmlTreeParse, xpathSApply
4. httr package
  1. GET, content, htmlParse
Reading from APIs
Reading from Other Sources

Week 3

Subsetting and Sorting
1. subsetting
  1. [], [,], [number], [name]
2. Dealing with missing values[NAs]
  1. which
3. Sorting
  1. sort, order, arrange(plyr)
4. Adding rows and columns
  1. table$column <- new_column, cbind, rbind
Summarizing Data
Creating New Variables
Reshaping Data
Managing Data Frames with dplyr – Instroduction
Managing Data Frames with dplyr – Basic Tools
Merging Data

Week 4

Editing Text Variables
Regular Expression I
1. Key
  1. literals, metacharacters
2. Metacharacters
  1. ^, $, [aA], [a-z], [a-zA-Z], “”
Regular Expression II
Working With Dates
Data Resources

Quiz

Quiz 1

Download and read a csv file and count rows that have a particular value.
The principles of tidy data
Download and read a Excel file and manipulate the data as given.
Download and read a XML file and?count rows that have a particular value.
Download and read a csv file and find the fastest way to calculate?the average value by a particular variable?broken down by another variable?using the data.table package

Quiz 2

Read a json file and extract particular information this question requires from the file.
sqldf package. Read a csv file and retrieve a table this question requires using sqldf function of sqldf package.
1. Basic SQL understanding?? how to use select, from and where ? ?needed.
sqldf package. Same as question 2 except requirement. unique.
1. select, distinct and from
httr package. Read a html document and find length of particular lines using nchar
Read a table of fixed width formatted data file. Function?read.fwf is used to get the required table and data.

Quiz 3

Reading a csv file. Understanding?a cookbook.Using subsetting, select rows which satisfy certain conditions. Remove NA data from a data set using the?which?function.
Reading a jpeg file. Find the 30th and 80th percentiles using the quantile?function.
Reading a csv file. Remove rows and columns from a data frame. Convert a factor-type data to a numeric-type data. Join two data frames using the merge function. Reorder a data frame using the arrange function in the plyr package. (In question 3, 4 and 5 the same data sources are used.)
Find the average of particular groups in a data frame.
Reorder a data frame using the arrange function in the plyr package. Count rows in a particular group.

Quiz 4

the strsplit function.
Remove a character?in a string using the gsub function in the stringr package.
Select names from a character vector using the?grep function and a regular expression.
Select rows from a column using?using grep and a regular expression.
Extract the year from a date class using the format function. Extract the weekday from a date class using the as.POSIXlt function.

Course project

[Blank]

swirl Programming Assignment

Manipulating Data with dplyr
1. Purpose: how to manipulate data using dplyr’s five main functions
2. select(), filter(), arrange(), mutate(), summarize()
  1. select(data, column) : select columns
  2. filter(data, column > 10) : select rows
  3. arrange(data, desc(column)) : order rows
  4. mutate(data, new_column = column * 100) : add columns
  5. summarize(data, mean(column)) :?collapses the data set to a single row
Grouping and Chaining with dplyr
1. group_by(), quantile(), View(), %>%
  1. group_by() :?break up your dataset into?groups of rows based on the values of one or more variables
  2. quantile(variable, probs = ) : return a quantile of the given probability.
  3. %>% : a chaining operator. a binary operator. See details using ?chain.
Tidying Data with tidyr
1. 3 conditions of tidy data
  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table
2. 5 characteristics of messy data
  1. Column headers are values, not variable names
  2. Variables are stored in both rows and columns
  3. A single observational unit is stored in multiple tables
  4. Multiple types of observational units are stored in the same
    table
  5. Multiple variables are stored in one column
3. the tidyr pacakge
  1. gather
  2. spread?(the opposite operation of gather)
  3. separate
Dates and Times with lubridate
1. the?lubridate package
  1. today, now
  2. year, month, day, wday, hour, minute, second
  3. years, months, days, hours, minutes, seconds
  4. ymd, ydm, dym, dmy, dym, dmy
  5. hms
  6. ymd_hms, ydm_hms, dym_hms, dmy_hms, dym_hms, dmy_hms
  7. update
  8. with_tz
  9. new_interval
  10. as.period

Result

Scores

Total = 104/100 points
- Quiz 1 = 15/15 points
- Quiz 2 = 15/15 points
- Quiz 3 = 15/15 points
- Quiz 4 = 15/15 points
- Course project = 41/40 points
- swirl Programming Assignment = 3/3?points
  - swirl Programming Assignment 1 = 1/1 point
  - swirl Programming Assignment 2 = 1/1?point
  - swirl Programming Assignment 3 = 1/1?point
  - swirl Programming Assignment 4 = 1/1?point

Certificate

References

References Provided in this Course

1-6
- the XLConnect package
- XLConnect vignette
1-9
- a list of differences between data.table and data.frame | Stackoverflow
?2-1
?2-2
2-3
- How Netflix Reverse Engineered Hollywood
- Webscraping – Wikipedia
- Package ‘httr’ | R-Project
- A number of examples of web scraping
- An example html document
  - https://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en
2-4
- GET blocks/list | Twitter Developers
- How did I know what url to use?
- In general look at the documentation
- Personal references
  - oauth – Wikipedia
  - HTTP Methods: GET vs. POST
2-5
- Package ‘foreign’ | R-Project
- Functions to Manipulate Connections | R document
- RPostresSQL?- Tutorial
- RPostresSQL – Help File
- RODBC – Tutorial
- RODBC – Help File
- RMongo – Help File
- rmongodb – Help File
- Reading images
  - jpeg
  - readbitmap
  - png
  - EBImage (Bioconductor)
- Reading GIS data
  - rdgal
  - rgeos
  - raster
- Reading music data
  - tuneR
  - seewave
3-1
- Andrew Jaffe’s lecture notes?-?Data Summarization and Manipulation

Study of Everything

Learning Based Life

Getting and Cleaning Data | Data Science Specialization | Coursera | Course

Brief Information

?Lectures

Week 1

Week 2

Week 3

Week 4

Quiz

Quiz 1

Quiz 2

Quiz 3

Quiz 4

Course project

swirl Programming Assignment

Result

Scores

Certificate

References

References Provided in this Course

Useful References

Leave a Reply Cancel reply

Brief Information

?Lectures

Week 1

Week 2

Week 3

Week 4

Quiz

Quiz 1

Quiz 2

Quiz 3

Quiz 4

Course project

swirl Programming Assignment

Result

Scores

Certificate

References

References Provided in this Course

Useful References

Related Posts

Leave a Reply Cancel reply