Brief Information
- Name : Getting and Cleaning Data?(the 3rd course of Data Science Specialization in Coursera)
- Lecturer ๐Jeff Leek
- Duration: 2015-07-06 ~ 08-02 (4 weeks)
- Course : Data Science Specialization in Coursera
- Syllabus : Syllabus__Getting and Cleaning Data – Coursera
- In short
- This course will cover the basic ways that data can be obtained. The course will cover obtaining data from the web, from APIs, from databases and from colleagues in various formats. It will also cover the basics of data cleaning and how to make data โtidyโ. Tidy data dramatically speed downstream data analysis tasks. The course will also cover the components of a complete data set including raw data, processing instructions, codebooks, and processed data. The course will cover the basics needed for collecting, cleaning, and sharing data.
?Lectures
ํ ์ค ์์ฝ ๋ถํํฉ๋๋ค.
Week 1
- Obtaining Data Motivation
- Raw and Processed Data
- Components of Tidy Data
- The 4 things you should have
- raw data
- tidy data
- code book
- instruction list
- The 4 things you should have
- Downloading Files
- setwd, file.exists, file.create, dir.exists, dir.create, download.file, lists.files
- Reading Local Files
- read.table,read.csv
- Reading Excel Files
- Reading XML
- XML package
- xmlTreeParse, rootNode, xmlName, xmlSApply, xpathSApply, htmlTreeParse,
- XML package
- Reading JSON
- jsonlite package
- toJSON, fromJSON,
- etc
- cat
- jsonlite package
- The data.table Package
- data.table package (fast, more functional than data.frame)
- data.table, tables, by, join, .N, setkey, merge, read.table
- etc.
- { } curly brackets, system.time, fread,?melt, dcast
- data.table package (fast, more functional than data.frame)
Week 2
- Reading from MySQL
- RMySQL package
- dbConnect,?dbDisconnect, dbListTables,? dbReadTable,,?dbGetQuery, dbSendQuery, fetch, dbListFields
- concepts
- connection, database, table, field, query, fetch
- etc.
- dim – used to know the number of the rows and the columns of a table
- RMySQL package
- Reading from HDF5
- source – execute a R script (= execute a list of R instructions in the R script)
- Reading from The Web
- Themes
- What is Web Scraping
- Parsing with XML
- GET from the httr package
- Accessing websites with passwords
- base package
- url, readLines, close,
- XML package
- htmlTreeParse, xpathSApply
- httr package
- GET, content, htmlParse
- Themes
- Reading from APIs
- Reading from Other Sources
Week 3
- Subsetting and Sorting
- subsetting
- [], [,], [number], [name]
- Dealing with missing values[NAs]
- which
- Sorting
- sort, order, arrange(plyr)
- Adding rows and columns
- table$column <- new_column, cbind, rbind
- subsetting
- Summarizing Data
- Creating New Variables
- Reshaping Data
- Managing Data Frames with dplyr – Instroduction
- Managing Data Frames with dplyr – Basic Tools
- Merging Data
Week 4
- Editing Text Variables
- Regular Expression I
- Key
- literals, metacharacters
- Metacharacters
- ^, $, [aA], [a-z], [a-zA-Z], “”
- Key
- Regular Expression II
- Working With Dates
- Data Resources
Quiz
Quiz 1
- Download and read a csv file and count rows that have a particular value.
- The principles of tidy data
- Download and read a Excel file and manipulate the data as given.
- Download and read a XML file and?count rows that have a particular value.
- Download and read a csv file and find the fastest way to calculate?the average value by a particular variable?broken down by another variable?using the data.table package
Quiz 2
- Read a json file and extract particular information this question requires from the file.
- sqldf package. Read a csv file and retrieve a table this question requires using sqldf function of sqldf package.
- Basic SQL understanding?? how to use select, from and where ? ?needed.
- sqldf package. Same as question 2 except requirement. unique.
- select, distinct and from
- httr package. Read a html document and find length of particular lines using nchar
- Read a table of fixed width formatted data file. Function?read.fwf is used to get the required table and data.
Quiz 3
- Reading a csv file. Understanding?a cookbook.Using subsetting, select rows which satisfy certain conditions. Remove NA data from a data set using the?which?function.
- Reading a jpeg file. Find the 30th and 80th percentiles using the quantile?function.
- Reading a csv file. Remove rows and columns from a data frame. Convert a factor-type data to a numeric-type data. Join two data frames using the merge function. Reorder a data frame using the arrange function in the plyr package. (In question 3, 4 and 5 the same data sources are used.)
- Find the average of particular groups in a data frame.
- Reorder a data frame using the arrange function in the plyr package. Count rows in a particular group.
Quiz 4
- the strsplit function.
- Remove a character?in a string using the gsub function in the stringr package.
- Select names from a character vector using the?grep function and a regular expression.
- Select rows from a column using?using grep and a regular expression.
- Extract the year from a date class using the format function. Extract the weekday from a date class using the as.POSIXlt function.
Course project
[Blank]
swirl Programming Assignment
- Manipulating Data with dplyr
- Purpose: how to manipulate data using dplyr’s five main functions
- select(), filter(), arrange(), mutate(), summarize()
- Grouping and Chaining with dplyr
- group_by(), quantile(), View(), %>%
- group_by() :?break up your dataset into?groups of rows based on the values of one or more variables
- quantile(variable, probs = ) : return a quantile of the given probability.
- %>% : a chaining operator. a binary operator. See details using ?chain.
- group_by(), quantile(), View(), %>%
- Tidying Data with tidyr
- 3 conditions of tidy data
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table
- 5 characteristics of messy data
- Column headers are values, not variable names
- Variables are stored in both rows and columns
- A single observational unit is stored in multiple tables
- Multiple types of observational units are stored in the same
table - Multiple variables are stored in one column
- the tidyr pacakge
- 3 conditions of tidy data
- Dates and Times with lubridate
- the?lubridate package
- today, now
- year, month, day, wday, hour, minute, second
- years, months, days, hours, minutes, seconds
- ymd, ydm, dym, dmy, dym, dmy
- hms
- ymd_hms, ydm_hms, dym_hms, dmy_hms, dym_hms, dmy_hms
- update
- with_tz
- new_interval
- as.period
- the?lubridate package
Result
Scores
- Total = 104/100 points
- Quiz 1 = 15/15 points
- Quiz 2 = 15/15 points
- Quiz 3 = 15/15 points
- Quiz 4 = 15/15 points
- Course project = 41/40 points
- swirl Programming Assignment = 3/3?points
- swirl Programming Assignment 1 = 1/1 point
- swirl Programming Assignment 2 = 1/1?point
- swirl Programming Assignment 3 = 1/1?point
- swirl Programming Assignment 4 = 1/1?point
Certificate
References
References Provided in this Course
- 1-6
- 1-9
- ?2-1
- ?2-2
- 2-3
- How Netflix Reverse Engineered Hollywood
- Webscraping – Wikipedia
- Package ‘httr’ | R-Project
- A number of examples of web scraping
- An example html document
- https://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en
- 2-4
- 2-5
- Package โforeignโ | R-Project
- Functions to Manipulate Connections | R document
- RPostresSQL?- Tutorial
- RPostresSQL – Help File
- RODBC – Tutorial
- RODBC – Help File
- RMongo – Help File
- rmongodb – Help File
- Reading images
- jpeg
- readbitmap
- png
- EBImage (Bioconductor)
- Reading GIS data
- Reading music data
- 3-1