12 Web scraping
Web is the main source of data these days. While some data is available in an easy-to-download format such CSV, oftentimes we need to go over many webpages to gather data for our analysis. This week we discuss the fundamentals of web scraping, collecting data from online sources.
Readings
- Munzert, Simon, Christian Rubba, Peter Meissner, and Dominic Nyhuis D. 2015. Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. Chichester, UK: Wiley & Sons. Chapters 2-4, 9;
- Nolan, Deborah, and Duncan Temple Lang. 2014. XML and Web Technologies for Data Sciences with R. Springer: New York. Chapters 1-4
Lab
- Handling basic XML/HTML
- Scraping web tables
Final project
- K-means clustering
- Due at 23:59 on Monday, 20th December (submission on Blackboard)
- Rename the file from
final_project.ipynb
tofinal_project_firstname_lastname.ipynb
before submission