12 Web scraping


Web is the main source of data these days. While some data is available in an easy-to-download format such CSV, oftentimes we need to go over many webpages to gather data for our analysis. This week we discuss the fundamentals of web scraping, collecting data from online sources.

Readings

  • Munzert, Simon, Christian Rubba, Peter Meissner, and Dominic Nyhuis D. 2015. Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. Chichester, UK: Wiley & Sons. Chapters 2-4, 9;
  • Nolan, Deborah, and Duncan Temple Lang. 2014. XML and Web Technologies for Data Sciences with R. Springer: New York. Chapters 1-4

Lab

  • Handling basic XML/HTML
  • Scraping web tables

Final project

  • K-means clustering
  • Due at 23:59 on Monday, 20th December (submission on Blackboard)
  • Rename the file from final_project.ipynb to final_project_firstname_lastname.ipynb before submission