Week 12 Tutorial: Web Scraping¶

POP77001 Computer Programming for Social Scientists¶

Module website: bit.ly/POP77001¶

Webpage elements and source code inspection¶

  • Modern browsers come with very powerful built-in tools that help in web scraping
  • Two most useful for us are:
    • Element inspection (Move cursor to some element on the page, Right click, Pick 'Inspect' from a menu)
    • Viewing page source (Ctrl + U/Option + CMD + U in Windows(Linux)/Mac for Chrome and Firefox)
  • Knowing the source code and element attributes helps build XPaths for selection

Webpage element inspection (Firefox)¶

Webpage element inspection (Chrome)¶

Webpage source code¶

Exercise 1: Working with HTML¶

  • Consider a simple HTML code belows
  • Read it into R using rvest's read_html() function
  • How many children does the <body> element have?
  • Build an XPath to extract the first paragraph
  • Build an XPath to extract the second sub-heading under a green header
In [2]:
library("rvest")
In [3]:
html_txt <- "
<!DOCTYPE html> 
<html>
    <head>
        <title>A title</title> 
    </head>
    <body>
        <h1 style='color:Red;'>
        Heading 1
        <h2>Subheading</h2>
        </h1>
        <h1 style='color:Green;'>
        Heading 2
        <h2>Subheading 1</h2>
        <h2>Subheading 2</h2>
        </h1> 
        <p>A paragraph.</p>
        <p>Another paragraph.</p>
    </body>
</html>"

Exercise 2: Working with Webpages¶

  • Now let's turn to a real website
  • Here we will extract the table of countries with their GDP from a Wikipedia article
  • Start by loading in the webpage using rvest's read_html() function
  • Go the webpage of the article and locate the elements that would be helpful for table extraction
  • Extract the <table> node that correponds to the main table
  • Extract <tbody> element as a child of this element
  • Extract the table of with data using rvest's html_table() function
  • Tidy up the extracted table

Final project¶

  • K-means clustering with Airbnb Dublin data
  • Due at 23:59 on Monday, 20th December (submission on Blackboard)