Week 12: Web scraping¶

POP77001 Computer Programming for Social Scientists¶

Tom Paskhalis¶

29 November 2021¶
Module website: bit.ly/POP77001¶

Overview¶

  • Online data sources
  • Data collection
  • Web technologies
  • HTML fundamentals
  • XPath

Online data sources¶

  • Data downloadable in tabular format (E.g. CSV/TSV, XLS, DTA, etc.)
  • Data available online as a table (E.g. webpages with rendered tables)
  • Unstructured data available online (E.g. simple webpages)
  • Interactive webpages with user-input (E.g. webpages with logins, dropdown menus)
  • Web APIs (special interfaces for querying, e.g. Twitter, Google)

Online data collection¶

  • Tabular format: download single or multiple files (automate with download.file() in R, wget in Python/Terminal)
  • Online tables and unstructured data: simple web scraping (HTML with XPath, rvest in R, beautifulsoup in Python)
  • Interactive webpages: web scraping with headless browser (Selenium, RSelenium in R, selenium in Python)
  • Web API: sending requests and processing responses (HTTP queries, httr in R, requests in Python)

Web tables¶

Source: Wikipedia

Unstructured data¶

Source: Eur-Lex

Interactive webpages¶

Source: Izbori.ba

Automated data collection¶

  • Manual scraping (copy-pasting) can be:
    • Extremely laborious and time-consuming
    • Very error-prone
    • Often impossible to reproduce exactly
  • Automated data collection
    • Easy to scale up (computer time is cheap)
    • Less error-prone
    • Usually, perfectly reproducible
  • There is a trade-off (time invested in automation vs time saved)
    • However, it is good to err on the side of automation

Commercial solutions¶

Web technologies¶

  • Key technologies used to disseminate content on the Web:
    • XML/HTML (Extensible Markup Language/Hypertext Markup Language)
    • CSS (Cascading Style Sheets)
    • JavaScript
    • API (Application Programming Interface)
    • JSON (JavaScript Object Notation)

Static vs dynamic websites¶

  • The critical feature of a website which determines approach to scraping its content
  • Static websites all have prebuild source code which is served at user's request
    • No real-time processing of user's input
    • They can contain elements that change the appearance of a website
    • Example: POP77001 course website
  • Dynamic websites render websites in real-time as a response to user's input
    • They can use a range of technologies to achieve it (JavaScript, Python Django, PHP)
    • Example: Google Maps

HTML: Hypertext Markup Language¶

  • HTML (Hypertext Markup Language) is a mark-up language for webpages
  • Forms the basis of static websites
  • Your browser renders (interprets) HTML for viewing
  • Current version is HTML5
<!DOCTYPE html> 
<html>
    <head>
        <title>A title</title> 
    </head>
    <body>
        <h1 style="color:Red;">A heading</h1> 
        <p>A paragraph.</p> 
    </body>
</html>

Extra: W3Schools: Try HTML

HTML basics¶

  • Basic unit of HTML is an element (aka node)
  • Elements, typically, begin with an start tag (e.g. <h1>)
  • And finish with an end tag (e.g. </h1>)
  • Content of an element is found between the start and end tags
  • Attributes are special words used within a start tag to control element's behaviour (e.g. style="color:Red;")
  • Soma HTML tag exampes:
    • Document structure: <html>, <body>, <header>
    • Document components: <h1>, <title>, <div>
    • Text style: <b>, <i>
    • Hyperlinks: <a>

HTML tree¶

HTML tree relationships¶

  • All elements (nodes) in HTML tree are connected by relationships
  • These relationship can be of the following types:
    • Ancestors (parents)
    • Descendants (children)
    • Siblings

HTML parent/ancestor¶

HTML children/descendants¶

HTML siblings¶

Parsing HTML tree example¶

In [2]:
library("rvest")
In [3]:
html_txt <- "
<!DOCTYPE html> 
<html>
    <head>
        <title>A title</title> 
    </head>
    <body>
        <h1 style='color:Red;'>A heading</h1> 
        <p>A paragraph.</p> 
    </body>
</html>"
In [4]:
html <- rvest::read_html(html_txt)
In [5]:
str(html)
List of 2
 $ node:<externalptr> 
 $ doc :<externalptr> 
 - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

Parsing HTML tree example continued¶

In [6]:
children <- rvest::html_children(html)
children
{xml_nodeset (2)}
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n        <h1 style="color:Red;">A heading</h1> \n        <p>A para ...
In [7]:
body <- children[2]
rvest::html_name(body)
[1] "body"
In [8]:
children2 <- rvest::html_children(body)
children2
{xml_nodeset (2)}
[1] <h1 style="color:Red;">A heading</h1>
[2] <p>A paragraph.</p>
In [9]:
rvest::html_attrs(children2[1])
[[1]]
       style 
"color:Red;" 
In [10]:
rvest::html_text(children2[1])
[1] "A heading"

XML: Extensible Markup Language¶

  • XML (Extensible Markup Language) is a more general form of markup language
  • Allows sharing structured data of tree-like form
  • Relative to HTML:
    • Tags are user-defined
    • End tags are always required
    • Stricter (no inconsistencies permitted)
<?xml version="1.0" encoding="UTF-8" ?>
<courses> 
    <course> 
        <title>Computer Programming for Social Scientists</title> 
        <code>POP77001</code> 
        <year>2021</year> 
        <term>Michaelmas</term> 
        <description>Course on computer programming in Python and R.</description> 
    </course> 
    <course> 
        <title>Applied Statistical Analysis I</title> 
        <code>POP77003</code> 
        <year>2021</year> 
        <term>Michaelmas</term> 
        <description>Introduction to statistical inference.</description> 
    </course> 
</courses>

Parsing XML tree example¶

In [11]:
library("xml2")
In [12]:
xml_txt <- 
'<?xml version="1.0" encoding="UTF-8" ?>
<courses> 
    <course> 
        <title>Computer Programming for Social Scientists</title> 
        <code>POP77001</code> 
        <year>2021</year> 
        <term>Michaelmas</term> 
        <description>Course on computer programming in Python and R.</description> 
    </course> 
    <course> 
        <title>Applied Statistical Analysis I</title> 
        <code>POP77003</code> 
        <year>2021</year> 
        <term>Michaelmas</term> 
        <description>Introduction to statistical inference.</description> 
    </course> 
</courses>'
In [13]:
xml <- xml2::read_xml(xml_txt)
In [14]:
str(xml)
List of 2
 $ node:<externalptr> 
 $ doc :<externalptr> 
 - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

Parsing XML tree example continued¶

In [15]:
children3 <- xml2::xml_children(xml)
children3
{xml_nodeset (2)}
[1] <course>\n  <title>Computer Programming for Social Scientists</title>\n   ...
[2] <course>\n  <title>Applied Statistical Analysis I</title>\n  <code>POP770 ...
In [16]:
pop77001 <- children3[1]
xml2::xml_children(pop77001)
{xml_nodeset (5)}
[1] <title>Computer Programming for Social Scientists</title>
[2] <code>POP77001</code>
[3] <year>2021</year>
[4] <term>Michaelmas</term>
[5] <description>Course on computer programming in Python and R.</description>
In [17]:
xml2::xml_text(xml_children(children3[1]))
[1] "Computer Programming for Social Scientists"     
[2] "POP77001"                                       
[3] "2021"                                           
[4] "Michaelmas"                                     
[5] "Course on computer programming in Python and R."

Examples of XML¶

  • RSS (Really Simple Syndication) feeds
  • SVG (Scalable Vector Graphics) images
  • Modern office documents (Microsoft Office .docx, .xlsx, .pptx, OpenOffice/LibreOffice)

Parsing XML/HTML with XPath¶

  • XPath (XML Path Language) is a language for selecting parts of XML/HTML tree
  • Basic syntax:
    • / - select element at the root node (e.g. /html/body)
    • // - select element at any depth (e.g. //h1)
    • //<tag>/* - select all descendants of tag (e.g. //body/*)
    • //<tag>[@<attr>] - select all elements that have given attribute (e.g. //h1[@style])
    • //<tag>[@<attr>='<value>'] - select all elements, whose attribute has given value (e.g. //h1[@style='color:Red;'])

Extra: XPath syntax

Parsing XML/HTML with XPath examples¶

In [18]:
rvest::html_elements(html, xpath = "//p")
{xml_nodeset (1)}
[1] <p>A paragraph.</p>
In [19]:
rvest::html_elements(html, xpath = "//h1[@style='color:Red;']")
{xml_nodeset (1)}
[1] <h1 style="color:Red;">A heading</h1>
In [20]:
xml2::xml_find_all(xml, xpath = "//code")
{xml_nodeset (2)}
[1] <code>POP77001</code>
[2] <code>POP77003</code>
In [21]:
# We can also find elements by text
xml2::xml_find_all(xml, xpath = "//code[text()='POP77001']")
{xml_nodeset (1)}
[1] <code>POP77001</code>

Scraping webpage¶

Source: Wikipedia

Scraping webpage with XPath example¶

In [22]:
html <- rvest::read_html("https://en.wikipedia.org/wiki/Members_of_the_1st_D%C3%A1il")
In [23]:
tables <- rvest::html_elements(html, xpath = "//table")
tables
{xml_nodeset (8)}
[1] <table class="box-More_citations_needed plainlinks metadata ambox ambox-c ...
[2] <table class="infobox vevent"><tbody>\n<tr><th colspan="2" class="infobox ...
[3] <table style="width:100%; border-collapse:collapse"><tbody><tr style="ver ...
[4] <table class="wikitable" style="font-size: 95%;"><tbody>\n<tr style="back ...
[5] <table class="wikitable" style="margin: 1em 1em 1em 0; background: #f9f9f ...
[6] <table class="wikitable"><tbody>\n<tr>\n<th>Constituency\n</th>\n<th>Outg ...
[7] <table class="wikitable"><tbody>\n<tr>\n<th>Winner\n</th>\n<th colspan="2 ...
[8] <table class="nowraplinks mw-collapsible autocollapse navbox-inner" style ...
In [24]:
tbody <- rvest::html_children(tables[5])
tbody
{xml_nodeset (1)}
[1] <tbody>\n<tr style="background-color:#E9E9E9;"><th colspan="4">Members of ...
In [25]:
tds <- rvest::html_table(tbody)
tds
[[1]]
# A tibble: 106 × 4
   `Members of the 1s… `Members of the 1s… `Members of the 1… `Members of the 1…
   <chr>               <chr>               <chr>              <chr>             
 1 Constituency        Name                "Party"            Party             
 2 Antrim East         Robert McCalmont    ""                 Irish Unionist    
 3 Antrim Mid          Hugh O'Neill        ""                 Irish Unionist    
 4 Antrim North        Peter Kerr-Smiley   ""                 Irish Unionist    
 5 Antrim South        Charles Curtis Cra… ""                 Irish Unionist    
 6 Armagh Mid          James Rolston Lons… ""                 Irish Unionist    
 7 Armagh North        William Allen       ""                 Irish Unionist    
 8 Armagh South        Patrick Donnelly    ""                 Irish Parliamenta…
 9 Belfast Cromac      William Arthur Lin… ""                 Irish Unionist    
10 Belfast Duncairn    Edward Carson       ""                 Irish Unionist    
# … with 96 more rows

Scraping webpage with XPath example continued¶

In [26]:
str(tds)
List of 1
 $ : tibble [106 × 4] (S3: tbl_df/tbl/data.frame)
  ..$ Members of the 1st Dáil[4]: chr [1:106] "Constituency" "Antrim East" "Antrim Mid" "Antrim North" ...
  ..$ Members of the 1st Dáil[4]: chr [1:106] "Name" "Robert McCalmont" "Hugh O'Neill" "Peter Kerr-Smiley" ...
  ..$ Members of the 1st Dáil[4]: chr [1:106] "Party" "" "" "" ...
  ..$ Members of the 1st Dáil[4]: chr [1:106] "Party" "Irish Unionist" "Irish Unionist" "Irish Unionist" ...
In [27]:
tds <- tds[[1]]
head(tds)
  Members of the 1st Dáil[4] Members of the 1st Dáil[4]
1 Constituency               Name                      
2 Antrim East                Robert McCalmont          
3 Antrim Mid                 Hugh O'Neill              
4 Antrim North               Peter Kerr-Smiley         
5 Antrim South               Charles Curtis Craig      
6 Armagh Mid                 James Rolston Lonsdale    
  Members of the 1st Dáil[4] Members of the 1st Dáil[4]
1 Party                      Party                     
2                            Irish Unionist            
3                            Irish Unionist            
4                            Irish Unionist            
5                            Irish Unionist            
6                            Irish Unionist            
In [28]:
colnames(tds) <- tds[1,]
tds <- tds[-1,]
head(tds)
  Constituency Name                   Party Party         
1 Antrim East  Robert McCalmont             Irish Unionist
2 Antrim Mid   Hugh O'Neill                 Irish Unionist
3 Antrim North Peter Kerr-Smiley            Irish Unionist
4 Antrim South Charles Curtis Craig         Irish Unionist
5 Armagh Mid   James Rolston Lonsdale       Irish Unionist
6 Armagh North William Allen                Irish Unionist
In [29]:
tds <- tds[,-3]
str(tds)
tibble [105 × 3] (S3: tbl_df/tbl/data.frame)
 $ Constituency: chr [1:105] "Antrim East" "Antrim Mid" "Antrim North" "Antrim South" ...
 $ Name        : chr [1:105] "Robert McCalmont" "Hugh O'Neill" "Peter Kerr-Smiley" "Charles Curtis Craig" ...
 $ Party       : chr [1:105] "Irish Unionist" "Irish Unionist" "Irish Unionist" "Irish Unionist" ...

Web scraping in practice¶

  • Always check first whether an API for querying exists
  • It is the most robust (and sanctioned) way of obtaining data
  • Check copyrights and respect those when using scraped data
  • Limit you scraping bandwidth (introduce waiting times between queries)

Next¶

  • Tutorial: handling basic HTML and scraping web tables
  • Final project: Due at 23:59 on Monday, 20th December (submission on Blackboard)