Step-by-Step Guide: Scraping Empleos Publicos with R and Selenium

[This article was first published on pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Because of delays with my scholarship payment, if this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee. It shall be used to continue my Open Source efforts. The full explanation is here: A Personal Message from an Open Source Contributor.

Motivation

My friend Nicolas Didier asked me about reading Empleos Publicos with R or Python. Here is a short example for him and anybody who may benefit from reading this.

The following steps were adapted from a tutorial I taught at the University of Michigan (GO BLUE!) in 2023.

Required R packages

RSelenium: R-Selenium integration
rvest: HTML processing
dplyr: to load the pipe operator (can be used later for data cleaning)
purrr: iteration (i.e., repeated operations)

I installed RSelenium from the R console:

if (!require(RSelenium)) install.packages("RSelenium")

# or

remotes::install_github("ropensci/RSelenium")

For the rest of the packages:

if (!require(rvest)) install.packages("rvest")
if (!require(dplyr)) install.packages("dplyr")
if (!require(purrr)) install.packages("purrr")

Installing Selenium and Chrome/Chromium

Note for Ubuntu/Debian users: We need to check that chrome or chromium is installed in our system. One of the many options is to use the bash console.

sudo add-apt-repository ppa:savoury1/chromium
sudo apt update
sudo apt install chromium-browser
sudo apt install chromium-chromedriver

Not using the PPA will install the snap version of Chromium, which is not compatible with Selenium.

I tried to start Selenium as it is mentioned in the official guide and it did not work.

I had to install Chromium. I am on Manjaro and I ran sudo pacman -S chromium. Windows/Mac users can use Google Chrome.

An extra requirement was to download Selenium Server. Based on this, I started by creating a directory to store the data for this post by typing this in VS Code terminal:

mkdir -p /tmp/didier-example
cd /tmp/didier-example

Then I opened R witn R and downloaded the JAR file:

url_jar <- "https://github.com/SeleniumHQ/selenium/releases/download/selenium-3.9.1/selenium-server-standalone-3.9.1.jar"
sel_jar <- "selenium-server-standalone-3.9.1.jar"

if (!file.exists(sel_jar)) {
  download.file(url_jar, sel_jar)
}

I had to run Selenium from a new terminal:

cd /tmp/didier-example
java -jar selenium-server-standalone-3.9.1.jar

Back to the R terminal, I was finally in condition to control the browser from R:

library(RSelenium)
library(rvest)
library(dplyr)
library(purrr)

rmDr <- remoteDriver(port = 4444L, browserName = "chrome")

rmDr$open(silent = TRUE)

url <- "https://www.empleospublicos.cl"

rmDr$navigate(url)

This should display a new Chrome/Chromium window that says “Chrome is being controlled by automated test software”.

Scraping the data

Using the browser’s inspector (ctrl + shift + i), I explored the page to see that the search bar corresponds to:

<input class="buscador-principal search form-control buscador-movil" name="q" type="search" autocomplete="off" placeholder="Ingresa el cargo, comuna o institución" id="buscadorprincipal">

For example, I can search for “Ministerio de Salud” because there were many posts by that organization on the landing page:

search_box <- rmDr$findElement(using = "id", value = "buscadorprincipal")
search_box$sendKeysToElement(list("Ministerio de Salud", key = "enter"))

That typed “Ministerio de Salud” and clicked search on my behalf. Inspecting the results I see that each job offer starts with

<div class="items col-md-4 col-lg-4 postulacion ...

The first offer listed is this:

<div class="items col-md-4 col-lg-4 postulacion otro otro eepp region7renta3calidad2 busqueda "><div class="item"><div class="top"><div class="label label-estado"><i class="fa fa-circle circulo-status1" aria-hidden="true"></i> Postulación hasta 30/09/2025 23:59:00</div><h3><a target="_blank" href="https://www.empleospublicos.cl/pub/convocatorias/convpostularavisoTrabajo.aspx?i=130648&c=0&j=0&tipo=convpostularavisoTrabajo" onclick="ga('send', 'event', 'convocatorias', 'Medico (a) especialista en Anestesiología 44 horas | Servicio de Salud Maule / Hospital de Constitución', 'eepp');">Medico (a) especialista en Anestesiología 44 horas</a></h3><p>Servicio de Salud Maule / Hospital de Constitución</p></div><hr><div class="cnt"><p>Ministerio de Salud</p><p>Constitución</p><br><div class="alert alert-primer"><i class="fa fa-address-card" aria-hidden="true"></i>  No pide experiencia</div><div class="row card-footer"><div class="col-xs-9 col-md-8 text-left"><a class="cronograma btn " url="https://www.empleospublicos.cl/pub/convocatorias/convpostularavisoTrabajo.aspx?i=130648&c=0&j=0&tipo=convpostularavisoTrabajo" onclick="return false;" href="#" title="Ver Cronograma de la Convocatoria"><i class="fa fa-calendar-days"></i> Calendarización</a>
        <div class="compartir-social">
            <div class="row">
                <div class="col-xs-3 col-md-4">
                    <a class="btn" onclick="enviarRS('t', 'https://www.empleospublicos.cl/pub/convocatorias/convpostularavisoTrabajo.aspx?i=130648&c=0&j=0&tipo=convpostularavisoTrabajo', 'Medico (a) especialista en Anestesiología 44 horas Servicio de Salud Maule / Hospital de Constitución'); return false;" href="#" target="_blank" title="Compartir en Twitter"><i class="fa-brands fa-square-x-twitter fa-xl" aria-hidden="true"></i></a>
                </div>
                <div class="col-xs-3 col-md-4">
                    <a class="btn" onclick="enviarRS('f', 'https://www.empleospublicos.cl/pub/convocatorias/convpostularavisoTrabajo.aspx?i=130648&c=0&j=0&tipo=convpostularavisoTrabajo', 'Medico (a) especialista en Anestesiología 44 horas Servicio de Salud Maule / Hospital de Constitución'); return false;" href="#" target="_blank" title="Compartir en Facebook"><i class="fa-brands fa-square-facebook fa-xl" aria-hidden="true"></i></a>
                </div>
                <div class="col-xs-3 col-md-4">
                    <a class="btn" onclick="enviarRS('l', 'https://www.empleospublicos.cl/pub/convocatorias/convpostularavisoTrabajo.aspx?i=130648&c=0&j=0&tipo=convpostularavisoTrabajo', 'Medico (a) especialista en Anestesiología 44 horas Servicio de Salud Maule / Hospital de Constitución'); return false;" href="#" target="_blank" title="Compartir en Linkedin"><i class="fa-brands fa-linkedin fa-xl" aria-hidden="true"></i></a>
                </div>
                <div class="col-xs-3 col-md-4">
                    <a class="btn whatsapp-link visible-xs visible-sm" title="Compartir en Whatsapp" onclick="enviarRS('w', 'https://www.empleospublicos.cl/pub/convocatorias/convpostularavisoTrabajo.aspx?i=130648&c=0&j=0&tipo=convpostularavisoTrabajo', 'Medico (a) especialista en Anestesiología 44 horas Servicio de Salud Maule / Hospital de Constitución'); return false;" href="#" data-action="share/whatsapp/share"><i class="fa-brands fa-square-whatsapp fa-xl" aria-hidden="true"></i></a>
                </div>
            </div>
        </div>
    <div class="row"><div class="col-md-12 card-footer-contenido "></div></div></div></div></div></div></div>
html <- read_html(rmDr$getPageSource()[[1]])

offers <- html %>%
  html_nodes("div.items")

offers_tbl <- map_df(offers, function(offer) {
  # Extract position (job title)
  position <- offer %>%
    html_node("h3 a") %>%
    html_text(trim = TRUE)

  # Extract organization (usually the first <p> inside .top)
  organization <- offer %>%
    html_node(".top p") %>%
    html_text(trim = TRUE)

  # Extract city (the second <p> inside .cnt)
  city <- offer %>%
    html_nodes(".cnt p") %>%
    .[2] %>%
    html_text(trim = TRUE)

  tibble(
    position = position,
    organization = organization,
    city = city
  )
})

The result has the following structure:

offers_tbl
# A tibble: 552 × 3
   position                                                   organization city
   <chr>                                                      <chr>        <chr>
 1 Medico (a) especialista en Anestesiología 44 horas         Servicio de… Cons…
 2 Titulares de la Planta Profesional Ley 18.834              Servicio de… Valp…
 3 ENFERMERA-O, JORNADA DIURNA, GRADO 12, PARA SERVICIO CLÍN… Servicio de… Reco…
 4 Psiquiatra infanto-juvenil sistema de atención intersecto… Servicio de… La P…
 5 Neurólogo(a) adulto GES Alzheimer y otras demencias        Servicio de… Puen…
 6 Médico(a) especialista en Neurología Infantil Hospital de… Servicio de… Cast…
 7 Arquitecto de Software                                     Central de … Ñuñoa
 8 TENS OPERADOR DE EQUIPOS DE ESTERILIZACIÓN                 Servicio de… Peña…
 9 (850-2892) Médico Especialista Broncopulmonar o Internist… Servicio de… Talc…
10 Enfermero(a) Clínico(a) Atención Abierta y Cerrada         Servicio de… Huas…
glimpse(offers_tbl)
> glimpse(offers_tbl)
Rows: 552
Columns: 3
$ position     <chr> "Medico (a) especialista en Anestesiología 44 horas", "Ti…
$ organization <chr> "Servicio de Salud Maule / Hospital de Constitución", "Se…
$ city         <chr> "Constitución", "Valparaíso", "Recoleta", "La Pintana", "…

I know this is a simple example but should allow different kinds of exploration and data extraction. I hope it helps.

To leave a comment for the author, please follow the link and comment on their blog: pacha.dev/blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: Step-by-Step Guide to Use R and Selenium to Scrape Empleos Publicos

Key points of the article

The article demonstrates the process of using the R language integrated with Selenium, explaining how to scrape data from public job catalogs, specifically from the website Empleos Publicos. The author lists out required R packages and provides steps to install them. The author also describes the process of installing Selenium and a compatible browser (in this case Chromium) and how to use them in tandem. The final part of the author’s tutorial teaches users how to scrape data and reads it into a table format for easy visualization.

Long-term implications and possible future developments

The demonstration provided in the article indicates there is potential for further simplification and automation in the process of extracting, processing, and visualizing data from various web sources. This application of R and Selenium opens possibilities to build more comprehensive and automated data scraping workflows in the future. Further developments could include the creation of functions or packages that allow for easy implementation of similar scraping tasks on other platforms, or even general-purpose utilities that simplify many common scraping tasks.

Actionable advice

Although the article provided an excellent step-by-step guide on using Selenium and R to scrape a specific website, generalizing this methodology could be useful in many other contexts. Consequently, readers and aspiring data analysts or data scientists could widen their skillsets by applying the provided code and understanding to other websites and use-cases. However, it’s necessary to understand and respect the terms of service of each website before initiating web scraping activities since not all websites permit this.

Regarding possible improvements, learning how to handle possible errors (e.g., the website undergoing changes, causing the scraper to break) and creating more dynamic, reusable code might be beneficial. Also, considering privacy and ethical issues while scraping personal data is crucial.

Finally, exploring technologies similar to R and Selenium, understanding their strengths and weaknesses, could be valuable since it may offer more efficient or easier ways to achieve the same results.

Read the original article