Creating text data from IATF resolutions for use in text analysis
Development of an R package for text data mining
In line with our proposed project on use of text analysis and topic modelling for conducting policy analysis in relation to the COVID-19 pandemic response in the Philippines, we first had to identify what sources of text were most appropriate for our purpose and then assess how we can extract text from these sources for our purposes. We knew it was always going to be a balance between having the most ideal sources of text data for what we want to do and those sources that could be mined of their text data in a reasonably efficient way.
In our search for sources, we found out about the IATF and its role/mandate to respond to the issues concerning emerging infectious diseases in the Philippines. The IATF was convened in early January 2020 in response to the growing outbreak in Wuhan, China. Since then, we found that the task force has released several resolutions that reflected the national government’s on-going assessment of the evolution and spread of COVID-19 and it’s intended response. Given this, we felt that the IATF resolutions were the key manifestation of the national government’s overarching policy with regard go the COVID-19 pandemic from which other more detailed and specific departmental and sub-national policies were derived from.
We noted that the Department of Health, which chairs the IATF, have routinely uploaded to their website copies of resolutions once they have been formally endorsed and signed by all departmental members. We later found out that IATF resolutions were also released via the country’s official journal, The Official Gazette, online via its website. Based on these considerations, we decided to focus on using the IATF resolutions for our policy analysis project.
The next challenge was how to extract the text from these resolutions into a data structure and format useful for text analysis and topic modelling. In this post, we describe how we used the R language for statistical computing to develop a supervised text extraction workflow described in the diagram below:
Step 1: Downloading the resolutions
The resolutions are available either from the DoH website or from The Official Gazette website. There are two key differences between these two sources. First, the DoH only provides resolutions starting from March 2020 at Resolution No. 9. Second, the DoH uploads the most recent resolutions a little bit later than The Official Gazette.
We automated the bulk download of all available resolutions from either sources in R by first creating a function that reads the HTML page/s for the resolutions and extracting the download links/URLs for each available resolution. Then, using these links, we created another function that downloads the PDFs into a temporary directory which can then be accessed for the next step of the workflow.
Extracting download links/URLs of resolutions
To extract download links/URLs of resolutions from either sources, we created wrappers around web scraping functions in the rvest package.
For the resolutions in the DoH website, we created the function get_iatf_links
. This function only requires a single argument termed base
which is the URL of the DoH webpage for IATF resolutions. The current URL for this is at https://www.doh.gov.ph/COVID-19/IATF-Resolutions and this is what is used as default value for base
. So, to get the links/URLs for the resolutions found in the DoH website, we use:
get_iatf_links()
which gives the following:
## # A tibble: 41 x 7
## id title date source type url checked
## <dbl> <chr> <date> <chr> <chr> <chr> <date>
## 1 9 Recommendations f… 2020-03-03 IATF reso… https://doh.gov.… 2020-06-17
## 2 10 Recommendations f… 2020-03-09 IATF reso… https://doh.gov.… 2020-06-17
## 3 11 Recommendations f… 2020-03-12 IATF reso… https://doh.gov.… 2020-06-17
## 4 12 Recommendations f… 2020-03-13 IATF reso… https://doh.gov.… 2020-06-17
## 5 13 Recommendations f… 2020-03-17 IATF reso… https://doh.gov.… 2020-06-17
## 6 14 Resolutions Relat… 2020-03-20 IATF reso… https://doh.gov.… 2020-06-17
## 7 15 Resolutions Relat… 2020-03-25 IATF reso… https://doh.gov.… 2020-06-17
## 8 16 Additional Guidel… 2020-03-30 IATF reso… https://doh.gov.… 2020-06-17
## 9 17 Recommendations R… 2020-03-30 IATF reso… https://doh.gov.… 2020-06-17
## 10 18 Recommendations R… 2020-04-01 IATF reso… https://doh.gov.… 2020-06-17
## # … with 31 more rows
The result is a tibble
with 7 columns and the number of rows equal to the number of resolutions available from the DoH website on the day the output was produced. The column named checked
in the output indicates when it was produced. The number of rows of the output may change between two calls to this function as newer resolution/s may have been uploaded from the time the first call was made to the time the second call was made.
For the resolutions in The Official Gazette, we created the get_iatf_gazette
function. As with the get_iatf_links
function, this requires the URL to the IATF resolutions webpage in The Official Gazette which is set at https://www.officialgazette.gov.ph/section/laws/other-issuances/inter-agency-task-force-for-the-management-of-emerging-infectious-diseases-resolutions/. However, a second argument named pages
is required. Unlike the DoH page that has all the download links to the resolutions available on a single page, The Official Gazette uses pagination to split the information on download links into multiple pages. To be able to get the links of all the resolutions, we would need to supply a vector of the page numbers of all the pages that contains the download links. So, to get the links/URLs for all available resolutions found in the The Official Gazette website, we use:
get_iatf_gazette(pages = 1:6)
after determining from the website that the download links are contained in 6 pages.
This gives the following result:
## # A tibble: 53 x 7
## id title date source type url checked
## <dbl> <chr> <date> <chr> <chr> <chr> <date>
## 1 46 RESOLUTION NO. 46… 2020-06-15 IATF reso… https://www.offi… 2020-06-17
## 2 46 RESOLUTION NO. 46… 2020-06-15 IATF reso… https://www.offi… 2020-06-17
## 3 45 RESOLUTION NO. 45… 2020-06-10 IATF reso… https://www.offi… 2020-06-17
## 4 44 RESOLUTION NO. 44… 2020-06-08 IATF reso… https://www.offi… 2020-06-17
## 5 NA OMNIBUS GUIDELINE… 2020-06-03 IATF reso… https://www.offi… 2020-06-17
## 6 43 RESOLUTION NO. 43… 2020-06-03 IATF reso… https://www.offi… 2020-06-17
## 7 42 RESOLUTION NO. 42… 2020-06-01 IATF reso… https://www.offi… 2020-06-17
## 8 41 RESOLUTION NO. 41… 2020-05-29 IATF reso… https://www.offi… 2020-06-17
## 9 40 RESOLUTION NO. 40… 2020-05-27 IATF reso… https://www.offi… 2020-06-17
## 10 NA OMNIBUS GUIDELINE… 2020-05-22 IATF reso… https://www.offi… 2020-06-17
## # … with 43 more rows
Notice that the ouput is the same in class and structure as the output of the previous function the only difference being that this produces more rows. This is because, as already mentioned earlier, The Official Gazette has the first 8 resolutions issued in January and February 2020 and is updated soon after resolutions are endorsed and signed by the IATF compared to that of the DoH.
Downloading the resolutions
The most important information from the output from either functions is the column named url
as this contains the links/URLs to the PDFs of the various resolutions.
We automated the download of the resolution PDFs from the DoH site using a convenience function called get_iatf_pdf
. In this function, the user can choose which specific resolution to download by specifying the argument id
. If the user wants to download the PDF for Resolution No. 29, the following call is made:
iatfLinks %>%
get_iatf_pdf(id = 29)
which gives the following output:
## [1] "/var/folders/rx/nr32tl5n6f3d_86tn0tc7kc00000gp/T//RtmpEQY0KS/iatfResolution29.pdf"
The output is a character vector of the path to the temporary directory created where the PDF for Resolution No. 29 has been stored under the filename iatfResolution29.pdf.
It is considered best practice for a function that downloads something from the internet to save this download into a temporary directory as this is more secure and potentially less harmful for the user. It is also advantageous for the user as the PDFs do not have to be saved into the local computer but instead immediately read and used in R.
Steps 2, 3 and 4: Reading, cleaning and structuring text data from the resolutions
These steps form the supervised component of the workflow.
Using the output from the previous step, the PDF/s can be read into R using the pdftools package. Typically, reading a PDF with pdftools would be as straightforward as:
pdftools::pdf_text(pdf = "PATH/TO/PDF")
However, the IATF resolutions have been released as downloadable PDF documents produced by scanning the hard copy, signed versions of the resolutions. This type of PDF requires a different function from the pdftools package to extract the data:
pdftools::pdf_ocr_text(pdf = "PATH/TO/SCANNED/PDF")
Whilst this type of document format can be read by pdftools package, the extracted text often requires a lot of manual cleaning and some words or phrases needing to be edited. Below is an example raw output after reading the PDF of IATF Resolution No. 29 and adding some linebreaks.
iatfLinks %>% ## Reference iatfLinks table
get_iatf_pdf(id = 29) %>% ## Download PDF of Resolution 29
pdftools::pdf_ocr_text() %>% ## Extract text using OCR
stringr::str_split(pattern = "\n") %>% ## Split lines of text
unlist() %>% ## Unlist to create data.frame
head(n = 30) ## Show first 30 rows/lines of text
## Converting page 1 to iatfResolution29_1.png... done!
## Converting page 2 to iatfResolution29_2.png... done!
## Converting page 3 to iatfResolution29_3.png... done!
## Converting page 4 to iatfResolution29_4.png... done!
## [1] "g \\ REPUBLIC OF THE PHILIPPINES"
## [2] ""
## [3] "; ay ; INTER-AGENCY TASK FORCE"
## [4] ""
## [5] "%, - FOR THE MANAGEMENT OF EMERGING INFECTIOUS DISEASES"
## [6] "“it Ore on Emer"
## [7] ""
## [8] "RESOLUTION NO. 29"
## [9] "Series of 2020"
## [10] "April 27, 2020"
## [11] "RECOMMENDATIONS RELATIVE TO THE MANAGEMENT"
## [12] "OF THE CORONAVIRUS DISEASE 2019 (COVID-19) SITUATION"
## [13] ""
## [14] "WHEREAS, on March 16, 2020, to prevent the sharp rise of COVID-19 cases in the"
## [15] "country, the President placed the entirety of Luzon under Enhanced Community Quarantine until"
## [16] "April 14, 2020;"
## [17] ""
## [18] "WHEREAS, on March 30, 2020, to develop a science-based approach in determining"
## [19] "whether the Enhanced Community Quarantine in Luzon should be totally or partially lifted,"
## [20] "extended, or expanded to other areas, the Inter-Agency Task Force (IATF) convened a"
## [21] "sub-Technical Working Group tasked to define parameters in assessing recent developments in"
## [22] "the Philippine COVID-19 situation;"
## [23] ""
## [24] "WHEREAS, on April 3, 2020, the [ATF finalized the parameters for deciding on the"
## [25] "lifting or extension of the Enhanced Community Quarantine in Luzon, which include trends on"
## [26] "the COVID-19 epidemiological curve, the health capacity of the country, social factors,"
## [27] "economic factors, and security factors;"
## [28] ""
## [29] "WHEREAS, on April 7, 2020, upon the recommendation of the IATF, the President"
## [30] "extended the implementation of the Enhanced Community Quarantine over the entirety of Luzon"
This is the reason why Steps 2, 3 and 4 cannot be automated algorithmically or in an unsupervised approach.
Following is the complete script used to read, clean and structure Resolution No. 29 in a manner that is amenable for text analysis:
## Read and re-structure text data
y <- iatfLinks %>% ## Reference iatfLinks table
get_iatf_pdf(id = 29) %>% ## Download PDF of Resolution 29
pdftools::pdf_ocr_text() %>% ## Extract text using OCR
stringr::str_split(pattern = "\n") %>% ## Split lines of text
unlist() ## Unlist to create data.frame
## Restructure text and clean
y <- y[c(14:48, 55:90, 98:109, 115:149)]
y <- y[y != ""]
y[69] <- "Francisco T. Duque III Karlo Alexei B. Nograles"
y[70] <- "Secretary, Department of Health Cabinet Secretary, Office of the Cabinet Secretary"
y[71] <- "IATF Chairperson IATF Co-Chairperson"
y[74] <- "1. I am presently an Assistant Secretary of the Department of Health;"
y[83] <- stringr::str_replace(string = y[83], pattern = "\\[", replacement = "I")
y[84] <- stringr::str_replace(string = y[84], pattern = "\\[", replacement = "I")
y[87] <- stringr::str_replace(string = y[87], pattern = "\\[", replacement = "I")
y[93] <- stringr::str_replace(string = y[93], pattern = "1g!", replacement = "28th")
y[95] <- "Kenneth G. Ronquillo, MD, MPHM"
y[97] <- stringr::str_replace(string = y[97], pattern = "\\[", replacement = "I")
y <- stringr::str_trim(string = y, side = "both")
## Add heading
y <- c(c("Republic of the Philippines",
"Inter-Agency Task Force",
"for the Management of Emerging Infectious Diseases",
"Resolution No. 29",
"Series of 2020",
"27 April 2020",
"Recommendations relative to the management",
"of the coronavirus disease 2019 (COVID-19) situation"), y)
## Add section
pStart <- which(stringr::str_detect(string = y, pattern = "WHEREAS"))[1]
oStart <- which(stringr::str_detect(string = y, pattern = "RESOLVED"))[1]
pEnd <- oStart - 1
eStart <- which(stringr::str_detect(string = y, pattern = "APPROVED"))[1]
oEnd <- eStart - 1
cStart <- which(stringr::str_detect(string = y, pattern = "CERTIFICATE|CERTIFICATION"))
eEnd <- cStart - 1
section <- NULL
section[1:8] <- "heading"
section[pStart:pEnd] <- "preamble"
section[oStart:oEnd] <- "operative"
section[eStart:eEnd] <- "endorsement"
section[cStart:length(y)] <- "certification"
## Re-structure text data.frame
y <- data.frame(linenumber = 1:length(y),
text = y,
source = "IATF",
type = "resolution",
id = 29,
section = section,
date = as.Date("27/04/2020", format = "%d/%m/%y"),
stringsAsFactors = FALSE)
## Convert data.frame to tibble
iatfResolution29 <- tibble::tibble(y)
This results in the following text data:
## # A tibble: 105 x 7
## linenumber text source type id section date
## <int> <chr> <chr> <chr> <dbl> <chr> <date>
## 1 1 Republic of the Philippines IATF resol… 29 heading 2020-04-27
## 2 2 Inter-Agency Task Force IATF resol… 29 heading 2020-04-27
## 3 3 for the Management of Emer… IATF resol… 29 heading 2020-04-27
## 4 4 Resolution No. 29 IATF resol… 29 heading 2020-04-27
## 5 5 Series of 2020 IATF resol… 29 heading 2020-04-27
## 6 6 27 April 2020 IATF resol… 29 heading 2020-04-27
## 7 7 Recommendations relative t… IATF resol… 29 heading 2020-04-27
## 8 8 of the coronavirus disease… IATF resol… 29 heading 2020-04-27
## 9 9 WHEREAS, on March 16, 2020… IATF resol… 29 preamb… 2020-04-27
## 10 10 country, the President pla… IATF resol… 29 preamb… 2020-04-27
## # … with 95 more rows
Step 5: Save the data
Once data have been read, processed and structured, they are saved in .rda
format for further use in text analysis or topic modelling with R language for statistical computing. To allow for replicability of the whole data extraction, processing and structuring process and to facilitate ease of distribution of the text data outputs, we packaged the functions we’ve developed and the datasets we produced into a standard, portable R package named covidphtext
. We then made the package available via GitHub from which those requiring the functions we developed and the text datasets we’ve produced can install the package. A detailed description of the package including installation and use can be found here.