Mining Text from PDF Files, Part 2: PDF with Tables
Intro
I wanted to find out how to mine text from PDF files with R. Last week I tried to extract text from a PDF file with just text in it. This week I will try extracting text from a PDF file with a table. Next week, I will try it from a picture inside a PDF file.
I’m assuming you’re using RStudio as your IDE (Integrated Development Environment). I’m sure most of this can be done with using something else as well.
tabulizer in action
For this experiment, I’m using another cool package called tabulizer. If you’d like to know more about it, you should check out the package’s documentation.
1. Let’s get ready
Before we go any further, I’m going to load the packages we’ll be needing today:
library(tabulizer)
# The main package for this operation
library(tidyverse)
# Prerequisite to everything
library(writexl)
# My go-to package for writing Excel files
I’ll also show you the raw material. If you’d like to try this at home, you can save the PDF file shown below. What we’re looking at here is (Spotify’s weekly top 100 chart for Finland (2021-05-14 - 2021-05-21), but in text form.
2. Read in the PDF file
Next, let’s read in the pdf file with the table inside with extract_tables() from the tabulizer package:
pdf_with_table <- extract_tables("index_files/pdf_with_table.pdf", method = "decide") # "decide" is the default method here, others are "lattice" and "stream" (see documentation for more info about them, but you can try switching them for a neater outcome, depending on the raw material)
pdf_with_table
## [[1]]
## [,1]
## [1,] "artist"
## [2,] "Lil Nas X"
## [3,] "Blind Channel"
## [4,] "SANNI"
## [5,] "Haloo Helsinki!"
## [6,] "Doja Cat"
## [7,] "Cledos, BEHM"
## [8,] "Masked Wolf"
## [9,] "BEHM"
## [10,] "Pyhimys, Eino Grön"
## [11,] "costee"
## [12,] "Riton, Nightcrawlers"
## [13,] "Tion Wayne, Russ Millions"
## [14,] "Nathan Evans"
## [15,] "Dua Lipa"
## [16,] "Etta"
## [17,] "BEHM"
## [18,] "Justin Bieber"
## [19,] "Keko Salata"
## [20,] "YB026, Nuteh Jonez"
## [21,] "ibe"
## [22,] "Billie Eilish"
## [23,] "The Weeknd"
## [24,] "william"
## [25,] "Polo G"
## [26,] "Olivia Rodrigo"
## [27,] "ATB, Topic, A7S"
## [28,] "Majestic, Boney M."
## [29,] "P!nk, Willow Sage Hart"
## [30,] "ABREU"
## [31,] "william"
## [32,] "Brädi"
## [33,] "Aurora"
## [34,] "Teflon Brothers, Pandora"
## [35,] "Portion Boys"
## [36,] "Lauri Tähkä"
## [37,] "Tiësto"
## [38,] "Pihlaja"
## [39,] "ibe, Blacflaco, Elastinen"
## [40,] "BEHM"
## [41,] "BEHM"
## [42,] "Ilta"
## [43,] "The Weeknd"
## [44,] "Alan Walker, Conor Maynard"
## [45,] "Kaija Koo"
## [46,] "BEHM"
## [47,] "Coldplay"
## [48,] "Tom Odell"
## [49,] "Ava Max"
## [50,] "Bella Poarch"
## [51,] "Pyrythekid"
## [,2]
## [1,] "track"
## [2,] "MONTERO (Call Me By Your Name)"
## [3,] "Dark Side"
## [4,] "Pettäjä"
## [5,] "Piilotan mun kyyneleet"
## [6,] "Kiss Me More (feat. SZA)"
## [7,] "Life (Sun luo)"
## [8,] "Astronaut In The Ocean"
## [9,] "Frida"
## [10,] "Hyvät hautajaiset"
## [11,] "Ne voi liittyy (feat. BIZI)"
## [12,] "Friday (feat. Mufasa & Hypeman) - Dopamine Re-Edit"
## [13,] "Body (Remix) [feat. ArrDee, E1 (3x3), ZT (3x3), Bugzy Malone, Buni, Fivio Foreign & Darkoo]"
## [14,] "Wellerman - Sea Shanty / 220 KID x Billen Ted Remix"
## [15,] "Levitating (feat. DaBaby)"
## [16,] "Prinsessa"
## [17,] "Hei rakas"
## [18,] "Peaches (feat. Daniel Caesar & Giveon)"
## [19,] "Kaipaan sua (feat. Boyat & Samuli Heimo)"
## [20,] "Steppasin Partyy"
## [21,] "Tunteet"
## [22,] "Your Power"
## [23,] "Save Your Tears (with Ariana Grande) (Remix)"
## [24,] "Penelope (feat. Clever)"
## [25,] "RAPSTAR"
## [26,] "good 4 u"
## [27,] "Your Love (9PM)"
## [28,] "Rasputin"
## [29,] "Cover Me In Sunshine"
## [30,] "20 Ave Mariaa"
## [31,] "Flyys"
## [32,] "Keväät"
## [33,] "Vettä kaivoon (feat. Keko Salata)"
## [34,] "I Love You"
## [35,] "Kyläbaari"
## [36,] "Aavikko"
## [37,] "The Business"
## [38,] "Paha Barbi"
## [39,] "WEST SIDE BABY"
## [40,] "Päästä varpaisiin"
## [41,] "Lupaan"
## [42,] "Kelle mä soitan"
## [43,] "Blinding Lights"
## [44,] "Believers"
## [45,] "Sateenkaari pimeessä"
## [46,] "Tivolit"
## [47,] "Higher Power"
## [48,] "Another Love"
## [49,] "My Head & My Heart"
## [50,] "Build a Bitch"
## [51,] "Epäkohtelias (feat. Axel Kala & Gettomasa)"
## [,3] [,4]
## [1,] "rank" "streams"
## [2,] "1" "275091"
## [3,] "2" "260403"
## [4,] "3" "255770"
## [5,] "4" "238089"
## [6,] "5" "236820"
## [7,] "6" "224839"
## [8,] "7" "202630"
## [9,] "8" "200910"
## [10,] "9" "196586"
## [11,] "10" "194852"
## [12,] "11" "186470"
## [13,] "12" "182369"
## [14,] "13" "182353"
## [15,] "14" "180922"
## [16,] "15" "175544"
## [17,] "16" "168578"
## [18,] "17" "167833"
## [19,] "18" "158543"
## [20,] "19" "154737"
## [21,] "20" "147176"
## [22,] "21" "143926"
## [23,] "22" "142865"
## [24,] "23" "142381"
## [25,] "24" "135300"
## [26,] "25" "132459"
## [27,] "26" "132176"
## [28,] "27" "129677"
## [29,] "28" "126374"
## [30,] "29" "125031"
## [31,] "30" "121863"
## [32,] "31" "121534"
## [33,] "32" "119095"
## [34,] "33" "117764"
## [35,] "34" "116293"
## [36,] "35" "114680"
## [37,] "36" "111908"
## [38,] "37" "110442"
## [39,] "38" "110222"
## [40,] "39" "110066"
## [41,] "40" "109843"
## [42,] "41" "108466"
## [43,] "42" "106940"
## [44,] "43" "106449"
## [45,] "44" "104221"
## [46,] "45" "103219"
## [47,] "46" "102244"
## [48,] "47" "99598"
## [49,] "48" "96021"
## [50,] "49" "95933"
## [51,] "50" "93762"
##
## [[2]]
## [,1]
## [1,] "The Weeknd"
## [2,] "Nightshift, Pyhimys, ibe, Dreas"
## [3,] "JVG"
## [4,] "AURORA"
## [5,] "Olivia Rodrigo"
## [6,] "Ellinoora"
## [7,] "Kube"
## [8,] "Janna"
## [9,] "Duncan Laurence"
## [10,] "Gasellit, Karri Koira"
## [11,] "Arttu Wiskari"
## [12,] "J. Cole"
## [13,] "Olivia Rodrigo"
## [14,] "Bruno Mars, Anderson .Paak, Silk Sonic"
## [15,] "Happoradio"
## [16,] "Ofenbach"
## [17,] "ibe"
## [18,] "J. Cole"
## [19,] "BEHM"
## [20,] "Haloo Helsinki!"
## [21,] "Erika Vikman"
## [22,] "ScurtDae"
## [23,] "J. Cole"
## [24,] "The Kid LAROI"
## [25,] "Imagine Dragons"
## [26,] "Erin"
## [27,] "Elias Kaskinen"
## [28,] "VIIVI"
## [29,] "J. Cole"
## [30,] "Kuningasidea"
## [31,] "Chebaleba"
## [32,] "Cardi B"
## [33,] "Studio Killers"
## [34,] "Elastinen"
## [35,] "Gettomasa"
## [36,] "Mikael Gabriel"
## [37,] "OneRepublic"
## [38,] "Kymppilinja"
## [39,] "Samu Haber"
## [40,] "Surf Curse"
## [41,] "Martin Garrix"
## [42,] "Glass Animals"
## [43,] "24kGoldn"
## [44,] "Axel Kala"
## [45,] "Poju"
## [46,] "DMNDS, Strange Fruits Music, Fallen Roses"
## [47,] "Aleksanteri Hakaniemi"
## [48,] "DJ Khaled"
## [49,] "Keko Salata"
## [50,] "Klamydia"
## [,2]
## [1,] "Save Your Tears"
## [2,] "Kivullisii"
## [3,] "Ikuinen vappu"
## [4,] "Runaway"
## [5,] "drivers license"
## [6,] "Dinosauruksii"
## [7,] "100"
## [8,] "Maailma meidän jälkeen"
## [9,] "Arcade"
## [10,] "Me ei mennä rikki"
## [11,] "Tässäkö tää oli? (feat. Leavings-Orkesteri)"
## [12,] "amari"
## [13,] "deja vu"
## [14,] "Leave The Door Open"
## [15,] "Jos et olis siinä"
## [16,] "Wasted Love (feat. Lagique)"
## [17,] "molemmat"
## [18,] "p r i d e . i s . t h e . d e v i l (with Lil Baby)"
## [19,] "Minä vai maailma (feat. Keko Salata)"
## [20,] "Lady Domina"
## [21,] "Syntisten pöytä"
## [22,] "Back to Life (Birthdae)"
## [23,] "m y . l i f e (with 21 Savage & Morray)"
## [24,] "WITHOUT YOU"
## [25,] "Follow You"
## [26,] "Niinku koko ajan"
## [27,] "Kerran elämässä"
## [28,] "Lääke"
## [29,] "i n t e r l u d e"
## [30,] "Pohjolan tuulet"
## [31,] "Kesäfiilistelyy (feat. RicoWamos)"
## [32,] "Up"
## [33,] "Jenny (I Wanna Ruin Our Friendship)"
## [34,] "Epäröimättä hetkeekään (feat. Jenni Vartiainen)"
## [35,] "Silmät"
## [36,] "Intiaanikesä"
## [37,] "Run"
## [38,] "Minä (feat. Mariska)"
## [39,] "Täältä tullaan"
## [40,] "Freaks"
## [41,] "We Are The People (feat. Bono & The Edge) - Official UEFA EURO 2020 Song"
## [42,] "Heat Waves"
## [43,] "Mood (feat. iann dior)"
## [44,] "Moni meist"
## [45,] "Esson baariin"
## [46,] "Calabria (feat. Lujavo & Nito-Onna)"
## [47,] "Bonsaipuu"
## [48,] "I DID IT (feat. Post Malone, Megan Thee Stallion, Lil Baby & DaBaby)"
## [49,] "Vanha (feat. BEHM)"
## [50,] "Pienen pojan elämää"
## [,3] [,4]
## [1,] "51" "93498"
## [2,] "52" "90460"
## [3,] "53" "89440"
## [4,] "54" "83727"
## [5,] "55" "83344"
## [6,] "56" "82398"
## [7,] "57" "82145"
## [8,] "58" "81706"
## [9,] "59" "81427"
## [10,] "60" "78202"
## [11,] "61" "77709"
## [12,] "62" "76892"
## [13,] "63" "75942"
## [14,] "64" "75299"
## [15,] "65" "74306"
## [16,] "66" "74285"
## [17,] "67" "73492"
## [18,] "68" "72613"
## [19,] "69" "71289"
## [20,] "70" "70890"
## [21,] "71" "70638"
## [22,] "72" "70592"
## [23,] "73" "70345"
## [24,] "74" "68620"
## [25,] "75" "67793"
## [26,] "76" "67644"
## [27,] "77" "67296"
## [28,] "78" "67076"
## [29,] "79" "66523"
## [30,] "80" "66243"
## [31,] "81" "65072"
## [32,] "82" "64987"
## [33,] "83" "64202"
## [34,] "84" "63930"
## [35,] "85" "63448"
## [36,] "86" "62128"
## [37,] "87" "62096"
## [38,] "88" "61114"
## [39,] "89" "60235"
## [40,] "90" "59496"
## [41,] "91" "58058"
## [42,] "92" "57694"
## [43,] "93" "57086"
## [44,] "94" "55938"
## [45,] "95" "55892"
## [46,] "96" "55805"
## [47,] "97" "55657"
## [48,] "98" "55644"
## [49,] "99" "55451"
## [50,] "100" "54190"
Okay, looks like we ended up with a list again. Let’s next turn it into a tibble by first using rbind to combine the list into a matrix.
3. Turn the list into a matrix and then to a tibble
First, using the do.call() function, let’s tell R to use rbind() to combine the rows of the two elements inside that list pdf_with_table. Then, using the pipe operator (%>%), one of my favorite things about Tidyverse, let’s feed the as_tibble() function to create a tibble.
pdf_with_table_tbl <- do.call(rbind, pdf_with_table) %>%
as_tibble()
pdf_with_table_tbl
## # A tibble: 101 x 4
## V1 V2 V3 V4
## <chr> <chr> <chr> <chr>
## 1 artist track rank streams
## 2 Lil Nas X MONTERO (Call Me By Your Name) 1 275091
## 3 Blind Channel Dark Side 2 260403
## 4 SANNI Pettäjä 3 255770
## 5 Haloo Helsinki! Piilotan mun kyyneleet 4 238089
## 6 Doja Cat Kiss Me More (feat. SZA) 5 236820
## 7 Cledos, BEHM Life (Sun luo) 6 224839
## 8 Masked Wolf Astronaut In The Ocean 7 202630
## 9 BEHM Frida 8 200910
## 10 Pyhimys, Eino Grön Hyvät hautajaiset 9 196586
## # ... with 91 more rows
Now, that looks better already, but we still need to turn the first row into column names. Let’s use a new package for that.
4. Turn first row into columns with janitor
Janitor is a nice little package to help clean data with. We’ll be using its row_to_names() function. And since we only need it this once, we might as well call it writing ‘package::function()’ instead of library(package) and function() separately.
pdf_with_table_named_tbl <- pdf_with_table_tbl %>%
janitor::row_to_names(row_number = 1)
pdf_with_table_named_tbl
## # A tibble: 100 x 4
## artist track rank streams
## <chr> <chr> <chr> <chr>
## 1 Lil Nas X MONTERO (Call Me By Your Name) 1 275091
## 2 Blind Channel Dark Side 2 260403
## 3 SANNI Pettäjä 3 255770
## 4 Haloo Helsinki! Piilotan mun kyyneleet 4 238089
## 5 Doja Cat Kiss Me More (feat. SZA) 5 236820
## 6 Cledos, BEHM Life (Sun luo) 6 224839
## 7 Masked Wolf Astronaut In The Ocean 7 202630
## 8 BEHM Frida 8 200910
## 9 Pyhimys, Eino Grön Hyvät hautajaiset 9 196586
## 10 costee Ne voi liittyy (feat. BIZI) 10 194852
## # ... with 90 more rows
Nice! We’re almost there. Now we’ll just change the column types for rank and streams.
5. Mutate rank and streams to numeric
Let’s now mutate the two numeric columns from character type to numeric.
pdf_with_table_final_tbl <- pdf_with_table_named_tbl %>%
mutate(
rank = as.numeric(rank),
streams = as.numeric(streams)
)
pdf_with_table_final_tbl
## # A tibble: 100 x 4
## artist track rank streams
## <chr> <chr> <dbl> <dbl>
## 1 Lil Nas X MONTERO (Call Me By Your Name) 1 275091
## 2 Blind Channel Dark Side 2 260403
## 3 SANNI Pettäjä 3 255770
## 4 Haloo Helsinki! Piilotan mun kyyneleet 4 238089
## 5 Doja Cat Kiss Me More (feat. SZA) 5 236820
## 6 Cledos, BEHM Life (Sun luo) 6 224839
## 7 Masked Wolf Astronaut In The Ocean 7 202630
## 8 BEHM Frida 8 200910
## 9 Pyhimys, Eino Grön Hyvät hautajaiset 9 196586
## 10 costee Ne voi liittyy (feat. BIZI) 10 194852
## # ... with 90 more rows
That’s it! Now we just need to create the excel file again.
6. Create Excel file with writexl
write_xlsx(pdf_with_table_final_tbl, "index_files/excel_from_table.xlsx")
# You should change the file path to suit your needs
Outro
We took a different path, because of the slightly different starting point. But we ended up with the same end results as last week. So, if you have tabular data inside a PDF, tabulizer is definitely worth checking out!
Thanks for reading this far. If you’re curious to see, how the same data behaves if it’s in picture inside a PDF (like a matryoshka doll), tune in next week for part 3 of this PDF trilogy! Until then, happy text mining!
ps. I’m more than happy to chat about all things data. Just send me a message on LinkedIn if you wish to do so!
Updated: 29 May, 2021
Created: 29 May, 2021