Two months ago I found the trove of papers on Nancy Leveson’s MIT homepage1, which planted the seed of a thought. Academic homepages are full of interesting reading (unpublished rambles; administrivia; pre-prints, the published versions of which are inaccessible behind paywalls). Academic homepages are also woefully ill-maintained. Why not scrape them?
They’re good candidates. Most are hand-written static sites with simple DOMs. Most are reachable from central departmental indices.
In the days after I found Dr. Leveson’s homepage, in the breaks between her papers, I threw together a scrapy
spider for pulling PDF URLs a department at a time, and for jotting those URLs down into text files… but that’s where the project stalled out.2
https://arxiv.org/pdf/1711.02226.pdf
http://proceedings.mlr.press/v70/carmon17a/carmon17a.pdf
http://proceedings.mlr.press/v70/namkoong17a/namkoong17a-supp.pdf
https://web.stanford.edu/~jduchi/projects/SinhaDu16.pdf
...
My scraper was fast, but I had no interest in actually downloading tens of gigabytes of PDFs––tens of thousands of files!––that I would, realistically, never read. The code would be annoying to write and molassify my home network. I moved on to other work.
But what is cloud computing for, if not the senseless accumulation of online stuff? Google Cloud Platform (or the competing cloud platform of your choice) has everything we need to store and index the papers my scraper discovers:
I haven’t used Cloud Functions much in the past (I often reach for App Engine out of habit, even when Cloud Functions will suffice). My impressions from this project:
Advantages | Disadvantages |
---|---|
+ Concurrency without code + Easy IAM permissioning + Built-in event triggers + Multi-language pipelines |
- Awkward local debugging - Awkward remote debugging - Multi-minute deployments |
Instead of focusing on their shortcomings, we’ll walk through what they handled well: I’ve indexed more than 20,000 PDFs––just over 20 gigabytes––and extracted 500 megabytes of plain text.
$ gsutil ls "gs://documents-repository/**" | grep ".pdf$" | wc -l
20243
$ gsutil du -e "*.txt" -sh "gs://documents-repository/"
20.07 GiB gs://documents-repository
$ gsutil du -e "*.pdf" -sh "gs://documents-repository/"
498.76 MiB gs://documents-repository
All this without using ssh
to connect to a server, configuring a network, sharing a client among concurrent routines, or spending more than $1.00.
Our first challenge: process the old files of jotted-down URLs and persist the PDFs. Several properties of this problem make Cloud Functions attractive.
I can pipe the PDFs straight from the network into Cloud Storage without holding an entire file in memory.
Since there are so many PDFs, I’d like to download as many of them concurrently as possible. A highly concurrent program on a single device might end up I/O-bottlenecked, but each Cloud Function has its own network resources! Need more concurrent uploads? GCP will automatically provision more Cloud Function instances.
Since invoking a Function is lightweight––a POST request with a URL––we can eventually invoke it directly from my scraper instead of writing PDF URLs to an intermediate file. If the scraper code had to download and upload files itself, it’d be unworkably slow.
The scraper doesn’t validate the URLs; they end with .pdf
, but they could point at anything. Some may lead to sites that no longer exist. Some may point at redirects. Since we have few guarantees about the quality of our input, we don’t have to bother managing failures in individual uploads. Let them fail!
We can split uploading documents into two scripts: a Cloud Function which takes a single URL and pipes the PDF to Cloud Storage, and a script that turns my scraped files into invocations of that Cloud Function.
Piping PDF data from a response body into GCS reminded me of Chris Roche’s post on splitting io.Reader
s Since this Cloud Function is self-contained and provides a neatly defined interface ((url) => pdf
), we don’t have to worry about our upload language being suitable for analyzing/indexing the PDFs. We’re free to choose whatever language fits our upload style. In Go:
// Package nmt contains a Cloud Function for streaming documents to GCS.
package nmt
import (
"context"
"encoding/json"
"fmt"
"io"
"net/http"
"net/url"
"cloud.google.com/go/storage"
)
// message is the data to parse from an invocation body. At this stage, there
// is no metadata besides a PDF URL.
type message struct {
URL string `json:"url"`
}
// Name for the bucket to which documents should be written. This Cloud
// Function assumes the bucket is in the same GCP project.
const bucketName = "documents-repository"
// Ingest streams the file at a URL to Cloud Storage.
func Ingest(w http.ResponseWriter, r *http.Request) {
// Parse invocation body for PDF URL.
var decoded message
if err := json.NewDecoder(r.Body).Decode(&decoded); err != nil {
http.Error(w, "Failed to decode message", 400)
return
}
filename, url, err := parseURL(decoded.URL)
if err != nil {
http.Error(w, "Failed to parse URL", 400)
return
}
resp, err := http.Get(url.String())
if err != nil {
http.Error(w, "Failed to request URL", 404)
return
}
defer resp.Body.Close()
if resp.Header.Get("Content-Type") != "application/pdf" {
http.Error(w, "Not a PDF", 404)
return
}
// Create a storage client.
ctx := context.Background()
storageClient, err := storage.NewClient(ctx)
if err != nil {
http.Error(w, "Failed to create storage client.", 500)
return
}
defer storageClient.Close()
// Initialize a storage object by creating a writer for it.
writer := storageClient.Bucket(bucketName).Object(filename).NewWriter(ctx)
defer writer.Close()
// Return a 200 status.
w.WriteHeader(http.StatusAccepted)
// Pipe PDF body to Cloud Storage.
writer.ContentType = resp.Header.Get("Content-Type")
written, err := io.Copy(writer, resp.Body)
if err != nil {
fmt.Println("Error encountered when piping to Cloud Storage", err)
}
fmt.Printf("Wrote %d to Cloud Storage\n", written)
}
// parseURL parses a URL. The returned filename is the URL host and URL path,
// concatenated. GCS treats the `/` tokens as folder separators.
func parseURL(input string) (filename string, url *url.URL, err error) {
url, err = url.Parse(input)
filename = fmt.Sprintf("%s%s", url.Host, url.Path)
return filename, url, err
}
Less than a hundred lines, and not a buffer in sight! That io.Copy()
call––from the HTTP response to the newly initialized GCS object––is doing the bulk of the work. Almost everything else is error handling.3 There’s pleasantly little tool-specific boilerplate, so this could be refactored into an HTTP handler function on a webserver if we saw fit.
Once this Cloud Function’s deployed, we can invoke it in a fairly tight loop: additional Cloud Function instances are automatically provisioned to manage the load. I wrote a short aiohttp
Python script for this. Uploading 20 GB of PDFs––hundreds of PDFs at a time––takes just minutes. Conveniently, the object names are their URLs.
Extracting text from PDFs is hard. PDFs are essentially visual documents; they’re meant to be read visually rather than parsed programmatically. There are, broadly, two ways of turning them into plain text:
Optical Character Recognition (OCR) programs like Tesseract trade off efficiency and ease of configuration to play by this expectation: they extract text by considering a PDF visually. With intense computational requirements and long runtimes, OCR programs suit persistent compute resources.
Some PDFs (the ones wherein you can highlight text, copy/paste, etc.) have text embedded in them; we can use programs to extract that text dirextly. Unfortunately, this just doesn’t work if a PDF doesn’t include encoded text.
OCR is overkill for our PDF-indexing use case; extracting a bag of words will do just fine, and there’s such a wealth of PDFs on academic homepages that we can satisfy ourselves with indexing most of them. That’s not to say this text extraction is a simple thing to build yourself, or even easy to solve with libraries in a variety of languages: I struggled for hours with a Go module before giving up and switching to Python.
This is a sweet feature of a radically modular infrastructure: for a given stage of our data pipeline, we’re free to pick the language with the best support. When we need to pull text from PDFs, we can pick the language with the most effective published tools (like we picked a language that suited our ingestion strategy). Python has plenty; pdfminer.six is decent. In a cloud function:
from google.cloud import storage
import tempfile
from pdfminer.high_level import extract_text
from flask import abort
def main(request):
"""main is theCloud Function entry point. It downloads the specified
existing PDF from GCS to a temporary location, extracts the text from that
PDF, then uploads that text to a new GCS object.
request -- the Flask request that invoked this Cloud Function.
"""
# Parse object name from request.
request_json = request.get_json()
objectName = request_json['object']
assert objectName.endswith(".pdf")
print("Got request to handle", objectName)
# Connect to GCS bucket.
client = storage.Client()
bucket = client.bucket("documents-repository")
# Initialize temporary file for downloaded PDF.
pdf = tempfile.NamedTemporaryFile()
try:
# Download blob into temporary file, extract, and uplaod.
bucket.blob(objectName).download_to_filename(pdf.name)
return extract(bucket, objectName, pdf.name)
except Exception as err:
return abort(500, "Exception while extracting text", err)
def extract(bucket: storage.Bucket, objectName: str, pdf: str):
"""extract pulls text out of a downloaded PDF and uploads the result to a
new GCS object in the same bucket.
bucket -- the GCS bucket to which the resulting text will be uploaded.
objectName -- the prefix for the uploaded text object. Usually this is the
object name for the processed PDF.
pdf -- the filename for a downloaded PDF from which to extract text.
"""
# TODO: silence pdfminer noisy logging.
text = extract_text(pdf)
# Upload extracted text to new GCS object.
dest_blob = bucket.blob(objectName + ".txt")
dest_blob.upload_from_string(text)
return text
Downloading a PDF to a temporary file is clumsy, but the published tools expect filenames (and, as I said, they’re complex enough to dictate my implementation). Like the upload Cloud Function before it, we can invoke this one in a tight loop using the scraped URLs from before… et voilà! After a few minutes, the pdfminer
output for each stored PDF is tucked next to its corresponding PDF in the bucket.
This extraction strategy works, but it involves manually triggerig both stages: first, we trigger PDF ingestion; second, after the upload is finished, we separately trigger our text extraction function. Instead, we can just trigger the ingestion, and have the “after the upload is completed” event trigger our Function automatically/immediately.
Google Cloud Storage Triggers are a neat Pub/Sub interface for invoking Cloud Functions. Instead of manually announcing “hey, I uploaded this object, it’s ready for processing,” the Cloud Function can consume the GCS-published finalize
event marking the Storage object creations and updates. Refactoring our code from before:
def on_finalized(event, _):
"""on_finalized is the Cloud Function entry point for handling GCS object
finalized events. It downloads the specified PDF from GCS into a temporary
file, extracts the text from that PDF, then uploads that text to a new GCS
object.
event -- the received GCS event. Includes the bucket name and the name of
the finalized object.
"""
bucket = event['bucket']
objectName = event['name']
# Skip non-PDF files: this function writes to the bucket it watches.
if not objectName.endswith(".pdf"):
print("Skipping request to handle", objectName)
return
print("Extracting text from", objectName)
# Connect to GCS bucket.
client = storage.Client()
bucket = client.bucket(bucket)
# Initialize temporary file for downloaded PDF.
pdf = tempfile.NamedTemporaryFile()
try:
# Download blob into temporary file, extract, and uplaod.
bucket.blob(objectName).download_to_filename(pdf.name)
extracted = extract(bucket, objectName, pdf.name)
print("Success: extracted {} characters".format(len(extracted)))
except Exception as err:
print("Exception while extracting text", err)
The extract
function––and, indeed, everything but the code to pull the objectName
from the Pub/Sub message––is unchanged. With this update, there’s no need to invoke anything but the ingestion function: once a PDF has been streamed into GCS object, the text extraction will automatically kick off; we don’t need any second pass over scraped URLs.
We can daisy-chain this further by writing Cloud Functions invoked by finalize
events on .txt objects, or publish events to Pub/Sub topics linking successive pipeline stages. TFIDF? Elasticsearch? Write the next stage in Typescript? The sky’s the limit.
I skimmed most of the PDFs listed on Dr. Leveson’s site. If you’re interested in software-system safety analysis, my personal favorites were
“Systems Theoretic Process Analysis (STPA) of an Offshore Supply Vessel Dynamic Positioning System.” I skipped a lot of the topic-specific detail, but its STPA overview is one of the best I read and its example (maintaining the relative positions of two ships without running either aground) is memorable.
“Software Deviation Analyis: A ‘Safeware’ Technique.” Discusses modeling a software system and, well, kind of Chaos Monkeying it: see what happens when different combinations of the software’s controls are violated.
“Inside Risks: An Integrated Approach to Safety and Security Based on Systems Theory.” A strong case for cross-applying principles from system safety (Leveson’s primary focus) into information security (my primary focus these days).
Leveson doesn’t index them by title, but they’re in there!↩︎
The scraper that produces the lists of PDF URLs is pretty janky. It requires some tweaking for each academic department. The abridged code:
import scrapy
from scrapy.linkextractors import LinkExtractor
class CustomLinkExtractor(LinkExtractor):
def __init__(self, *args, **kwargs):
super(CustomLinkExtractor, self).__init__(*args, **kwargs)
# Keep the default values in "deny_extensions" *except* for PDFs.
self.deny_extensions = [ext for ext in self.deny_extensions if ext != ".pdf"]
class Spider(scrapy.Spider):
name = "nmt-spider"
# allowed_domains limit the domains to be scraped rather than the PDF links
# to be extracted.
allowed_domains = ['department.domain.edu', 'personal.sites.domain.edu']
def start_requests(self):
# An iniial entry point; usually a faculty index.
yield scrapy.Request(url="department.domain.edu/faculty")
def parse(self, response):
pdf_extractor = CustomLinkExtractor(allow=r'.*\.pdf$')
with open("jots.txt", "a") as f:
for pdf_link in pdf_extractor.extract_links(response):
f.write(url + "\n")
for link in LinkExtractor.extract_links(response):
# ...filter out problematic links here before yielding.
yield response.follow(link, callback=self.parse)
Running scrapy runspider scrape.py
yields a file jots.txt
full of PDF URLs.↩︎
I made a conscious effort to stick to “The Kyoto School of Go Nihilism”––every err
checked, every defer
red function executed.↩︎