Downloading Academic Papers Automatically

I’ve been using ebib as my bibliography manager for the last three years of my PhD, and have loved how integrated it is into Emacs. Whether writing in org-mode, LaTeX or ConTeXt, I can get autocompletion for all of my references from my main bibliography file, and insert native citation commands for the language that I am currently writing in. It even supports creating a sub-bibliography file containing only the references that are used by the current project, but is linked to my main bibliography file so that changes propagate in both directions. It also has powerful filtering options that make it easy to group and find related papers. However, the main reason I wanted to learn it initially was because of the extensibility that is inherent to an Emacs-based application.

** Automatic ID Generation

The first useful feature that ebib provides is the automatic ID generation, which reuses the internal BibTeX ID generation provided by Emacs (bibtex-generate-autokey). I had already used the automatic ID generation with org-ref, and had changed the generation function slightly so that it did not generate colons in the key name, and had already used on my bibliography file. The following is a great feature of Emacs and Lisp, which allows you to wrap an existing function with more code so that this extra code gets executed every time the original function is called. In this case it does a string replacement to remove any colons on the output of the original function.

(advice-add 'bibtex-generate-autokey :around
            (lambda (orig-func &rest args)
              (replace-regexp-in-string ":" "" (apply orig-func args))))

As ebib reuses this function, my advice that I added around that function was automatically used by all the automatic ID generation that ebib used and I therefore did not need to configure anything else for it to behave properly for me.

** Automatic Paper Downloads

Ebib allows for a lot of extra content to be stored together with your bibliography entry. It handles this extra information nicely because it always uses the ID of the entry as a way to store this extra information without having to create additional entries inside of the bib file. I use this mainly to store notes associated to papers as well as store their PDF version. This allows me to go to any entry in ebib and just press ‘N’ to view the associated notes, or ‘f’ to open the PDF (inside of emacs of course). However, the latter assumes that you have manually downloaded the PDF associated with that bib entry into the right folder and named it after the key of the entry in the bib file. I used to do this manually, but it took quite a bit of work and seemed like something I should automate.

The first step is just figuring out how to get the ID of the current entry when in the ebib index buffer (the one that lists all the bib entries). I know of a function which can already copy the key when hovering over the entry, which is bound to C k, so we can have a look at what function is executed when pressing these keys, using the lovely builtin describe-key function, and then at how this function is implemented by using describe-function, which also gives you the source code for the function (which you can obviously modify as you want and reevaluate to change the behaviour at runtime). We then find out that we can use the following function to retrieve the key of the entry: ebib--get-key-at-point. For example, if we want to create a function that will check if a file exists for the current entry, we can write the following:

(defun ebib-check-file ()
  "Check if current entry has a file associated with it."
  (interactive)
  (let ((key (ebib--get-key-at-point)))
    (unless (file-exists-p (concat (car ebib-file-search-dirs) "/" key ".pdf"))
      (error "[Ebib] No PDF found."))))

When executing this function in the ebib index buffer, we will get an error if the file is not present, or nothing at all. ebib-file-search-dirs in this case contains a list of directories that should be searched for a file associated with the current entry (and we only care about the first one in this case).

Then, if the file is not present, we want to download the PDF, so we now want to write a simple download function. Let’s focus on getting papers from the ACM first. In emacs we can download a file from a URL using the url-copy-file function, so all we need is generate a URL to pass to that function. To do that we can check a few PDFs in the ACM and check what the URL looks like. Luckily, it seems like it’s based on the DOI for the paper, which should be available in the bib entry, so we can write the following function:

(defun acm-pdf-url (doi)
  "Generate the URL for a paper from the ACM based on the DOI."
  (concat "https://dl.acm.org/doi/pdf/" doi))

This of course assumes that you have access to the paper, either because it’s open access or because you have access through your university. We can then download it from there using the following:

(defun download-pdf-from-doi (key doi)
  "Download pdf from doi with KEY name."
  (url-copy-file (acm-pdf-url doi) (concat (car ebib-file-search-dirs) "/" key ".pdf")))

And then wrap it in a top-level function which can then be called interactively, and will retrieve all the important information from the current bib entry in ebib.

(defun ebib-download-pdf-from-doi ()
  "Download a PDF for the current entry."
  (interactive)
  (let* ((key (ebib--get-key-at-point))
         (doi (ebib-get-field-value "doi" key ebib--cur-db 'noerror 'unbraced 'xref)))
    (unless key (error "[Ebib] No key assigned to entry"))
    (download-pdf-from-doi key doi)))

As you can see, we can get values for arbitrary fields using the ebib-get-field-value function, which I also found using the trick above concerning getting the key.

This will only work with papers from the ACM, but we can easily add support for other publishers such as Springer, IEEE and arXiv. This is mainly straightforward, except for the IEEE where I needed to realise that in most cases they use the last few numbers of the DOI as their indexing number, so I had to implement the function as follows:

(defun ieee-pdf-url (doi)
  "Retrieve a DOI pdf from the IEEE."
  (when (string-match "\\.\\([0-9]*\\)$" doi)
    (let ((doi-bit (match-string 1 doi)))
      (concat "https://ieeexplore.ieee.org/stampPDF/getPDF.jsp?tp=&arnumber=" doi-bit "&ref="))))

ArXiv is also a bit special because it normally puts it’s own unique codes into the ’eprint’ field.

*** A More Robust Downloader

We now have all these functions that can download PDFs from various sources, but we just need a way to decide which URL to use. We could ask the user to choose when they want to download the PDF, but I argue that there is normally enough information in the bib entry to automatically choose. The final heuristic I came up with, which seems to mostly work well, is the following:

(defun download-pdf-from-doi (key &optional doi publisher eprint journal organization url)
  "Download pdf from DOI with KEY name."
  (let ((pub  (or publisher ""))
        (epr  (or eprint ""))
        (jour (or journal ""))
        (org  (or organization ""))
        (link (or url "")))
    (url-copy-file (cond
                    ((not doi) link)
                    ((or (string-match "ACM" (s-upcase pub))
                         (string-match "association for computing machinery" (s-downcase pub)))
                     (acm-pdf-url doi))
                    ((string-match "arxiv" (s-downcase pub))
                     (arxiv-pdf-url epr))
                    ((or (string-match "IEEE" (s-upcase pub))
                         (string-match "IEEE" (s-upcase jour))
                         (string-match "IEEE" (s-upcase org)))
                     (ieee-pdf-url doi))
                    ((string-match "springer" (s-downcase pub))
                     (springer-pdf-url doi))
                    (t (error "Cannot possibly find the PDF any other way")))
                   (concat (car ebib-file-search-dirs) "/" key ".pdf"))))

It looks at the DOI, publisher, eprint, journal, organization and a URL. Then, it first checks if it got a DOI, which if it didn’t means that the URL should be used. Then, it checks if the publisher is the ACM using different possible spellings, and if so uses the ACM link to download the PDF. Then it checks if the publisher is arXiv, and uses the eprint entry to download it. IEEE is the trickiest, as it can appear in various locations based on the conference or journal of the original entry. We therefore check the publisher field, journal field and organization filed. Finally, we check if the publisher is Springer and download it from there.

The complete code is available.

** Automatic Syncing of Papers to a Remarkable Tablet

Finally, reading papers on your laptop or on the desktop is not a great experience. I therefore got myself a Remarkable tablet, which has served me greatly for taking notes as well as reading papers. The main selling point of the tablet is the extremely low latency for drawing and writing on the tablet compared to other E Ink tablets. However, it also has a nifty feature which makes it ideal to read papers even though it’s essentially an A5 piece of paper. You can crop PDF margins which make them much more readable without having to zoom in and move around the PDF, and this cropping is consistent when turning pages as well as opening and closing the PDF. I also love that it runs Linux instead of other tablets which usually run Android.

However, one downside is that it has a pretty closed source ecosystem with respect to the applications used for syncing files to the tablet. However, there is also a great community around the Remarkable to counteract this, for example the great rmapi tool which allows for downloading and uploading files from the command-line, or the lines-are-rusty tool which produces SVG from Remarkable lines files.

Therefore, we can use rmapi to sync all the files in my biblography to the remarkable, by just running:

ls *.pdf | xargs -n1 rmapi put

which will try to upload all my files every time I call it, but nicely enough it fails quickly whenever the file already exists on the Remarkable.