arxiv

arxiv.py

PyPI PyPI - Python Version GitHub Workflow Status (branch) Full package documentation

Python wrapper for the arXiv API.

arXiv is a project by the Cornell University Library that provides open access to 1,000,000+ articles in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics.

Usage

Installation

$ pip install arxiv

In your Python script, include the line

import arxiv

Examples

Fetching results

import arxiv

# Construct the default API client.
client = Client()

# Search for the 10 most recent articles matching the keyword "quantum."
search = Search(
  query = "quantum",
  max_results = 10,
  sort_by = SortCriterion.SubmittedDate
)

results = client.results(search)

# `results` is a generator; you can iterate over its elements one by one...
for r in client.results(search):
  print(r.title)
# ...or exhaust it into a list. Careful: this is slow for large results sets.
all_results = list(results)
print([r.title for r in all_results])

# For advanced query syntax documentation, see the arXiv API User Manual:
# https://arxiv.org/help/api/user-manual#query_details
search = Search(query = "au:del_maestro AND ti:checkerboard")
first_result = next(client.results(search))
print(first_result)

# Search for the paper with ID "1605.08386v1"
search_by_id = Search(id_list=["1605.08386v1"])
# Reuse client to fetch the paper, then print its title.
first_result = next(client.results(search_by_id))
print(first_result.title)

Downloading papers

To download a PDF of the paper with ID "1605.08386v1," run a Search and then use Result.download_pdf():

import arxiv

paper = next(Client().results(Search(id_list=["1605.08386v1"])))
# Download the PDF to the PWD with a default filename.
paper.download_pdf()
# Download the PDF to the PWD with a custom filename.
paper.download_pdf(filename="downloaded-paper.pdf")
# Download the PDF to a specified directory with a custom filename.
paper.download_pdf(dirpath="./mydir", filename="downloaded-paper.pdf")

The same interface is available for downloading .tar.gz files of the paper source:

import arxiv

paper = next(Client().results(Search(id_list=["1605.08386v1"])))
# Download the archive to the PWD with a default filename.
paper.download_source()
# Download the archive to the PWD with a custom filename.
paper.download_source(filename="downloaded-paper.tar.gz")
# Download the archive to a specified directory with a custom filename.
paper.download_source(dirpath="./mydir", filename="downloaded-paper.tar.gz")

Fetching results with a custom client

import arxiv

big_slow_client = Client(
  page_size = 1000,
  delay_seconds = 10.0,
  num_retries = 5
)

# Prints 1000 titles before needing to make another request.
for result in big_slow_client.results(Search(query="quantum")):
  print(result.title)

Logging

To inspect this package's network behavior and API logic, configure a DEBUG-level logger.

>>> import logging, arxiv
>>> logging.basicConfig(level=logging.DEBUG)
>>> client = Client()
>>> paper = next(client.results(Search(id_list=["1605.08386v1"])))
INFO:arxiv.arxiv:Requesting 100 results at offset 0
INFO:arxiv.arxiv:Requesting page (first: False, try: 0): https://exportarxiv.arxiv.org/api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): export.arxiv.org:443
DEBUG:urllib3.connectionpool:https://export.arxiv.org:443 "GET /api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100&user-agent=arxiv.py%2F1.4.8 HTTP/1.1" 200 979

Types

Client

A Client specifies a reusable strategy for fetching results from arXiv's API. For most use cases the default client should suffice.

Clients configurations specify pagination and retry logic. Reusing a client allows successive API calls to use the same connection pool and ensures they abide by the rate limit you set.

A Search specifies a search of arXiv's database. Use Client.results to get a generator yielding Results.

Result

The Result objects yielded by Client.results include metadata about each paper and helper methods for downloading their content.

The meaning of the underlying raw data is documented in the arXiv API User Manual: Details of Atom Results Returned.

Result also exposes helper methods for downloading papers: Result.download_pdf and Result.download_source.

  1""".. include:: ../README.md"""
  2
  3from __future__ import annotations
  4
  5import logging
  6import time
  7import itertools
  8import feedparser
  9import os
 10import math
 11import re
 12import requests
 13import warnings
 14
 15from urllib.parse import urlencode, urlparse
 16from urllib.request import urlretrieve
 17from datetime import datetime, timedelta, timezone
 18from calendar import timegm
 19
 20from enum import Enum
 21from typing import Dict, Generator, List, Optional
 22
 23logger = logging.getLogger(__name__)
 24
 25_DEFAULT_TIME = datetime.min
 26
 27
 28class Result(object):
 29    """
 30    An entry in an arXiv query results feed.
 31
 32    See [the arXiv API User's Manual: Details of Atom Results
 33    Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned).
 34    """
 35
 36    entry_id: str
 37    """A url of the form `https://arxiv.org/abs/{id}`."""
 38    updated: datetime
 39    """When the result was last updated."""
 40    published: datetime
 41    """When the result was originally published."""
 42    title: str
 43    """The title of the result."""
 44    authors: List[Author]
 45    """The result's authors."""
 46    summary: str
 47    """The result abstract."""
 48    comment: Optional[str]
 49    """The authors' comment if present."""
 50    journal_ref: Optional[str]
 51    """A journal reference if present."""
 52    doi: Optional[str]
 53    """A URL for the resolved DOI to an external resource if present."""
 54    primary_category: str
 55    """
 56    The result's primary arXiv category. See [arXiv: Category
 57    Taxonomy](https://arxiv.org/category_taxonomy).
 58    """
 59    categories: List[str]
 60    """
 61    All of the result's categories. See [arXiv: Category
 62    Taxonomy](https://arxiv.org/category_taxonomy).
 63    """
 64    links: List[Link]
 65    """Up to three URLs associated with this result."""
 66    pdf_url: Optional[str]
 67    """The URL of a PDF version of this result if present among links."""
 68    _raw: feedparser.FeedParserDict
 69    """
 70    The raw feedparser result object if this Result was constructed with
 71    Result._from_feed_entry.
 72    """
 73
 74    def __init__(
 75        self,
 76        entry_id: str,
 77        updated: datetime = _DEFAULT_TIME,
 78        published: datetime = _DEFAULT_TIME,
 79        title: str = "",
 80        authors: List[Author] = [],
 81        summary: str = "",
 82        comment: str = "",
 83        journal_ref: str = "",
 84        doi: str = "",
 85        primary_category: str = "",
 86        categories: List[str] = [],
 87        links: List[Link] = [],
 88        _raw: feedparser.FeedParserDict = None,
 89    ):
 90        """
 91        Constructs an arXiv search result item.
 92
 93        In most cases, prefer using `Result._from_feed_entry` to parsing and
 94        constructing `Result`s yourself.
 95        """
 96        self.entry_id = entry_id
 97        self.updated = updated
 98        self.published = published
 99        self.title = title
100        self.authors = authors
101        self.summary = summary
102        self.comment = comment
103        self.journal_ref = journal_ref
104        self.doi = doi
105        self.primary_category = primary_category
106        self.categories = categories
107        self.links = links
108        # Calculated members
109        self.pdf_url = Result._get_pdf_url(links)
110        # Debugging
111        self._raw = _raw
112
113    def _from_feed_entry(entry: feedparser.FeedParserDict) -> Result:
114        """
115        Converts a feedparser entry for an arXiv search result feed into a
116        Result object.
117        """
118        if not hasattr(entry, "id"):
119            raise Result.MissingFieldError("id")
120        # Title attribute may be absent for certain titles. Defaulting to "0" as
121        # it's the only title observed to cause this bug.
122        # https://github.com/lukasschwab/arxiv.py/issues/71
123        # title = entry.title if hasattr(entry, "title") else "0"
124        title = "0"
125        if hasattr(entry, "title"):
126            title = entry.title
127        else:
128            logger.warning("Result %s is missing title attribute; defaulting to '0'", entry.id)
129        return Result(
130            entry_id=entry.id,
131            updated=Result._to_datetime(entry.updated_parsed),
132            published=Result._to_datetime(entry.published_parsed),
133            title=re.sub(r"\s+", " ", title),
134            authors=[Result.Author._from_feed_author(a) for a in entry.authors],
135            summary=entry.summary,
136            comment=entry.get("arxiv_comment"),
137            journal_ref=entry.get("arxiv_journal_ref"),
138            doi=entry.get("arxiv_doi"),
139            primary_category=entry.arxiv_primary_category.get("term"),
140            categories=[tag.get("term") for tag in entry.tags],
141            links=[Result.Link._from_feed_link(link) for link in entry.links],
142            _raw=entry,
143        )
144
145    def __str__(self) -> str:
146        return self.entry_id
147
148    def __repr__(self) -> str:
149        return (
150            "{}(entry_id={}, updated={}, published={}, title={}, authors={}, "
151            "summary={}, comment={}, journal_ref={}, doi={}, "
152            "primary_category={}, categories={}, links={})"
153        ).format(
154            _classname(self),
155            repr(self.entry_id),
156            repr(self.updated),
157            repr(self.published),
158            repr(self.title),
159            repr(self.authors),
160            repr(self.summary),
161            repr(self.comment),
162            repr(self.journal_ref),
163            repr(self.doi),
164            repr(self.primary_category),
165            repr(self.categories),
166            repr(self.links),
167        )
168
169    def __eq__(self, other) -> bool:
170        if isinstance(other, Result):
171            return self.entry_id == other.entry_id
172        return False
173
174    def get_short_id(self) -> str:
175        """
176        Returns the short ID for this result.
177
178        + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`,
179        `result.get_short_id()` returns `2107.05580v1`.
180
181        + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`,
182        `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March
183        2007 arXiv identifier format).
184
185        For an explanation of the difference between arXiv's legacy and current
186        identifiers, see [Understanding the arXiv
187        identifier](https://arxiv.org/help/arxiv_identifier).
188        """
189        return self.entry_id.split("arxiv.org/abs/")[-1]
190
191    def _get_default_filename(self, extension: str = "pdf") -> str:
192        """
193        A default `to_filename` function for the extension given.
194        """
195        nonempty_title = self.title if self.title else "UNTITLED"
196        return ".".join(
197            [
198                self.get_short_id().replace("/", "_"),
199                re.sub(r"[^\w]", "_", nonempty_title),
200                extension,
201            ]
202        )
203
204    def download_pdf(
205        self,
206        dirpath: str = "./",
207        filename: str = "",
208        download_domain: str = "export.arxiv.org",
209    ) -> str:
210        """
211        Downloads the PDF for this result to the specified directory.
212
213        The filename is generated by calling `to_filename(self)`.
214        """
215        if not filename:
216            filename = self._get_default_filename()
217        path = os.path.join(dirpath, filename)
218        pdf_url = Result._substitute_domain(self.pdf_url, download_domain)
219        written_path, _ = urlretrieve(pdf_url, path)
220        return written_path
221
222    def download_source(
223        self,
224        dirpath: str = "./",
225        filename: str = "",
226        download_domain: str = "export.arxiv.org",
227    ) -> str:
228        """
229        Downloads the source tarfile for this result to the specified
230        directory.
231
232        The filename is generated by calling `to_filename(self)`.
233        """
234        if not filename:
235            filename = self._get_default_filename("tar.gz")
236        path = os.path.join(dirpath, filename)
237        pdf_url = Result._substitute_domain(self.pdf_url, download_domain)
238        # Bodge: construct the source URL from the PDF URL.
239        src_url = pdf_url.replace("/pdf/", "/src/")
240        written_path, _ = urlretrieve(src_url, path)
241        return written_path
242
243    def _get_pdf_url(links: List[Link]) -> str:
244        """
245        Finds the PDF link among a result's links and returns its URL.
246
247        Should only be called once for a given `Result`, in its constructor.
248        After construction, the URL should be available in `Result.pdf_url`.
249        """
250        pdf_urls = [link.href for link in links if link.title == "pdf"]
251        if len(pdf_urls) == 0:
252            return None
253        elif len(pdf_urls) > 1:
254            logger.warning("Result has multiple PDF links; using %s", pdf_urls[0])
255        return pdf_urls[0]
256
257    def _to_datetime(ts: time.struct_time) -> datetime:
258        """
259        Converts a UTC time.struct_time into a time-zone-aware datetime.
260
261        This will be replaced with feedparser functionality [when it becomes
262        available](https://github.com/kurtmckee/feedparser/issues/212).
263        """
264        return datetime.fromtimestamp(timegm(ts), tz=timezone.utc)
265
266    def _substitute_domain(url: str, domain: str) -> str:
267        """
268        Replaces the domain of the given URL with the specified domain.
269
270        This is useful for testing purposes.
271        """
272        parsed_url = urlparse(url)
273        return parsed_url._replace(netloc=domain).geturl()
274
275    class Author(object):
276        """
277        A light inner class for representing a result's authors.
278        """
279
280        name: str
281        """The author's name."""
282
283        def __init__(self, name: str):
284            """
285            Constructs an `Author` with the specified name.
286
287            In most cases, prefer using `Author._from_feed_author` to parsing
288            and constructing `Author`s yourself.
289            """
290            self.name = name
291
292        def _from_feed_author(feed_author: feedparser.FeedParserDict) -> Result.Author:
293            """
294            Constructs an `Author` with the name specified in an author object
295            from a feed entry.
296
297            See usage in `Result._from_feed_entry`.
298            """
299            return Result.Author(feed_author.name)
300
301        def __str__(self) -> str:
302            return self.name
303
304        def __repr__(self) -> str:
305            return "{}({})".format(_classname(self), repr(self.name))
306
307        def __eq__(self, other) -> bool:
308            if isinstance(other, Result.Author):
309                return self.name == other.name
310            return False
311
312    class Link(object):
313        """
314        A light inner class for representing a result's links.
315        """
316
317        href: str
318        """The link's `href` attribute."""
319        title: Optional[str]
320        """The link's title."""
321        rel: str
322        """The link's relationship to the `Result`."""
323        content_type: str
324        """The link's HTTP content type."""
325
326        def __init__(
327            self,
328            href: str,
329            title: str = None,
330            rel: str = None,
331            content_type: str = None,
332        ):
333            """
334            Constructs a `Link` with the specified link metadata.
335
336            In most cases, prefer using `Link._from_feed_link` to parsing and
337            constructing `Link`s yourself.
338            """
339            self.href = href
340            self.title = title
341            self.rel = rel
342            self.content_type = content_type
343
344        def _from_feed_link(feed_link: feedparser.FeedParserDict) -> Result.Link:
345            """
346            Constructs a `Link` with link metadata specified in a link object
347            from a feed entry.
348
349            See usage in `Result._from_feed_entry`.
350            """
351            return Result.Link(
352                href=feed_link.href,
353                title=feed_link.get("title"),
354                rel=feed_link.get("rel"),
355                content_type=feed_link.get("content_type"),
356            )
357
358        def __str__(self) -> str:
359            return self.href
360
361        def __repr__(self) -> str:
362            return "{}({}, title={}, rel={}, content_type={})".format(
363                _classname(self),
364                repr(self.href),
365                repr(self.title),
366                repr(self.rel),
367                repr(self.content_type),
368            )
369
370        def __eq__(self, other) -> bool:
371            if isinstance(other, Result.Link):
372                return self.href == other.href
373            return False
374
375    class MissingFieldError(Exception):
376        """
377        An error indicating an entry is unparseable because it lacks required
378        fields.
379        """
380
381        missing_field: str
382        """The required field missing from the would-be entry."""
383        message: str
384        """Message describing what caused this error."""
385
386        def __init__(self, missing_field):
387            self.missing_field = missing_field
388            self.message = "Entry from arXiv missing required info"
389
390        def __repr__(self) -> str:
391            return "{}({})".format(_classname(self), repr(self.missing_field))
392
393
394class SortCriterion(Enum):
395    """
396    A SortCriterion identifies a property by which search results can be
397    sorted.
398
399    See [the arXiv API User's Manual: sort order for return
400    results](https://arxiv.org/help/api/user-manual#sort).
401    """
402
403    Relevance = "relevance"
404    LastUpdatedDate = "lastUpdatedDate"
405    SubmittedDate = "submittedDate"
406
407
408class SortOrder(Enum):
409    """
410    A SortOrder indicates order in which search results are sorted according
411    to the specified arxiv.SortCriterion.
412
413    See [the arXiv API User's Manual: sort order for return
414    results](https://arxiv.org/help/api/user-manual#sort).
415    """
416
417    Ascending = "ascending"
418    Descending = "descending"
419
420
421class Search(object):
422    """
423    A specification for a search of arXiv's database.
424
425    To run a search, use `Search.run` to use a default client or `Client.run`
426    with a specific client.
427    """
428
429    query: str
430    """
431    A query string.
432
433    This should be unencoded. Use `au:del_maestro AND ti:checkerboard`, not
434    `au:del_maestro+AND+ti:checkerboard`.
435
436    See [the arXiv API User's Manual: Details of Query
437    Construction](https://arxiv.org/help/api/user-manual#query_details).
438    """
439    id_list: List[str]
440    """
441    A list of arXiv article IDs to which to limit the search.
442
443    See [the arXiv API User's
444    Manual](https://arxiv.org/help/api/user-manual#search_query_and_id_list)
445    for documentation of the interaction between `query` and `id_list`.
446    """
447    max_results: int | None
448    """
449    The maximum number of results to be returned in an execution of this
450    search. To fetch every result available, set `max_results=None`.
451
452    The API's limit is 300,000 results per query.
453    """
454    sort_by: SortCriterion
455    """The sort criterion for results."""
456    sort_order: SortOrder
457    """The sort order for results."""
458
459    def __init__(
460        self,
461        query: str = "",
462        id_list: List[str] = [],
463        max_results: int | None = None,
464        sort_by: SortCriterion = SortCriterion.Relevance,
465        sort_order: SortOrder = SortOrder.Descending,
466    ):
467        """
468        Constructs an arXiv API search with the specified criteria.
469        """
470        self.query = query
471        self.id_list = id_list
472        # Handle deprecated v1 default behavior.
473        self.max_results = None if max_results == math.inf else max_results
474        self.sort_by = sort_by
475        self.sort_order = sort_order
476
477    def __str__(self) -> str:
478        # TODO: develop a more informative string representation.
479        return repr(self)
480
481    def __repr__(self) -> str:
482        return ("{}(query={}, id_list={}, max_results={}, sort_by={}, sort_order={})").format(
483            _classname(self),
484            repr(self.query),
485            repr(self.id_list),
486            repr(self.max_results),
487            repr(self.sort_by),
488            repr(self.sort_order),
489        )
490
491    def _url_args(self) -> Dict[str, str]:
492        """
493        Returns a dict of search parameters that should be included in an API
494        request for this search.
495        """
496        return {
497            "search_query": self.query,
498            "id_list": ",".join(self.id_list),
499            "sortBy": self.sort_by.value,
500            "sortOrder": self.sort_order.value,
501        }
502
503    def results(self, offset: int = 0) -> Generator[Result, None, None]:
504        """
505        Executes the specified search using a default arXiv API client. For info
506        on default behavior, see `Client.__init__` and `Client.results`.
507
508        **Deprecated** after 2.0.0; use `Client.results`.
509        """
510        warnings.warn(
511            "The 'Search.results' method is deprecated, use 'Client.results' instead",
512            DeprecationWarning,
513            stacklevel=2,
514        )
515        return Client().results(self, offset=offset)
516
517
518class Client(object):
519    """
520    Specifies a strategy for fetching results from arXiv's API.
521
522    This class obscures pagination and retry logic, and exposes
523    `Client.results`.
524    """
525
526    query_url_format = "https://export.arxiv.org/api/query?{}"
527    """
528    The arXiv query API endpoint format.
529    """
530    page_size: int
531    """
532    Maximum number of results fetched in a single API request. Smaller pages can
533    be retrieved faster, but may require more round-trips.
534
535    The API's limit is 2000 results per page.
536    """
537    delay_seconds: float
538    """
539    Number of seconds to wait between API requests.
540
541    [arXiv's Terms of Use](https://arxiv.org/help/api/tou) ask that you "make no
542    more than one request every three seconds."
543    """
544    num_retries: int
545    """
546    Number of times to retry a failing API request before raising an Exception.
547    """
548
549    _last_request_dt: datetime
550    _session: requests.Session
551
552    def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3):
553        """
554        Constructs an arXiv API client with the specified options.
555
556        Note: the default parameters should provide a robust request strategy
557        for most use cases. Extreme page sizes, delays, or retries risk
558        violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou),
559        brittle behavior, and inconsistent results.
560        """
561        self.page_size = page_size
562        self.delay_seconds = delay_seconds
563        self.num_retries = num_retries
564        self._last_request_dt = None
565        self._session = requests.Session()
566
567    def __str__(self) -> str:
568        # TODO: develop a more informative string representation.
569        return repr(self)
570
571    def __repr__(self) -> str:
572        return "{}(page_size={}, delay_seconds={}, num_retries={})".format(
573            _classname(self),
574            repr(self.page_size),
575            repr(self.delay_seconds),
576            repr(self.num_retries),
577        )
578
579    def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]:
580        """
581        Uses this client configuration to fetch one page of the search results
582        at a time, yielding the parsed `Result`s, until `max_results` results
583        have been yielded or there are no more search results.
584
585        If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`.
586
587        Setting a nonzero `offset` discards leading records in the result set.
588        When `offset` is greater than or equal to `search.max_results`, the full
589        result set is discarded.
590
591        For more on using generators, see
592        [Generators](https://wiki.python.org/moin/Generators).
593        """
594        limit = search.max_results - offset if search.max_results else None
595        if limit and limit < 0:
596            return iter(())
597        return itertools.islice(self._results(search, offset), limit)
598
599    def _results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]:
600        page_url = self._format_url(search, offset, self.page_size)
601        feed = self._parse_feed(page_url, first_page=True)
602        if not feed.entries:
603            logger.info("Got empty first page; stopping generation")
604            return
605        total_results = int(feed.feed.opensearch_totalresults)
606        logger.info(
607            "Got first page: %d of %d total results",
608            len(feed.entries),
609            total_results,
610        )
611
612        while feed.entries:
613            for entry in feed.entries:
614                try:
615                    yield Result._from_feed_entry(entry)
616                except Result.MissingFieldError as e:
617                    logger.warning("Skipping partial result: %s", e)
618            offset += len(feed.entries)
619            if offset >= total_results:
620                break
621            page_url = self._format_url(search, offset, self.page_size)
622            feed = self._parse_feed(page_url, first_page=False)
623
624    def _format_url(self, search: Search, start: int, page_size: int) -> str:
625        """
626        Construct a request API for search that returns up to `page_size`
627        results starting with the result at index `start`.
628        """
629        url_args = search._url_args()
630        url_args.update(
631            {
632                "start": start,
633                "max_results": page_size,
634            }
635        )
636        return self.query_url_format.format(urlencode(url_args))
637
638    def _parse_feed(
639        self, url: str, first_page: bool = True, _try_index: int = 0
640    ) -> feedparser.FeedParserDict:
641        """
642        Fetches the specified URL and parses it with feedparser.
643
644        If a request fails or is unexpectedly empty, retries the request up to
645        `self.num_retries` times.
646        """
647        try:
648            return self.__try_parse_feed(url, first_page=first_page, try_index=_try_index)
649        except (
650            HTTPError,
651            UnexpectedEmptyPageError,
652            requests.exceptions.ConnectionError,
653        ) as err:
654            if _try_index < self.num_retries:
655                logger.debug("Got error (try %d): %s", _try_index, err)
656                return self._parse_feed(url, first_page=first_page, _try_index=_try_index + 1)
657            logger.debug("Giving up (try %d): %s", _try_index, err)
658            raise err
659
660    def __try_parse_feed(
661        self,
662        url: str,
663        first_page: bool,
664        try_index: int,
665    ) -> feedparser.FeedParserDict:
666        """
667        Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that
668        number of seconds has not passed since `_parse_feed` was last called,
669        sleeps until delay_seconds seconds have passed.
670        """
671        # If this call would violate the rate limit, sleep until it doesn't.
672        if self._last_request_dt is not None:
673            required = timedelta(seconds=self.delay_seconds)
674            since_last_request = datetime.now() - self._last_request_dt
675            if since_last_request < required:
676                to_sleep = (required - since_last_request).total_seconds()
677                logger.info("Sleeping: %f seconds", to_sleep)
678                time.sleep(to_sleep)
679
680        logger.info("Requesting page (first: %r, try: %d): %s", first_page, try_index, url)
681
682        resp = self._session.get(url, headers={"user-agent": "arxiv.py/2.2.0"})
683        self._last_request_dt = datetime.now()
684        if resp.status_code != requests.codes.OK:
685            raise HTTPError(url, try_index, resp.status_code)
686
687        feed = feedparser.parse(resp.content)
688        if len(feed.entries) == 0 and not first_page:
689            raise UnexpectedEmptyPageError(url, try_index, feed)
690
691        if feed.bozo:
692            logger.warning(
693                "Bozo feed; consider handling: %s",
694                feed.bozo_exception if "bozo_exception" in feed else None,
695            )
696
697        return feed
698
699
700class ArxivError(Exception):
701    """This package's base Exception class."""
702
703    url: str
704    """The feed URL that could not be fetched."""
705    retry: int
706    """
707    The request try number which encountered this error; 0 for the initial try,
708    1 for the first retry, and so on.
709    """
710    message: str
711    """Message describing what caused this error."""
712
713    def __init__(self, url: str, retry: int, message: str):
714        """
715        Constructs an `ArxivError` encountered while fetching the specified URL.
716        """
717        self.url = url
718        self.retry = retry
719        self.message = message
720        super().__init__(self.message)
721
722    def __str__(self) -> str:
723        return "{} ({})".format(self.message, self.url)
724
725
726class UnexpectedEmptyPageError(ArxivError):
727    """
728    An error raised when a page of results that should be non-empty is empty.
729
730    This should never happen in theory, but happens sporadically due to
731    brittleness in the underlying arXiv API; usually resolved by retries.
732
733    See `Client.results` for usage.
734    """
735
736    raw_feed: feedparser.FeedParserDict
737    """
738    The raw output of `feedparser.parse`. Sometimes this contains useful
739    diagnostic information, e.g. in 'bozo_exception'.
740    """
741
742    def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict):
743        """
744        Constructs an `UnexpectedEmptyPageError` encountered for the specified
745        API URL after `retry` tries.
746        """
747        self.url = url
748        self.raw_feed = raw_feed
749        super().__init__(url, retry, "Page of results was unexpectedly empty")
750
751    def __repr__(self) -> str:
752        return "{}({}, {}, {})".format(
753            _classname(self), repr(self.url), repr(self.retry), repr(self.raw_feed)
754        )
755
756
757class HTTPError(ArxivError):
758    """
759    A non-200 status encountered while fetching a page of results.
760
761    See `Client.results` for usage.
762    """
763
764    status: int
765    """The HTTP status reported by feedparser."""
766
767    def __init__(self, url: str, retry: int, status: int):
768        """
769        Constructs an `HTTPError` for the specified status code, encountered for
770        the specified API URL after `retry` tries.
771        """
772        self.url = url
773        self.status = status
774        super().__init__(
775            url,
776            retry,
777            "Page request resulted in HTTP {}".format(self.status),
778        )
779
780    def __repr__(self) -> str:
781        return "{}({}, {}, {})".format(
782            _classname(self), repr(self.url), repr(self.retry), repr(self.status)
783        )
784
785
786def _classname(o):
787    """A helper function for use in __repr__ methods: arxiv.Result.Link."""
788    return "arxiv.{}".format(o.__class__.__qualname__)
logger = <Logger arxiv (WARNING)>
class Result:
 29class Result(object):
 30    """
 31    An entry in an arXiv query results feed.
 32
 33    See [the arXiv API User's Manual: Details of Atom Results
 34    Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned).
 35    """
 36
 37    entry_id: str
 38    """A url of the form `https://arxiv.org/abs/{id}`."""
 39    updated: datetime
 40    """When the result was last updated."""
 41    published: datetime
 42    """When the result was originally published."""
 43    title: str
 44    """The title of the result."""
 45    authors: List[Author]
 46    """The result's authors."""
 47    summary: str
 48    """The result abstract."""
 49    comment: Optional[str]
 50    """The authors' comment if present."""
 51    journal_ref: Optional[str]
 52    """A journal reference if present."""
 53    doi: Optional[str]
 54    """A URL for the resolved DOI to an external resource if present."""
 55    primary_category: str
 56    """
 57    The result's primary arXiv category. See [arXiv: Category
 58    Taxonomy](https://arxiv.org/category_taxonomy).
 59    """
 60    categories: List[str]
 61    """
 62    All of the result's categories. See [arXiv: Category
 63    Taxonomy](https://arxiv.org/category_taxonomy).
 64    """
 65    links: List[Link]
 66    """Up to three URLs associated with this result."""
 67    pdf_url: Optional[str]
 68    """The URL of a PDF version of this result if present among links."""
 69    _raw: feedparser.FeedParserDict
 70    """
 71    The raw feedparser result object if this Result was constructed with
 72    Result._from_feed_entry.
 73    """
 74
 75    def __init__(
 76        self,
 77        entry_id: str,
 78        updated: datetime = _DEFAULT_TIME,
 79        published: datetime = _DEFAULT_TIME,
 80        title: str = "",
 81        authors: List[Author] = [],
 82        summary: str = "",
 83        comment: str = "",
 84        journal_ref: str = "",
 85        doi: str = "",
 86        primary_category: str = "",
 87        categories: List[str] = [],
 88        links: List[Link] = [],
 89        _raw: feedparser.FeedParserDict = None,
 90    ):
 91        """
 92        Constructs an arXiv search result item.
 93
 94        In most cases, prefer using `Result._from_feed_entry` to parsing and
 95        constructing `Result`s yourself.
 96        """
 97        self.entry_id = entry_id
 98        self.updated = updated
 99        self.published = published
100        self.title = title
101        self.authors = authors
102        self.summary = summary
103        self.comment = comment
104        self.journal_ref = journal_ref
105        self.doi = doi
106        self.primary_category = primary_category
107        self.categories = categories
108        self.links = links
109        # Calculated members
110        self.pdf_url = Result._get_pdf_url(links)
111        # Debugging
112        self._raw = _raw
113
114    def _from_feed_entry(entry: feedparser.FeedParserDict) -> Result:
115        """
116        Converts a feedparser entry for an arXiv search result feed into a
117        Result object.
118        """
119        if not hasattr(entry, "id"):
120            raise Result.MissingFieldError("id")
121        # Title attribute may be absent for certain titles. Defaulting to "0" as
122        # it's the only title observed to cause this bug.
123        # https://github.com/lukasschwab/arxiv.py/issues/71
124        # title = entry.title if hasattr(entry, "title") else "0"
125        title = "0"
126        if hasattr(entry, "title"):
127            title = entry.title
128        else:
129            logger.warning("Result %s is missing title attribute; defaulting to '0'", entry.id)
130        return Result(
131            entry_id=entry.id,
132            updated=Result._to_datetime(entry.updated_parsed),
133            published=Result._to_datetime(entry.published_parsed),
134            title=re.sub(r"\s+", " ", title),
135            authors=[Result.Author._from_feed_author(a) for a in entry.authors],
136            summary=entry.summary,
137            comment=entry.get("arxiv_comment"),
138            journal_ref=entry.get("arxiv_journal_ref"),
139            doi=entry.get("arxiv_doi"),
140            primary_category=entry.arxiv_primary_category.get("term"),
141            categories=[tag.get("term") for tag in entry.tags],
142            links=[Result.Link._from_feed_link(link) for link in entry.links],
143            _raw=entry,
144        )
145
146    def __str__(self) -> str:
147        return self.entry_id
148
149    def __repr__(self) -> str:
150        return (
151            "{}(entry_id={}, updated={}, published={}, title={}, authors={}, "
152            "summary={}, comment={}, journal_ref={}, doi={}, "
153            "primary_category={}, categories={}, links={})"
154        ).format(
155            _classname(self),
156            repr(self.entry_id),
157            repr(self.updated),
158            repr(self.published),
159            repr(self.title),
160            repr(self.authors),
161            repr(self.summary),
162            repr(self.comment),
163            repr(self.journal_ref),
164            repr(self.doi),
165            repr(self.primary_category),
166            repr(self.categories),
167            repr(self.links),
168        )
169
170    def __eq__(self, other) -> bool:
171        if isinstance(other, Result):
172            return self.entry_id == other.entry_id
173        return False
174
175    def get_short_id(self) -> str:
176        """
177        Returns the short ID for this result.
178
179        + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`,
180        `result.get_short_id()` returns `2107.05580v1`.
181
182        + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`,
183        `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March
184        2007 arXiv identifier format).
185
186        For an explanation of the difference between arXiv's legacy and current
187        identifiers, see [Understanding the arXiv
188        identifier](https://arxiv.org/help/arxiv_identifier).
189        """
190        return self.entry_id.split("arxiv.org/abs/")[-1]
191
192    def _get_default_filename(self, extension: str = "pdf") -> str:
193        """
194        A default `to_filename` function for the extension given.
195        """
196        nonempty_title = self.title if self.title else "UNTITLED"
197        return ".".join(
198            [
199                self.get_short_id().replace("/", "_"),
200                re.sub(r"[^\w]", "_", nonempty_title),
201                extension,
202            ]
203        )
204
205    def download_pdf(
206        self,
207        dirpath: str = "./",
208        filename: str = "",
209        download_domain: str = "export.arxiv.org",
210    ) -> str:
211        """
212        Downloads the PDF for this result to the specified directory.
213
214        The filename is generated by calling `to_filename(self)`.
215        """
216        if not filename:
217            filename = self._get_default_filename()
218        path = os.path.join(dirpath, filename)
219        pdf_url = Result._substitute_domain(self.pdf_url, download_domain)
220        written_path, _ = urlretrieve(pdf_url, path)
221        return written_path
222
223    def download_source(
224        self,
225        dirpath: str = "./",
226        filename: str = "",
227        download_domain: str = "export.arxiv.org",
228    ) -> str:
229        """
230        Downloads the source tarfile for this result to the specified
231        directory.
232
233        The filename is generated by calling `to_filename(self)`.
234        """
235        if not filename:
236            filename = self._get_default_filename("tar.gz")
237        path = os.path.join(dirpath, filename)
238        pdf_url = Result._substitute_domain(self.pdf_url, download_domain)
239        # Bodge: construct the source URL from the PDF URL.
240        src_url = pdf_url.replace("/pdf/", "/src/")
241        written_path, _ = urlretrieve(src_url, path)
242        return written_path
243
244    def _get_pdf_url(links: List[Link]) -> str:
245        """
246        Finds the PDF link among a result's links and returns its URL.
247
248        Should only be called once for a given `Result`, in its constructor.
249        After construction, the URL should be available in `Result.pdf_url`.
250        """
251        pdf_urls = [link.href for link in links if link.title == "pdf"]
252        if len(pdf_urls) == 0:
253            return None
254        elif len(pdf_urls) > 1:
255            logger.warning("Result has multiple PDF links; using %s", pdf_urls[0])
256        return pdf_urls[0]
257
258    def _to_datetime(ts: time.struct_time) -> datetime:
259        """
260        Converts a UTC time.struct_time into a time-zone-aware datetime.
261
262        This will be replaced with feedparser functionality [when it becomes
263        available](https://github.com/kurtmckee/feedparser/issues/212).
264        """
265        return datetime.fromtimestamp(timegm(ts), tz=timezone.utc)
266
267    def _substitute_domain(url: str, domain: str) -> str:
268        """
269        Replaces the domain of the given URL with the specified domain.
270
271        This is useful for testing purposes.
272        """
273        parsed_url = urlparse(url)
274        return parsed_url._replace(netloc=domain).geturl()
275
276    class Author(object):
277        """
278        A light inner class for representing a result's authors.
279        """
280
281        name: str
282        """The author's name."""
283
284        def __init__(self, name: str):
285            """
286            Constructs an `Author` with the specified name.
287
288            In most cases, prefer using `Author._from_feed_author` to parsing
289            and constructing `Author`s yourself.
290            """
291            self.name = name
292
293        def _from_feed_author(feed_author: feedparser.FeedParserDict) -> Result.Author:
294            """
295            Constructs an `Author` with the name specified in an author object
296            from a feed entry.
297
298            See usage in `Result._from_feed_entry`.
299            """
300            return Result.Author(feed_author.name)
301
302        def __str__(self) -> str:
303            return self.name
304
305        def __repr__(self) -> str:
306            return "{}({})".format(_classname(self), repr(self.name))
307
308        def __eq__(self, other) -> bool:
309            if isinstance(other, Result.Author):
310                return self.name == other.name
311            return False
312
313    class Link(object):
314        """
315        A light inner class for representing a result's links.
316        """
317
318        href: str
319        """The link's `href` attribute."""
320        title: Optional[str]
321        """The link's title."""
322        rel: str
323        """The link's relationship to the `Result`."""
324        content_type: str
325        """The link's HTTP content type."""
326
327        def __init__(
328            self,
329            href: str,
330            title: str = None,
331            rel: str = None,
332            content_type: str = None,
333        ):
334            """
335            Constructs a `Link` with the specified link metadata.
336
337            In most cases, prefer using `Link._from_feed_link` to parsing and
338            constructing `Link`s yourself.
339            """
340            self.href = href
341            self.title = title
342            self.rel = rel
343            self.content_type = content_type
344
345        def _from_feed_link(feed_link: feedparser.FeedParserDict) -> Result.Link:
346            """
347            Constructs a `Link` with link metadata specified in a link object
348            from a feed entry.
349
350            See usage in `Result._from_feed_entry`.
351            """
352            return Result.Link(
353                href=feed_link.href,
354                title=feed_link.get("title"),
355                rel=feed_link.get("rel"),
356                content_type=feed_link.get("content_type"),
357            )
358
359        def __str__(self) -> str:
360            return self.href
361
362        def __repr__(self) -> str:
363            return "{}({}, title={}, rel={}, content_type={})".format(
364                _classname(self),
365                repr(self.href),
366                repr(self.title),
367                repr(self.rel),
368                repr(self.content_type),
369            )
370
371        def __eq__(self, other) -> bool:
372            if isinstance(other, Result.Link):
373                return self.href == other.href
374            return False
375
376    class MissingFieldError(Exception):
377        """
378        An error indicating an entry is unparseable because it lacks required
379        fields.
380        """
381
382        missing_field: str
383        """The required field missing from the would-be entry."""
384        message: str
385        """Message describing what caused this error."""
386
387        def __init__(self, missing_field):
388            self.missing_field = missing_field
389            self.message = "Entry from arXiv missing required info"
390
391        def __repr__(self) -> str:
392            return "{}({})".format(_classname(self), repr(self.missing_field))

An entry in an arXiv query results feed.

See the arXiv API User's Manual: Details of Atom Results Returned.

Result( entry_id: str, updated: datetime.datetime = datetime.datetime(1, 1, 1, 0, 0), published: datetime.datetime = datetime.datetime(1, 1, 1, 0, 0), title: str = '', authors: List[Result.Author] = [], summary: str = '', comment: str = '', journal_ref: str = '', doi: str = '', primary_category: str = '', categories: List[str] = [], links: List[Result.Link] = [], _raw: feedparser.util.FeedParserDict = None)
 75    def __init__(
 76        self,
 77        entry_id: str,
 78        updated: datetime = _DEFAULT_TIME,
 79        published: datetime = _DEFAULT_TIME,
 80        title: str = "",
 81        authors: List[Author] = [],
 82        summary: str = "",
 83        comment: str = "",
 84        journal_ref: str = "",
 85        doi: str = "",
 86        primary_category: str = "",
 87        categories: List[str] = [],
 88        links: List[Link] = [],
 89        _raw: feedparser.FeedParserDict = None,
 90    ):
 91        """
 92        Constructs an arXiv search result item.
 93
 94        In most cases, prefer using `Result._from_feed_entry` to parsing and
 95        constructing `Result`s yourself.
 96        """
 97        self.entry_id = entry_id
 98        self.updated = updated
 99        self.published = published
100        self.title = title
101        self.authors = authors
102        self.summary = summary
103        self.comment = comment
104        self.journal_ref = journal_ref
105        self.doi = doi
106        self.primary_category = primary_category
107        self.categories = categories
108        self.links = links
109        # Calculated members
110        self.pdf_url = Result._get_pdf_url(links)
111        # Debugging
112        self._raw = _raw

Constructs an arXiv search result item.

In most cases, prefer using Result._from_feed_entry to parsing and constructing Results yourself.

entry_id: str

A url of the form https://arxiv.org/abs/{id}.

updated: datetime.datetime

When the result was last updated.

published: datetime.datetime

When the result was originally published.

title: str

The title of the result.

authors: List[Result.Author]

The result's authors.

summary: str

The result abstract.

comment: Optional[str]

The authors' comment if present.

journal_ref: Optional[str]

A journal reference if present.

doi: Optional[str]

A URL for the resolved DOI to an external resource if present.

primary_category: str

The result's primary arXiv category. See arXiv: Category Taxonomy.

categories: List[str]

All of the result's categories. See arXiv: Category Taxonomy.

pdf_url: Optional[str]

The URL of a PDF version of this result if present among links.

def get_short_id(self) -> str:
175    def get_short_id(self) -> str:
176        """
177        Returns the short ID for this result.
178
179        + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`,
180        `result.get_short_id()` returns `2107.05580v1`.
181
182        + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`,
183        `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March
184        2007 arXiv identifier format).
185
186        For an explanation of the difference between arXiv's legacy and current
187        identifiers, see [Understanding the arXiv
188        identifier](https://arxiv.org/help/arxiv_identifier).
189        """
190        return self.entry_id.split("arxiv.org/abs/")[-1]

Returns the short ID for this result.

  • If the result URL is "https://arxiv.org/abs/2107.05580v1", result.get_short_id() returns 2107.05580v1.

  • If the result URL is "https://arxiv.org/abs/quant-ph/0201082v1", result.get_short_id() returns "quant-ph/0201082v1" (the pre-March 2007 arXiv identifier format).

For an explanation of the difference between arXiv's legacy and current identifiers, see Understanding the arXiv identifier.

def download_pdf( self, dirpath: str = './', filename: str = '', download_domain: str = 'export.arxiv.org') -> str:
205    def download_pdf(
206        self,
207        dirpath: str = "./",
208        filename: str = "",
209        download_domain: str = "export.arxiv.org",
210    ) -> str:
211        """
212        Downloads the PDF for this result to the specified directory.
213
214        The filename is generated by calling `to_filename(self)`.
215        """
216        if not filename:
217            filename = self._get_default_filename()
218        path = os.path.join(dirpath, filename)
219        pdf_url = Result._substitute_domain(self.pdf_url, download_domain)
220        written_path, _ = urlretrieve(pdf_url, path)
221        return written_path

Downloads the PDF for this result to the specified directory.

The filename is generated by calling to_filename(self).

def download_source( self, dirpath: str = './', filename: str = '', download_domain: str = 'export.arxiv.org') -> str:
223    def download_source(
224        self,
225        dirpath: str = "./",
226        filename: str = "",
227        download_domain: str = "export.arxiv.org",
228    ) -> str:
229        """
230        Downloads the source tarfile for this result to the specified
231        directory.
232
233        The filename is generated by calling `to_filename(self)`.
234        """
235        if not filename:
236            filename = self._get_default_filename("tar.gz")
237        path = os.path.join(dirpath, filename)
238        pdf_url = Result._substitute_domain(self.pdf_url, download_domain)
239        # Bodge: construct the source URL from the PDF URL.
240        src_url = pdf_url.replace("/pdf/", "/src/")
241        written_path, _ = urlretrieve(src_url, path)
242        return written_path

Downloads the source tarfile for this result to the specified directory.

The filename is generated by calling to_filename(self).

class Result.Author:
276    class Author(object):
277        """
278        A light inner class for representing a result's authors.
279        """
280
281        name: str
282        """The author's name."""
283
284        def __init__(self, name: str):
285            """
286            Constructs an `Author` with the specified name.
287
288            In most cases, prefer using `Author._from_feed_author` to parsing
289            and constructing `Author`s yourself.
290            """
291            self.name = name
292
293        def _from_feed_author(feed_author: feedparser.FeedParserDict) -> Result.Author:
294            """
295            Constructs an `Author` with the name specified in an author object
296            from a feed entry.
297
298            See usage in `Result._from_feed_entry`.
299            """
300            return Result.Author(feed_author.name)
301
302        def __str__(self) -> str:
303            return self.name
304
305        def __repr__(self) -> str:
306            return "{}({})".format(_classname(self), repr(self.name))
307
308        def __eq__(self, other) -> bool:
309            if isinstance(other, Result.Author):
310                return self.name == other.name
311            return False

A light inner class for representing a result's authors.

Result.Author(name: str)
284        def __init__(self, name: str):
285            """
286            Constructs an `Author` with the specified name.
287
288            In most cases, prefer using `Author._from_feed_author` to parsing
289            and constructing `Author`s yourself.
290            """
291            self.name = name

Constructs an Author with the specified name.

In most cases, prefer using Author._from_feed_author to parsing and constructing Authors yourself.

name: str

The author's name.

class Result.MissingFieldError(builtins.Exception):
376    class MissingFieldError(Exception):
377        """
378        An error indicating an entry is unparseable because it lacks required
379        fields.
380        """
381
382        missing_field: str
383        """The required field missing from the would-be entry."""
384        message: str
385        """Message describing what caused this error."""
386
387        def __init__(self, missing_field):
388            self.missing_field = missing_field
389            self.message = "Entry from arXiv missing required info"
390
391        def __repr__(self) -> str:
392            return "{}({})".format(_classname(self), repr(self.missing_field))

An error indicating an entry is unparseable because it lacks required fields.

Result.MissingFieldError(missing_field)
387        def __init__(self, missing_field):
388            self.missing_field = missing_field
389            self.message = "Entry from arXiv missing required info"
missing_field: str

The required field missing from the would-be entry.

message: str

Message describing what caused this error.

Inherited Members
builtins.BaseException
with_traceback
args
class SortCriterion(enum.Enum):
395class SortCriterion(Enum):
396    """
397    A SortCriterion identifies a property by which search results can be
398    sorted.
399
400    See [the arXiv API User's Manual: sort order for return
401    results](https://arxiv.org/help/api/user-manual#sort).
402    """
403
404    Relevance = "relevance"
405    LastUpdatedDate = "lastUpdatedDate"
406    SubmittedDate = "submittedDate"

A SortCriterion identifies a property by which search results can be sorted.

See the arXiv API User's Manual: sort order for return results.

Relevance = <SortCriterion.Relevance: 'relevance'>
LastUpdatedDate = <SortCriterion.LastUpdatedDate: 'lastUpdatedDate'>
SubmittedDate = <SortCriterion.SubmittedDate: 'submittedDate'>
Inherited Members
enum.Enum
name
value
class SortOrder(enum.Enum):
409class SortOrder(Enum):
410    """
411    A SortOrder indicates order in which search results are sorted according
412    to the specified arxiv.SortCriterion.
413
414    See [the arXiv API User's Manual: sort order for return
415    results](https://arxiv.org/help/api/user-manual#sort).
416    """
417
418    Ascending = "ascending"
419    Descending = "descending"

A SortOrder indicates order in which search results are sorted according to the specified SortCriterion.

See the arXiv API User's Manual: sort order for return results.

Ascending = <SortOrder.Ascending: 'ascending'>
Descending = <SortOrder.Descending: 'descending'>
Inherited Members
enum.Enum
name
value
class Client:
519class Client(object):
520    """
521    Specifies a strategy for fetching results from arXiv's API.
522
523    This class obscures pagination and retry logic, and exposes
524    `Client.results`.
525    """
526
527    query_url_format = "https://export.arxiv.org/api/query?{}"
528    """
529    The arXiv query API endpoint format.
530    """
531    page_size: int
532    """
533    Maximum number of results fetched in a single API request. Smaller pages can
534    be retrieved faster, but may require more round-trips.
535
536    The API's limit is 2000 results per page.
537    """
538    delay_seconds: float
539    """
540    Number of seconds to wait between API requests.
541
542    [arXiv's Terms of Use](https://arxiv.org/help/api/tou) ask that you "make no
543    more than one request every three seconds."
544    """
545    num_retries: int
546    """
547    Number of times to retry a failing API request before raising an Exception.
548    """
549
550    _last_request_dt: datetime
551    _session: requests.Session
552
553    def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3):
554        """
555        Constructs an arXiv API client with the specified options.
556
557        Note: the default parameters should provide a robust request strategy
558        for most use cases. Extreme page sizes, delays, or retries risk
559        violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou),
560        brittle behavior, and inconsistent results.
561        """
562        self.page_size = page_size
563        self.delay_seconds = delay_seconds
564        self.num_retries = num_retries
565        self._last_request_dt = None
566        self._session = requests.Session()
567
568    def __str__(self) -> str:
569        # TODO: develop a more informative string representation.
570        return repr(self)
571
572    def __repr__(self) -> str:
573        return "{}(page_size={}, delay_seconds={}, num_retries={})".format(
574            _classname(self),
575            repr(self.page_size),
576            repr(self.delay_seconds),
577            repr(self.num_retries),
578        )
579
580    def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]:
581        """
582        Uses this client configuration to fetch one page of the search results
583        at a time, yielding the parsed `Result`s, until `max_results` results
584        have been yielded or there are no more search results.
585
586        If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`.
587
588        Setting a nonzero `offset` discards leading records in the result set.
589        When `offset` is greater than or equal to `search.max_results`, the full
590        result set is discarded.
591
592        For more on using generators, see
593        [Generators](https://wiki.python.org/moin/Generators).
594        """
595        limit = search.max_results - offset if search.max_results else None
596        if limit and limit < 0:
597            return iter(())
598        return itertools.islice(self._results(search, offset), limit)
599
600    def _results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]:
601        page_url = self._format_url(search, offset, self.page_size)
602        feed = self._parse_feed(page_url, first_page=True)
603        if not feed.entries:
604            logger.info("Got empty first page; stopping generation")
605            return
606        total_results = int(feed.feed.opensearch_totalresults)
607        logger.info(
608            "Got first page: %d of %d total results",
609            len(feed.entries),
610            total_results,
611        )
612
613        while feed.entries:
614            for entry in feed.entries:
615                try:
616                    yield Result._from_feed_entry(entry)
617                except Result.MissingFieldError as e:
618                    logger.warning("Skipping partial result: %s", e)
619            offset += len(feed.entries)
620            if offset >= total_results:
621                break
622            page_url = self._format_url(search, offset, self.page_size)
623            feed = self._parse_feed(page_url, first_page=False)
624
625    def _format_url(self, search: Search, start: int, page_size: int) -> str:
626        """
627        Construct a request API for search that returns up to `page_size`
628        results starting with the result at index `start`.
629        """
630        url_args = search._url_args()
631        url_args.update(
632            {
633                "start": start,
634                "max_results": page_size,
635            }
636        )
637        return self.query_url_format.format(urlencode(url_args))
638
639    def _parse_feed(
640        self, url: str, first_page: bool = True, _try_index: int = 0
641    ) -> feedparser.FeedParserDict:
642        """
643        Fetches the specified URL and parses it with feedparser.
644
645        If a request fails or is unexpectedly empty, retries the request up to
646        `self.num_retries` times.
647        """
648        try:
649            return self.__try_parse_feed(url, first_page=first_page, try_index=_try_index)
650        except (
651            HTTPError,
652            UnexpectedEmptyPageError,
653            requests.exceptions.ConnectionError,
654        ) as err:
655            if _try_index < self.num_retries:
656                logger.debug("Got error (try %d): %s", _try_index, err)
657                return self._parse_feed(url, first_page=first_page, _try_index=_try_index + 1)
658            logger.debug("Giving up (try %d): %s", _try_index, err)
659            raise err
660
661    def __try_parse_feed(
662        self,
663        url: str,
664        first_page: bool,
665        try_index: int,
666    ) -> feedparser.FeedParserDict:
667        """
668        Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that
669        number of seconds has not passed since `_parse_feed` was last called,
670        sleeps until delay_seconds seconds have passed.
671        """
672        # If this call would violate the rate limit, sleep until it doesn't.
673        if self._last_request_dt is not None:
674            required = timedelta(seconds=self.delay_seconds)
675            since_last_request = datetime.now() - self._last_request_dt
676            if since_last_request < required:
677                to_sleep = (required - since_last_request).total_seconds()
678                logger.info("Sleeping: %f seconds", to_sleep)
679                time.sleep(to_sleep)
680
681        logger.info("Requesting page (first: %r, try: %d): %s", first_page, try_index, url)
682
683        resp = self._session.get(url, headers={"user-agent": "arxiv.py/2.2.0"})
684        self._last_request_dt = datetime.now()
685        if resp.status_code != requests.codes.OK:
686            raise HTTPError(url, try_index, resp.status_code)
687
688        feed = feedparser.parse(resp.content)
689        if len(feed.entries) == 0 and not first_page:
690            raise UnexpectedEmptyPageError(url, try_index, feed)
691
692        if feed.bozo:
693            logger.warning(
694                "Bozo feed; consider handling: %s",
695                feed.bozo_exception if "bozo_exception" in feed else None,
696            )
697
698        return feed

Specifies a strategy for fetching results from arXiv's API.

This class obscures pagination and retry logic, and exposes Client.results.

Client( page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3)
553    def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3):
554        """
555        Constructs an arXiv API client with the specified options.
556
557        Note: the default parameters should provide a robust request strategy
558        for most use cases. Extreme page sizes, delays, or retries risk
559        violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou),
560        brittle behavior, and inconsistent results.
561        """
562        self.page_size = page_size
563        self.delay_seconds = delay_seconds
564        self.num_retries = num_retries
565        self._last_request_dt = None
566        self._session = requests.Session()

Constructs an arXiv API client with the specified options.

Note: the default parameters should provide a robust request strategy for most use cases. Extreme page sizes, delays, or retries risk violating the arXiv API Terms of Use, brittle behavior, and inconsistent results.

query_url_format = 'https://exportarxiv.arxiv.org/api/query?{}'

The arXiv query API endpoint format.

page_size: int

Maximum number of results fetched in a single API request. Smaller pages can be retrieved faster, but may require more round-trips.

The API's limit is 2000 results per page.

delay_seconds: float

Number of seconds to wait between API requests.

arXiv's Terms of Use ask that you "make no more than one request every three seconds."

num_retries: int

Number of times to retry a failing API request before raising an Exception.

def results( self, search: Search, offset: int = 0) -> Generator[Result, NoneType, NoneType]:
580    def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]:
581        """
582        Uses this client configuration to fetch one page of the search results
583        at a time, yielding the parsed `Result`s, until `max_results` results
584        have been yielded or there are no more search results.
585
586        If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`.
587
588        Setting a nonzero `offset` discards leading records in the result set.
589        When `offset` is greater than or equal to `search.max_results`, the full
590        result set is discarded.
591
592        For more on using generators, see
593        [Generators](https://wiki.python.org/moin/Generators).
594        """
595        limit = search.max_results - offset if search.max_results else None
596        if limit and limit < 0:
597            return iter(())
598        return itertools.islice(self._results(search, offset), limit)

Uses this client configuration to fetch one page of the search results at a time, yielding the parsed Results, until max_results results have been yielded or there are no more search results.

If all tries fail, raises an UnexpectedEmptyPageError or HTTPError.

Setting a nonzero offset discards leading records in the result set. When offset is greater than or equal to search.max_results, the full result set is discarded.

For more on using generators, see Generators.

class ArxivError(builtins.Exception):
701class ArxivError(Exception):
702    """This package's base Exception class."""
703
704    url: str
705    """The feed URL that could not be fetched."""
706    retry: int
707    """
708    The request try number which encountered this error; 0 for the initial try,
709    1 for the first retry, and so on.
710    """
711    message: str
712    """Message describing what caused this error."""
713
714    def __init__(self, url: str, retry: int, message: str):
715        """
716        Constructs an `ArxivError` encountered while fetching the specified URL.
717        """
718        self.url = url
719        self.retry = retry
720        self.message = message
721        super().__init__(self.message)
722
723    def __str__(self) -> str:
724        return "{} ({})".format(self.message, self.url)

This package's base Exception class.

ArxivError(url: str, retry: int, message: str)
714    def __init__(self, url: str, retry: int, message: str):
715        """
716        Constructs an `ArxivError` encountered while fetching the specified URL.
717        """
718        self.url = url
719        self.retry = retry
720        self.message = message
721        super().__init__(self.message)

Constructs an ArxivError encountered while fetching the specified URL.

url: str

The feed URL that could not be fetched.

retry: int

The request try number which encountered this error; 0 for the initial try, 1 for the first retry, and so on.

message: str

Message describing what caused this error.

Inherited Members
builtins.BaseException
with_traceback
args
class UnexpectedEmptyPageError(ArxivError):
727class UnexpectedEmptyPageError(ArxivError):
728    """
729    An error raised when a page of results that should be non-empty is empty.
730
731    This should never happen in theory, but happens sporadically due to
732    brittleness in the underlying arXiv API; usually resolved by retries.
733
734    See `Client.results` for usage.
735    """
736
737    raw_feed: feedparser.FeedParserDict
738    """
739    The raw output of `feedparser.parse`. Sometimes this contains useful
740    diagnostic information, e.g. in 'bozo_exception'.
741    """
742
743    def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict):
744        """
745        Constructs an `UnexpectedEmptyPageError` encountered for the specified
746        API URL after `retry` tries.
747        """
748        self.url = url
749        self.raw_feed = raw_feed
750        super().__init__(url, retry, "Page of results was unexpectedly empty")
751
752    def __repr__(self) -> str:
753        return "{}({}, {}, {})".format(
754            _classname(self), repr(self.url), repr(self.retry), repr(self.raw_feed)
755        )

An error raised when a page of results that should be non-empty is empty.

This should never happen in theory, but happens sporadically due to brittleness in the underlying arXiv API; usually resolved by retries.

See Client.results for usage.

UnexpectedEmptyPageError(url: str, retry: int, raw_feed: feedparser.util.FeedParserDict)
743    def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict):
744        """
745        Constructs an `UnexpectedEmptyPageError` encountered for the specified
746        API URL after `retry` tries.
747        """
748        self.url = url
749        self.raw_feed = raw_feed
750        super().__init__(url, retry, "Page of results was unexpectedly empty")

Constructs an UnexpectedEmptyPageError encountered for the specified API URL after retry tries.

raw_feed: feedparser.util.FeedParserDict

The raw output of feedparser.parse. Sometimes this contains useful diagnostic information, e.g. in 'bozo_exception'.

url

The feed URL that could not be fetched.

Inherited Members
ArxivError
retry
message
builtins.BaseException
with_traceback
args
class HTTPError(ArxivError):
758class HTTPError(ArxivError):
759    """
760    A non-200 status encountered while fetching a page of results.
761
762    See `Client.results` for usage.
763    """
764
765    status: int
766    """The HTTP status reported by feedparser."""
767
768    def __init__(self, url: str, retry: int, status: int):
769        """
770        Constructs an `HTTPError` for the specified status code, encountered for
771        the specified API URL after `retry` tries.
772        """
773        self.url = url
774        self.status = status
775        super().__init__(
776            url,
777            retry,
778            "Page request resulted in HTTP {}".format(self.status),
779        )
780
781    def __repr__(self) -> str:
782        return "{}({}, {}, {})".format(
783            _classname(self), repr(self.url), repr(self.retry), repr(self.status)
784        )

A non-200 status encountered while fetching a page of results.

See Client.results for usage.

HTTPError(url: str, retry: int, status: int)
768    def __init__(self, url: str, retry: int, status: int):
769        """
770        Constructs an `HTTPError` for the specified status code, encountered for
771        the specified API URL after `retry` tries.
772        """
773        self.url = url
774        self.status = status
775        super().__init__(
776            url,
777            retry,
778            "Page request resulted in HTTP {}".format(self.status),
779        )

Constructs an HTTPError for the specified status code, encountered for the specified API URL after retry tries.

status: int

The HTTP status reported by feedparser.

url

The feed URL that could not be fetched.

Inherited Members
ArxivError
retry
message
builtins.BaseException
with_traceback
args