arxiv

arxiv.py

PyPI PyPI - Python Version GitHub Workflow Status (branch) Full package documentation

Python wrapper for the arXiv API.

arXiv is a project by the Cornell University Library that provides open access to 1,000,000+ articles in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics.

Usage

Installation

$ pip install arxiv

In your Python script, include the line

import arxiv

Examples

Fetching results

import arxiv

# Construct the default API client.
client = arxiv.Client()

# Search for the 10 most recent articles matching the keyword "quantum."
search = arxiv.Search(
  query = "quantum",
  max_results = 10,
  sort_by = arxiv.SortCriterion.SubmittedDate
)

results = client.results(search)

# `results` is a generator; you can iterate over its elements one by one...
for r in client.results(search):
  print(r.title)
# ...or exhaust it into a list. Careful: this is slow for large results sets.
all_results = list(results)
print([r.title for r in all_results])

# For advanced query syntax documentation, see the arXiv API User Manual:
# https://arxiv.org/help/api/user-manual#query_details
search = arxiv.Search(query = "au:del_maestro AND ti:checkerboard")
first_result = next(client.results(search))
print(first_result)

# Search for the paper with ID "1605.08386v1"
search_by_id = arxiv.Search(id_list=["1605.08386v1"])
# Reuse client to fetch the paper, then print its title.
first_result = next(client.results(search))
print(first_result.title)

Downloading papers

To download a PDF of the paper with ID "1605.08386v1," run a Search and then use Result.download_pdf():

import arxiv

paper = next(arxiv.Client().results(arxiv.Search(id_list=["1605.08386v1"])))
# Download the PDF to the PWD with a default filename.
paper.download_pdf()
# Download the PDF to the PWD with a custom filename.
paper.download_pdf(filename="downloaded-paper.pdf")
# Download the PDF to a specified directory with a custom filename.
paper.download_pdf(dirpath="./mydir", filename="downloaded-paper.pdf")

The same interface is available for downloading .tar.gz files of the paper source:

import arxiv

paper = next(arxiv.Client().results(arxiv.Search(id_list=["1605.08386v1"])))
# Download the archive to the PWD with a default filename.
paper.download_source()
# Download the archive to the PWD with a custom filename.
paper.download_source(filename="downloaded-paper.tar.gz")
# Download the archive to a specified directory with a custom filename.
paper.download_source(dirpath="./mydir", filename="downloaded-paper.tar.gz")

Fetching results with a custom client

import arxiv

big_slow_client = arxiv.Client(
  page_size = 1000,
  delay_seconds = 10.0,
  num_retries = 5
)

# Prints 1000 titles before needing to make another request.
for result in big_slow_client.results(arxiv.Search(query="quantum")):
  print(result.title)

Logging

To inspect this package's network behavior and API logic, configure a DEBUG-level logger.

>>> import logging, arxiv
>>> logging.basicConfig(level=logging.DEBUG)
>>> client = arxiv.Client()
>>> paper = next(client.results(arxiv.Search(id_list=["1605.08386v1"])))
INFO:arxiv.arxiv:Requesting 100 results at offset 0
INFO:arxiv.arxiv:Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): export.arxiv.org:443
DEBUG:urllib3.connectionpool:https://export.arxiv.org:443 "GET /api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100&user-agent=arxiv.py%2F1.4.8 HTTP/1.1" 200 979

Types

Client

A Client specifies a reusable strategy for fetching results from arXiv's API. For most use cases the default client should suffice.

Clients configurations specify pagination and retry logic. Reusing a client allows successive API calls to use the same connection pool and ensures they abide by the rate limit you set.

A Search specifies a search of arXiv's database. Use Client.results to get a generator yielding Results.

Result

The Result objects yielded by Client.results include metadata about each paper and helper methods for downloading their content.

The meaning of the underlying raw data is documented in the arXiv API User Manual: Details of Atom Results Returned.

Result also exposes helper methods for downloading papers: Result.download_pdf and Result.download_source.

  1""".. include:: ../README.md"""
  2from __future__ import annotations
  3
  4import logging
  5import time
  6import itertools
  7import feedparser
  8import os
  9import math
 10import re
 11import requests
 12import warnings
 13
 14from urllib.parse import urlencode
 15from urllib.request import urlretrieve
 16from datetime import datetime, timedelta, timezone
 17from calendar import timegm
 18
 19from enum import Enum
 20from typing import Dict, Generator, List
 21
 22logger = logging.getLogger(__name__)
 23
 24_DEFAULT_TIME = datetime.min
 25
 26
 27class Result(object):
 28    """
 29    An entry in an arXiv query results feed.
 30
 31    See [the arXiv API User's Manual: Details of Atom Results
 32    Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned).
 33    """
 34
 35    entry_id: str
 36    """A url of the form `https://arxiv.org/abs/{id}`."""
 37    updated: datetime
 38    """When the result was last updated."""
 39    published: datetime
 40    """When the result was originally published."""
 41    title: str
 42    """The title of the result."""
 43    authors: List[Author]
 44    """The result's authors."""
 45    summary: str
 46    """The result abstract."""
 47    comment: str
 48    """The authors' comment if present."""
 49    journal_ref: str
 50    """A journal reference if present."""
 51    doi: str
 52    """A URL for the resolved DOI to an external resource if present."""
 53    primary_category: str
 54    """
 55    The result's primary arXiv category. See [arXiv: Category
 56    Taxonomy](https://arxiv.org/category_taxonomy).
 57    """
 58    categories: List[str]
 59    """
 60    All of the result's categories. See [arXiv: Category
 61    Taxonomy](https://arxiv.org/category_taxonomy).
 62    """
 63    links: List[Link]
 64    """Up to three URLs associated with this result."""
 65    pdf_url: str
 66    """The URL of a PDF version of this result if present among links."""
 67    _raw: feedparser.FeedParserDict
 68    """
 69    The raw feedparser result object if this Result was constructed with
 70    Result._from_feed_entry.
 71    """
 72
 73    def __init__(
 74        self,
 75        entry_id: str,
 76        updated: datetime = _DEFAULT_TIME,
 77        published: datetime = _DEFAULT_TIME,
 78        title: str = "",
 79        authors: List[Author] = [],
 80        summary: str = "",
 81        comment: str = "",
 82        journal_ref: str = "",
 83        doi: str = "",
 84        primary_category: str = "",
 85        categories: List[str] = [],
 86        links: List[Link] = [],
 87        _raw: feedparser.FeedParserDict = None,
 88    ):
 89        """
 90        Constructs an arXiv search result item.
 91
 92        In most cases, prefer using `Result._from_feed_entry` to parsing and
 93        constructing `Result`s yourself.
 94        """
 95        self.entry_id = entry_id
 96        self.updated = updated
 97        self.published = published
 98        self.title = title
 99        self.authors = authors
100        self.summary = summary
101        self.comment = comment
102        self.journal_ref = journal_ref
103        self.doi = doi
104        self.primary_category = primary_category
105        self.categories = categories
106        self.links = links
107        # Calculated members
108        self.pdf_url = Result._get_pdf_url(links)
109        # Debugging
110        self._raw = _raw
111
112    def _from_feed_entry(entry: feedparser.FeedParserDict) -> Result:
113        """
114        Converts a feedparser entry for an arXiv search result feed into a
115        Result object.
116        """
117        if not hasattr(entry, "id"):
118            raise Result.MissingFieldError("id")
119        # Title attribute may be absent for certain titles. Defaulting to "0" as
120        # it's the only title observed to cause this bug.
121        # https://github.com/lukasschwab/arxiv.py/issues/71
122        # title = entry.title if hasattr(entry, "title") else "0"
123        title = "0"
124        if hasattr(entry, "title"):
125            title = entry.title
126        else:
127            logger.warning("Result %s is missing title attribute; defaulting to '0'", entry.id)
128        return Result(
129            entry_id=entry.id,
130            updated=Result._to_datetime(entry.updated_parsed),
131            published=Result._to_datetime(entry.published_parsed),
132            title=re.sub(r"\s+", " ", title),
133            authors=[Result.Author._from_feed_author(a) for a in entry.authors],
134            summary=entry.summary,
135            comment=entry.get("arxiv_comment"),
136            journal_ref=entry.get("arxiv_journal_ref"),
137            doi=entry.get("arxiv_doi"),
138            primary_category=entry.arxiv_primary_category.get("term"),
139            categories=[tag.get("term") for tag in entry.tags],
140            links=[Result.Link._from_feed_link(link) for link in entry.links],
141            _raw=entry,
142        )
143
144    def __str__(self) -> str:
145        return self.entry_id
146
147    def __repr__(self) -> str:
148        return (
149            "{}(entry_id={}, updated={}, published={}, title={}, authors={}, "
150            "summary={}, comment={}, journal_ref={}, doi={}, "
151            "primary_category={}, categories={}, links={})"
152        ).format(
153            _classname(self),
154            repr(self.entry_id),
155            repr(self.updated),
156            repr(self.published),
157            repr(self.title),
158            repr(self.authors),
159            repr(self.summary),
160            repr(self.comment),
161            repr(self.journal_ref),
162            repr(self.doi),
163            repr(self.primary_category),
164            repr(self.categories),
165            repr(self.links),
166        )
167
168    def __eq__(self, other) -> bool:
169        if isinstance(other, Result):
170            return self.entry_id == other.entry_id
171        return False
172
173    def get_short_id(self) -> str:
174        """
175        Returns the short ID for this result.
176
177        + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`,
178        `result.get_short_id()` returns `2107.05580v1`.
179
180        + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`,
181        `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March
182        2007 arXiv identifier format).
183
184        For an explanation of the difference between arXiv's legacy and current
185        identifiers, see [Understanding the arXiv
186        identifier](https://arxiv.org/help/arxiv_identifier).
187        """
188        return self.entry_id.split("arxiv.org/abs/")[-1]
189
190    def _get_default_filename(self, extension: str = "pdf") -> str:
191        """
192        A default `to_filename` function for the extension given.
193        """
194        nonempty_title = self.title if self.title else "UNTITLED"
195        return ".".join(
196            [
197                self.get_short_id().replace("/", "_"),
198                re.sub(r"[^\w]", "_", nonempty_title),
199                extension,
200            ]
201        )
202
203    def download_pdf(self, dirpath: str = "./", filename: str = "") -> str:
204        """
205        Downloads the PDF for this result to the specified directory.
206
207        The filename is generated by calling `to_filename(self)`.
208        """
209        if not filename:
210            filename = self._get_default_filename()
211        path = os.path.join(dirpath, filename)
212        written_path, _ = urlretrieve(self.pdf_url, path)
213        return written_path
214
215    def download_source(self, dirpath: str = "./", filename: str = "") -> str:
216        """
217        Downloads the source tarfile for this result to the specified
218        directory.
219
220        The filename is generated by calling `to_filename(self)`.
221        """
222        if not filename:
223            filename = self._get_default_filename("tar.gz")
224        path = os.path.join(dirpath, filename)
225        # Bodge: construct the source URL from the PDF URL.
226        source_url = self.pdf_url.replace("/pdf/", "/src/")
227        written_path, _ = urlretrieve(source_url, path)
228        return written_path
229
230    def _get_pdf_url(links: List[Link]) -> str:
231        """
232        Finds the PDF link among a result's links and returns its URL.
233
234        Should only be called once for a given `Result`, in its constructor.
235        After construction, the URL should be available in `Result.pdf_url`.
236        """
237        pdf_urls = [link.href for link in links if link.title == "pdf"]
238        if len(pdf_urls) == 0:
239            return None
240        elif len(pdf_urls) > 1:
241            logger.warning("Result has multiple PDF links; using %s", pdf_urls[0])
242        return pdf_urls[0]
243
244    def _to_datetime(ts: time.struct_time) -> datetime:
245        """
246        Converts a UTC time.struct_time into a time-zone-aware datetime.
247
248        This will be replaced with feedparser functionality [when it becomes
249        available](https://github.com/kurtmckee/feedparser/issues/212).
250        """
251        return datetime.fromtimestamp(timegm(ts), tz=timezone.utc)
252
253    class Author(object):
254        """
255        A light inner class for representing a result's authors.
256        """
257
258        name: str
259        """The author's name."""
260
261        def __init__(self, name: str):
262            """
263            Constructs an `Author` with the specified name.
264
265            In most cases, prefer using `Author._from_feed_author` to parsing
266            and constructing `Author`s yourself.
267            """
268            self.name = name
269
270        def _from_feed_author(feed_author: feedparser.FeedParserDict) -> Result.Author:
271            """
272            Constructs an `Author` with the name specified in an author object
273            from a feed entry.
274
275            See usage in `Result._from_feed_entry`.
276            """
277            return Result.Author(feed_author.name)
278
279        def __str__(self) -> str:
280            return self.name
281
282        def __repr__(self) -> str:
283            return "{}({})".format(_classname(self), repr(self.name))
284
285        def __eq__(self, other) -> bool:
286            if isinstance(other, Result.Author):
287                return self.name == other.name
288            return False
289
290    class Link(object):
291        """
292        A light inner class for representing a result's links.
293        """
294
295        href: str
296        """The link's `href` attribute."""
297        title: str
298        """The link's title."""
299        rel: str
300        """The link's relationship to the `Result`."""
301        content_type: str
302        """The link's HTTP content type."""
303
304        def __init__(
305            self,
306            href: str,
307            title: str = None,
308            rel: str = None,
309            content_type: str = None,
310        ):
311            """
312            Constructs a `Link` with the specified link metadata.
313
314            In most cases, prefer using `Link._from_feed_link` to parsing and
315            constructing `Link`s yourself.
316            """
317            self.href = href
318            self.title = title
319            self.rel = rel
320            self.content_type = content_type
321
322        def _from_feed_link(feed_link: feedparser.FeedParserDict) -> Result.Link:
323            """
324            Constructs a `Link` with link metadata specified in a link object
325            from a feed entry.
326
327            See usage in `Result._from_feed_entry`.
328            """
329            return Result.Link(
330                href=feed_link.href,
331                title=feed_link.get("title"),
332                rel=feed_link.get("rel"),
333                content_type=feed_link.get("content_type"),
334            )
335
336        def __str__(self) -> str:
337            return self.href
338
339        def __repr__(self) -> str:
340            return "{}({}, title={}, rel={}, content_type={})".format(
341                _classname(self),
342                repr(self.href),
343                repr(self.title),
344                repr(self.rel),
345                repr(self.content_type),
346            )
347
348        def __eq__(self, other) -> bool:
349            if isinstance(other, Result.Link):
350                return self.href == other.href
351            return False
352
353    class MissingFieldError(Exception):
354        """
355        An error indicating an entry is unparseable because it lacks required
356        fields.
357        """
358
359        missing_field: str
360        """The required field missing from the would-be entry."""
361        message: str
362        """Message describing what caused this error."""
363
364        def __init__(self, missing_field):
365            self.missing_field = missing_field
366            self.message = "Entry from arXiv missing required info"
367
368        def __repr__(self) -> str:
369            return "{}({})".format(_classname(self), repr(self.missing_field))
370
371
372class SortCriterion(Enum):
373    """
374    A SortCriterion identifies a property by which search results can be
375    sorted.
376
377    See [the arXiv API User's Manual: sort order for return
378    results](https://arxiv.org/help/api/user-manual#sort).
379    """
380
381    Relevance = "relevance"
382    LastUpdatedDate = "lastUpdatedDate"
383    SubmittedDate = "submittedDate"
384
385
386class SortOrder(Enum):
387    """
388    A SortOrder indicates order in which search results are sorted according
389    to the specified arxiv.SortCriterion.
390
391    See [the arXiv API User's Manual: sort order for return
392    results](https://arxiv.org/help/api/user-manual#sort).
393    """
394
395    Ascending = "ascending"
396    Descending = "descending"
397
398
399class Search(object):
400    """
401    A specification for a search of arXiv's database.
402
403    To run a search, use `Search.run` to use a default client or `Client.run`
404    with a specific client.
405    """
406
407    query: str
408    """
409    A query string.
410
411    This should be unencoded. Use `au:del_maestro AND ti:checkerboard`, not
412    `au:del_maestro+AND+ti:checkerboard`.
413
414    See [the arXiv API User's Manual: Details of Query
415    Construction](https://arxiv.org/help/api/user-manual#query_details).
416    """
417    id_list: List[str]
418    """
419    A list of arXiv article IDs to which to limit the search.
420
421    See [the arXiv API User's
422    Manual](https://arxiv.org/help/api/user-manual#search_query_and_id_list)
423    for documentation of the interaction between `query` and `id_list`.
424    """
425    max_results: int | None
426    """
427    The maximum number of results to be returned in an execution of this
428    search. To fetch every result available, set `max_results=None`.
429
430    The API's limit is 300,000 results per query.
431    """
432    sort_by: SortCriterion
433    """The sort criterion for results."""
434    sort_order: SortOrder
435    """The sort order for results."""
436
437    def __init__(
438        self,
439        query: str = "",
440        id_list: List[str] = [],
441        max_results: int | None = None,
442        sort_by: SortCriterion = SortCriterion.Relevance,
443        sort_order: SortOrder = SortOrder.Descending,
444    ):
445        """
446        Constructs an arXiv API search with the specified criteria.
447        """
448        self.query = query
449        self.id_list = id_list
450        # Handle deprecated v1 default behavior.
451        self.max_results = None if max_results == math.inf else max_results
452        self.sort_by = sort_by
453        self.sort_order = sort_order
454
455    def __str__(self) -> str:
456        # TODO: develop a more informative string representation.
457        return repr(self)
458
459    def __repr__(self) -> str:
460        return ("{}(query={}, id_list={}, max_results={}, sort_by={}, " "sort_order={})").format(
461            _classname(self),
462            repr(self.query),
463            repr(self.id_list),
464            repr(self.max_results),
465            repr(self.sort_by),
466            repr(self.sort_order),
467        )
468
469    def _url_args(self) -> Dict[str, str]:
470        """
471        Returns a dict of search parameters that should be included in an API
472        request for this search.
473        """
474        return {
475            "search_query": self.query,
476            "id_list": ",".join(self.id_list),
477            "sortBy": self.sort_by.value,
478            "sortOrder": self.sort_order.value,
479        }
480
481    def results(self, offset: int = 0) -> Generator[Result, None, None]:
482        """
483        Executes the specified search using a default arXiv API client. For info
484        on default behavior, see `Client.__init__` and `Client.results`.
485
486        **Deprecated** after 2.0.0; use `Client.results`.
487        """
488        warnings.warn(
489            "The 'Search.results' method is deprecated, use 'Client.results' instead",
490            DeprecationWarning,
491            stacklevel=2,
492        )
493        return Client().results(self, offset=offset)
494
495
496class Client(object):
497    """
498    Specifies a strategy for fetching results from arXiv's API.
499
500    This class obscures pagination and retry logic, and exposes
501    `Client.results`.
502    """
503
504    query_url_format = "https://export.arxiv.org/api/query?{}"
505    """
506    The arXiv query API endpoint format.
507    """
508    page_size: int
509    """
510    Maximum number of results fetched in a single API request. Smaller pages can
511    be retrieved faster, but may require more round-trips.
512
513    The API's limit is 2000 results per page.
514    """
515    delay_seconds: float
516    """
517    Number of seconds to wait between API requests.
518
519    [arXiv's Terms of Use](https://arxiv.org/help/api/tou) ask that you "make no
520    more than one request every three seconds."
521    """
522    num_retries: int
523    """
524    Number of times to retry a failing API request before raising an Exception.
525    """
526
527    _last_request_dt: datetime
528    _session: requests.Session
529
530    def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3):
531        """
532        Constructs an arXiv API client with the specified options.
533
534        Note: the default parameters should provide a robust request strategy
535        for most use cases. Extreme page sizes, delays, or retries risk
536        violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou),
537        brittle behavior, and inconsistent results.
538        """
539        self.page_size = page_size
540        self.delay_seconds = delay_seconds
541        self.num_retries = num_retries
542        self._last_request_dt = None
543        self._session = requests.Session()
544
545    def __str__(self) -> str:
546        # TODO: develop a more informative string representation.
547        return repr(self)
548
549    def __repr__(self) -> str:
550        return "{}(page_size={}, delay_seconds={}, num_retries={})".format(
551            _classname(self),
552            repr(self.page_size),
553            repr(self.delay_seconds),
554            repr(self.num_retries),
555        )
556
557    def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]:
558        """
559        Uses this client configuration to fetch one page of the search results
560        at a time, yielding the parsed `Result`s, until `max_results` results
561        have been yielded or there are no more search results.
562
563        If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`.
564
565        Setting a nonzero `offset` discards leading records in the result set.
566        When `offset` is greater than or equal to `search.max_results`, the full
567        result set is discarded.
568
569        For more on using generators, see
570        [Generators](https://wiki.python.org/moin/Generators).
571        """
572        limit = search.max_results - offset if search.max_results else None
573        if limit and limit < 0:
574            return iter(())
575        return itertools.islice(self._results(search, offset), limit)
576
577    def _results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]:
578        page_url = self._format_url(search, offset, self.page_size)
579        feed = self._parse_feed(page_url, first_page=True)
580        if not feed.entries:
581            logger.info("Got empty first page; stopping generation")
582            return
583        total_results = int(feed.feed.opensearch_totalresults)
584        logger.info(
585            "Got first page: %d of %d total results",
586            len(feed.entries),
587            total_results,
588        )
589
590        while feed.entries:
591            for entry in feed.entries:
592                try:
593                    yield Result._from_feed_entry(entry)
594                except Result.MissingFieldError as e:
595                    logger.warning("Skipping partial result: %s", e)
596            offset += len(feed.entries)
597            if offset >= total_results:
598                break
599            page_url = self._format_url(search, offset, self.page_size)
600            feed = self._parse_feed(page_url, first_page=False)
601
602    def _format_url(self, search: Search, start: int, page_size: int) -> str:
603        """
604        Construct a request API for search that returns up to `page_size`
605        results starting with the result at index `start`.
606        """
607        url_args = search._url_args()
608        url_args.update(
609            {
610                "start": start,
611                "max_results": page_size,
612            }
613        )
614        return self.query_url_format.format(urlencode(url_args))
615
616    def _parse_feed(
617        self, url: str, first_page: bool = True, _try_index: int = 0
618    ) -> feedparser.FeedParserDict:
619        """
620        Fetches the specified URL and parses it with feedparser.
621
622        If a request fails or is unexpectedly empty, retries the request up to
623        `self.num_retries` times.
624        """
625        try:
626            return self.__try_parse_feed(url, first_page=first_page, try_index=_try_index)
627        except (
628            HTTPError,
629            UnexpectedEmptyPageError,
630            requests.exceptions.ConnectionError,
631        ) as err:
632            if _try_index < self.num_retries:
633                logger.debug("Got error (try %d): %s", _try_index, err)
634                return self._parse_feed(url, first_page=first_page, _try_index=_try_index + 1)
635            logger.debug("Giving up (try %d): %s", _try_index, err)
636            raise err
637
638    def __try_parse_feed(
639        self,
640        url: str,
641        first_page: bool,
642        try_index: int,
643    ) -> feedparser.FeedParserDict:
644        """
645        Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that
646        number of seconds has not passed since `_parse_feed` was last called,
647        sleeps until delay_seconds seconds have passed.
648        """
649        # If this call would violate the rate limit, sleep until it doesn't.
650        if self._last_request_dt is not None:
651            required = timedelta(seconds=self.delay_seconds)
652            since_last_request = datetime.now() - self._last_request_dt
653            if since_last_request < required:
654                to_sleep = (required - since_last_request).total_seconds()
655                logger.info("Sleeping: %f seconds", to_sleep)
656                time.sleep(to_sleep)
657
658        logger.info("Requesting page (first: %r, try: %d): %s", first_page, try_index, url)
659
660        resp = self._session.get(url, headers={"user-agent": "arxiv.py/2.1.0"})
661        self._last_request_dt = datetime.now()
662        if resp.status_code != requests.codes.OK:
663            raise HTTPError(url, try_index, resp.status_code)
664
665        feed = feedparser.parse(resp.content)
666        if len(feed.entries) == 0 and not first_page:
667            raise UnexpectedEmptyPageError(url, try_index, feed)
668
669        if feed.bozo:
670            logger.warning(
671                "Bozo feed; consider handling: %s",
672                feed.bozo_exception if "bozo_exception" in feed else None,
673            )
674
675        return feed
676
677
678class ArxivError(Exception):
679    """This package's base Exception class."""
680
681    url: str
682    """The feed URL that could not be fetched."""
683    retry: int
684    """
685    The request try number which encountered this error; 0 for the initial try,
686    1 for the first retry, and so on.
687    """
688    message: str
689    """Message describing what caused this error."""
690
691    def __init__(self, url: str, retry: int, message: str):
692        """
693        Constructs an `ArxivError` encountered while fetching the specified URL.
694        """
695        self.url = url
696        self.retry = retry
697        self.message = message
698        super().__init__(self.message)
699
700    def __str__(self) -> str:
701        return "{} ({})".format(self.message, self.url)
702
703
704class UnexpectedEmptyPageError(ArxivError):
705    """
706    An error raised when a page of results that should be non-empty is empty.
707
708    This should never happen in theory, but happens sporadically due to
709    brittleness in the underlying arXiv API; usually resolved by retries.
710
711    See `Client.results` for usage.
712    """
713
714    raw_feed: feedparser.FeedParserDict
715    """
716    The raw output of `feedparser.parse`. Sometimes this contains useful
717    diagnostic information, e.g. in 'bozo_exception'.
718    """
719
720    def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict):
721        """
722        Constructs an `UnexpectedEmptyPageError` encountered for the specified
723        API URL after `retry` tries.
724        """
725        self.url = url
726        self.raw_feed = raw_feed
727        super().__init__(url, retry, "Page of results was unexpectedly empty")
728
729    def __repr__(self) -> str:
730        return "{}({}, {}, {})".format(
731            _classname(self), repr(self.url), repr(self.retry), repr(self.raw_feed)
732        )
733
734
735class HTTPError(ArxivError):
736    """
737    A non-200 status encountered while fetching a page of results.
738
739    See `Client.results` for usage.
740    """
741
742    status: int
743    """The HTTP status reported by feedparser."""
744
745    def __init__(self, url: str, retry: int, status: int):
746        """
747        Constructs an `HTTPError` for the specified status code, encountered for
748        the specified API URL after `retry` tries.
749        """
750        self.url = url
751        self.status = status
752        super().__init__(
753            url,
754            retry,
755            "Page request resulted in HTTP {}".format(self.status),
756        )
757
758    def __repr__(self) -> str:
759        return "{}({}, {}, {})".format(
760            _classname(self), repr(self.url), repr(self.retry), repr(self.status)
761        )
762
763
764def _classname(o):
765    """A helper function for use in __repr__ methods: arxiv.Result.Link."""
766    return "arxiv.{}".format(o.__class__.__qualname__)
class Result:
 28class Result(object):
 29    """
 30    An entry in an arXiv query results feed.
 31
 32    See [the arXiv API User's Manual: Details of Atom Results
 33    Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned).
 34    """
 35
 36    entry_id: str
 37    """A url of the form `https://arxiv.org/abs/{id}`."""
 38    updated: datetime
 39    """When the result was last updated."""
 40    published: datetime
 41    """When the result was originally published."""
 42    title: str
 43    """The title of the result."""
 44    authors: List[Author]
 45    """The result's authors."""
 46    summary: str
 47    """The result abstract."""
 48    comment: str
 49    """The authors' comment if present."""
 50    journal_ref: str
 51    """A journal reference if present."""
 52    doi: str
 53    """A URL for the resolved DOI to an external resource if present."""
 54    primary_category: str
 55    """
 56    The result's primary arXiv category. See [arXiv: Category
 57    Taxonomy](https://arxiv.org/category_taxonomy).
 58    """
 59    categories: List[str]
 60    """
 61    All of the result's categories. See [arXiv: Category
 62    Taxonomy](https://arxiv.org/category_taxonomy).
 63    """
 64    links: List[Link]
 65    """Up to three URLs associated with this result."""
 66    pdf_url: str
 67    """The URL of a PDF version of this result if present among links."""
 68    _raw: feedparser.FeedParserDict
 69    """
 70    The raw feedparser result object if this Result was constructed with
 71    Result._from_feed_entry.
 72    """
 73
 74    def __init__(
 75        self,
 76        entry_id: str,
 77        updated: datetime = _DEFAULT_TIME,
 78        published: datetime = _DEFAULT_TIME,
 79        title: str = "",
 80        authors: List[Author] = [],
 81        summary: str = "",
 82        comment: str = "",
 83        journal_ref: str = "",
 84        doi: str = "",
 85        primary_category: str = "",
 86        categories: List[str] = [],
 87        links: List[Link] = [],
 88        _raw: feedparser.FeedParserDict = None,
 89    ):
 90        """
 91        Constructs an arXiv search result item.
 92
 93        In most cases, prefer using `Result._from_feed_entry` to parsing and
 94        constructing `Result`s yourself.
 95        """
 96        self.entry_id = entry_id
 97        self.updated = updated
 98        self.published = published
 99        self.title = title
100        self.authors = authors
101        self.summary = summary
102        self.comment = comment
103        self.journal_ref = journal_ref
104        self.doi = doi
105        self.primary_category = primary_category
106        self.categories = categories
107        self.links = links
108        # Calculated members
109        self.pdf_url = Result._get_pdf_url(links)
110        # Debugging
111        self._raw = _raw
112
113    def _from_feed_entry(entry: feedparser.FeedParserDict) -> Result:
114        """
115        Converts a feedparser entry for an arXiv search result feed into a
116        Result object.
117        """
118        if not hasattr(entry, "id"):
119            raise Result.MissingFieldError("id")
120        # Title attribute may be absent for certain titles. Defaulting to "0" as
121        # it's the only title observed to cause this bug.
122        # https://github.com/lukasschwab/arxiv.py/issues/71
123        # title = entry.title if hasattr(entry, "title") else "0"
124        title = "0"
125        if hasattr(entry, "title"):
126            title = entry.title
127        else:
128            logger.warning("Result %s is missing title attribute; defaulting to '0'", entry.id)
129        return Result(
130            entry_id=entry.id,
131            updated=Result._to_datetime(entry.updated_parsed),
132            published=Result._to_datetime(entry.published_parsed),
133            title=re.sub(r"\s+", " ", title),
134            authors=[Result.Author._from_feed_author(a) for a in entry.authors],
135            summary=entry.summary,
136            comment=entry.get("arxiv_comment"),
137            journal_ref=entry.get("arxiv_journal_ref"),
138            doi=entry.get("arxiv_doi"),
139            primary_category=entry.arxiv_primary_category.get("term"),
140            categories=[tag.get("term") for tag in entry.tags],
141            links=[Result.Link._from_feed_link(link) for link in entry.links],
142            _raw=entry,
143        )
144
145    def __str__(self) -> str:
146        return self.entry_id
147
148    def __repr__(self) -> str:
149        return (
150            "{}(entry_id={}, updated={}, published={}, title={}, authors={}, "
151            "summary={}, comment={}, journal_ref={}, doi={}, "
152            "primary_category={}, categories={}, links={})"
153        ).format(
154            _classname(self),
155            repr(self.entry_id),
156            repr(self.updated),
157            repr(self.published),
158            repr(self.title),
159            repr(self.authors),
160            repr(self.summary),
161            repr(self.comment),
162            repr(self.journal_ref),
163            repr(self.doi),
164            repr(self.primary_category),
165            repr(self.categories),
166            repr(self.links),
167        )
168
169    def __eq__(self, other) -> bool:
170        if isinstance(other, Result):
171            return self.entry_id == other.entry_id
172        return False
173
174    def get_short_id(self) -> str:
175        """
176        Returns the short ID for this result.
177
178        + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`,
179        `result.get_short_id()` returns `2107.05580v1`.
180
181        + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`,
182        `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March
183        2007 arXiv identifier format).
184
185        For an explanation of the difference between arXiv's legacy and current
186        identifiers, see [Understanding the arXiv
187        identifier](https://arxiv.org/help/arxiv_identifier).
188        """
189        return self.entry_id.split("arxiv.org/abs/")[-1]
190
191    def _get_default_filename(self, extension: str = "pdf") -> str:
192        """
193        A default `to_filename` function for the extension given.
194        """
195        nonempty_title = self.title if self.title else "UNTITLED"
196        return ".".join(
197            [
198                self.get_short_id().replace("/", "_"),
199                re.sub(r"[^\w]", "_", nonempty_title),
200                extension,
201            ]
202        )
203
204    def download_pdf(self, dirpath: str = "./", filename: str = "") -> str:
205        """
206        Downloads the PDF for this result to the specified directory.
207
208        The filename is generated by calling `to_filename(self)`.
209        """
210        if not filename:
211            filename = self._get_default_filename()
212        path = os.path.join(dirpath, filename)
213        written_path, _ = urlretrieve(self.pdf_url, path)
214        return written_path
215
216    def download_source(self, dirpath: str = "./", filename: str = "") -> str:
217        """
218        Downloads the source tarfile for this result to the specified
219        directory.
220
221        The filename is generated by calling `to_filename(self)`.
222        """
223        if not filename:
224            filename = self._get_default_filename("tar.gz")
225        path = os.path.join(dirpath, filename)
226        # Bodge: construct the source URL from the PDF URL.
227        source_url = self.pdf_url.replace("/pdf/", "/src/")
228        written_path, _ = urlretrieve(source_url, path)
229        return written_path
230
231    def _get_pdf_url(links: List[Link]) -> str:
232        """
233        Finds the PDF link among a result's links and returns its URL.
234
235        Should only be called once for a given `Result`, in its constructor.
236        After construction, the URL should be available in `Result.pdf_url`.
237        """
238        pdf_urls = [link.href for link in links if link.title == "pdf"]
239        if len(pdf_urls) == 0:
240            return None
241        elif len(pdf_urls) > 1:
242            logger.warning("Result has multiple PDF links; using %s", pdf_urls[0])
243        return pdf_urls[0]
244
245    def _to_datetime(ts: time.struct_time) -> datetime:
246        """
247        Converts a UTC time.struct_time into a time-zone-aware datetime.
248
249        This will be replaced with feedparser functionality [when it becomes
250        available](https://github.com/kurtmckee/feedparser/issues/212).
251        """
252        return datetime.fromtimestamp(timegm(ts), tz=timezone.utc)
253
254    class Author(object):
255        """
256        A light inner class for representing a result's authors.
257        """
258
259        name: str
260        """The author's name."""
261
262        def __init__(self, name: str):
263            """
264            Constructs an `Author` with the specified name.
265
266            In most cases, prefer using `Author._from_feed_author` to parsing
267            and constructing `Author`s yourself.
268            """
269            self.name = name
270
271        def _from_feed_author(feed_author: feedparser.FeedParserDict) -> Result.Author:
272            """
273            Constructs an `Author` with the name specified in an author object
274            from a feed entry.
275
276            See usage in `Result._from_feed_entry`.
277            """
278            return Result.Author(feed_author.name)
279
280        def __str__(self) -> str:
281            return self.name
282
283        def __repr__(self) -> str:
284            return "{}({})".format(_classname(self), repr(self.name))
285
286        def __eq__(self, other) -> bool:
287            if isinstance(other, Result.Author):
288                return self.name == other.name
289            return False
290
291    class Link(object):
292        """
293        A light inner class for representing a result's links.
294        """
295
296        href: str
297        """The link's `href` attribute."""
298        title: str
299        """The link's title."""
300        rel: str
301        """The link's relationship to the `Result`."""
302        content_type: str
303        """The link's HTTP content type."""
304
305        def __init__(
306            self,
307            href: str,
308            title: str = None,
309            rel: str = None,
310            content_type: str = None,
311        ):
312            """
313            Constructs a `Link` with the specified link metadata.
314
315            In most cases, prefer using `Link._from_feed_link` to parsing and
316            constructing `Link`s yourself.
317            """
318            self.href = href
319            self.title = title
320            self.rel = rel
321            self.content_type = content_type
322
323        def _from_feed_link(feed_link: feedparser.FeedParserDict) -> Result.Link:
324            """
325            Constructs a `Link` with link metadata specified in a link object
326            from a feed entry.
327
328            See usage in `Result._from_feed_entry`.
329            """
330            return Result.Link(
331                href=feed_link.href,
332                title=feed_link.get("title"),
333                rel=feed_link.get("rel"),
334                content_type=feed_link.get("content_type"),
335            )
336
337        def __str__(self) -> str:
338            return self.href
339
340        def __repr__(self) -> str:
341            return "{}({}, title={}, rel={}, content_type={})".format(
342                _classname(self),
343                repr(self.href),
344                repr(self.title),
345                repr(self.rel),
346                repr(self.content_type),
347            )
348
349        def __eq__(self, other) -> bool:
350            if isinstance(other, Result.Link):
351                return self.href == other.href
352            return False
353
354    class MissingFieldError(Exception):
355        """
356        An error indicating an entry is unparseable because it lacks required
357        fields.
358        """
359
360        missing_field: str
361        """The required field missing from the would-be entry."""
362        message: str
363        """Message describing what caused this error."""
364
365        def __init__(self, missing_field):
366            self.missing_field = missing_field
367            self.message = "Entry from arXiv missing required info"
368
369        def __repr__(self) -> str:
370            return "{}({})".format(_classname(self), repr(self.missing_field))

An entry in an arXiv query results feed.

See the arXiv API User's Manual: Details of Atom Results Returned.

Result( entry_id: str, updated: datetime.datetime = datetime.datetime(1, 1, 1, 0, 0), published: datetime.datetime = datetime.datetime(1, 1, 1, 0, 0), title: str = '', authors: List[arxiv.Result.Author] = [], summary: str = '', comment: str = '', journal_ref: str = '', doi: str = '', primary_category: str = '', categories: List[str] = [], links: List[arxiv.Result.Link] = [], _raw: feedparser.util.FeedParserDict = None)
 74    def __init__(
 75        self,
 76        entry_id: str,
 77        updated: datetime = _DEFAULT_TIME,
 78        published: datetime = _DEFAULT_TIME,
 79        title: str = "",
 80        authors: List[Author] = [],
 81        summary: str = "",
 82        comment: str = "",
 83        journal_ref: str = "",
 84        doi: str = "",
 85        primary_category: str = "",
 86        categories: List[str] = [],
 87        links: List[Link] = [],
 88        _raw: feedparser.FeedParserDict = None,
 89    ):
 90        """
 91        Constructs an arXiv search result item.
 92
 93        In most cases, prefer using `Result._from_feed_entry` to parsing and
 94        constructing `Result`s yourself.
 95        """
 96        self.entry_id = entry_id
 97        self.updated = updated
 98        self.published = published
 99        self.title = title
100        self.authors = authors
101        self.summary = summary
102        self.comment = comment
103        self.journal_ref = journal_ref
104        self.doi = doi
105        self.primary_category = primary_category
106        self.categories = categories
107        self.links = links
108        # Calculated members
109        self.pdf_url = Result._get_pdf_url(links)
110        # Debugging
111        self._raw = _raw

Constructs an arXiv search result item.

In most cases, prefer using Result._from_feed_entry to parsing and constructing Results yourself.

entry_id: str

A url of the form https://arxiv.org/abs/{id}.

updated: datetime.datetime

When the result was last updated.

published: datetime.datetime

When the result was originally published.

title: str

The title of the result.

authors: List[arxiv.Result.Author]

The result's authors.

summary: str

The result abstract.

comment: str

The authors' comment if present.

journal_ref: str

A journal reference if present.

doi: str

A URL for the resolved DOI to an external resource if present.

primary_category: str

The result's primary arXiv category. See arXiv: Category Taxonomy.

categories: List[str]

All of the result's categories. See arXiv: Category Taxonomy.

pdf_url: str

The URL of a PDF version of this result if present among links.

def get_short_id(self) -> str:
174    def get_short_id(self) -> str:
175        """
176        Returns the short ID for this result.
177
178        + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`,
179        `result.get_short_id()` returns `2107.05580v1`.
180
181        + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`,
182        `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March
183        2007 arXiv identifier format).
184
185        For an explanation of the difference between arXiv's legacy and current
186        identifiers, see [Understanding the arXiv
187        identifier](https://arxiv.org/help/arxiv_identifier).
188        """
189        return self.entry_id.split("arxiv.org/abs/")[-1]

Returns the short ID for this result.

  • If the result URL is "https://arxiv.org/abs/2107.05580v1", result.get_short_id() returns 2107.05580v1.

  • If the result URL is "https://arxiv.org/abs/quant-ph/0201082v1", result.get_short_id() returns "quant-ph/0201082v1" (the pre-March 2007 arXiv identifier format).

For an explanation of the difference between arXiv's legacy and current identifiers, see Understanding the arXiv identifier.

def download_pdf(self, dirpath: str = './', filename: str = '') -> str:
204    def download_pdf(self, dirpath: str = "./", filename: str = "") -> str:
205        """
206        Downloads the PDF for this result to the specified directory.
207
208        The filename is generated by calling `to_filename(self)`.
209        """
210        if not filename:
211            filename = self._get_default_filename()
212        path = os.path.join(dirpath, filename)
213        written_path, _ = urlretrieve(self.pdf_url, path)
214        return written_path

Downloads the PDF for this result to the specified directory.

The filename is generated by calling to_filename(self).

def download_source(self, dirpath: str = './', filename: str = '') -> str:
216    def download_source(self, dirpath: str = "./", filename: str = "") -> str:
217        """
218        Downloads the source tarfile for this result to the specified
219        directory.
220
221        The filename is generated by calling `to_filename(self)`.
222        """
223        if not filename:
224            filename = self._get_default_filename("tar.gz")
225        path = os.path.join(dirpath, filename)
226        # Bodge: construct the source URL from the PDF URL.
227        source_url = self.pdf_url.replace("/pdf/", "/src/")
228        written_path, _ = urlretrieve(source_url, path)
229        return written_path

Downloads the source tarfile for this result to the specified directory.

The filename is generated by calling to_filename(self).

class Result.Author:
254    class Author(object):
255        """
256        A light inner class for representing a result's authors.
257        """
258
259        name: str
260        """The author's name."""
261
262        def __init__(self, name: str):
263            """
264            Constructs an `Author` with the specified name.
265
266            In most cases, prefer using `Author._from_feed_author` to parsing
267            and constructing `Author`s yourself.
268            """
269            self.name = name
270
271        def _from_feed_author(feed_author: feedparser.FeedParserDict) -> Result.Author:
272            """
273            Constructs an `Author` with the name specified in an author object
274            from a feed entry.
275
276            See usage in `Result._from_feed_entry`.
277            """
278            return Result.Author(feed_author.name)
279
280        def __str__(self) -> str:
281            return self.name
282
283        def __repr__(self) -> str:
284            return "{}({})".format(_classname(self), repr(self.name))
285
286        def __eq__(self, other) -> bool:
287            if isinstance(other, Result.Author):
288                return self.name == other.name
289            return False

A light inner class for representing a result's authors.

Result.Author(name: str)
262        def __init__(self, name: str):
263            """
264            Constructs an `Author` with the specified name.
265
266            In most cases, prefer using `Author._from_feed_author` to parsing
267            and constructing `Author`s yourself.
268            """
269            self.name = name

Constructs an Author with the specified name.

In most cases, prefer using Author._from_feed_author to parsing and constructing Authors yourself.

name: str

The author's name.

class Result.MissingFieldError(builtins.Exception):
354    class MissingFieldError(Exception):
355        """
356        An error indicating an entry is unparseable because it lacks required
357        fields.
358        """
359
360        missing_field: str
361        """The required field missing from the would-be entry."""
362        message: str
363        """Message describing what caused this error."""
364
365        def __init__(self, missing_field):
366            self.missing_field = missing_field
367            self.message = "Entry from arXiv missing required info"
368
369        def __repr__(self) -> str:
370            return "{}({})".format(_classname(self), repr(self.missing_field))

An error indicating an entry is unparseable because it lacks required fields.

Result.MissingFieldError(missing_field)
365        def __init__(self, missing_field):
366            self.missing_field = missing_field
367            self.message = "Entry from arXiv missing required info"
missing_field: str

The required field missing from the would-be entry.

message: str

Message describing what caused this error.

Inherited Members
builtins.BaseException
with_traceback
class SortCriterion(enum.Enum):
373class SortCriterion(Enum):
374    """
375    A SortCriterion identifies a property by which search results can be
376    sorted.
377
378    See [the arXiv API User's Manual: sort order for return
379    results](https://arxiv.org/help/api/user-manual#sort).
380    """
381
382    Relevance = "relevance"
383    LastUpdatedDate = "lastUpdatedDate"
384    SubmittedDate = "submittedDate"

A SortCriterion identifies a property by which search results can be sorted.

See the arXiv API User's Manual: sort order for return results.

Relevance = <SortCriterion.Relevance: 'relevance'>
LastUpdatedDate = <SortCriterion.LastUpdatedDate: 'lastUpdatedDate'>
SubmittedDate = <SortCriterion.SubmittedDate: 'submittedDate'>
Inherited Members
enum.Enum
name
value
class SortOrder(enum.Enum):
387class SortOrder(Enum):
388    """
389    A SortOrder indicates order in which search results are sorted according
390    to the specified arxiv.SortCriterion.
391
392    See [the arXiv API User's Manual: sort order for return
393    results](https://arxiv.org/help/api/user-manual#sort).
394    """
395
396    Ascending = "ascending"
397    Descending = "descending"

A SortOrder indicates order in which search results are sorted according to the specified arxiv.SortCriterion.

See the arXiv API User's Manual: sort order for return results.

Ascending = <SortOrder.Ascending: 'ascending'>
Descending = <SortOrder.Descending: 'descending'>
Inherited Members
enum.Enum
name
value
class Client:
497class Client(object):
498    """
499    Specifies a strategy for fetching results from arXiv's API.
500
501    This class obscures pagination and retry logic, and exposes
502    `Client.results`.
503    """
504
505    query_url_format = "https://export.arxiv.org/api/query?{}"
506    """
507    The arXiv query API endpoint format.
508    """
509    page_size: int
510    """
511    Maximum number of results fetched in a single API request. Smaller pages can
512    be retrieved faster, but may require more round-trips.
513
514    The API's limit is 2000 results per page.
515    """
516    delay_seconds: float
517    """
518    Number of seconds to wait between API requests.
519
520    [arXiv's Terms of Use](https://arxiv.org/help/api/tou) ask that you "make no
521    more than one request every three seconds."
522    """
523    num_retries: int
524    """
525    Number of times to retry a failing API request before raising an Exception.
526    """
527
528    _last_request_dt: datetime
529    _session: requests.Session
530
531    def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3):
532        """
533        Constructs an arXiv API client with the specified options.
534
535        Note: the default parameters should provide a robust request strategy
536        for most use cases. Extreme page sizes, delays, or retries risk
537        violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou),
538        brittle behavior, and inconsistent results.
539        """
540        self.page_size = page_size
541        self.delay_seconds = delay_seconds
542        self.num_retries = num_retries
543        self._last_request_dt = None
544        self._session = requests.Session()
545
546    def __str__(self) -> str:
547        # TODO: develop a more informative string representation.
548        return repr(self)
549
550    def __repr__(self) -> str:
551        return "{}(page_size={}, delay_seconds={}, num_retries={})".format(
552            _classname(self),
553            repr(self.page_size),
554            repr(self.delay_seconds),
555            repr(self.num_retries),
556        )
557
558    def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]:
559        """
560        Uses this client configuration to fetch one page of the search results
561        at a time, yielding the parsed `Result`s, until `max_results` results
562        have been yielded or there are no more search results.
563
564        If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`.
565
566        Setting a nonzero `offset` discards leading records in the result set.
567        When `offset` is greater than or equal to `search.max_results`, the full
568        result set is discarded.
569
570        For more on using generators, see
571        [Generators](https://wiki.python.org/moin/Generators).
572        """
573        limit = search.max_results - offset if search.max_results else None
574        if limit and limit < 0:
575            return iter(())
576        return itertools.islice(self._results(search, offset), limit)
577
578    def _results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]:
579        page_url = self._format_url(search, offset, self.page_size)
580        feed = self._parse_feed(page_url, first_page=True)
581        if not feed.entries:
582            logger.info("Got empty first page; stopping generation")
583            return
584        total_results = int(feed.feed.opensearch_totalresults)
585        logger.info(
586            "Got first page: %d of %d total results",
587            len(feed.entries),
588            total_results,
589        )
590
591        while feed.entries:
592            for entry in feed.entries:
593                try:
594                    yield Result._from_feed_entry(entry)
595                except Result.MissingFieldError as e:
596                    logger.warning("Skipping partial result: %s", e)
597            offset += len(feed.entries)
598            if offset >= total_results:
599                break
600            page_url = self._format_url(search, offset, self.page_size)
601            feed = self._parse_feed(page_url, first_page=False)
602
603    def _format_url(self, search: Search, start: int, page_size: int) -> str:
604        """
605        Construct a request API for search that returns up to `page_size`
606        results starting with the result at index `start`.
607        """
608        url_args = search._url_args()
609        url_args.update(
610            {
611                "start": start,
612                "max_results": page_size,
613            }
614        )
615        return self.query_url_format.format(urlencode(url_args))
616
617    def _parse_feed(
618        self, url: str, first_page: bool = True, _try_index: int = 0
619    ) -> feedparser.FeedParserDict:
620        """
621        Fetches the specified URL and parses it with feedparser.
622
623        If a request fails or is unexpectedly empty, retries the request up to
624        `self.num_retries` times.
625        """
626        try:
627            return self.__try_parse_feed(url, first_page=first_page, try_index=_try_index)
628        except (
629            HTTPError,
630            UnexpectedEmptyPageError,
631            requests.exceptions.ConnectionError,
632        ) as err:
633            if _try_index < self.num_retries:
634                logger.debug("Got error (try %d): %s", _try_index, err)
635                return self._parse_feed(url, first_page=first_page, _try_index=_try_index + 1)
636            logger.debug("Giving up (try %d): %s", _try_index, err)
637            raise err
638
639    def __try_parse_feed(
640        self,
641        url: str,
642        first_page: bool,
643        try_index: int,
644    ) -> feedparser.FeedParserDict:
645        """
646        Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that
647        number of seconds has not passed since `_parse_feed` was last called,
648        sleeps until delay_seconds seconds have passed.
649        """
650        # If this call would violate the rate limit, sleep until it doesn't.
651        if self._last_request_dt is not None:
652            required = timedelta(seconds=self.delay_seconds)
653            since_last_request = datetime.now() - self._last_request_dt
654            if since_last_request < required:
655                to_sleep = (required - since_last_request).total_seconds()
656                logger.info("Sleeping: %f seconds", to_sleep)
657                time.sleep(to_sleep)
658
659        logger.info("Requesting page (first: %r, try: %d): %s", first_page, try_index, url)
660
661        resp = self._session.get(url, headers={"user-agent": "arxiv.py/2.1.0"})
662        self._last_request_dt = datetime.now()
663        if resp.status_code != requests.codes.OK:
664            raise HTTPError(url, try_index, resp.status_code)
665
666        feed = feedparser.parse(resp.content)
667        if len(feed.entries) == 0 and not first_page:
668            raise UnexpectedEmptyPageError(url, try_index, feed)
669
670        if feed.bozo:
671            logger.warning(
672                "Bozo feed; consider handling: %s",
673                feed.bozo_exception if "bozo_exception" in feed else None,
674            )
675
676        return feed

Specifies a strategy for fetching results from arXiv's API.

This class obscures pagination and retry logic, and exposes Client.results.

Client( page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3)
531    def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3):
532        """
533        Constructs an arXiv API client with the specified options.
534
535        Note: the default parameters should provide a robust request strategy
536        for most use cases. Extreme page sizes, delays, or retries risk
537        violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou),
538        brittle behavior, and inconsistent results.
539        """
540        self.page_size = page_size
541        self.delay_seconds = delay_seconds
542        self.num_retries = num_retries
543        self._last_request_dt = None
544        self._session = requests.Session()

Constructs an arXiv API client with the specified options.

Note: the default parameters should provide a robust request strategy for most use cases. Extreme page sizes, delays, or retries risk violating the arXiv API Terms of Use, brittle behavior, and inconsistent results.

query_url_format = 'https://export.arxiv.org/api/query?{}'

The arXiv query API endpoint format.

page_size: int

Maximum number of results fetched in a single API request. Smaller pages can be retrieved faster, but may require more round-trips.

The API's limit is 2000 results per page.

delay_seconds: float

Number of seconds to wait between API requests.

arXiv's Terms of Use ask that you "make no more than one request every three seconds."

num_retries: int

Number of times to retry a failing API request before raising an Exception.

def results( self, search: arxiv.Search, offset: int = 0) -> Generator[arxiv.Result, NoneType, NoneType]:
558    def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]:
559        """
560        Uses this client configuration to fetch one page of the search results
561        at a time, yielding the parsed `Result`s, until `max_results` results
562        have been yielded or there are no more search results.
563
564        If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`.
565
566        Setting a nonzero `offset` discards leading records in the result set.
567        When `offset` is greater than or equal to `search.max_results`, the full
568        result set is discarded.
569
570        For more on using generators, see
571        [Generators](https://wiki.python.org/moin/Generators).
572        """
573        limit = search.max_results - offset if search.max_results else None
574        if limit and limit < 0:
575            return iter(())
576        return itertools.islice(self._results(search, offset), limit)

Uses this client configuration to fetch one page of the search results at a time, yielding the parsed Results, until max_results results have been yielded or there are no more search results.

If all tries fail, raises an UnexpectedEmptyPageError or HTTPError.

Setting a nonzero offset discards leading records in the result set. When offset is greater than or equal to search.max_results, the full result set is discarded.

For more on using generators, see Generators.

class ArxivError(builtins.Exception):
679class ArxivError(Exception):
680    """This package's base Exception class."""
681
682    url: str
683    """The feed URL that could not be fetched."""
684    retry: int
685    """
686    The request try number which encountered this error; 0 for the initial try,
687    1 for the first retry, and so on.
688    """
689    message: str
690    """Message describing what caused this error."""
691
692    def __init__(self, url: str, retry: int, message: str):
693        """
694        Constructs an `ArxivError` encountered while fetching the specified URL.
695        """
696        self.url = url
697        self.retry = retry
698        self.message = message
699        super().__init__(self.message)
700
701    def __str__(self) -> str:
702        return "{} ({})".format(self.message, self.url)

This package's base Exception class.

ArxivError(url: str, retry: int, message: str)
692    def __init__(self, url: str, retry: int, message: str):
693        """
694        Constructs an `ArxivError` encountered while fetching the specified URL.
695        """
696        self.url = url
697        self.retry = retry
698        self.message = message
699        super().__init__(self.message)

Constructs an ArxivError encountered while fetching the specified URL.

url: str

The feed URL that could not be fetched.

retry: int

The request try number which encountered this error; 0 for the initial try, 1 for the first retry, and so on.

message: str

Message describing what caused this error.

Inherited Members
builtins.BaseException
with_traceback
class UnexpectedEmptyPageError(ArxivError):
705class UnexpectedEmptyPageError(ArxivError):
706    """
707    An error raised when a page of results that should be non-empty is empty.
708
709    This should never happen in theory, but happens sporadically due to
710    brittleness in the underlying arXiv API; usually resolved by retries.
711
712    See `Client.results` for usage.
713    """
714
715    raw_feed: feedparser.FeedParserDict
716    """
717    The raw output of `feedparser.parse`. Sometimes this contains useful
718    diagnostic information, e.g. in 'bozo_exception'.
719    """
720
721    def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict):
722        """
723        Constructs an `UnexpectedEmptyPageError` encountered for the specified
724        API URL after `retry` tries.
725        """
726        self.url = url
727        self.raw_feed = raw_feed
728        super().__init__(url, retry, "Page of results was unexpectedly empty")
729
730    def __repr__(self) -> str:
731        return "{}({}, {}, {})".format(
732            _classname(self), repr(self.url), repr(self.retry), repr(self.raw_feed)
733        )

An error raised when a page of results that should be non-empty is empty.

This should never happen in theory, but happens sporadically due to brittleness in the underlying arXiv API; usually resolved by retries.

See Client.results for usage.

UnexpectedEmptyPageError(url: str, retry: int, raw_feed: feedparser.util.FeedParserDict)
721    def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict):
722        """
723        Constructs an `UnexpectedEmptyPageError` encountered for the specified
724        API URL after `retry` tries.
725        """
726        self.url = url
727        self.raw_feed = raw_feed
728        super().__init__(url, retry, "Page of results was unexpectedly empty")

Constructs an UnexpectedEmptyPageError encountered for the specified API URL after retry tries.

raw_feed: feedparser.util.FeedParserDict

The raw output of feedparser.parse. Sometimes this contains useful diagnostic information, e.g. in 'bozo_exception'.

url

The feed URL that could not be fetched.

Inherited Members
ArxivError
retry
message
builtins.BaseException
with_traceback
class HTTPError(ArxivError):
736class HTTPError(ArxivError):
737    """
738    A non-200 status encountered while fetching a page of results.
739
740    See `Client.results` for usage.
741    """
742
743    status: int
744    """The HTTP status reported by feedparser."""
745
746    def __init__(self, url: str, retry: int, status: int):
747        """
748        Constructs an `HTTPError` for the specified status code, encountered for
749        the specified API URL after `retry` tries.
750        """
751        self.url = url
752        self.status = status
753        super().__init__(
754            url,
755            retry,
756            "Page request resulted in HTTP {}".format(self.status),
757        )
758
759    def __repr__(self) -> str:
760        return "{}({}, {}, {})".format(
761            _classname(self), repr(self.url), repr(self.retry), repr(self.status)
762        )

A non-200 status encountered while fetching a page of results.

See Client.results for usage.

HTTPError(url: str, retry: int, status: int)
746    def __init__(self, url: str, retry: int, status: int):
747        """
748        Constructs an `HTTPError` for the specified status code, encountered for
749        the specified API URL after `retry` tries.
750        """
751        self.url = url
752        self.status = status
753        super().__init__(
754            url,
755            retry,
756            "Page request resulted in HTTP {}".format(self.status),
757        )

Constructs an HTTPError for the specified status code, encountered for the specified API URL after retry tries.

status: int

The HTTP status reported by feedparser.

url

The feed URL that could not be fetched.

Inherited Members
ArxivError
retry
message
builtins.BaseException
with_traceback