arxiv

arxiv.py

PyPI PyPI - Python Version GitHub Workflow Status (branch) Full package documentation

Python wrapper for the arXiv API.

arXiv is a project by the Cornell University Library that provides open access to 1,000,000+ articles in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics.

Usage

Installation

$ pip install arxiv

In your Python script, include the line

import arxiv

Examples

Fetching results

import arxiv

# Construct the default API client.
client = Client()

# Search for the 10 most recent articles matching the keyword "quantum."
search = Search(
  query = "quantum",
  max_results = 10,
  sort_by = SortCriterion.SubmittedDate
)

results = client.results(search)

# `results` is a generator; you can iterate over its elements one by one...
for r in client.results(search):
  print(r.title)
# ...or exhaust it into a list. Careful: this is slow for large results sets.
all_results = list(results)
print([r.title for r in all_results])

# For advanced query syntax documentation, see the arXiv API User Manual:
# https://arxiv.org/help/api/user-manual#query_details
search = Search(query = "au:del_maestro AND ti:checkerboard")
first_result = next(client.results(search))
print(first_result)

# Search for the paper with ID "1605.08386v1"
search_by_id = Search(id_list=["1605.08386v1"])
# Reuse client to fetch the paper, then print its title.
first_result = next(client.results(search_by_id))
print(first_result.title)

Fetching results with a custom client

import arxiv

big_slow_client = Client(
  page_size = 1000,
  delay_seconds = 10.0,
  num_retries = 5
)

# Prints 1000 titles before needing to make another request.
for result in big_slow_client.results(Search(query="quantum")):
  print(result.title)

Logging

To inspect this package's network behavior and API logic, configure a DEBUG-level logger.

>>> import logging, arxiv
>>> logging.basicConfig(level=logging.DEBUG)
>>> client = Client()
>>> paper = next(client.results(Search(id_list=["1605.08386v1"])))
INFO:arxiv.arxiv:Requesting 100 results at offset 0
INFO:arxiv.arxiv:Requesting page (first: False, try: 0): https://exportarxiv.arxiv.org/api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): export.arxiv.org:443
DEBUG:urllib3.connectionpool:https://export.arxiv.org:443 "GET /api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100&user-agent=arxiv.py%2F1.4.8 HTTP/1.1" 200 979

Types

Client

A Client specifies a reusable strategy for fetching results from arXiv's API. For most use cases the default client should suffice.

Clients configurations specify pagination and retry logic. Reusing a client allows successive API calls to use the same connection pool and ensures they abide by the rate limit you set.

A Search specifies a search of arXiv's database. Use Client.results to get a generator yielding Results.

Result

The Result objects yielded by Client.results include metadata about each paper and helper methods for downloading their content.

The meaning of the underlying raw data is documented in the arXiv API User Manual: Details of Atom Results Returned.

Result also exposes helper methods for downloading papers: Result.download_pdf and Result.download_source.

  1""".. include:: ../README.md"""
  2
  3from __future__ import annotations
  4
  5import logging
  6import time
  7import itertools
  8import feedparser
  9import os
 10import math
 11import re
 12import requests
 13import warnings
 14
 15from urllib.parse import urlencode, urlparse
 16from urllib.request import urlretrieve
 17from datetime import datetime, timedelta, timezone
 18from calendar import timegm
 19
 20from enum import Enum
 21from typing import Dict, Generator, List, Optional
 22
 23logger = logging.getLogger(__name__)
 24
 25_DEFAULT_TIME = datetime.min
 26
 27
 28class Result(object):
 29    """
 30    An entry in an arXiv query results feed.
 31
 32    See [the arXiv API User's Manual: Details of Atom Results
 33    Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned).
 34    """
 35
 36    entry_id: str
 37    """A url of the form `https://arxiv.org/abs/{id}`."""
 38    updated: datetime
 39    """When the result was last updated."""
 40    published: datetime
 41    """When the result was originally published."""
 42    title: str
 43    """The title of the result."""
 44    authors: List[Author]
 45    """The result's authors."""
 46    summary: str
 47    """The result abstract."""
 48    comment: Optional[str]
 49    """The authors' comment if present."""
 50    journal_ref: Optional[str]
 51    """A journal reference if present."""
 52    doi: Optional[str]
 53    """A URL for the resolved DOI to an external resource if present."""
 54    primary_category: str
 55    """
 56    The result's primary arXiv category. See [arXiv: Category
 57    Taxonomy](https://arxiv.org/category_taxonomy).
 58    """
 59    categories: List[str]
 60    """
 61    All of the result's categories. See [arXiv: Category
 62    Taxonomy](https://arxiv.org/category_taxonomy).
 63    """
 64    links: List[Link]
 65    """Up to three URLs associated with this result."""
 66    pdf_url: Optional[str]
 67    """The URL of a PDF version of this result if present among links."""
 68    _raw: feedparser.FeedParserDict
 69    """
 70    The raw feedparser result object if this Result was constructed with
 71    Result._from_feed_entry.
 72    """
 73
 74    def __init__(
 75        self,
 76        entry_id: str,
 77        updated: datetime = _DEFAULT_TIME,
 78        published: datetime = _DEFAULT_TIME,
 79        title: str = "",
 80        authors: List[Author] = [],
 81        summary: str = "",
 82        comment: str = "",
 83        journal_ref: str = "",
 84        doi: str = "",
 85        primary_category: str = "",
 86        categories: List[str] = [],
 87        links: List[Link] = [],
 88        _raw: feedparser.FeedParserDict = None,
 89    ):
 90        """
 91        Constructs an arXiv search result item.
 92
 93        In most cases, prefer using `Result._from_feed_entry` to parsing and
 94        constructing `Result`s yourself.
 95        """
 96        self.entry_id = entry_id
 97        self.updated = updated
 98        self.published = published
 99        self.title = title
100        self.authors = authors
101        self.summary = summary
102        self.comment = comment
103        self.journal_ref = journal_ref
104        self.doi = doi
105        self.primary_category = primary_category
106        self.categories = categories
107        self.links = links
108        # Calculated members
109        self.pdf_url = Result._get_pdf_url(links)
110        # Debugging
111        self._raw = _raw
112
113    def _from_feed_entry(entry: feedparser.FeedParserDict) -> Result:
114        """
115        Converts a feedparser entry for an arXiv search result feed into a
116        Result object.
117        """
118        if not hasattr(entry, "id"):
119            raise Result.MissingFieldError("id")
120        # Title attribute may be absent for certain titles. Defaulting to "0" as
121        # it's the only title observed to cause this bug.
122        # https://github.com/lukasschwab/arxiv.py/issues/71
123        # title = entry.title if hasattr(entry, "title") else "0"
124        title = "0"
125        if hasattr(entry, "title"):
126            title = entry.title
127        else:
128            logger.warning("Result %s is missing title attribute; defaulting to '0'", entry.id)
129        return Result(
130            entry_id=entry.id,
131            updated=Result._to_datetime(entry.updated_parsed),
132            published=Result._to_datetime(entry.published_parsed),
133            title=re.sub(r"\s+", " ", title),
134            authors=[Result.Author._from_feed_author(a) for a in entry.authors],
135            summary=entry.summary,
136            comment=entry.get("arxiv_comment"),
137            journal_ref=entry.get("arxiv_journal_ref"),
138            doi=entry.get("arxiv_doi"),
139            primary_category=entry.arxiv_primary_category.get("term"),
140            categories=[tag.get("term") for tag in entry.tags],
141            links=[Result.Link._from_feed_link(link) for link in entry.links],
142            _raw=entry,
143        )
144
145    def __str__(self) -> str:
146        return self.entry_id
147
148    def __repr__(self) -> str:
149        return (
150            "{}(entry_id={}, updated={}, published={}, title={}, authors={}, "
151            "summary={}, comment={}, journal_ref={}, doi={}, "
152            "primary_category={}, categories={}, links={})"
153        ).format(
154            _classname(self),
155            repr(self.entry_id),
156            repr(self.updated),
157            repr(self.published),
158            repr(self.title),
159            repr(self.authors),
160            repr(self.summary),
161            repr(self.comment),
162            repr(self.journal_ref),
163            repr(self.doi),
164            repr(self.primary_category),
165            repr(self.categories),
166            repr(self.links),
167        )
168
169    def __eq__(self, other) -> bool:
170        if isinstance(other, Result):
171            return self.entry_id == other.entry_id
172        return False
173
174    def get_short_id(self) -> str:
175        """
176        Returns the short ID for this result.
177
178        + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`,
179        `result.get_short_id()` returns `2107.05580v1`.
180
181        + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`,
182        `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March
183        2007 arXiv identifier format).
184
185        For an explanation of the difference between arXiv's legacy and current
186        identifiers, see [Understanding the arXiv
187        identifier](https://arxiv.org/help/arxiv_identifier).
188        """
189        return self.entry_id.split("arxiv.org/abs/")[-1]
190
191    def _get_default_filename(self, extension: str = "pdf") -> str:
192        """
193        A default `to_filename` function for the extension given.
194        """
195        nonempty_title = self.title if self.title else "UNTITLED"
196        return ".".join(
197            [
198                self.get_short_id().replace("/", "_"),
199                re.sub(r"[^\w]", "_", nonempty_title),
200                extension,
201            ]
202        )
203
204    def download_pdf(
205        self,
206        dirpath: str = "./",
207        filename: str = "",
208        download_domain: str = "export.arxiv.org",
209    ) -> str:
210        """
211        Downloads the PDF for this result to the specified directory.
212
213        The filename is generated by calling `to_filename(self)`.
214
215        **Deprecated:** future versions of this client library will not provide
216        download helpers (out of scope). Use `result.pdf_url` directly.
217        """
218        if not filename:
219            filename = self._get_default_filename()
220        path = os.path.join(dirpath, filename)
221        pdf_url = Result._substitute_domain(self.pdf_url, download_domain)
222        written_path, _ = urlretrieve(pdf_url, path)
223        return written_path
224
225    def download_source(
226        self,
227        dirpath: str = "./",
228        filename: str = "",
229        download_domain: str = "export.arxiv.org",
230    ) -> str:
231        """
232        Downloads the source tarfile for this result to the specified
233        directory.
234
235        The filename is generated by calling `to_filename(self)`.
236
237        **Deprecated:** future versions of this client library will not provide
238        download helpers (out of scope). Use `result.source_url` directly.
239        """
240        if not filename:
241            filename = self._get_default_filename("tar.gz")
242        path = os.path.join(dirpath, filename)
243        source_url = Result._substitute_domain(self.source_url(), download_domain)
244        written_path, _ = urlretrieve(source_url, path)
245        return written_path
246
247    def source_url(self) -> str:
248        """
249        Derives a URL for the source tarfile for this result.
250        """
251        return self.pdf_url.replace("/pdf/", "/src/")
252
253    def _get_pdf_url(links: List[Link]) -> str:
254        """
255        Finds the PDF link among a result's links and returns its URL.
256
257        Should only be called once for a given `Result`, in its constructor.
258        After construction, the URL should be available in `Result.pdf_url`.
259        """
260        pdf_urls = [link.href for link in links if link.title == "pdf"]
261        if len(pdf_urls) == 0:
262            return None
263        elif len(pdf_urls) > 1:
264            logger.warning("Result has multiple PDF links; using %s", pdf_urls[0])
265        return pdf_urls[0]
266
267    def _to_datetime(ts: time.struct_time) -> datetime:
268        """
269        Converts a UTC time.struct_time into a time-zone-aware datetime.
270
271        This will be replaced with feedparser functionality [when it becomes
272        available](https://github.com/kurtmckee/feedparser/issues/212).
273        """
274        return datetime.fromtimestamp(timegm(ts), tz=timezone.utc)
275
276    def _substitute_domain(url: str, domain: str) -> str:
277        """
278        Replaces the domain of the given URL with the specified domain.
279
280        This is useful for testing purposes.
281        """
282        parsed_url = urlparse(url)
283        return parsed_url._replace(netloc=domain).geturl()
284
285    class Author(object):
286        """
287        A light inner class for representing a result's authors.
288        """
289
290        name: str
291        """The author's name."""
292
293        def __init__(self, name: str):
294            """
295            Constructs an `Author` with the specified name.
296
297            In most cases, prefer using `Author._from_feed_author` to parsing
298            and constructing `Author`s yourself.
299            """
300            self.name = name
301
302        def _from_feed_author(feed_author: feedparser.FeedParserDict) -> Result.Author:
303            """
304            Constructs an `Author` with the name specified in an author object
305            from a feed entry.
306
307            See usage in `Result._from_feed_entry`.
308            """
309            return Result.Author(feed_author.name)
310
311        def __str__(self) -> str:
312            return self.name
313
314        def __repr__(self) -> str:
315            return "{}({})".format(_classname(self), repr(self.name))
316
317        def __eq__(self, other) -> bool:
318            if isinstance(other, Result.Author):
319                return self.name == other.name
320            return False
321
322    class Link(object):
323        """
324        A light inner class for representing a result's links.
325        """
326
327        href: str
328        """The link's `href` attribute."""
329        title: Optional[str]
330        """The link's title."""
331        rel: str
332        """The link's relationship to the `Result`."""
333        content_type: str
334        """The link's HTTP content type."""
335
336        def __init__(
337            self,
338            href: str,
339            title: str = None,
340            rel: str = None,
341            content_type: str = None,
342        ):
343            """
344            Constructs a `Link` with the specified link metadata.
345
346            In most cases, prefer using `Link._from_feed_link` to parsing and
347            constructing `Link`s yourself.
348            """
349            self.href = href
350            self.title = title
351            self.rel = rel
352            self.content_type = content_type
353
354        def _from_feed_link(feed_link: feedparser.FeedParserDict) -> Result.Link:
355            """
356            Constructs a `Link` with link metadata specified in a link object
357            from a feed entry.
358
359            See usage in `Result._from_feed_entry`.
360            """
361            return Result.Link(
362                href=feed_link.href,
363                title=feed_link.get("title"),
364                rel=feed_link.get("rel"),
365                content_type=feed_link.get("content_type"),
366            )
367
368        def __str__(self) -> str:
369            return self.href
370
371        def __repr__(self) -> str:
372            return "{}({}, title={}, rel={}, content_type={})".format(
373                _classname(self),
374                repr(self.href),
375                repr(self.title),
376                repr(self.rel),
377                repr(self.content_type),
378            )
379
380        def __eq__(self, other) -> bool:
381            if isinstance(other, Result.Link):
382                return self.href == other.href
383            return False
384
385    class MissingFieldError(Exception):
386        """
387        An error indicating an entry is unparseable because it lacks required
388        fields.
389        """
390
391        missing_field: str
392        """The required field missing from the would-be entry."""
393        message: str
394        """Message describing what caused this error."""
395
396        def __init__(self, missing_field):
397            self.missing_field = missing_field
398            self.message = "Entry from arXiv missing required info"
399
400        def __repr__(self) -> str:
401            return "{}({})".format(_classname(self), repr(self.missing_field))
402
403
404class SortCriterion(Enum):
405    """
406    A SortCriterion identifies a property by which search results can be
407    sorted.
408
409    See [the arXiv API User's Manual: sort order for return
410    results](https://arxiv.org/help/api/user-manual#sort).
411    """
412
413    Relevance = "relevance"
414    LastUpdatedDate = "lastUpdatedDate"
415    SubmittedDate = "submittedDate"
416
417
418class SortOrder(Enum):
419    """
420    A SortOrder indicates order in which search results are sorted according
421    to the specified arxiv.SortCriterion.
422
423    See [the arXiv API User's Manual: sort order for return
424    results](https://arxiv.org/help/api/user-manual#sort).
425    """
426
427    Ascending = "ascending"
428    Descending = "descending"
429
430
431class Search(object):
432    """
433    A specification for a search of arXiv's database.
434
435    To run a search, use `Search.run` to use a default client or `Client.run`
436    with a specific client.
437    """
438
439    query: str
440    """
441    A query string.
442
443    This should be unencoded. Use `au:del_maestro AND ti:checkerboard`, not
444    `au:del_maestro+AND+ti:checkerboard`.
445
446    See [the arXiv API User's Manual: Details of Query
447    Construction](https://arxiv.org/help/api/user-manual#query_details).
448    """
449    id_list: List[str]
450    """
451    A list of arXiv article IDs to which to limit the search.
452
453    See [the arXiv API User's
454    Manual](https://arxiv.org/help/api/user-manual#search_query_and_id_list)
455    for documentation of the interaction between `query` and `id_list`.
456    """
457    max_results: int | None
458    """
459    The maximum number of results to be returned in an execution of this
460    search. To fetch every result available, set `max_results=None`.
461
462    The API's limit is 300,000 results per query.
463    """
464    sort_by: SortCriterion
465    """The sort criterion for results."""
466    sort_order: SortOrder
467    """The sort order for results."""
468
469    def __init__(
470        self,
471        query: str = "",
472        id_list: List[str] = [],
473        max_results: int | None = None,
474        sort_by: SortCriterion = SortCriterion.Relevance,
475        sort_order: SortOrder = SortOrder.Descending,
476    ):
477        """
478        Constructs an arXiv API search with the specified criteria.
479        """
480        self.query = query
481        self.id_list = id_list
482        # Handle deprecated v1 default behavior.
483        self.max_results = None if max_results == math.inf else max_results
484        self.sort_by = sort_by
485        self.sort_order = sort_order
486
487    def __str__(self) -> str:
488        # TODO: develop a more informative string representation.
489        return repr(self)
490
491    def __repr__(self) -> str:
492        return ("{}(query={}, id_list={}, max_results={}, sort_by={}, sort_order={})").format(
493            _classname(self),
494            repr(self.query),
495            repr(self.id_list),
496            repr(self.max_results),
497            repr(self.sort_by),
498            repr(self.sort_order),
499        )
500
501    def _url_args(self) -> Dict[str, str]:
502        """
503        Returns a dict of search parameters that should be included in an API
504        request for this search.
505        """
506        return {
507            "search_query": self.query,
508            "id_list": ",".join(self.id_list),
509            "sortBy": self.sort_by.value,
510            "sortOrder": self.sort_order.value,
511        }
512
513    def results(self, offset: int = 0) -> Generator[Result, None, None]:
514        """
515        Executes the specified search using a default arXiv API client. For info
516        on default behavior, see `Client.__init__` and `Client.results`.
517
518        **Deprecated** after 2.0.0; use `Client.results`.
519        """
520        warnings.warn(
521            "The 'Search.results' method is deprecated, use 'Client.results' instead",
522            DeprecationWarning,
523            stacklevel=2,
524        )
525        return Client().results(self, offset=offset)
526
527
528class Client(object):
529    """
530    Specifies a strategy for fetching results from arXiv's API.
531
532    This class obscures pagination and retry logic, and exposes
533    `Client.results`.
534    """
535
536    query_url_format = "https://export.arxiv.org/api/query?{}"
537    """
538    The arXiv query API endpoint format.
539    """
540    page_size: int
541    """
542    Maximum number of results fetched in a single API request. Smaller pages can
543    be retrieved faster, but may require more round-trips.
544
545    The API's limit is 2000 results per page.
546    """
547    delay_seconds: float
548    """
549    Number of seconds to wait between API requests.
550
551    [arXiv's Terms of Use](https://arxiv.org/help/api/tou) ask that you "make no
552    more than one request every three seconds."
553    """
554    num_retries: int
555    """
556    Number of times to retry a failing API request before raising an Exception.
557    """
558
559    _last_request_dt: datetime
560    _session: requests.Session
561
562    def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3):
563        """
564        Constructs an arXiv API client with the specified options.
565
566        Note: the default parameters should provide a robust request strategy
567        for most use cases. Extreme page sizes, delays, or retries risk
568        violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou),
569        brittle behavior, and inconsistent results.
570        """
571        self.page_size = page_size
572        self.delay_seconds = delay_seconds
573        self.num_retries = num_retries
574        self._last_request_dt = None
575        self._session = requests.Session()
576
577    def __str__(self) -> str:
578        # TODO: develop a more informative string representation.
579        return repr(self)
580
581    def __repr__(self) -> str:
582        return "{}(page_size={}, delay_seconds={}, num_retries={})".format(
583            _classname(self),
584            repr(self.page_size),
585            repr(self.delay_seconds),
586            repr(self.num_retries),
587        )
588
589    def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]:
590        """
591        Uses this client configuration to fetch one page of the search results
592        at a time, yielding the parsed `Result`s, until `max_results` results
593        have been yielded or there are no more search results.
594
595        If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`.
596
597        Setting a nonzero `offset` discards leading records in the result set.
598        When `offset` is greater than or equal to `search.max_results`, the full
599        result set is discarded.
600
601        For more on using generators, see
602        [Generators](https://wiki.python.org/moin/Generators).
603        """
604        limit = search.max_results - offset if search.max_results else None
605        if limit and limit < 0:
606            return iter(())
607        return itertools.islice(self._results(search, offset), limit)
608
609    def _results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]:
610        page_url = self._format_url(search, offset, self.page_size)
611        feed = self._parse_feed(page_url, first_page=True)
612        if not feed.entries:
613            logger.info("Got empty first page; stopping generation")
614            return
615        total_results = int(feed.feed.opensearch_totalresults)
616        logger.info(
617            "Got first page: %d of %d total results",
618            len(feed.entries),
619            total_results,
620        )
621
622        while feed.entries:
623            for entry in feed.entries:
624                try:
625                    yield Result._from_feed_entry(entry)
626                except Result.MissingFieldError as e:
627                    logger.warning("Skipping partial result: %s", e)
628            offset += len(feed.entries)
629            if offset >= total_results:
630                break
631            page_url = self._format_url(search, offset, self.page_size)
632            feed = self._parse_feed(page_url, first_page=False)
633
634    def _format_url(self, search: Search, start: int, page_size: int) -> str:
635        """
636        Construct a request API for search that returns up to `page_size`
637        results starting with the result at index `start`.
638        """
639        url_args = search._url_args()
640        url_args.update(
641            {
642                "start": start,
643                "max_results": page_size,
644            }
645        )
646        return self.query_url_format.format(urlencode(url_args))
647
648    def _parse_feed(
649        self, url: str, first_page: bool = True, _try_index: int = 0
650    ) -> feedparser.FeedParserDict:
651        """
652        Fetches the specified URL and parses it with feedparser.
653
654        If a request fails or is unexpectedly empty, retries the request up to
655        `self.num_retries` times.
656        """
657        try:
658            return self.__try_parse_feed(url, first_page=first_page, try_index=_try_index)
659        except (
660            HTTPError,
661            UnexpectedEmptyPageError,
662            requests.exceptions.ConnectionError,
663        ) as err:
664            if _try_index < self.num_retries:
665                logger.debug("Got error (try %d): %s", _try_index, err)
666                return self._parse_feed(url, first_page=first_page, _try_index=_try_index + 1)
667            logger.debug("Giving up (try %d): %s", _try_index, err)
668            raise err
669
670    def __try_parse_feed(
671        self,
672        url: str,
673        first_page: bool,
674        try_index: int,
675    ) -> feedparser.FeedParserDict:
676        """
677        Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that
678        number of seconds has not passed since `_parse_feed` was last called,
679        sleeps until delay_seconds seconds have passed.
680        """
681        # If this call would violate the rate limit, sleep until it doesn't.
682        if self._last_request_dt is not None:
683            required = timedelta(seconds=self.delay_seconds)
684            since_last_request = datetime.now() - self._last_request_dt
685            if since_last_request < required:
686                to_sleep = (required - since_last_request).total_seconds()
687                logger.info("Sleeping: %f seconds", to_sleep)
688                time.sleep(to_sleep)
689
690        logger.info("Requesting page (first: %r, try: %d): %s", first_page, try_index, url)
691
692        resp = self._session.get(url, headers={"user-agent": "arxiv.py/2.3.1"})
693        self._last_request_dt = datetime.now()
694        if resp.status_code != requests.codes.OK:
695            raise HTTPError(url, try_index, resp.status_code)
696
697        feed = feedparser.parse(resp.content)
698        if len(feed.entries) == 0 and not first_page:
699            raise UnexpectedEmptyPageError(url, try_index, feed)
700
701        if feed.bozo:
702            logger.warning(
703                "Bozo feed; consider handling: %s",
704                feed.bozo_exception if "bozo_exception" in feed else None,
705            )
706
707        return feed
708
709
710class ArxivError(Exception):
711    """This package's base Exception class."""
712
713    url: str
714    """The feed URL that could not be fetched."""
715    retry: int
716    """
717    The request try number which encountered this error; 0 for the initial try,
718    1 for the first retry, and so on.
719    """
720    message: str
721    """Message describing what caused this error."""
722
723    def __init__(self, url: str, retry: int, message: str):
724        """
725        Constructs an `ArxivError` encountered while fetching the specified URL.
726        """
727        self.url = url
728        self.retry = retry
729        self.message = message
730        super().__init__(self.message)
731
732    def __str__(self) -> str:
733        return "{} ({})".format(self.message, self.url)
734
735
736class UnexpectedEmptyPageError(ArxivError):
737    """
738    An error raised when a page of results that should be non-empty is empty.
739
740    This should never happen in theory, but happens sporadically due to
741    brittleness in the underlying arXiv API; usually resolved by retries.
742
743    See `Client.results` for usage.
744    """
745
746    raw_feed: feedparser.FeedParserDict
747    """
748    The raw output of `feedparser.parse`. Sometimes this contains useful
749    diagnostic information, e.g. in 'bozo_exception'.
750    """
751
752    def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict):
753        """
754        Constructs an `UnexpectedEmptyPageError` encountered for the specified
755        API URL after `retry` tries.
756        """
757        self.url = url
758        self.raw_feed = raw_feed
759        super().__init__(url, retry, "Page of results was unexpectedly empty")
760
761    def __repr__(self) -> str:
762        return "{}({}, {}, {})".format(
763            _classname(self), repr(self.url), repr(self.retry), repr(self.raw_feed)
764        )
765
766
767class HTTPError(ArxivError):
768    """
769    A non-200 status encountered while fetching a page of results.
770
771    See `Client.results` for usage.
772    """
773
774    status: int
775    """The HTTP status reported by feedparser."""
776
777    def __init__(self, url: str, retry: int, status: int):
778        """
779        Constructs an `HTTPError` for the specified status code, encountered for
780        the specified API URL after `retry` tries.
781        """
782        self.url = url
783        self.status = status
784        super().__init__(
785            url,
786            retry,
787            "Page request resulted in HTTP {}".format(self.status),
788        )
789
790    def __repr__(self) -> str:
791        return "{}({}, {}, {})".format(
792            _classname(self), repr(self.url), repr(self.retry), repr(self.status)
793        )
794
795
796def _classname(o):
797    """A helper function for use in __repr__ methods: arxiv.Result.Link."""
798    return "arxiv.{}".format(o.__class__.__qualname__)
logger = <Logger arxiv (WARNING)>
class Result:
 29class Result(object):
 30    """
 31    An entry in an arXiv query results feed.
 32
 33    See [the arXiv API User's Manual: Details of Atom Results
 34    Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned).
 35    """
 36
 37    entry_id: str
 38    """A url of the form `https://arxiv.org/abs/{id}`."""
 39    updated: datetime
 40    """When the result was last updated."""
 41    published: datetime
 42    """When the result was originally published."""
 43    title: str
 44    """The title of the result."""
 45    authors: List[Author]
 46    """The result's authors."""
 47    summary: str
 48    """The result abstract."""
 49    comment: Optional[str]
 50    """The authors' comment if present."""
 51    journal_ref: Optional[str]
 52    """A journal reference if present."""
 53    doi: Optional[str]
 54    """A URL for the resolved DOI to an external resource if present."""
 55    primary_category: str
 56    """
 57    The result's primary arXiv category. See [arXiv: Category
 58    Taxonomy](https://arxiv.org/category_taxonomy).
 59    """
 60    categories: List[str]
 61    """
 62    All of the result's categories. See [arXiv: Category
 63    Taxonomy](https://arxiv.org/category_taxonomy).
 64    """
 65    links: List[Link]
 66    """Up to three URLs associated with this result."""
 67    pdf_url: Optional[str]
 68    """The URL of a PDF version of this result if present among links."""
 69    _raw: feedparser.FeedParserDict
 70    """
 71    The raw feedparser result object if this Result was constructed with
 72    Result._from_feed_entry.
 73    """
 74
 75    def __init__(
 76        self,
 77        entry_id: str,
 78        updated: datetime = _DEFAULT_TIME,
 79        published: datetime = _DEFAULT_TIME,
 80        title: str = "",
 81        authors: List[Author] = [],
 82        summary: str = "",
 83        comment: str = "",
 84        journal_ref: str = "",
 85        doi: str = "",
 86        primary_category: str = "",
 87        categories: List[str] = [],
 88        links: List[Link] = [],
 89        _raw: feedparser.FeedParserDict = None,
 90    ):
 91        """
 92        Constructs an arXiv search result item.
 93
 94        In most cases, prefer using `Result._from_feed_entry` to parsing and
 95        constructing `Result`s yourself.
 96        """
 97        self.entry_id = entry_id
 98        self.updated = updated
 99        self.published = published
100        self.title = title
101        self.authors = authors
102        self.summary = summary
103        self.comment = comment
104        self.journal_ref = journal_ref
105        self.doi = doi
106        self.primary_category = primary_category
107        self.categories = categories
108        self.links = links
109        # Calculated members
110        self.pdf_url = Result._get_pdf_url(links)
111        # Debugging
112        self._raw = _raw
113
114    def _from_feed_entry(entry: feedparser.FeedParserDict) -> Result:
115        """
116        Converts a feedparser entry for an arXiv search result feed into a
117        Result object.
118        """
119        if not hasattr(entry, "id"):
120            raise Result.MissingFieldError("id")
121        # Title attribute may be absent for certain titles. Defaulting to "0" as
122        # it's the only title observed to cause this bug.
123        # https://github.com/lukasschwab/arxiv.py/issues/71
124        # title = entry.title if hasattr(entry, "title") else "0"
125        title = "0"
126        if hasattr(entry, "title"):
127            title = entry.title
128        else:
129            logger.warning("Result %s is missing title attribute; defaulting to '0'", entry.id)
130        return Result(
131            entry_id=entry.id,
132            updated=Result._to_datetime(entry.updated_parsed),
133            published=Result._to_datetime(entry.published_parsed),
134            title=re.sub(r"\s+", " ", title),
135            authors=[Result.Author._from_feed_author(a) for a in entry.authors],
136            summary=entry.summary,
137            comment=entry.get("arxiv_comment"),
138            journal_ref=entry.get("arxiv_journal_ref"),
139            doi=entry.get("arxiv_doi"),
140            primary_category=entry.arxiv_primary_category.get("term"),
141            categories=[tag.get("term") for tag in entry.tags],
142            links=[Result.Link._from_feed_link(link) for link in entry.links],
143            _raw=entry,
144        )
145
146    def __str__(self) -> str:
147        return self.entry_id
148
149    def __repr__(self) -> str:
150        return (
151            "{}(entry_id={}, updated={}, published={}, title={}, authors={}, "
152            "summary={}, comment={}, journal_ref={}, doi={}, "
153            "primary_category={}, categories={}, links={})"
154        ).format(
155            _classname(self),
156            repr(self.entry_id),
157            repr(self.updated),
158            repr(self.published),
159            repr(self.title),
160            repr(self.authors),
161            repr(self.summary),
162            repr(self.comment),
163            repr(self.journal_ref),
164            repr(self.doi),
165            repr(self.primary_category),
166            repr(self.categories),
167            repr(self.links),
168        )
169
170    def __eq__(self, other) -> bool:
171        if isinstance(other, Result):
172            return self.entry_id == other.entry_id
173        return False
174
175    def get_short_id(self) -> str:
176        """
177        Returns the short ID for this result.
178
179        + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`,
180        `result.get_short_id()` returns `2107.05580v1`.
181
182        + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`,
183        `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March
184        2007 arXiv identifier format).
185
186        For an explanation of the difference between arXiv's legacy and current
187        identifiers, see [Understanding the arXiv
188        identifier](https://arxiv.org/help/arxiv_identifier).
189        """
190        return self.entry_id.split("arxiv.org/abs/")[-1]
191
192    def _get_default_filename(self, extension: str = "pdf") -> str:
193        """
194        A default `to_filename` function for the extension given.
195        """
196        nonempty_title = self.title if self.title else "UNTITLED"
197        return ".".join(
198            [
199                self.get_short_id().replace("/", "_"),
200                re.sub(r"[^\w]", "_", nonempty_title),
201                extension,
202            ]
203        )
204
205    def download_pdf(
206        self,
207        dirpath: str = "./",
208        filename: str = "",
209        download_domain: str = "export.arxiv.org",
210    ) -> str:
211        """
212        Downloads the PDF for this result to the specified directory.
213
214        The filename is generated by calling `to_filename(self)`.
215
216        **Deprecated:** future versions of this client library will not provide
217        download helpers (out of scope). Use `result.pdf_url` directly.
218        """
219        if not filename:
220            filename = self._get_default_filename()
221        path = os.path.join(dirpath, filename)
222        pdf_url = Result._substitute_domain(self.pdf_url, download_domain)
223        written_path, _ = urlretrieve(pdf_url, path)
224        return written_path
225
226    def download_source(
227        self,
228        dirpath: str = "./",
229        filename: str = "",
230        download_domain: str = "export.arxiv.org",
231    ) -> str:
232        """
233        Downloads the source tarfile for this result to the specified
234        directory.
235
236        The filename is generated by calling `to_filename(self)`.
237
238        **Deprecated:** future versions of this client library will not provide
239        download helpers (out of scope). Use `result.source_url` directly.
240        """
241        if not filename:
242            filename = self._get_default_filename("tar.gz")
243        path = os.path.join(dirpath, filename)
244        source_url = Result._substitute_domain(self.source_url(), download_domain)
245        written_path, _ = urlretrieve(source_url, path)
246        return written_path
247
248    def source_url(self) -> str:
249        """
250        Derives a URL for the source tarfile for this result.
251        """
252        return self.pdf_url.replace("/pdf/", "/src/")
253
254    def _get_pdf_url(links: List[Link]) -> str:
255        """
256        Finds the PDF link among a result's links and returns its URL.
257
258        Should only be called once for a given `Result`, in its constructor.
259        After construction, the URL should be available in `Result.pdf_url`.
260        """
261        pdf_urls = [link.href for link in links if link.title == "pdf"]
262        if len(pdf_urls) == 0:
263            return None
264        elif len(pdf_urls) > 1:
265            logger.warning("Result has multiple PDF links; using %s", pdf_urls[0])
266        return pdf_urls[0]
267
268    def _to_datetime(ts: time.struct_time) -> datetime:
269        """
270        Converts a UTC time.struct_time into a time-zone-aware datetime.
271
272        This will be replaced with feedparser functionality [when it becomes
273        available](https://github.com/kurtmckee/feedparser/issues/212).
274        """
275        return datetime.fromtimestamp(timegm(ts), tz=timezone.utc)
276
277    def _substitute_domain(url: str, domain: str) -> str:
278        """
279        Replaces the domain of the given URL with the specified domain.
280
281        This is useful for testing purposes.
282        """
283        parsed_url = urlparse(url)
284        return parsed_url._replace(netloc=domain).geturl()
285
286    class Author(object):
287        """
288        A light inner class for representing a result's authors.
289        """
290
291        name: str
292        """The author's name."""
293
294        def __init__(self, name: str):
295            """
296            Constructs an `Author` with the specified name.
297
298            In most cases, prefer using `Author._from_feed_author` to parsing
299            and constructing `Author`s yourself.
300            """
301            self.name = name
302
303        def _from_feed_author(feed_author: feedparser.FeedParserDict) -> Result.Author:
304            """
305            Constructs an `Author` with the name specified in an author object
306            from a feed entry.
307
308            See usage in `Result._from_feed_entry`.
309            """
310            return Result.Author(feed_author.name)
311
312        def __str__(self) -> str:
313            return self.name
314
315        def __repr__(self) -> str:
316            return "{}({})".format(_classname(self), repr(self.name))
317
318        def __eq__(self, other) -> bool:
319            if isinstance(other, Result.Author):
320                return self.name == other.name
321            return False
322
323    class Link(object):
324        """
325        A light inner class for representing a result's links.
326        """
327
328        href: str
329        """The link's `href` attribute."""
330        title: Optional[str]
331        """The link's title."""
332        rel: str
333        """The link's relationship to the `Result`."""
334        content_type: str
335        """The link's HTTP content type."""
336
337        def __init__(
338            self,
339            href: str,
340            title: str = None,
341            rel: str = None,
342            content_type: str = None,
343        ):
344            """
345            Constructs a `Link` with the specified link metadata.
346
347            In most cases, prefer using `Link._from_feed_link` to parsing and
348            constructing `Link`s yourself.
349            """
350            self.href = href
351            self.title = title
352            self.rel = rel
353            self.content_type = content_type
354
355        def _from_feed_link(feed_link: feedparser.FeedParserDict) -> Result.Link:
356            """
357            Constructs a `Link` with link metadata specified in a link object
358            from a feed entry.
359
360            See usage in `Result._from_feed_entry`.
361            """
362            return Result.Link(
363                href=feed_link.href,
364                title=feed_link.get("title"),
365                rel=feed_link.get("rel"),
366                content_type=feed_link.get("content_type"),
367            )
368
369        def __str__(self) -> str:
370            return self.href
371
372        def __repr__(self) -> str:
373            return "{}({}, title={}, rel={}, content_type={})".format(
374                _classname(self),
375                repr(self.href),
376                repr(self.title),
377                repr(self.rel),
378                repr(self.content_type),
379            )
380
381        def __eq__(self, other) -> bool:
382            if isinstance(other, Result.Link):
383                return self.href == other.href
384            return False
385
386    class MissingFieldError(Exception):
387        """
388        An error indicating an entry is unparseable because it lacks required
389        fields.
390        """
391
392        missing_field: str
393        """The required field missing from the would-be entry."""
394        message: str
395        """Message describing what caused this error."""
396
397        def __init__(self, missing_field):
398            self.missing_field = missing_field
399            self.message = "Entry from arXiv missing required info"
400
401        def __repr__(self) -> str:
402            return "{}({})".format(_classname(self), repr(self.missing_field))

An entry in an arXiv query results feed.

See the arXiv API User's Manual: Details of Atom Results Returned.

Result( entry_id: str, updated: datetime.datetime = datetime.datetime(1, 1, 1, 0, 0), published: datetime.datetime = datetime.datetime(1, 1, 1, 0, 0), title: str = '', authors: List[Result.Author] = [], summary: str = '', comment: str = '', journal_ref: str = '', doi: str = '', primary_category: str = '', categories: List[str] = [], links: List[Result.Link] = [], _raw: feedparser.util.FeedParserDict = None)
 75    def __init__(
 76        self,
 77        entry_id: str,
 78        updated: datetime = _DEFAULT_TIME,
 79        published: datetime = _DEFAULT_TIME,
 80        title: str = "",
 81        authors: List[Author] = [],
 82        summary: str = "",
 83        comment: str = "",
 84        journal_ref: str = "",
 85        doi: str = "",
 86        primary_category: str = "",
 87        categories: List[str] = [],
 88        links: List[Link] = [],
 89        _raw: feedparser.FeedParserDict = None,
 90    ):
 91        """
 92        Constructs an arXiv search result item.
 93
 94        In most cases, prefer using `Result._from_feed_entry` to parsing and
 95        constructing `Result`s yourself.
 96        """
 97        self.entry_id = entry_id
 98        self.updated = updated
 99        self.published = published
100        self.title = title
101        self.authors = authors
102        self.summary = summary
103        self.comment = comment
104        self.journal_ref = journal_ref
105        self.doi = doi
106        self.primary_category = primary_category
107        self.categories = categories
108        self.links = links
109        # Calculated members
110        self.pdf_url = Result._get_pdf_url(links)
111        # Debugging
112        self._raw = _raw

Constructs an arXiv search result item.

In most cases, prefer using Result._from_feed_entry to parsing and constructing Results yourself.

entry_id: str

A url of the form https://arxiv.org/abs/{id}.

updated: datetime.datetime

When the result was last updated.

published: datetime.datetime

When the result was originally published.

title: str

The title of the result.

authors: List[Result.Author]

The result's authors.

summary: str

The result abstract.

comment: Optional[str]

The authors' comment if present.

journal_ref: Optional[str]

A journal reference if present.

doi: Optional[str]

A URL for the resolved DOI to an external resource if present.

primary_category: str

The result's primary arXiv category. See arXiv: Category Taxonomy.

categories: List[str]

All of the result's categories. See arXiv: Category Taxonomy.

pdf_url: Optional[str]

The URL of a PDF version of this result if present among links.

def get_short_id(self) -> str:
175    def get_short_id(self) -> str:
176        """
177        Returns the short ID for this result.
178
179        + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`,
180        `result.get_short_id()` returns `2107.05580v1`.
181
182        + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`,
183        `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March
184        2007 arXiv identifier format).
185
186        For an explanation of the difference between arXiv's legacy and current
187        identifiers, see [Understanding the arXiv
188        identifier](https://arxiv.org/help/arxiv_identifier).
189        """
190        return self.entry_id.split("arxiv.org/abs/")[-1]

Returns the short ID for this result.

  • If the result URL is "https://arxiv.org/abs/2107.05580v1", result.get_short_id() returns 2107.05580v1.

  • If the result URL is "https://arxiv.org/abs/quant-ph/0201082v1", result.get_short_id() returns "quant-ph/0201082v1" (the pre-March 2007 arXiv identifier format).

For an explanation of the difference between arXiv's legacy and current identifiers, see Understanding the arXiv identifier.

def download_pdf( self, dirpath: str = './', filename: str = '', download_domain: str = 'export.arxiv.org') -> str:
205    def download_pdf(
206        self,
207        dirpath: str = "./",
208        filename: str = "",
209        download_domain: str = "export.arxiv.org",
210    ) -> str:
211        """
212        Downloads the PDF for this result to the specified directory.
213
214        The filename is generated by calling `to_filename(self)`.
215
216        **Deprecated:** future versions of this client library will not provide
217        download helpers (out of scope). Use `result.pdf_url` directly.
218        """
219        if not filename:
220            filename = self._get_default_filename()
221        path = os.path.join(dirpath, filename)
222        pdf_url = Result._substitute_domain(self.pdf_url, download_domain)
223        written_path, _ = urlretrieve(pdf_url, path)
224        return written_path

Downloads the PDF for this result to the specified directory.

The filename is generated by calling to_filename(self).

Deprecated: future versions of this client library will not provide download helpers (out of scope). Use result.pdf_url directly.

def download_source( self, dirpath: str = './', filename: str = '', download_domain: str = 'export.arxiv.org') -> str:
226    def download_source(
227        self,
228        dirpath: str = "./",
229        filename: str = "",
230        download_domain: str = "export.arxiv.org",
231    ) -> str:
232        """
233        Downloads the source tarfile for this result to the specified
234        directory.
235
236        The filename is generated by calling `to_filename(self)`.
237
238        **Deprecated:** future versions of this client library will not provide
239        download helpers (out of scope). Use `result.source_url` directly.
240        """
241        if not filename:
242            filename = self._get_default_filename("tar.gz")
243        path = os.path.join(dirpath, filename)
244        source_url = Result._substitute_domain(self.source_url(), download_domain)
245        written_path, _ = urlretrieve(source_url, path)
246        return written_path

Downloads the source tarfile for this result to the specified directory.

The filename is generated by calling to_filename(self).

Deprecated: future versions of this client library will not provide download helpers (out of scope). Use result.source_url directly.

def source_url(self) -> str:
248    def source_url(self) -> str:
249        """
250        Derives a URL for the source tarfile for this result.
251        """
252        return self.pdf_url.replace("/pdf/", "/src/")

Derives a URL for the source tarfile for this result.

class Result.Author:
286    class Author(object):
287        """
288        A light inner class for representing a result's authors.
289        """
290
291        name: str
292        """The author's name."""
293
294        def __init__(self, name: str):
295            """
296            Constructs an `Author` with the specified name.
297
298            In most cases, prefer using `Author._from_feed_author` to parsing
299            and constructing `Author`s yourself.
300            """
301            self.name = name
302
303        def _from_feed_author(feed_author: feedparser.FeedParserDict) -> Result.Author:
304            """
305            Constructs an `Author` with the name specified in an author object
306            from a feed entry.
307
308            See usage in `Result._from_feed_entry`.
309            """
310            return Result.Author(feed_author.name)
311
312        def __str__(self) -> str:
313            return self.name
314
315        def __repr__(self) -> str:
316            return "{}({})".format(_classname(self), repr(self.name))
317
318        def __eq__(self, other) -> bool:
319            if isinstance(other, Result.Author):
320                return self.name == other.name
321            return False

A light inner class for representing a result's authors.

Result.Author(name: str)
294        def __init__(self, name: str):
295            """
296            Constructs an `Author` with the specified name.
297
298            In most cases, prefer using `Author._from_feed_author` to parsing
299            and constructing `Author`s yourself.
300            """
301            self.name = name

Constructs an Author with the specified name.

In most cases, prefer using Author._from_feed_author to parsing and constructing Authors yourself.

name: str

The author's name.

class Result.MissingFieldError(builtins.Exception):
386    class MissingFieldError(Exception):
387        """
388        An error indicating an entry is unparseable because it lacks required
389        fields.
390        """
391
392        missing_field: str
393        """The required field missing from the would-be entry."""
394        message: str
395        """Message describing what caused this error."""
396
397        def __init__(self, missing_field):
398            self.missing_field = missing_field
399            self.message = "Entry from arXiv missing required info"
400
401        def __repr__(self) -> str:
402            return "{}({})".format(_classname(self), repr(self.missing_field))

An error indicating an entry is unparseable because it lacks required fields.

Result.MissingFieldError(missing_field)
397        def __init__(self, missing_field):
398            self.missing_field = missing_field
399            self.message = "Entry from arXiv missing required info"
missing_field: str

The required field missing from the would-be entry.

message: str

Message describing what caused this error.

Inherited Members
builtins.BaseException
with_traceback
args
class SortCriterion(enum.Enum):
405class SortCriterion(Enum):
406    """
407    A SortCriterion identifies a property by which search results can be
408    sorted.
409
410    See [the arXiv API User's Manual: sort order for return
411    results](https://arxiv.org/help/api/user-manual#sort).
412    """
413
414    Relevance = "relevance"
415    LastUpdatedDate = "lastUpdatedDate"
416    SubmittedDate = "submittedDate"

A SortCriterion identifies a property by which search results can be sorted.

See the arXiv API User's Manual: sort order for return results.

Relevance = <SortCriterion.Relevance: 'relevance'>
LastUpdatedDate = <SortCriterion.LastUpdatedDate: 'lastUpdatedDate'>
SubmittedDate = <SortCriterion.SubmittedDate: 'submittedDate'>
Inherited Members
enum.Enum
name
value
class SortOrder(enum.Enum):
419class SortOrder(Enum):
420    """
421    A SortOrder indicates order in which search results are sorted according
422    to the specified arxiv.SortCriterion.
423
424    See [the arXiv API User's Manual: sort order for return
425    results](https://arxiv.org/help/api/user-manual#sort).
426    """
427
428    Ascending = "ascending"
429    Descending = "descending"

A SortOrder indicates order in which search results are sorted according to the specified SortCriterion.

See the arXiv API User's Manual: sort order for return results.

Ascending = <SortOrder.Ascending: 'ascending'>
Descending = <SortOrder.Descending: 'descending'>
Inherited Members
enum.Enum
name
value
class Client:
529class Client(object):
530    """
531    Specifies a strategy for fetching results from arXiv's API.
532
533    This class obscures pagination and retry logic, and exposes
534    `Client.results`.
535    """
536
537    query_url_format = "https://export.arxiv.org/api/query?{}"
538    """
539    The arXiv query API endpoint format.
540    """
541    page_size: int
542    """
543    Maximum number of results fetched in a single API request. Smaller pages can
544    be retrieved faster, but may require more round-trips.
545
546    The API's limit is 2000 results per page.
547    """
548    delay_seconds: float
549    """
550    Number of seconds to wait between API requests.
551
552    [arXiv's Terms of Use](https://arxiv.org/help/api/tou) ask that you "make no
553    more than one request every three seconds."
554    """
555    num_retries: int
556    """
557    Number of times to retry a failing API request before raising an Exception.
558    """
559
560    _last_request_dt: datetime
561    _session: requests.Session
562
563    def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3):
564        """
565        Constructs an arXiv API client with the specified options.
566
567        Note: the default parameters should provide a robust request strategy
568        for most use cases. Extreme page sizes, delays, or retries risk
569        violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou),
570        brittle behavior, and inconsistent results.
571        """
572        self.page_size = page_size
573        self.delay_seconds = delay_seconds
574        self.num_retries = num_retries
575        self._last_request_dt = None
576        self._session = requests.Session()
577
578    def __str__(self) -> str:
579        # TODO: develop a more informative string representation.
580        return repr(self)
581
582    def __repr__(self) -> str:
583        return "{}(page_size={}, delay_seconds={}, num_retries={})".format(
584            _classname(self),
585            repr(self.page_size),
586            repr(self.delay_seconds),
587            repr(self.num_retries),
588        )
589
590    def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]:
591        """
592        Uses this client configuration to fetch one page of the search results
593        at a time, yielding the parsed `Result`s, until `max_results` results
594        have been yielded or there are no more search results.
595
596        If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`.
597
598        Setting a nonzero `offset` discards leading records in the result set.
599        When `offset` is greater than or equal to `search.max_results`, the full
600        result set is discarded.
601
602        For more on using generators, see
603        [Generators](https://wiki.python.org/moin/Generators).
604        """
605        limit = search.max_results - offset if search.max_results else None
606        if limit and limit < 0:
607            return iter(())
608        return itertools.islice(self._results(search, offset), limit)
609
610    def _results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]:
611        page_url = self._format_url(search, offset, self.page_size)
612        feed = self._parse_feed(page_url, first_page=True)
613        if not feed.entries:
614            logger.info("Got empty first page; stopping generation")
615            return
616        total_results = int(feed.feed.opensearch_totalresults)
617        logger.info(
618            "Got first page: %d of %d total results",
619            len(feed.entries),
620            total_results,
621        )
622
623        while feed.entries:
624            for entry in feed.entries:
625                try:
626                    yield Result._from_feed_entry(entry)
627                except Result.MissingFieldError as e:
628                    logger.warning("Skipping partial result: %s", e)
629            offset += len(feed.entries)
630            if offset >= total_results:
631                break
632            page_url = self._format_url(search, offset, self.page_size)
633            feed = self._parse_feed(page_url, first_page=False)
634
635    def _format_url(self, search: Search, start: int, page_size: int) -> str:
636        """
637        Construct a request API for search that returns up to `page_size`
638        results starting with the result at index `start`.
639        """
640        url_args = search._url_args()
641        url_args.update(
642            {
643                "start": start,
644                "max_results": page_size,
645            }
646        )
647        return self.query_url_format.format(urlencode(url_args))
648
649    def _parse_feed(
650        self, url: str, first_page: bool = True, _try_index: int = 0
651    ) -> feedparser.FeedParserDict:
652        """
653        Fetches the specified URL and parses it with feedparser.
654
655        If a request fails or is unexpectedly empty, retries the request up to
656        `self.num_retries` times.
657        """
658        try:
659            return self.__try_parse_feed(url, first_page=first_page, try_index=_try_index)
660        except (
661            HTTPError,
662            UnexpectedEmptyPageError,
663            requests.exceptions.ConnectionError,
664        ) as err:
665            if _try_index < self.num_retries:
666                logger.debug("Got error (try %d): %s", _try_index, err)
667                return self._parse_feed(url, first_page=first_page, _try_index=_try_index + 1)
668            logger.debug("Giving up (try %d): %s", _try_index, err)
669            raise err
670
671    def __try_parse_feed(
672        self,
673        url: str,
674        first_page: bool,
675        try_index: int,
676    ) -> feedparser.FeedParserDict:
677        """
678        Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that
679        number of seconds has not passed since `_parse_feed` was last called,
680        sleeps until delay_seconds seconds have passed.
681        """
682        # If this call would violate the rate limit, sleep until it doesn't.
683        if self._last_request_dt is not None:
684            required = timedelta(seconds=self.delay_seconds)
685            since_last_request = datetime.now() - self._last_request_dt
686            if since_last_request < required:
687                to_sleep = (required - since_last_request).total_seconds()
688                logger.info("Sleeping: %f seconds", to_sleep)
689                time.sleep(to_sleep)
690
691        logger.info("Requesting page (first: %r, try: %d): %s", first_page, try_index, url)
692
693        resp = self._session.get(url, headers={"user-agent": "arxiv.py/2.3.1"})
694        self._last_request_dt = datetime.now()
695        if resp.status_code != requests.codes.OK:
696            raise HTTPError(url, try_index, resp.status_code)
697
698        feed = feedparser.parse(resp.content)
699        if len(feed.entries) == 0 and not first_page:
700            raise UnexpectedEmptyPageError(url, try_index, feed)
701
702        if feed.bozo:
703            logger.warning(
704                "Bozo feed; consider handling: %s",
705                feed.bozo_exception if "bozo_exception" in feed else None,
706            )
707
708        return feed

Specifies a strategy for fetching results from arXiv's API.

This class obscures pagination and retry logic, and exposes Client.results.

Client( page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3)
563    def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3):
564        """
565        Constructs an arXiv API client with the specified options.
566
567        Note: the default parameters should provide a robust request strategy
568        for most use cases. Extreme page sizes, delays, or retries risk
569        violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou),
570        brittle behavior, and inconsistent results.
571        """
572        self.page_size = page_size
573        self.delay_seconds = delay_seconds
574        self.num_retries = num_retries
575        self._last_request_dt = None
576        self._session = requests.Session()

Constructs an arXiv API client with the specified options.

Note: the default parameters should provide a robust request strategy for most use cases. Extreme page sizes, delays, or retries risk violating the arXiv API Terms of Use, brittle behavior, and inconsistent results.

query_url_format = 'https://exportarxiv.arxiv.org/api/query?{}'

The arXiv query API endpoint format.

page_size: int

Maximum number of results fetched in a single API request. Smaller pages can be retrieved faster, but may require more round-trips.

The API's limit is 2000 results per page.

delay_seconds: float

Number of seconds to wait between API requests.

arXiv's Terms of Use ask that you "make no more than one request every three seconds."

num_retries: int

Number of times to retry a failing API request before raising an Exception.

def results( self, search: Search, offset: int = 0) -> Generator[Result, NoneType, NoneType]:
590    def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]:
591        """
592        Uses this client configuration to fetch one page of the search results
593        at a time, yielding the parsed `Result`s, until `max_results` results
594        have been yielded or there are no more search results.
595
596        If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`.
597
598        Setting a nonzero `offset` discards leading records in the result set.
599        When `offset` is greater than or equal to `search.max_results`, the full
600        result set is discarded.
601
602        For more on using generators, see
603        [Generators](https://wiki.python.org/moin/Generators).
604        """
605        limit = search.max_results - offset if search.max_results else None
606        if limit and limit < 0:
607            return iter(())
608        return itertools.islice(self._results(search, offset), limit)

Uses this client configuration to fetch one page of the search results at a time, yielding the parsed Results, until max_results results have been yielded or there are no more search results.

If all tries fail, raises an UnexpectedEmptyPageError or HTTPError.

Setting a nonzero offset discards leading records in the result set. When offset is greater than or equal to search.max_results, the full result set is discarded.

For more on using generators, see Generators.

class ArxivError(builtins.Exception):
711class ArxivError(Exception):
712    """This package's base Exception class."""
713
714    url: str
715    """The feed URL that could not be fetched."""
716    retry: int
717    """
718    The request try number which encountered this error; 0 for the initial try,
719    1 for the first retry, and so on.
720    """
721    message: str
722    """Message describing what caused this error."""
723
724    def __init__(self, url: str, retry: int, message: str):
725        """
726        Constructs an `ArxivError` encountered while fetching the specified URL.
727        """
728        self.url = url
729        self.retry = retry
730        self.message = message
731        super().__init__(self.message)
732
733    def __str__(self) -> str:
734        return "{} ({})".format(self.message, self.url)

This package's base Exception class.

ArxivError(url: str, retry: int, message: str)
724    def __init__(self, url: str, retry: int, message: str):
725        """
726        Constructs an `ArxivError` encountered while fetching the specified URL.
727        """
728        self.url = url
729        self.retry = retry
730        self.message = message
731        super().__init__(self.message)

Constructs an ArxivError encountered while fetching the specified URL.

url: str

The feed URL that could not be fetched.

retry: int

The request try number which encountered this error; 0 for the initial try, 1 for the first retry, and so on.

message: str

Message describing what caused this error.

Inherited Members
builtins.BaseException
with_traceback
args
class UnexpectedEmptyPageError(ArxivError):
737class UnexpectedEmptyPageError(ArxivError):
738    """
739    An error raised when a page of results that should be non-empty is empty.
740
741    This should never happen in theory, but happens sporadically due to
742    brittleness in the underlying arXiv API; usually resolved by retries.
743
744    See `Client.results` for usage.
745    """
746
747    raw_feed: feedparser.FeedParserDict
748    """
749    The raw output of `feedparser.parse`. Sometimes this contains useful
750    diagnostic information, e.g. in 'bozo_exception'.
751    """
752
753    def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict):
754        """
755        Constructs an `UnexpectedEmptyPageError` encountered for the specified
756        API URL after `retry` tries.
757        """
758        self.url = url
759        self.raw_feed = raw_feed
760        super().__init__(url, retry, "Page of results was unexpectedly empty")
761
762    def __repr__(self) -> str:
763        return "{}({}, {}, {})".format(
764            _classname(self), repr(self.url), repr(self.retry), repr(self.raw_feed)
765        )

An error raised when a page of results that should be non-empty is empty.

This should never happen in theory, but happens sporadically due to brittleness in the underlying arXiv API; usually resolved by retries.

See Client.results for usage.

UnexpectedEmptyPageError(url: str, retry: int, raw_feed: feedparser.util.FeedParserDict)
753    def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict):
754        """
755        Constructs an `UnexpectedEmptyPageError` encountered for the specified
756        API URL after `retry` tries.
757        """
758        self.url = url
759        self.raw_feed = raw_feed
760        super().__init__(url, retry, "Page of results was unexpectedly empty")

Constructs an UnexpectedEmptyPageError encountered for the specified API URL after retry tries.

raw_feed: feedparser.util.FeedParserDict

The raw output of feedparser.parse. Sometimes this contains useful diagnostic information, e.g. in 'bozo_exception'.

url

The feed URL that could not be fetched.

Inherited Members
ArxivError
retry
message
builtins.BaseException
with_traceback
args
class HTTPError(ArxivError):
768class HTTPError(ArxivError):
769    """
770    A non-200 status encountered while fetching a page of results.
771
772    See `Client.results` for usage.
773    """
774
775    status: int
776    """The HTTP status reported by feedparser."""
777
778    def __init__(self, url: str, retry: int, status: int):
779        """
780        Constructs an `HTTPError` for the specified status code, encountered for
781        the specified API URL after `retry` tries.
782        """
783        self.url = url
784        self.status = status
785        super().__init__(
786            url,
787            retry,
788            "Page request resulted in HTTP {}".format(self.status),
789        )
790
791    def __repr__(self) -> str:
792        return "{}({}, {}, {})".format(
793            _classname(self), repr(self.url), repr(self.retry), repr(self.status)
794        )

A non-200 status encountered while fetching a page of results.

See Client.results for usage.

HTTPError(url: str, retry: int, status: int)
778    def __init__(self, url: str, retry: int, status: int):
779        """
780        Constructs an `HTTPError` for the specified status code, encountered for
781        the specified API URL after `retry` tries.
782        """
783        self.url = url
784        self.status = status
785        super().__init__(
786            url,
787            retry,
788            "Page request resulted in HTTP {}".format(self.status),
789        )

Constructs an HTTPError for the specified status code, encountered for the specified API URL after retry tries.

status: int

The HTTP status reported by feedparser.

url

The feed URL that could not be fetched.

Inherited Members
ArxivError
retry
message
builtins.BaseException
with_traceback
args