arxiv

arxiv.py

PyPI PyPI - Python Version GitHub Workflow Status (branch) Full package documentation

Python wrapper for the arXiv API.

arXiv is a project by the Cornell University Library that provides open access to 1,000,000+ articles in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics.

Usage

Install the package:

$ pip install arxiv   # Or `uv add arxiv` or similar.

In your Python code, include the line:

import arxiv

Examples

[!TIP] [arxivql](https://pypi.org/project/arxivql/) may simplify constructing complex query strings.

Fetching results

import arxiv

# Construct the default API client.
client = Client()

# Search for the 10 most recent articles matching the keyword "quantum."
search = Search(
  query = "quantum",
  max_results = 10,
  sort_by = SortCriterion.SubmittedDate
)

results = client.results(search)

# `results` is a generator; you can iterate over its elements one by one...
for r in client.results(search):
  print(r.title)
# ...or exhaust it into a list. Careful: this is slow for large results sets.
all_results = list(results)
print([r.title for r in all_results])

# For advanced query syntax documentation, see the arXiv API User Manual:
# https://arxiv.org/help/api/user-manual#query_details
search = Search(query = "au:del_maestro AND ti:checkerboard")
first_result = next(client.results(search))
print(first_result)

# Search for the paper with ID "1605.08386v1"
search_by_id = Search(id_list=["1605.08386v1"])
# Reuse client to fetch the paper, then print its title.
first_result = next(client.results(search_by_id))
print(first_result.title)

Fetching results with a custom client

import arxiv

big_slow_client = Client(
  page_size = 1000,
  delay_seconds = 10.0,
  num_retries = 5
)

# Prints 1000 titles before needing to make another request.
for result in big_slow_client.results(Search(query="quantum")):
  print(result.title)

Logging

To inspect this package's network behavior and API logic, configure a DEBUG-level logger.

>>> import logging, arxiv
>>> logging.basicConfig(level=logging.DEBUG)
>>> client = Client()
>>> paper = next(client.results(Search(id_list=["1605.08386v1"])))
INFO:arxiv.arxiv:Requesting 100 results at offset 0
INFO:arxiv.arxiv:Requesting page (first: False, try: 0): https://exportarxiv.arxiv.org/api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): export.arxiv.org:443
DEBUG:urllib3.connectionpool:https://export.arxiv.org:443 "GET /api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100&user-agent=arxiv.py%2F1.4.8 HTTP/1.1" 200 979

Types

Client

A Client specifies a reusable strategy for fetching results from arXiv's API. For most use cases the default client should suffice.

Clients configurations specify pagination and retry logic. Reusing a client allows successive API calls to use the same connection pool and ensures they abide by the rate limit you set.

A Search specifies a search of arXiv's database. Use Client.results to get a generator yielding Results.

Result

The Result objects yielded by Client.results include metadata about each paper and helper methods for downloading their content.

The meaning of the underlying raw data is documented in the arXiv API User Manual: Details of Atom Results Returned.

Result also exposes helper methods for downloading papers: Result.download_pdf and Result.download_source.

Development

This project uses UV for development, while maintaining compatibility with traditional pip installation for end users.

Development Setup

  1. Install UV (if you haven't already):

    curl -LsSf https://astral.sh/uv/install.sh | sh
    
  2. Clone and setup:

    git clone https://github.com/lukasschwab/arxiv.py
    cd arxiv.py
    make dev-setup
    
  1""".. include:: ../README.md"""
  2
  3from __future__ import annotations
  4
  5import logging
  6import time
  7import itertools
  8import feedparser
  9import os
 10import math
 11import re
 12import requests
 13import warnings
 14
 15from urllib.parse import urlencode, urlparse
 16from urllib.request import urlretrieve
 17from datetime import datetime, timedelta, timezone
 18from calendar import timegm
 19
 20from enum import Enum
 21from typing import TYPE_CHECKING, Generator, Iterator
 22
 23if TYPE_CHECKING:
 24    from typing_extensions import TypedDict
 25    import feedparser
 26
 27    class FeedParserDict(TypedDict, total=False):
 28        id: str
 29        title: str
 30        summary: str
 31        authors: list[dict[str, str]]
 32        links: list[dict[str, str]]
 33        tags: list[dict[str, str]]
 34        updated_parsed: time.struct_time
 35        published_parsed: time.struct_time
 36        arxiv_comment: str
 37        arxiv_journal_ref: str
 38        arxiv_doi: str
 39        arxiv_primary_category: dict[str, str]
 40
 41
 42logger = logging.getLogger(__name__)
 43
 44_DEFAULT_TIME = datetime.min
 45
 46
 47class Result:
 48    """
 49    An entry in an arXiv query results feed.
 50
 51    See [the arXiv API User's Manual: Details of Atom Results
 52    Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned).
 53    """
 54
 55    entry_id: str
 56    """A url of the form `https://arxiv.org/abs/{id}`."""
 57    updated: datetime
 58    """When the result was last updated."""
 59    published: datetime
 60    """When the result was originally published."""
 61    title: str
 62    """The title of the result."""
 63    authors: list[Result.Author]
 64    """The result's authors."""
 65    summary: str
 66    """The result abstract."""
 67    comment: str | None
 68    """The authors' comment if present."""
 69    journal_ref: str | None
 70    """A journal reference if present."""
 71    doi: str | None
 72    """A URL for the resolved DOI to an external resource if present."""
 73    primary_category: str
 74    """
 75    The result's primary arXiv category. See [arXiv: Category
 76    Taxonomy](https://arxiv.org/category_taxonomy).
 77    """
 78    categories: list[str]
 79    """
 80    All of the result's categories. See [arXiv: Category
 81    Taxonomy](https://arxiv.org/category_taxonomy).
 82    """
 83    links: list[Result.Link]
 84    """Up to three URLs associated with this result."""
 85    pdf_url: str | None
 86    """The URL of a PDF version of this result if present among links."""
 87    _raw: feedparser.FeedParserDict
 88    """
 89    The raw feedparser result object if this Result was constructed with
 90    Result._from_feed_entry.
 91    """
 92
 93    def __init__(
 94        self,
 95        entry_id: str,
 96        updated: datetime = _DEFAULT_TIME,
 97        published: datetime = _DEFAULT_TIME,
 98        title: str = "",
 99        authors: list[Result.Author] | None = None,
100        summary: str = "",
101        comment: str = "",
102        journal_ref: str = "",
103        doi: str = "",
104        primary_category: str = "",
105        categories: list[str] | None = None,
106        links: list[Result.Link] | None = None,
107        _raw: feedparser.FeedParserDict | None = None,
108    ):
109        """
110        Constructs an arXiv search result item.
111
112        In most cases, prefer using `Result._from_feed_entry` to parsing and
113        constructing `Result`s yourself.
114        """
115        self.entry_id = entry_id
116        self.updated = updated
117        self.published = published
118        self.title = title
119        self.authors = authors or []
120        self.summary = summary
121        self.comment = comment
122        self.journal_ref = journal_ref
123        self.doi = doi
124        self.primary_category = primary_category
125        self.categories = categories or []
126        self.links = links or []
127        # Calculated members
128        self.pdf_url = Result._get_pdf_url(self.links)
129        # Debugging
130        self._raw = _raw
131
132    @classmethod
133    def _from_feed_entry(cls, entry: feedparser.FeedParserDict) -> Result:
134        """
135        Converts a feedparser entry for an arXiv search result feed into a
136        Result object.
137        """
138        if not hasattr(entry, "id"):
139            raise Result.MissingFieldError("id")
140        # Title attribute may be absent for certain titles. Defaulting to "0" as
141        # it's the only title observed to cause this bug.
142        # https://github.com/lukasschwab/arxiv.py/issues/71
143        # title = entry.title if hasattr(entry, "title") else "0"
144        title = "0"
145        if hasattr(entry, "title"):
146            title = entry.title
147        else:
148            logger.warning("Result %s is missing title attribute; defaulting to '0'", entry.id)
149        return Result(
150            entry_id=entry.id,
151            updated=Result._to_datetime(entry.updated_parsed),
152            published=Result._to_datetime(entry.published_parsed),
153            title=re.sub(r"\s+", " ", title),
154            authors=[Result.Author._from_feed_author(a) for a in entry.authors],
155            summary=entry.summary,
156            comment=entry.get("arxiv_comment"),
157            journal_ref=entry.get("arxiv_journal_ref"),
158            doi=entry.get("arxiv_doi"),
159            primary_category=entry.arxiv_primary_category.get("term"),
160            categories=[tag.get("term") for tag in entry.tags],
161            links=[Result.Link._from_feed_link(link) for link in entry.links],
162            _raw=entry,
163        )
164
165    def __str__(self) -> str:
166        return self.entry_id
167
168    def __repr__(self) -> str:
169        return (
170            "{}(entry_id={}, updated={}, published={}, title={}, authors={}, "
171            "summary={}, comment={}, journal_ref={}, doi={}, "
172            "primary_category={}, categories={}, links={})"
173        ).format(
174            _classname(self),
175            repr(self.entry_id),
176            repr(self.updated),
177            repr(self.published),
178            repr(self.title),
179            repr(self.authors),
180            repr(self.summary),
181            repr(self.comment),
182            repr(self.journal_ref),
183            repr(self.doi),
184            repr(self.primary_category),
185            repr(self.categories),
186            repr(self.links),
187        )
188
189    def __eq__(self, other: object) -> bool:
190        if isinstance(other, Result):
191            return self.entry_id == other.entry_id
192        return False
193
194    def get_short_id(self) -> str:
195        """
196        Returns the short ID for this result.
197
198        + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`,
199        `result.get_short_id()` returns `2107.05580v1`.
200
201        + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`,
202        `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March
203        2007 arXiv identifier format).
204
205        For an explanation of the difference between arXiv's legacy and current
206        identifiers, see [Understanding the arXiv
207        identifier](https://arxiv.org/help/arxiv_identifier).
208        """
209        return self.entry_id.split("arxiv.org/abs/")[-1]
210
211    def _get_default_filename(self, extension: str = "pdf") -> str:
212        """
213        A default `to_filename` function for the extension given.
214        """
215        nonempty_title = self.title if self.title else "UNTITLED"
216        return ".".join(
217            [
218                self.get_short_id().replace("/", "_"),
219                re.sub(r"[^\w]", "_", nonempty_title),
220                extension,
221            ]
222        )
223
224    def download_pdf(
225        self,
226        dirpath: str = "./",
227        filename: str = "",
228        download_domain: str = "export.arxiv.org",
229    ) -> str:
230        """
231        Downloads the PDF for this result to the specified directory.
232
233        The filename is generated by calling `to_filename(self)`.
234
235        **Deprecated:** future versions of this client library will not provide
236        download helpers (out of scope). Use `result.pdf_url` directly.
237        """
238        if not filename:
239            filename = self._get_default_filename()
240        path = os.path.join(dirpath, filename)
241        if self.pdf_url is None:
242            raise ValueError("No PDF URL available for this result")
243        pdf_url = Result._substitute_domain(self.pdf_url, download_domain)
244        written_path, _ = urlretrieve(pdf_url, path)
245        return written_path
246
247    def download_source(
248        self,
249        dirpath: str = "./",
250        filename: str = "",
251        download_domain: str = "export.arxiv.org",
252    ) -> str:
253        """
254        Downloads the source tarfile for this result to the specified
255        directory.
256
257        The filename is generated by calling `to_filename(self)`.
258
259        **Deprecated:** future versions of this client library will not provide
260        download helpers (out of scope). Use `result.source_url` directly.
261        """
262        if not filename:
263            filename = self._get_default_filename("tar.gz")
264        path = os.path.join(dirpath, filename)
265        source_url_str = self.source_url()
266        if source_url_str is None:
267            raise ValueError("No source URL available for this result")
268        source_url = Result._substitute_domain(source_url_str, download_domain)
269        written_path, _ = urlretrieve(source_url, path)
270        return written_path
271
272    def source_url(self) -> str | None:
273        """
274        Derives a URL for the source tarfile for this result.
275        """
276        if self.pdf_url is None:
277            return None
278        return self.pdf_url.replace("/pdf/", "/src/")
279
280    @staticmethod
281    def _get_pdf_url(links: list[Result.Link]) -> str | None:
282        """
283        Finds the PDF link among a result's links and returns its URL.
284
285        Should only be called once for a given `Result`, in its constructor.
286        After construction, the URL should be available in `Result.pdf_url`.
287        """
288        pdf_urls = [link.href for link in links if link.title == "pdf"]
289        if len(pdf_urls) == 0:
290            return None
291        elif len(pdf_urls) > 1:
292            logger.warning("Result has multiple PDF links; using %s", pdf_urls[0])
293        return pdf_urls[0]
294
295    @staticmethod
296    def _to_datetime(ts: time.struct_time) -> datetime:
297        """
298        Converts a UTC time.struct_time into a time-zone-aware datetime.
299
300        This will be replaced with feedparser functionality [when it becomes
301        available](https://github.com/kurtmckee/feedparser/issues/212).
302        """
303        return datetime.fromtimestamp(timegm(ts), tz=timezone.utc)
304
305    @staticmethod
306    def _substitute_domain(url: str, domain: str) -> str:
307        """
308        Replaces the domain of the given URL with the specified domain.
309
310        This is useful for testing purposes.
311        """
312        parsed_url = urlparse(url)
313        return parsed_url._replace(netloc=domain).geturl()
314
315    class Author:
316        """
317        A light inner class for representing a result's authors.
318        """
319
320        name: str
321        """The author's name."""
322
323        def __init__(self, name: str):
324            """
325            Constructs an `Author` with the specified name.
326
327            In most cases, prefer using `Author._from_feed_author` to parsing
328            and constructing `Author`s yourself.
329            """
330            self.name = name
331
332        @classmethod
333        def _from_feed_author(cls, feed_author: feedparser.FeedParserDict) -> Result.Author:
334            """
335            Constructs an `Author` with the name specified in an author object
336            from a feed entry.
337
338            See usage in `Result._from_feed_entry`.
339            """
340            return Result.Author(feed_author.name)
341
342        def __str__(self) -> str:
343            return self.name
344
345        def __repr__(self) -> str:
346            return "{}({})".format(_classname(self), repr(self.name))
347
348        def __eq__(self, other: object) -> bool:
349            if isinstance(other, Result.Author):
350                return self.name == other.name
351            return False
352
353    class Link:
354        """
355        A light inner class for representing a result's links.
356        """
357
358        href: str
359        """The link's `href` attribute."""
360        title: str | None
361        """The link's title."""
362        rel: str
363        """The link's relationship to the `Result`."""
364        content_type: str | None
365        """The link's HTTP content type."""
366
367        def __init__(
368            self,
369            href: str,
370            title: str | None = None,
371            rel: str = "",
372            content_type: str | None = None,
373        ):
374            """
375            Constructs a `Link` with the specified link metadata.
376
377            In most cases, prefer using `Link._from_feed_link` to parsing and
378            constructing `Link`s yourself.
379            """
380            self.href = href
381            self.title = title
382            self.rel = rel
383            self.content_type = content_type
384
385        @classmethod
386        def _from_feed_link(cls, feed_link: feedparser.FeedParserDict) -> Result.Link:
387            """
388            Constructs a `Link` with link metadata specified in a link object
389            from a feed entry.
390
391            See usage in `Result._from_feed_entry`.
392            """
393            return Result.Link(
394                href=feed_link.href,
395                title=feed_link.get("title"),
396                rel=feed_link.get("rel") or "",
397                content_type=feed_link.get("content_type"),
398            )
399
400        def __str__(self) -> str:
401            return self.href
402
403        def __repr__(self) -> str:
404            return "{}({}, title={}, rel={}, content_type={})".format(
405                _classname(self),
406                repr(self.href),
407                repr(self.title),
408                repr(self.rel),
409                repr(self.content_type),
410            )
411
412        def __eq__(self, other: object) -> bool:
413            if isinstance(other, Result.Link):
414                return self.href == other.href
415            return False
416
417    class MissingFieldError(Exception):
418        """
419        An error indicating an entry is unparseable because it lacks required
420        fields.
421        """
422
423        missing_field: str
424        """The required field missing from the would-be entry."""
425        message: str
426        """Message describing what caused this error."""
427
428        def __init__(self, missing_field: str):
429            self.missing_field = missing_field
430            self.message = "Entry from arXiv missing required info"
431
432        def __repr__(self) -> str:
433            return "{}({})".format(_classname(self), repr(self.missing_field))
434
435
436class SortCriterion(Enum):
437    """
438    A SortCriterion identifies a property by which search results can be
439    sorted.
440
441    See [the arXiv API User's Manual: sort order for return
442    results](https://arxiv.org/help/api/user-manual#sort).
443    """
444
445    Relevance = "relevance"
446    LastUpdatedDate = "lastUpdatedDate"
447    SubmittedDate = "submittedDate"
448
449
450class SortOrder(Enum):
451    """
452    A SortOrder indicates order in which search results are sorted according
453    to the specified arxiv.SortCriterion.
454
455    See [the arXiv API User's Manual: sort order for return
456    results](https://arxiv.org/help/api/user-manual#sort).
457    """
458
459    Ascending = "ascending"
460    Descending = "descending"
461
462
463class Search:
464    """
465    A specification for a search of arXiv's database.
466
467    To run a search, use `Search.run` to use a default client or `Client.run`
468    with a specific client.
469    """
470
471    query: str
472    """
473    A query string.
474
475    This should be unencoded. Use `au:del_maestro AND ti:checkerboard`, not
476    `au:del_maestro+AND+ti:checkerboard`.
477
478    See [the arXiv API User's Manual: Details of Query
479    Construction](https://arxiv.org/help/api/user-manual#query_details).
480    """
481    id_list: list[str]
482    """
483    A list of arXiv article IDs to which to limit the search.
484
485    See [the arXiv API User's
486    Manual](https://arxiv.org/help/api/user-manual#search_query_and_id_list)
487    for documentation of the interaction between `query` and `id_list`.
488    """
489    max_results: int | None
490    """
491    The maximum number of results to be returned in an execution of this
492    search. To fetch every result available, set `max_results=None`.
493
494    The API's limit is 300,000 results per query.
495    """
496    sort_by: SortCriterion
497    """The sort criterion for results."""
498    sort_order: SortOrder
499    """The sort order for results."""
500
501    def __init__(
502        self,
503        query: str = "",
504        id_list: list[str] | None = None,
505        max_results: int | None = None,
506        sort_by: SortCriterion = SortCriterion.Relevance,
507        sort_order: SortOrder = SortOrder.Descending,
508    ):
509        """
510        Constructs an arXiv API search with the specified criteria.
511        """
512        self.query = query
513        self.id_list = id_list or []
514        # Handle deprecated v1 default behavior.
515        self.max_results = None if max_results == math.inf else max_results
516        self.sort_by = sort_by
517        self.sort_order = sort_order
518
519    def __str__(self) -> str:
520        if self.query and self.id_list:
521            return f"Search(query='{self.query}', id_list={len(self.id_list)} items)"
522        elif self.query:
523            return f"Search(query='{self.query}')"
524        elif self.id_list:
525            return f"Search(id_list={len(self.id_list)} items)"
526        else:
527            return "Search(empty)"
528
529    def __repr__(self) -> str:
530        return ("{}(query={}, id_list={}, max_results={}, sort_by={}, sort_order={})").format(
531            _classname(self),
532            repr(self.query),
533            repr(self.id_list),
534            repr(self.max_results),
535            repr(self.sort_by),
536            repr(self.sort_order),
537        )
538
539    def _url_args(self) -> dict[str, str]:
540        """
541        Returns a dict of search parameters that should be included in an API
542        request for this search.
543        """
544        return {
545            "search_query": self.query,
546            "id_list": ",".join(self.id_list),
547            "sortBy": self.sort_by.value,
548            "sortOrder": self.sort_order.value,
549        }
550
551    def results(self, offset: int = 0) -> Iterator[Result]:
552        """
553        Executes the specified search using a default arXiv API client. For info
554        on default behavior, see `Client.__init__` and `Client.results`.
555
556        **Deprecated** after 2.0.0; use `Client.results`.
557        """
558        warnings.warn(
559            "The 'Search.results' method is deprecated, use 'Client.results' instead",
560            DeprecationWarning,
561            stacklevel=2,
562        )
563        return Client().results(self, offset=offset)
564
565
566class Client:
567    """
568    Specifies a strategy for fetching results from arXiv's API.
569
570    This class obscures pagination and retry logic, and exposes
571    `Client.results`.
572    """
573
574    query_url_format = "https://export.arxiv.org/api/query?{}"
575    """
576    The arXiv query API endpoint format.
577    """
578    page_size: int
579    """
580    Maximum number of results fetched in a single API request. Smaller pages can
581    be retrieved faster, but may require more round-trips.
582
583    The API's limit is 2000 results per page.
584    """
585    delay_seconds: float
586    """
587    Number of seconds to wait between API requests.
588
589    [arXiv's Terms of Use](https://arxiv.org/help/api/tou) ask that you "make no
590    more than one request every three seconds."
591    """
592    num_retries: int
593    """
594    Number of times to retry a failing API request before raising an Exception.
595    """
596
597    _last_request_dt: datetime | None
598    _session: requests.Session
599
600    def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3):
601        """
602        Constructs an arXiv API client with the specified options.
603
604        Note: the default parameters should provide a robust request strategy
605        for most use cases. Extreme page sizes, delays, or retries risk
606        violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou),
607        brittle behavior, and inconsistent results.
608        """
609        self.page_size = page_size
610        self.delay_seconds = delay_seconds
611        self.num_retries = num_retries
612        self._last_request_dt = None
613        self._session = requests.Session()
614
615    def __str__(self) -> str:
616        return f"Client(page_size={self.page_size}, delay={self.delay_seconds}s, retries={self.num_retries})"
617
618    def __repr__(self) -> str:
619        return "{}(page_size={}, delay_seconds={}, num_retries={})".format(
620            _classname(self),
621            repr(self.page_size),
622            repr(self.delay_seconds),
623            repr(self.num_retries),
624        )
625
626    def results(self, search: Search, offset: int = 0) -> Iterator[Result]:
627        """
628        Uses this client configuration to fetch one page of the search results
629        at a time, yielding the parsed `Result`s, until `max_results` results
630        have been yielded or there are no more search results.
631
632        If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`.
633
634        Setting a nonzero `offset` discards leading records in the result set.
635        When `offset` is greater than or equal to `search.max_results`, the full
636        result set is discarded.
637
638        For more on using generators, see
639        [Generators](https://wiki.python.org/moin/Generators).
640        """
641        limit = search.max_results - offset if search.max_results else None
642        if limit and limit < 0:
643            return iter(())
644        return itertools.islice(self._results(search, offset), limit)
645
646    def _results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]:
647        page_url = self._format_url(search, offset, self.page_size)
648        feed = self._parse_feed(page_url, first_page=True)
649        if not feed.entries:
650            logger.info("Got empty first page; stopping generation")
651            return
652        total_results = int(feed.feed.opensearch_totalresults)
653        logger.info(
654            "Got first page: %d of %d total results",
655            len(feed.entries),
656            total_results,
657        )
658
659        while feed.entries:
660            for entry in feed.entries:
661                try:
662                    yield Result._from_feed_entry(entry)
663                except Result.MissingFieldError as e:
664                    logger.warning("Skipping partial result: %s", e)
665            offset += len(feed.entries)
666            if offset >= total_results:
667                break
668            page_url = self._format_url(search, offset, self.page_size)
669            feed = self._parse_feed(page_url, first_page=False)
670
671    def _format_url(self, search: Search, start: int, page_size: int) -> str:
672        """
673        Construct a request API for search that returns up to `page_size`
674        results starting with the result at index `start`.
675        """
676        url_args = search._url_args()
677        url_args.update(
678            {
679                "start": str(start),
680                "max_results": str(page_size),
681            }
682        )
683        return self.query_url_format.format(urlencode(url_args))
684
685    def _parse_feed(
686        self, url: str, first_page: bool = True, _try_index: int = 0
687    ) -> feedparser.FeedParserDict:
688        """
689        Fetches the specified URL and parses it with feedparser.
690
691        If a request fails or is unexpectedly empty, retries the request up to
692        `self.num_retries` times.
693        """
694        try:
695            return self.__try_parse_feed(url, first_page=first_page, try_index=_try_index)
696        except (
697            HTTPError,
698            UnexpectedEmptyPageError,
699            requests.exceptions.ConnectionError,
700        ) as err:
701            if _try_index < self.num_retries:
702                logger.debug("Got error (try %d): %s", _try_index, err)
703                return self._parse_feed(url, first_page=first_page, _try_index=_try_index + 1)
704            logger.debug("Giving up (try %d): %s", _try_index, err)
705            raise err
706
707    def __try_parse_feed(
708        self,
709        url: str,
710        first_page: bool,
711        try_index: int,
712    ) -> feedparser.FeedParserDict:
713        """
714        Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that
715        number of seconds has not passed since `_parse_feed` was last called,
716        sleeps until delay_seconds seconds have passed.
717        """
718        # If this call would violate the rate limit, sleep until it doesn't.
719        if self._last_request_dt is not None:
720            required = timedelta(seconds=self.delay_seconds)
721            since_last_request = datetime.now() - self._last_request_dt
722            if since_last_request < required:
723                to_sleep = (required - since_last_request).total_seconds()
724                logger.info("Sleeping: %f seconds", to_sleep)
725                time.sleep(to_sleep)
726
727        logger.info("Requesting page (first: %r, try: %d): %s", first_page, try_index, url)
728
729        resp = self._session.get(url, headers={"user-agent": "arxiv.py/2.3.2"})
730        self._last_request_dt = datetime.now()
731        if resp.status_code != requests.codes.OK:
732            raise HTTPError(url, try_index, resp.status_code)
733
734        feed = feedparser.parse(resp.content)
735        if len(feed.entries) == 0 and not first_page:
736            raise UnexpectedEmptyPageError(url, try_index, feed)
737
738        if feed.bozo:
739            logger.warning(
740                "Bozo feed; consider handling: %s",
741                feed.bozo_exception if "bozo_exception" in feed else None,
742            )
743
744        return feed
745
746
747class ArxivError(Exception):
748    """This package's base Exception class."""
749
750    url: str
751    """The feed URL that could not be fetched."""
752    retry: int
753    """
754    The request try number which encountered this error; 0 for the initial try,
755    1 for the first retry, and so on.
756    """
757    message: str
758    """Message describing what caused this error."""
759
760    def __init__(self, url: str, retry: int, message: str):
761        """
762        Constructs an `ArxivError` encountered while fetching the specified URL.
763        """
764        self.url = url
765        self.retry = retry
766        self.message = message
767        super().__init__(self.message)
768
769    def __str__(self) -> str:
770        return "{} ({})".format(self.message, self.url)
771
772
773class UnexpectedEmptyPageError(ArxivError):
774    """
775    An error raised when a page of results that should be non-empty is empty.
776
777    This should never happen in theory, but happens sporadically due to
778    brittleness in the underlying arXiv API; usually resolved by retries.
779
780    See `Client.results` for usage.
781    """
782
783    raw_feed: feedparser.FeedParserDict
784    """
785    The raw output of `feedparser.parse`. Sometimes this contains useful
786    diagnostic information, e.g. in 'bozo_exception'.
787    """
788
789    def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict):
790        """
791        Constructs an `UnexpectedEmptyPageError` encountered for the specified
792        API URL after `retry` tries.
793        """
794        self.url = url
795        self.raw_feed = raw_feed
796        super().__init__(url, retry, "Page of results was unexpectedly empty")
797
798    def __repr__(self) -> str:
799        return "{}({}, {}, {})".format(
800            _classname(self), repr(self.url), repr(self.retry), repr(self.raw_feed)
801        )
802
803
804class HTTPError(ArxivError):
805    """
806    A non-200 status encountered while fetching a page of results.
807
808    See `Client.results` for usage.
809    """
810
811    status: int
812    """The HTTP status reported by feedparser."""
813
814    def __init__(self, url: str, retry: int, status: int):
815        """
816        Constructs an `HTTPError` for the specified status code, encountered for
817        the specified API URL after `retry` tries.
818        """
819        self.url = url
820        self.status = status
821        super().__init__(
822            url,
823            retry,
824            "Page request resulted in HTTP {}".format(self.status),
825        )
826
827    def __repr__(self) -> str:
828        return "{}({}, {}, {})".format(
829            _classname(self), repr(self.url), repr(self.retry), repr(self.status)
830        )
831
832
833def _classname(o: object) -> str:
834    """A helper function for use in __repr__ methods: arxiv.Result.Link."""
835    return "arxiv.{}".format(o.__class__.__qualname__)
logger = <Logger arxiv (WARNING)>
class Result:
 48class Result:
 49    """
 50    An entry in an arXiv query results feed.
 51
 52    See [the arXiv API User's Manual: Details of Atom Results
 53    Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned).
 54    """
 55
 56    entry_id: str
 57    """A url of the form `https://arxiv.org/abs/{id}`."""
 58    updated: datetime
 59    """When the result was last updated."""
 60    published: datetime
 61    """When the result was originally published."""
 62    title: str
 63    """The title of the result."""
 64    authors: list[Result.Author]
 65    """The result's authors."""
 66    summary: str
 67    """The result abstract."""
 68    comment: str | None
 69    """The authors' comment if present."""
 70    journal_ref: str | None
 71    """A journal reference if present."""
 72    doi: str | None
 73    """A URL for the resolved DOI to an external resource if present."""
 74    primary_category: str
 75    """
 76    The result's primary arXiv category. See [arXiv: Category
 77    Taxonomy](https://arxiv.org/category_taxonomy).
 78    """
 79    categories: list[str]
 80    """
 81    All of the result's categories. See [arXiv: Category
 82    Taxonomy](https://arxiv.org/category_taxonomy).
 83    """
 84    links: list[Result.Link]
 85    """Up to three URLs associated with this result."""
 86    pdf_url: str | None
 87    """The URL of a PDF version of this result if present among links."""
 88    _raw: feedparser.FeedParserDict
 89    """
 90    The raw feedparser result object if this Result was constructed with
 91    Result._from_feed_entry.
 92    """
 93
 94    def __init__(
 95        self,
 96        entry_id: str,
 97        updated: datetime = _DEFAULT_TIME,
 98        published: datetime = _DEFAULT_TIME,
 99        title: str = "",
100        authors: list[Result.Author] | None = None,
101        summary: str = "",
102        comment: str = "",
103        journal_ref: str = "",
104        doi: str = "",
105        primary_category: str = "",
106        categories: list[str] | None = None,
107        links: list[Result.Link] | None = None,
108        _raw: feedparser.FeedParserDict | None = None,
109    ):
110        """
111        Constructs an arXiv search result item.
112
113        In most cases, prefer using `Result._from_feed_entry` to parsing and
114        constructing `Result`s yourself.
115        """
116        self.entry_id = entry_id
117        self.updated = updated
118        self.published = published
119        self.title = title
120        self.authors = authors or []
121        self.summary = summary
122        self.comment = comment
123        self.journal_ref = journal_ref
124        self.doi = doi
125        self.primary_category = primary_category
126        self.categories = categories or []
127        self.links = links or []
128        # Calculated members
129        self.pdf_url = Result._get_pdf_url(self.links)
130        # Debugging
131        self._raw = _raw
132
133    @classmethod
134    def _from_feed_entry(cls, entry: feedparser.FeedParserDict) -> Result:
135        """
136        Converts a feedparser entry for an arXiv search result feed into a
137        Result object.
138        """
139        if not hasattr(entry, "id"):
140            raise Result.MissingFieldError("id")
141        # Title attribute may be absent for certain titles. Defaulting to "0" as
142        # it's the only title observed to cause this bug.
143        # https://github.com/lukasschwab/arxiv.py/issues/71
144        # title = entry.title if hasattr(entry, "title") else "0"
145        title = "0"
146        if hasattr(entry, "title"):
147            title = entry.title
148        else:
149            logger.warning("Result %s is missing title attribute; defaulting to '0'", entry.id)
150        return Result(
151            entry_id=entry.id,
152            updated=Result._to_datetime(entry.updated_parsed),
153            published=Result._to_datetime(entry.published_parsed),
154            title=re.sub(r"\s+", " ", title),
155            authors=[Result.Author._from_feed_author(a) for a in entry.authors],
156            summary=entry.summary,
157            comment=entry.get("arxiv_comment"),
158            journal_ref=entry.get("arxiv_journal_ref"),
159            doi=entry.get("arxiv_doi"),
160            primary_category=entry.arxiv_primary_category.get("term"),
161            categories=[tag.get("term") for tag in entry.tags],
162            links=[Result.Link._from_feed_link(link) for link in entry.links],
163            _raw=entry,
164        )
165
166    def __str__(self) -> str:
167        return self.entry_id
168
169    def __repr__(self) -> str:
170        return (
171            "{}(entry_id={}, updated={}, published={}, title={}, authors={}, "
172            "summary={}, comment={}, journal_ref={}, doi={}, "
173            "primary_category={}, categories={}, links={})"
174        ).format(
175            _classname(self),
176            repr(self.entry_id),
177            repr(self.updated),
178            repr(self.published),
179            repr(self.title),
180            repr(self.authors),
181            repr(self.summary),
182            repr(self.comment),
183            repr(self.journal_ref),
184            repr(self.doi),
185            repr(self.primary_category),
186            repr(self.categories),
187            repr(self.links),
188        )
189
190    def __eq__(self, other: object) -> bool:
191        if isinstance(other, Result):
192            return self.entry_id == other.entry_id
193        return False
194
195    def get_short_id(self) -> str:
196        """
197        Returns the short ID for this result.
198
199        + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`,
200        `result.get_short_id()` returns `2107.05580v1`.
201
202        + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`,
203        `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March
204        2007 arXiv identifier format).
205
206        For an explanation of the difference between arXiv's legacy and current
207        identifiers, see [Understanding the arXiv
208        identifier](https://arxiv.org/help/arxiv_identifier).
209        """
210        return self.entry_id.split("arxiv.org/abs/")[-1]
211
212    def _get_default_filename(self, extension: str = "pdf") -> str:
213        """
214        A default `to_filename` function for the extension given.
215        """
216        nonempty_title = self.title if self.title else "UNTITLED"
217        return ".".join(
218            [
219                self.get_short_id().replace("/", "_"),
220                re.sub(r"[^\w]", "_", nonempty_title),
221                extension,
222            ]
223        )
224
225    def download_pdf(
226        self,
227        dirpath: str = "./",
228        filename: str = "",
229        download_domain: str = "export.arxiv.org",
230    ) -> str:
231        """
232        Downloads the PDF for this result to the specified directory.
233
234        The filename is generated by calling `to_filename(self)`.
235
236        **Deprecated:** future versions of this client library will not provide
237        download helpers (out of scope). Use `result.pdf_url` directly.
238        """
239        if not filename:
240            filename = self._get_default_filename()
241        path = os.path.join(dirpath, filename)
242        if self.pdf_url is None:
243            raise ValueError("No PDF URL available for this result")
244        pdf_url = Result._substitute_domain(self.pdf_url, download_domain)
245        written_path, _ = urlretrieve(pdf_url, path)
246        return written_path
247
248    def download_source(
249        self,
250        dirpath: str = "./",
251        filename: str = "",
252        download_domain: str = "export.arxiv.org",
253    ) -> str:
254        """
255        Downloads the source tarfile for this result to the specified
256        directory.
257
258        The filename is generated by calling `to_filename(self)`.
259
260        **Deprecated:** future versions of this client library will not provide
261        download helpers (out of scope). Use `result.source_url` directly.
262        """
263        if not filename:
264            filename = self._get_default_filename("tar.gz")
265        path = os.path.join(dirpath, filename)
266        source_url_str = self.source_url()
267        if source_url_str is None:
268            raise ValueError("No source URL available for this result")
269        source_url = Result._substitute_domain(source_url_str, download_domain)
270        written_path, _ = urlretrieve(source_url, path)
271        return written_path
272
273    def source_url(self) -> str | None:
274        """
275        Derives a URL for the source tarfile for this result.
276        """
277        if self.pdf_url is None:
278            return None
279        return self.pdf_url.replace("/pdf/", "/src/")
280
281    @staticmethod
282    def _get_pdf_url(links: list[Result.Link]) -> str | None:
283        """
284        Finds the PDF link among a result's links and returns its URL.
285
286        Should only be called once for a given `Result`, in its constructor.
287        After construction, the URL should be available in `Result.pdf_url`.
288        """
289        pdf_urls = [link.href for link in links if link.title == "pdf"]
290        if len(pdf_urls) == 0:
291            return None
292        elif len(pdf_urls) > 1:
293            logger.warning("Result has multiple PDF links; using %s", pdf_urls[0])
294        return pdf_urls[0]
295
296    @staticmethod
297    def _to_datetime(ts: time.struct_time) -> datetime:
298        """
299        Converts a UTC time.struct_time into a time-zone-aware datetime.
300
301        This will be replaced with feedparser functionality [when it becomes
302        available](https://github.com/kurtmckee/feedparser/issues/212).
303        """
304        return datetime.fromtimestamp(timegm(ts), tz=timezone.utc)
305
306    @staticmethod
307    def _substitute_domain(url: str, domain: str) -> str:
308        """
309        Replaces the domain of the given URL with the specified domain.
310
311        This is useful for testing purposes.
312        """
313        parsed_url = urlparse(url)
314        return parsed_url._replace(netloc=domain).geturl()
315
316    class Author:
317        """
318        A light inner class for representing a result's authors.
319        """
320
321        name: str
322        """The author's name."""
323
324        def __init__(self, name: str):
325            """
326            Constructs an `Author` with the specified name.
327
328            In most cases, prefer using `Author._from_feed_author` to parsing
329            and constructing `Author`s yourself.
330            """
331            self.name = name
332
333        @classmethod
334        def _from_feed_author(cls, feed_author: feedparser.FeedParserDict) -> Result.Author:
335            """
336            Constructs an `Author` with the name specified in an author object
337            from a feed entry.
338
339            See usage in `Result._from_feed_entry`.
340            """
341            return Result.Author(feed_author.name)
342
343        def __str__(self) -> str:
344            return self.name
345
346        def __repr__(self) -> str:
347            return "{}({})".format(_classname(self), repr(self.name))
348
349        def __eq__(self, other: object) -> bool:
350            if isinstance(other, Result.Author):
351                return self.name == other.name
352            return False
353
354    class Link:
355        """
356        A light inner class for representing a result's links.
357        """
358
359        href: str
360        """The link's `href` attribute."""
361        title: str | None
362        """The link's title."""
363        rel: str
364        """The link's relationship to the `Result`."""
365        content_type: str | None
366        """The link's HTTP content type."""
367
368        def __init__(
369            self,
370            href: str,
371            title: str | None = None,
372            rel: str = "",
373            content_type: str | None = None,
374        ):
375            """
376            Constructs a `Link` with the specified link metadata.
377
378            In most cases, prefer using `Link._from_feed_link` to parsing and
379            constructing `Link`s yourself.
380            """
381            self.href = href
382            self.title = title
383            self.rel = rel
384            self.content_type = content_type
385
386        @classmethod
387        def _from_feed_link(cls, feed_link: feedparser.FeedParserDict) -> Result.Link:
388            """
389            Constructs a `Link` with link metadata specified in a link object
390            from a feed entry.
391
392            See usage in `Result._from_feed_entry`.
393            """
394            return Result.Link(
395                href=feed_link.href,
396                title=feed_link.get("title"),
397                rel=feed_link.get("rel") or "",
398                content_type=feed_link.get("content_type"),
399            )
400
401        def __str__(self) -> str:
402            return self.href
403
404        def __repr__(self) -> str:
405            return "{}({}, title={}, rel={}, content_type={})".format(
406                _classname(self),
407                repr(self.href),
408                repr(self.title),
409                repr(self.rel),
410                repr(self.content_type),
411            )
412
413        def __eq__(self, other: object) -> bool:
414            if isinstance(other, Result.Link):
415                return self.href == other.href
416            return False
417
418    class MissingFieldError(Exception):
419        """
420        An error indicating an entry is unparseable because it lacks required
421        fields.
422        """
423
424        missing_field: str
425        """The required field missing from the would-be entry."""
426        message: str
427        """Message describing what caused this error."""
428
429        def __init__(self, missing_field: str):
430            self.missing_field = missing_field
431            self.message = "Entry from arXiv missing required info"
432
433        def __repr__(self) -> str:
434            return "{}({})".format(_classname(self), repr(self.missing_field))

An entry in an arXiv query results feed.

See the arXiv API User's Manual: Details of Atom Results Returned.

Result( entry_id: str, updated: datetime.datetime = datetime.datetime(1, 1, 1, 0, 0), published: datetime.datetime = datetime.datetime(1, 1, 1, 0, 0), title: str = '', authors: list[Result.Author] | None = None, summary: str = '', comment: str = '', journal_ref: str = '', doi: str = '', primary_category: str = '', categories: list[str] | None = None, links: list[Result.Link] | None = None, _raw: feedparser.util.FeedParserDict | None = None)
 94    def __init__(
 95        self,
 96        entry_id: str,
 97        updated: datetime = _DEFAULT_TIME,
 98        published: datetime = _DEFAULT_TIME,
 99        title: str = "",
100        authors: list[Result.Author] | None = None,
101        summary: str = "",
102        comment: str = "",
103        journal_ref: str = "",
104        doi: str = "",
105        primary_category: str = "",
106        categories: list[str] | None = None,
107        links: list[Result.Link] | None = None,
108        _raw: feedparser.FeedParserDict | None = None,
109    ):
110        """
111        Constructs an arXiv search result item.
112
113        In most cases, prefer using `Result._from_feed_entry` to parsing and
114        constructing `Result`s yourself.
115        """
116        self.entry_id = entry_id
117        self.updated = updated
118        self.published = published
119        self.title = title
120        self.authors = authors or []
121        self.summary = summary
122        self.comment = comment
123        self.journal_ref = journal_ref
124        self.doi = doi
125        self.primary_category = primary_category
126        self.categories = categories or []
127        self.links = links or []
128        # Calculated members
129        self.pdf_url = Result._get_pdf_url(self.links)
130        # Debugging
131        self._raw = _raw

Constructs an arXiv search result item.

In most cases, prefer using Result._from_feed_entry to parsing and constructing Results yourself.

entry_id: str

A url of the form https://arxiv.org/abs/{id}.

updated: datetime.datetime

When the result was last updated.

published: datetime.datetime

When the result was originally published.

title: str

The title of the result.

authors: list[Result.Author]

The result's authors.

summary: str

The result abstract.

comment: str | None

The authors' comment if present.

journal_ref: str | None

A journal reference if present.

doi: str | None

A URL for the resolved DOI to an external resource if present.

primary_category: str

The result's primary arXiv category. See arXiv: Category Taxonomy.

categories: list[str]

All of the result's categories. See arXiv: Category Taxonomy.

pdf_url: str | None

The URL of a PDF version of this result if present among links.

def get_short_id(self) -> str:
195    def get_short_id(self) -> str:
196        """
197        Returns the short ID for this result.
198
199        + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`,
200        `result.get_short_id()` returns `2107.05580v1`.
201
202        + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`,
203        `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March
204        2007 arXiv identifier format).
205
206        For an explanation of the difference between arXiv's legacy and current
207        identifiers, see [Understanding the arXiv
208        identifier](https://arxiv.org/help/arxiv_identifier).
209        """
210        return self.entry_id.split("arxiv.org/abs/")[-1]

Returns the short ID for this result.

  • If the result URL is "https://arxiv.org/abs/2107.05580v1", result.get_short_id() returns 2107.05580v1.

  • If the result URL is "https://arxiv.org/abs/quant-ph/0201082v1", result.get_short_id() returns "quant-ph/0201082v1" (the pre-March 2007 arXiv identifier format).

For an explanation of the difference between arXiv's legacy and current identifiers, see Understanding the arXiv identifier.

def download_pdf( self, dirpath: str = './', filename: str = '', download_domain: str = 'export.arxiv.org') -> str:
225    def download_pdf(
226        self,
227        dirpath: str = "./",
228        filename: str = "",
229        download_domain: str = "export.arxiv.org",
230    ) -> str:
231        """
232        Downloads the PDF for this result to the specified directory.
233
234        The filename is generated by calling `to_filename(self)`.
235
236        **Deprecated:** future versions of this client library will not provide
237        download helpers (out of scope). Use `result.pdf_url` directly.
238        """
239        if not filename:
240            filename = self._get_default_filename()
241        path = os.path.join(dirpath, filename)
242        if self.pdf_url is None:
243            raise ValueError("No PDF URL available for this result")
244        pdf_url = Result._substitute_domain(self.pdf_url, download_domain)
245        written_path, _ = urlretrieve(pdf_url, path)
246        return written_path

Downloads the PDF for this result to the specified directory.

The filename is generated by calling to_filename(self).

Deprecated: future versions of this client library will not provide download helpers (out of scope). Use result.pdf_url directly.

def download_source( self, dirpath: str = './', filename: str = '', download_domain: str = 'export.arxiv.org') -> str:
248    def download_source(
249        self,
250        dirpath: str = "./",
251        filename: str = "",
252        download_domain: str = "export.arxiv.org",
253    ) -> str:
254        """
255        Downloads the source tarfile for this result to the specified
256        directory.
257
258        The filename is generated by calling `to_filename(self)`.
259
260        **Deprecated:** future versions of this client library will not provide
261        download helpers (out of scope). Use `result.source_url` directly.
262        """
263        if not filename:
264            filename = self._get_default_filename("tar.gz")
265        path = os.path.join(dirpath, filename)
266        source_url_str = self.source_url()
267        if source_url_str is None:
268            raise ValueError("No source URL available for this result")
269        source_url = Result._substitute_domain(source_url_str, download_domain)
270        written_path, _ = urlretrieve(source_url, path)
271        return written_path

Downloads the source tarfile for this result to the specified directory.

The filename is generated by calling to_filename(self).

Deprecated: future versions of this client library will not provide download helpers (out of scope). Use result.source_url directly.

def source_url(self) -> str | None:
273    def source_url(self) -> str | None:
274        """
275        Derives a URL for the source tarfile for this result.
276        """
277        if self.pdf_url is None:
278            return None
279        return self.pdf_url.replace("/pdf/", "/src/")

Derives a URL for the source tarfile for this result.

class Result.Author:
316    class Author:
317        """
318        A light inner class for representing a result's authors.
319        """
320
321        name: str
322        """The author's name."""
323
324        def __init__(self, name: str):
325            """
326            Constructs an `Author` with the specified name.
327
328            In most cases, prefer using `Author._from_feed_author` to parsing
329            and constructing `Author`s yourself.
330            """
331            self.name = name
332
333        @classmethod
334        def _from_feed_author(cls, feed_author: feedparser.FeedParserDict) -> Result.Author:
335            """
336            Constructs an `Author` with the name specified in an author object
337            from a feed entry.
338
339            See usage in `Result._from_feed_entry`.
340            """
341            return Result.Author(feed_author.name)
342
343        def __str__(self) -> str:
344            return self.name
345
346        def __repr__(self) -> str:
347            return "{}({})".format(_classname(self), repr(self.name))
348
349        def __eq__(self, other: object) -> bool:
350            if isinstance(other, Result.Author):
351                return self.name == other.name
352            return False

A light inner class for representing a result's authors.

Result.Author(name: str)
324        def __init__(self, name: str):
325            """
326            Constructs an `Author` with the specified name.
327
328            In most cases, prefer using `Author._from_feed_author` to parsing
329            and constructing `Author`s yourself.
330            """
331            self.name = name

Constructs an Author with the specified name.

In most cases, prefer using Author._from_feed_author to parsing and constructing Authors yourself.

name: str

The author's name.

class Result.MissingFieldError(builtins.Exception):
418    class MissingFieldError(Exception):
419        """
420        An error indicating an entry is unparseable because it lacks required
421        fields.
422        """
423
424        missing_field: str
425        """The required field missing from the would-be entry."""
426        message: str
427        """Message describing what caused this error."""
428
429        def __init__(self, missing_field: str):
430            self.missing_field = missing_field
431            self.message = "Entry from arXiv missing required info"
432
433        def __repr__(self) -> str:
434            return "{}({})".format(_classname(self), repr(self.missing_field))

An error indicating an entry is unparseable because it lacks required fields.

Result.MissingFieldError(missing_field: str)
429        def __init__(self, missing_field: str):
430            self.missing_field = missing_field
431            self.message = "Entry from arXiv missing required info"
missing_field: str

The required field missing from the would-be entry.

message: str

Message describing what caused this error.

Inherited Members
builtins.BaseException
with_traceback
args
class SortCriterion(enum.Enum):
437class SortCriterion(Enum):
438    """
439    A SortCriterion identifies a property by which search results can be
440    sorted.
441
442    See [the arXiv API User's Manual: sort order for return
443    results](https://arxiv.org/help/api/user-manual#sort).
444    """
445
446    Relevance = "relevance"
447    LastUpdatedDate = "lastUpdatedDate"
448    SubmittedDate = "submittedDate"

A SortCriterion identifies a property by which search results can be sorted.

See the arXiv API User's Manual: sort order for return results.

Relevance = <SortCriterion.Relevance: 'relevance'>
LastUpdatedDate = <SortCriterion.LastUpdatedDate: 'lastUpdatedDate'>
SubmittedDate = <SortCriterion.SubmittedDate: 'submittedDate'>
Inherited Members
enum.Enum
name
value
class SortOrder(enum.Enum):
451class SortOrder(Enum):
452    """
453    A SortOrder indicates order in which search results are sorted according
454    to the specified arxiv.SortCriterion.
455
456    See [the arXiv API User's Manual: sort order for return
457    results](https://arxiv.org/help/api/user-manual#sort).
458    """
459
460    Ascending = "ascending"
461    Descending = "descending"

A SortOrder indicates order in which search results are sorted according to the specified SortCriterion.

See the arXiv API User's Manual: sort order for return results.

Ascending = <SortOrder.Ascending: 'ascending'>
Descending = <SortOrder.Descending: 'descending'>
Inherited Members
enum.Enum
name
value
class Client:
567class Client:
568    """
569    Specifies a strategy for fetching results from arXiv's API.
570
571    This class obscures pagination and retry logic, and exposes
572    `Client.results`.
573    """
574
575    query_url_format = "https://export.arxiv.org/api/query?{}"
576    """
577    The arXiv query API endpoint format.
578    """
579    page_size: int
580    """
581    Maximum number of results fetched in a single API request. Smaller pages can
582    be retrieved faster, but may require more round-trips.
583
584    The API's limit is 2000 results per page.
585    """
586    delay_seconds: float
587    """
588    Number of seconds to wait between API requests.
589
590    [arXiv's Terms of Use](https://arxiv.org/help/api/tou) ask that you "make no
591    more than one request every three seconds."
592    """
593    num_retries: int
594    """
595    Number of times to retry a failing API request before raising an Exception.
596    """
597
598    _last_request_dt: datetime | None
599    _session: requests.Session
600
601    def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3):
602        """
603        Constructs an arXiv API client with the specified options.
604
605        Note: the default parameters should provide a robust request strategy
606        for most use cases. Extreme page sizes, delays, or retries risk
607        violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou),
608        brittle behavior, and inconsistent results.
609        """
610        self.page_size = page_size
611        self.delay_seconds = delay_seconds
612        self.num_retries = num_retries
613        self._last_request_dt = None
614        self._session = requests.Session()
615
616    def __str__(self) -> str:
617        return f"Client(page_size={self.page_size}, delay={self.delay_seconds}s, retries={self.num_retries})"
618
619    def __repr__(self) -> str:
620        return "{}(page_size={}, delay_seconds={}, num_retries={})".format(
621            _classname(self),
622            repr(self.page_size),
623            repr(self.delay_seconds),
624            repr(self.num_retries),
625        )
626
627    def results(self, search: Search, offset: int = 0) -> Iterator[Result]:
628        """
629        Uses this client configuration to fetch one page of the search results
630        at a time, yielding the parsed `Result`s, until `max_results` results
631        have been yielded or there are no more search results.
632
633        If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`.
634
635        Setting a nonzero `offset` discards leading records in the result set.
636        When `offset` is greater than or equal to `search.max_results`, the full
637        result set is discarded.
638
639        For more on using generators, see
640        [Generators](https://wiki.python.org/moin/Generators).
641        """
642        limit = search.max_results - offset if search.max_results else None
643        if limit and limit < 0:
644            return iter(())
645        return itertools.islice(self._results(search, offset), limit)
646
647    def _results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]:
648        page_url = self._format_url(search, offset, self.page_size)
649        feed = self._parse_feed(page_url, first_page=True)
650        if not feed.entries:
651            logger.info("Got empty first page; stopping generation")
652            return
653        total_results = int(feed.feed.opensearch_totalresults)
654        logger.info(
655            "Got first page: %d of %d total results",
656            len(feed.entries),
657            total_results,
658        )
659
660        while feed.entries:
661            for entry in feed.entries:
662                try:
663                    yield Result._from_feed_entry(entry)
664                except Result.MissingFieldError as e:
665                    logger.warning("Skipping partial result: %s", e)
666            offset += len(feed.entries)
667            if offset >= total_results:
668                break
669            page_url = self._format_url(search, offset, self.page_size)
670            feed = self._parse_feed(page_url, first_page=False)
671
672    def _format_url(self, search: Search, start: int, page_size: int) -> str:
673        """
674        Construct a request API for search that returns up to `page_size`
675        results starting with the result at index `start`.
676        """
677        url_args = search._url_args()
678        url_args.update(
679            {
680                "start": str(start),
681                "max_results": str(page_size),
682            }
683        )
684        return self.query_url_format.format(urlencode(url_args))
685
686    def _parse_feed(
687        self, url: str, first_page: bool = True, _try_index: int = 0
688    ) -> feedparser.FeedParserDict:
689        """
690        Fetches the specified URL and parses it with feedparser.
691
692        If a request fails or is unexpectedly empty, retries the request up to
693        `self.num_retries` times.
694        """
695        try:
696            return self.__try_parse_feed(url, first_page=first_page, try_index=_try_index)
697        except (
698            HTTPError,
699            UnexpectedEmptyPageError,
700            requests.exceptions.ConnectionError,
701        ) as err:
702            if _try_index < self.num_retries:
703                logger.debug("Got error (try %d): %s", _try_index, err)
704                return self._parse_feed(url, first_page=first_page, _try_index=_try_index + 1)
705            logger.debug("Giving up (try %d): %s", _try_index, err)
706            raise err
707
708    def __try_parse_feed(
709        self,
710        url: str,
711        first_page: bool,
712        try_index: int,
713    ) -> feedparser.FeedParserDict:
714        """
715        Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that
716        number of seconds has not passed since `_parse_feed` was last called,
717        sleeps until delay_seconds seconds have passed.
718        """
719        # If this call would violate the rate limit, sleep until it doesn't.
720        if self._last_request_dt is not None:
721            required = timedelta(seconds=self.delay_seconds)
722            since_last_request = datetime.now() - self._last_request_dt
723            if since_last_request < required:
724                to_sleep = (required - since_last_request).total_seconds()
725                logger.info("Sleeping: %f seconds", to_sleep)
726                time.sleep(to_sleep)
727
728        logger.info("Requesting page (first: %r, try: %d): %s", first_page, try_index, url)
729
730        resp = self._session.get(url, headers={"user-agent": "arxiv.py/2.3.2"})
731        self._last_request_dt = datetime.now()
732        if resp.status_code != requests.codes.OK:
733            raise HTTPError(url, try_index, resp.status_code)
734
735        feed = feedparser.parse(resp.content)
736        if len(feed.entries) == 0 and not first_page:
737            raise UnexpectedEmptyPageError(url, try_index, feed)
738
739        if feed.bozo:
740            logger.warning(
741                "Bozo feed; consider handling: %s",
742                feed.bozo_exception if "bozo_exception" in feed else None,
743            )
744
745        return feed

Specifies a strategy for fetching results from arXiv's API.

This class obscures pagination and retry logic, and exposes Client.results.

Client( page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3)
601    def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3):
602        """
603        Constructs an arXiv API client with the specified options.
604
605        Note: the default parameters should provide a robust request strategy
606        for most use cases. Extreme page sizes, delays, or retries risk
607        violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou),
608        brittle behavior, and inconsistent results.
609        """
610        self.page_size = page_size
611        self.delay_seconds = delay_seconds
612        self.num_retries = num_retries
613        self._last_request_dt = None
614        self._session = requests.Session()

Constructs an arXiv API client with the specified options.

Note: the default parameters should provide a robust request strategy for most use cases. Extreme page sizes, delays, or retries risk violating the arXiv API Terms of Use, brittle behavior, and inconsistent results.

query_url_format = 'https://exportarxiv.arxiv.org/api/query?{}'

The arXiv query API endpoint format.

page_size: int

Maximum number of results fetched in a single API request. Smaller pages can be retrieved faster, but may require more round-trips.

The API's limit is 2000 results per page.

delay_seconds: float

Number of seconds to wait between API requests.

arXiv's Terms of Use ask that you "make no more than one request every three seconds."

num_retries: int

Number of times to retry a failing API request before raising an Exception.

def results(self, search: Search, offset: int = 0) -> Iterator[Result]:
627    def results(self, search: Search, offset: int = 0) -> Iterator[Result]:
628        """
629        Uses this client configuration to fetch one page of the search results
630        at a time, yielding the parsed `Result`s, until `max_results` results
631        have been yielded or there are no more search results.
632
633        If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`.
634
635        Setting a nonzero `offset` discards leading records in the result set.
636        When `offset` is greater than or equal to `search.max_results`, the full
637        result set is discarded.
638
639        For more on using generators, see
640        [Generators](https://wiki.python.org/moin/Generators).
641        """
642        limit = search.max_results - offset if search.max_results else None
643        if limit and limit < 0:
644            return iter(())
645        return itertools.islice(self._results(search, offset), limit)

Uses this client configuration to fetch one page of the search results at a time, yielding the parsed Results, until max_results results have been yielded or there are no more search results.

If all tries fail, raises an UnexpectedEmptyPageError or HTTPError.

Setting a nonzero offset discards leading records in the result set. When offset is greater than or equal to search.max_results, the full result set is discarded.

For more on using generators, see Generators.

class ArxivError(builtins.Exception):
748class ArxivError(Exception):
749    """This package's base Exception class."""
750
751    url: str
752    """The feed URL that could not be fetched."""
753    retry: int
754    """
755    The request try number which encountered this error; 0 for the initial try,
756    1 for the first retry, and so on.
757    """
758    message: str
759    """Message describing what caused this error."""
760
761    def __init__(self, url: str, retry: int, message: str):
762        """
763        Constructs an `ArxivError` encountered while fetching the specified URL.
764        """
765        self.url = url
766        self.retry = retry
767        self.message = message
768        super().__init__(self.message)
769
770    def __str__(self) -> str:
771        return "{} ({})".format(self.message, self.url)

This package's base Exception class.

ArxivError(url: str, retry: int, message: str)
761    def __init__(self, url: str, retry: int, message: str):
762        """
763        Constructs an `ArxivError` encountered while fetching the specified URL.
764        """
765        self.url = url
766        self.retry = retry
767        self.message = message
768        super().__init__(self.message)

Constructs an ArxivError encountered while fetching the specified URL.

url: str

The feed URL that could not be fetched.

retry: int

The request try number which encountered this error; 0 for the initial try, 1 for the first retry, and so on.

message: str

Message describing what caused this error.

Inherited Members
builtins.BaseException
with_traceback
args
class UnexpectedEmptyPageError(ArxivError):
774class UnexpectedEmptyPageError(ArxivError):
775    """
776    An error raised when a page of results that should be non-empty is empty.
777
778    This should never happen in theory, but happens sporadically due to
779    brittleness in the underlying arXiv API; usually resolved by retries.
780
781    See `Client.results` for usage.
782    """
783
784    raw_feed: feedparser.FeedParserDict
785    """
786    The raw output of `feedparser.parse`. Sometimes this contains useful
787    diagnostic information, e.g. in 'bozo_exception'.
788    """
789
790    def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict):
791        """
792        Constructs an `UnexpectedEmptyPageError` encountered for the specified
793        API URL after `retry` tries.
794        """
795        self.url = url
796        self.raw_feed = raw_feed
797        super().__init__(url, retry, "Page of results was unexpectedly empty")
798
799    def __repr__(self) -> str:
800        return "{}({}, {}, {})".format(
801            _classname(self), repr(self.url), repr(self.retry), repr(self.raw_feed)
802        )

An error raised when a page of results that should be non-empty is empty.

This should never happen in theory, but happens sporadically due to brittleness in the underlying arXiv API; usually resolved by retries.

See Client.results for usage.

UnexpectedEmptyPageError(url: str, retry: int, raw_feed: feedparser.util.FeedParserDict)
790    def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict):
791        """
792        Constructs an `UnexpectedEmptyPageError` encountered for the specified
793        API URL after `retry` tries.
794        """
795        self.url = url
796        self.raw_feed = raw_feed
797        super().__init__(url, retry, "Page of results was unexpectedly empty")

Constructs an UnexpectedEmptyPageError encountered for the specified API URL after retry tries.

raw_feed: feedparser.util.FeedParserDict

The raw output of feedparser.parse. Sometimes this contains useful diagnostic information, e.g. in 'bozo_exception'.

url

The feed URL that could not be fetched.

Inherited Members
ArxivError
retry
message
builtins.BaseException
with_traceback
args
class HTTPError(ArxivError):
805class HTTPError(ArxivError):
806    """
807    A non-200 status encountered while fetching a page of results.
808
809    See `Client.results` for usage.
810    """
811
812    status: int
813    """The HTTP status reported by feedparser."""
814
815    def __init__(self, url: str, retry: int, status: int):
816        """
817        Constructs an `HTTPError` for the specified status code, encountered for
818        the specified API URL after `retry` tries.
819        """
820        self.url = url
821        self.status = status
822        super().__init__(
823            url,
824            retry,
825            "Page request resulted in HTTP {}".format(self.status),
826        )
827
828    def __repr__(self) -> str:
829        return "{}({}, {}, {})".format(
830            _classname(self), repr(self.url), repr(self.retry), repr(self.status)
831        )

A non-200 status encountered while fetching a page of results.

See Client.results for usage.

HTTPError(url: str, retry: int, status: int)
815    def __init__(self, url: str, retry: int, status: int):
816        """
817        Constructs an `HTTPError` for the specified status code, encountered for
818        the specified API URL after `retry` tries.
819        """
820        self.url = url
821        self.status = status
822        super().__init__(
823            url,
824            retry,
825            "Page request resulted in HTTP {}".format(self.status),
826        )

Constructs an HTTPError for the specified status code, encountered for the specified API URL after retry tries.

status: int

The HTTP status reported by feedparser.

url

The feed URL that could not be fetched.

Inherited Members
ArxivError
retry
message
builtins.BaseException
with_traceback
args