arxiv.arxiv

arxiv.py Python 3.6 PyPI GitHub Workflow Status (branch)

Python wrapper for the arXiv API.

About arXiv

arXiv is a project by the Cornell University Library that provides open access to 1,000,000+ articles in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics.

Usage

Installation

$ pip install arxiv

In your Python script, include the line

import arxiv

A Search specifies a search of arXiv's database.

arxiv.Search(
  query: str = "",
  id_list: List[str] = [],
  max_results: float = float('inf'),
  sort_by: SortCriterion = SortCriterion.Relevanvce,
  sort_order: SortOrder = SortOrder.Descending
)
  • query: an arXiv query string. Advanced query formats are documented in the arXiv API User Manual.
  • id_list: list of arXiv record IDs (typically of the format "0710.5765v1"). See the arXiv API User's Manual for documentation of the interaction between query and id_list.
  • max_results: The maximum number of results to be returned in an execution of this search. To fetch every result available, set max_results=float('inf') (default); to fetch up to 10 results, set max_results=10. The API's limit is 300,000 results.
  • sort_by: The sort criterion for results: relevance, lastUpdatedDate, or submittedDate.
  • sort_order: The sort order for results: 'descending' or 'ascending'.

To fetch arXiv records matching a Search, use search.results() or (Client).results(search) to get a generator yielding Results.

Example: fetching results

Print the titles fo the 10 most recent articles related to the keyword "quantum:"

import arxiv

search = arxiv.Search(
  query = "quantum",
  max_results = 10,
  sort_by = arxiv.SortCriterion.SubmittedDate
)

for result in search.results():
  print(result.title)

Fetch and print the title of the paper with ID "1605.08386v1:"

import arxiv

search = arxiv.Search(id_list=["1605.08386v1"])
paper = next(search.results())
print(paper.title)

Result

The Result objects yielded by (Search).results() include metadata about each paper and some helper functions for downloading their content.

The meaning of the underlying raw data is documented in the arXiv API User Manual: Details of Atom Results Returned.

  • result.entry_id: A url http://arxiv.org/abs/{id}.
  • result.updated: When the result was last updated.
  • result.published: When the result was originally published.
  • result.title: The title of the result.
  • result.authors: The result's authors, as arxiv.Authors.
  • result.summary: The result abstract.
  • result.comment: The authors' comment if present.
  • result.journal_ref: A journal reference if present.
  • result.doi: A URL for the resolved DOI to an external resource if present.
  • result.primary_category: The result's primary arXiv category. See arXiv: Category Taxonomy.
  • result.categories: All of the result's categories. See arXiv: Category Taxonomy.
  • result.links: Up to three URLs associated with this result, as arxiv.Links.
  • result.pdf_url: A URL for the result's PDF if present. Note: this URL also appears among result.links.

They also expose helper methods for downloading papers: (Result).download_pdf() and (Result).download_source().

Example: downloading papers

To download a PDF of the paper with ID "1605.08386v1," run a Search and then use (Result).download_pdf():

import arxiv

paper = next(arxiv.Search(id_list=["1605.08386v1"]).results())
# Download the PDF to the PWD with a default filename.
paper.download_pdf()
# Download the PDF to the PWD with a custom filename.
paper.download_pdf(filename="downloaded-paper.pdf")
# Download the PDF to a specified directory with a custom filename.
paper.download_pdf(dirpath="./mydir", filename="downloaded-paper.pdf")

The same interface is available for downloading .tar.gz files of the paper source:

import arxiv

paper = next(arxiv.Search(id_list=["1605.08386v1"]).results())
# Download the archive to the PWD with a default filename.
paper.download_source()
# Download the archive to the PWD with a custom filename.
paper.download_source(filename="downloaded-paper.tar.gz")
# Download the archive to a specified directory with a custom filename.
paper.download_source(dirpath="./mydir", filename="downloaded-paper.tar.gz")

Client

A Client specifies a strategy for fetching results from arXiv's API; it obscures pagination and retry logic.

For most use cases the default client should suffice. You can construct it explicitly with arxiv.Client(), or use it via the (Search).results() method.

arxiv.Client(
  page_size: int = 100,
  delay_seconds: int = 3,
  num_retries: int = 3
)
  • page_size: the number of papers to fetch from arXiv per page of results. Smaller pages can be retrieved faster, but may require more round-trips. The API's limit is 2000 results.
  • delay_seconds: the number of seconds to wait between requests for pages. arXiv's Terms of Use ask that you "make no more than one request every three seconds."
  • num_retries: The number of times the client will retry a request that fails, either with a non-200 HTTP status code or with an unexpected number of results given the search parameters.

Example: fetching results with a custom client

(Search).results() uses the default client settings. If you want to use a client you've defined instead of the defaults, use (Client).results(...):

import arxiv

big_slow_client = arxiv.Client(
  page_size = 1000,
  delay_seconds = 10,
  num_retries = 5
)

# Prints 1000 titles before needing to make another request.
for result in big_slow_client.results(arxiv.Search(query="quantum")):
  print(result.title)

Example: logging

To inspect this package's network behavior and API logic, configure an INFO-level logger.

>>> import logging, arxiv
>>> logging.basicConfig(level=logging.INFO)
>>> paper = next(arxiv.Search(id_list=["1605.08386v1"]).results())
INFO:arxiv.arxiv:Requesting 100 results at offset 0
INFO:arxiv.arxiv:Requesting page of results
INFO:arxiv.arxiv:Got first page; 1 of inf results available
  1""".. include:: ../README.md"""
  2import logging
  3import time
  4import feedparser
  5import re
  6import os
  7import warnings
  8
  9from urllib.parse import urlencode
 10from urllib.request import urlretrieve
 11from datetime import datetime, timedelta, timezone
 12from calendar import timegm
 13
 14from enum import Enum
 15from typing import Dict, Generator, List
 16
 17logger = logging.getLogger(__name__)
 18
 19_DEFAULT_TIME = datetime.min
 20
 21
 22class Result(object):
 23    """
 24    An entry in an arXiv query results feed.
 25
 26    See [the arXiv API User's Manual: Details of Atom Results
 27    Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned).
 28    """
 29
 30    entry_id: str
 31    """A url of the form `http://arxiv.org/abs/{id}`."""
 32    updated: time.struct_time
 33    """When the result was last updated."""
 34    published: time.struct_time
 35    """When the result was originally published."""
 36    title: str
 37    """The title of the result."""
 38    authors: list
 39    """The result's authors."""
 40    summary: str
 41    """The result abstract."""
 42    comment: str
 43    """The authors' comment if present."""
 44    journal_ref: str
 45    """A journal reference if present."""
 46    doi: str
 47    """A URL for the resolved DOI to an external resource if present."""
 48    primary_category: str
 49    """
 50    The result's primary arXiv category. See [arXiv: Category
 51    Taxonomy](https://arxiv.org/category_taxonomy).
 52    """
 53    categories: List[str]
 54    """
 55    All of the result's categories. See [arXiv: Category
 56    Taxonomy](https://arxiv.org/category_taxonomy).
 57    """
 58    links: list
 59    """Up to three URLs associated with this result."""
 60    pdf_url: str
 61    """The URL of a PDF version of this result if present among links."""
 62    _raw: feedparser.FeedParserDict
 63    """
 64    The raw feedparser result object if this Result was constructed with
 65    Result._from_feed_entry.
 66    """
 67
 68    def __init__(
 69        self,
 70        entry_id: str,
 71        updated: datetime = _DEFAULT_TIME,
 72        published: datetime = _DEFAULT_TIME,
 73        title: str = "",
 74        authors: List['Result.Author'] = [],
 75        summary: str = "",
 76        comment: str = "",
 77        journal_ref: str = "",
 78        doi: str = "",
 79        primary_category: str = "",
 80        categories: List[str] = [],
 81        links: List['Result.Link'] = [],
 82        _raw: feedparser.FeedParserDict = None,
 83    ):
 84        """
 85        Constructs an arXiv search result item.
 86
 87        In most cases, prefer using `Result._from_feed_entry` to parsing and
 88        constructing `Result`s yourself.
 89        """
 90        self.entry_id = entry_id
 91        self.updated = updated
 92        self.published = published
 93        self.title = title
 94        self.authors = authors
 95        self.summary = summary
 96        self.comment = comment
 97        self.journal_ref = journal_ref
 98        self.doi = doi
 99        self.primary_category = primary_category
100        self.categories = categories
101        self.links = links
102        # Calculated members
103        self.pdf_url = Result._get_pdf_url(links)
104        # Debugging
105        self._raw = _raw
106
107    def _from_feed_entry(entry: feedparser.FeedParserDict) -> 'Result':
108        """
109        Converts a feedparser entry for an arXiv search result feed into a
110        Result object.
111        """
112        if not hasattr(entry, "id"):
113            raise Result.MissingFieldError("id")
114        # Title attribute may be absent for certain titles. Defaulting to "0" as
115        # it's the only title observed to cause this bug.
116        # https://github.com/lukasschwab/arxiv.py/issues/71
117        # title = entry.title if hasattr(entry, "title") else "0"
118        title = "0"
119        if hasattr(entry, "title"):
120            title = entry.title
121        else:
122            logger.warning(
123                "Result %s is missing title attribute; defaulting to '0'",
124                entry.id
125            )
126        return Result(
127            entry_id=entry.id,
128            updated=Result._to_datetime(entry.updated_parsed),
129            published=Result._to_datetime(entry.published_parsed),
130            title=re.sub(r'\s+', ' ', title),
131            authors=[Result.Author._from_feed_author(a) for a in entry.authors],
132            summary=entry.summary,
133            comment=entry.get('arxiv_comment'),
134            journal_ref=entry.get('arxiv_journal_ref'),
135            doi=entry.get('arxiv_doi'),
136            primary_category=entry.arxiv_primary_category.get('term'),
137            categories=[tag.get('term') for tag in entry.tags],
138            links=[Result.Link._from_feed_link(link) for link in entry.links],
139            _raw=entry
140        )
141
142    def __str__(self) -> str:
143        return self.entry_id
144
145    def __repr__(self) -> str:
146        return (
147            '{}(entry_id={}, updated={}, published={}, title={}, authors={}, '
148            'summary={}, comment={}, journal_ref={}, doi={}, '
149            'primary_category={}, categories={}, links={})'
150        ).format(
151            _classname(self),
152            repr(self.entry_id),
153            repr(self.updated),
154            repr(self.published),
155            repr(self.title),
156            repr(self.authors),
157            repr(self.summary),
158            repr(self.comment),
159            repr(self.journal_ref),
160            repr(self.doi),
161            repr(self.primary_category),
162            repr(self.categories),
163            repr(self.links)
164        )
165
166    def __eq__(self, other) -> bool:
167        if isinstance(other, Result):
168            return self.entry_id == other.entry_id
169        return False
170
171    def get_short_id(self) -> str:
172        """
173        Returns the short ID for this result.
174
175        + If the result URL is `"http://arxiv.org/abs/2107.05580v1"`,
176        `result.get_short_id()` returns `2107.05580v1`.
177
178        + If the result URL is `"http://arxiv.org/abs/quant-ph/0201082v1"`,
179        `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March
180        2007 arXiv identifier format).
181
182        For an explanation of the difference between arXiv's legacy and current
183        identifiers, see [Understanding the arXiv
184        identifier](https://arxiv.org/help/arxiv_identifier).
185        """
186        return self.entry_id.split('arxiv.org/abs/')[-1]
187
188    def _get_default_filename(self, extension: str = "pdf") -> str:
189        """
190        A default `to_filename` function for the extension given.
191        """
192        nonempty_title = self.title if self.title else "UNTITLED"
193        # Remove disallowed characters.
194        clean_title = '_'.join(re.findall(r'\w+', nonempty_title))
195        return "{}.{}.{}".format(self.get_short_id(), clean_title, extension)
196
197    def download_pdf(self, dirpath: str = './', filename: str = '') -> str:
198        """
199        Downloads the PDF for this result to the specified directory.
200
201        The filename is generated by calling `to_filename(self)`.
202        """
203        if not filename:
204            filename = self._get_default_filename()
205        path = os.path.join(dirpath, filename)
206        written_path, _ = urlretrieve(self.pdf_url, path)
207        return written_path
208
209    def download_source(self, dirpath: str = './', filename: str = '') -> str:
210        """
211        Downloads the source tarfile for this result to the specified
212        directory.
213
214        The filename is generated by calling `to_filename(self)`.
215        """
216        if not filename:
217            filename = self._get_default_filename('tar.gz')
218        path = os.path.join(dirpath, filename)
219        # Bodge: construct the source URL from the PDF URL.
220        source_url = self.pdf_url.replace('/pdf/', '/src/')
221        written_path, _ = urlretrieve(source_url, path)
222        return written_path
223
224    def _get_pdf_url(links: list) -> str:
225        """
226        Finds the PDF link among a result's links and returns its URL.
227
228        Should only be called once for a given `Result`, in its constructor.
229        After construction, the URL should be available in `Result.pdf_url`.
230        """
231        pdf_urls = [link.href for link in links if link.title == 'pdf']
232        if len(pdf_urls) == 0:
233            return None
234        elif len(pdf_urls) > 1:
235            logger.warning(
236                "Result has multiple PDF links; using %s",
237                pdf_urls[0]
238            )
239        return pdf_urls[0]
240
241    def _to_datetime(ts: time.struct_time) -> datetime:
242        """
243        Converts a UTC time.struct_time into a time-zone-aware datetime.
244
245        This will be replaced with feedparser functionality [when it becomes
246        available](https://github.com/kurtmckee/feedparser/issues/212).
247        """
248        return datetime.fromtimestamp(timegm(ts), tz=timezone.utc)
249
250    class Author(object):
251        """
252        A light inner class for representing a result's authors.
253        """
254
255        name: str
256        """The author's name."""
257
258        def __init__(self, name: str):
259            """
260            Constructs an `Author` with the specified name.
261
262            In most cases, prefer using `Author._from_feed_author` to parsing
263            and constructing `Author`s yourself.
264            """
265            self.name = name
266
267        def _from_feed_author(
268            feed_author: feedparser.FeedParserDict
269        ) -> 'Result.Author':
270            """
271            Constructs an `Author` with the name specified in an author object
272            from a feed entry.
273
274            See usage in `Result._from_feed_entry`.
275            """
276            return Result.Author(feed_author.name)
277
278        def __str__(self) -> str:
279            return self.name
280
281        def __repr__(self) -> str:
282            return '{}({})'.format(_classname(self), repr(self.name))
283
284        def __eq__(self, other) -> bool:
285            if isinstance(other, Result.Author):
286                return self.name == other.name
287            return False
288
289    class Link(object):
290        """
291        A light inner class for representing a result's links.
292        """
293
294        href: str
295        """The link's `href` attribute."""
296        title: str
297        """The link's title."""
298        rel: str
299        """The link's relationship to the `Result`."""
300        content_type: str
301        """The link's HTTP content type."""
302
303        def __init__(
304            self,
305            href: str,
306            title: str = None,
307            rel: str = None,
308            content_type: str = None
309        ):
310            """
311            Constructs a `Link` with the specified link metadata.
312
313            In most cases, prefer using `Link._from_feed_link` to parsing and
314            constructing `Link`s yourself.
315            """
316            self.href = href
317            self.title = title
318            self.rel = rel
319            self.content_type = content_type
320
321        def _from_feed_link(
322            feed_link: feedparser.FeedParserDict
323        ) -> 'Result.Link':
324            """
325            Constructs a `Link` with link metadata specified in a link object
326            from a feed entry.
327
328            See usage in `Result._from_feed_entry`.
329            """
330            return Result.Link(
331                href=feed_link.href,
332                title=feed_link.get('title'),
333                rel=feed_link.get('rel'),
334                content_type=feed_link.get('content_type')
335            )
336
337        def __str__(self) -> str:
338            return self.href
339
340        def __repr__(self) -> str:
341            return '{}({}, title={}, rel={}, content_type={})'.format(
342                _classname(self),
343                repr(self.href),
344                repr(self.title),
345                repr(self.rel),
346                repr(self.content_type)
347            )
348
349        def __eq__(self, other) -> bool:
350            if isinstance(other, Result.Link):
351                return self.href == other.href
352            return False
353
354    class MissingFieldError(Exception):
355        """
356        An error indicating an entry is unparseable because it lacks required
357        fields.
358        """
359
360        missing_field: str
361        """The required field missing from the would-be entry."""
362        message: str
363        """Message describing what caused this error."""
364
365        def __init__(self, missing_field):
366            self.missing_field = missing_field
367            self.message = "Entry from arXiv missing required info"
368
369        def __repr__(self) -> str:
370            return '{}({})'.format(
371                _classname(self),
372                repr(self.missing_field)
373            )
374
375
376class SortCriterion(Enum):
377    """
378    A SortCriterion identifies a property by which search results can be
379    sorted.
380
381    See [the arXiv API User's Manual: sort order for return
382    results](https://arxiv.org/help/api/user-manual#sort).
383    """
384    Relevance = "relevance"
385    LastUpdatedDate = "lastUpdatedDate"
386    SubmittedDate = "submittedDate"
387
388
389class SortOrder(Enum):
390    """
391    A SortOrder indicates order in which search results are sorted according
392    to the specified arxiv.SortCriterion.
393
394    See [the arXiv API User's Manual: sort order for return
395    results](https://arxiv.org/help/api/user-manual#sort).
396    """
397    Ascending = "ascending"
398    Descending = "descending"
399
400
401class Search(object):
402    """
403    A specification for a search of arXiv's database.
404
405    To run a search, use `Search.run` to use a default client or `Client.run`
406    with a specific client.
407    """
408
409    query: str
410    """
411    A query string.
412
413    This should be unencoded. Use `au:del_maestro AND ti:checkerboard`, not
414    `au:del_maestro+AND+ti:checkerboard`.
415
416    See [the arXiv API User's Manual: Details of Query
417    Construction](https://arxiv.org/help/api/user-manual#query_details).
418    """
419    id_list: list
420    """
421    A list of arXiv article IDs to which to limit the search.
422
423    See [the arXiv API User's
424    Manual](https://arxiv.org/help/api/user-manual#search_query_and_id_list)
425    for documentation of the interaction between `query` and `id_list`.
426    """
427    max_results: float
428    """
429    The maximum number of results to be returned in an execution of this
430    search.
431
432    To fetch every result available, set `max_results=float('inf')`.
433    """
434    sort_by: SortCriterion
435    """The sort criterion for results."""
436    sort_order: SortOrder
437    """The sort order for results."""
438
439    def __init__(
440        self,
441        query: str = "",
442        id_list: List[str] = [],
443        max_results: float = float('inf'),
444        sort_by: SortCriterion = SortCriterion.Relevance,
445        sort_order: SortOrder = SortOrder.Descending
446    ):
447        """
448        Constructs an arXiv API search with the specified criteria.
449        """
450        self.query = query
451        self.id_list = id_list
452        self.max_results = max_results
453        self.sort_by = sort_by
454        self.sort_order = sort_order
455
456    def __str__(self) -> str:
457        # TODO: develop a more informative string representation.
458        return repr(self)
459
460    def __repr__(self) -> str:
461        return (
462            '{}(query={}, id_list={}, max_results={}, sort_by={}, '
463            'sort_order={})'
464        ).format(
465            _classname(self),
466            repr(self.query),
467            repr(self.id_list),
468            repr(self.max_results),
469            repr(self.sort_by),
470            repr(self.sort_order)
471        )
472
473    def _url_args(self) -> Dict[str, str]:
474        """
475        Returns a dict of search parameters that should be included in an API
476        request for this search.
477        """
478        return {
479            "search_query": self.query,
480            "id_list": ','.join(self.id_list),
481            "sortBy": self.sort_by.value,
482            "sortOrder": self.sort_order.value
483        }
484
485    def get(self) -> Generator[Result, None, None]:
486        """
487        **Deprecated** after 1.2.0; use `Search.results`.
488        """
489        warnings.warn(
490            "The 'get' method is deprecated, use 'results' instead",
491            DeprecationWarning,
492            stacklevel=2
493        )
494        return self.results()
495
496    def results(self) -> Generator[Result, None, None]:
497        """
498        Executes the specified search using a default arXiv API client.
499
500        For info on default behavior, see `Client.__init__` and `Client.results`.
501        """
502        return Client().results(self)
503
504
505class Client(object):
506    """
507    Specifies a strategy for fetching results from arXiv's API.
508
509    This class obscures pagination and retry logic, and exposes
510    `Client.results`.
511    """
512
513    query_url_format = 'http://export.arxiv.org/api/query?{}'
514    """The arXiv query API endpoint format."""
515    page_size: int
516    """Maximum number of results fetched in a single API request."""
517    delay_seconds: int
518    """Number of seconds to wait between API requests."""
519    num_retries: int
520    """Number of times to retry a failing API request."""
521    _last_request_dt: datetime
522
523    def __init__(
524        self,
525        page_size: int = 100,
526        delay_seconds: int = 3,
527        num_retries: int = 3
528    ):
529        """
530        Constructs an arXiv API client with the specified options.
531
532        Note: the default parameters should provide a robust request strategy
533        for most use cases. Extreme page sizes, delays, or retries risk
534        violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou),
535        brittle behavior, and inconsistent results.
536        """
537        self.page_size = page_size
538        self.delay_seconds = delay_seconds
539        self.num_retries = num_retries
540        self._last_request_dt = None
541
542    def __str__(self) -> str:
543        # TODO: develop a more informative string representation.
544        return repr(self)
545
546    def __repr__(self) -> str:
547        return '{}(page_size={}, delay_seconds={}, num_retries={})'.format(
548            _classname(self),
549            repr(self.page_size),
550            repr(self.delay_seconds),
551            repr(self.num_retries)
552        )
553
554    def get(self, search: Search) -> Generator[Result, None, None]:
555        """
556        **Deprecated** after 1.2.0; use `Client.results`.
557        """
558        warnings.warn(
559            "The 'get' method is deprecated, use 'results' instead",
560            DeprecationWarning,
561            stacklevel=2
562        )
563        return self.results(search)
564
565    def results(self, search: Search) -> Generator[Result, None, None]:
566        """
567        Uses this client configuration to fetch one page of the search results
568        at a time, yielding the parsed `Result`s, until `max_results` results
569        have been yielded or there are no more search results.
570
571        If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`.
572
573        For more on using generators, see
574        [Generators](https://wiki.python.org/moin/Generators).
575        """
576        offset = 0
577        # total_results may be reduced according to the feed's
578        # opensearch:totalResults value.
579        total_results = search.max_results
580        first_page = True
581        while offset < total_results:
582            page_size = min(self.page_size, search.max_results - offset)
583            logger.info("Requesting {} results at offset {}".format(
584                page_size,
585                offset,
586            ))
587            page_url = self._format_url(search, offset, page_size)
588            feed = self._parse_feed(page_url, first_page)
589            if first_page:
590                # NOTE: this is an ugly fix for a known bug. The totalresults
591                # value is set to 1 for results with zero entries. If that API
592                # bug is fixed, we can remove this conditional and always set
593                # `total_results = min(...)`.
594                if len(feed.entries) == 0:
595                    logger.info("Got empty results; stopping generation")
596                    total_results = 0
597                else:
598                    total_results = min(
599                        total_results,
600                        int(feed.feed.opensearch_totalresults)
601                    )
602                    logger.info("Got first page; {} of {} results available".format(
603                        total_results,
604                        search.max_results
605                    ))
606                # Subsequent pages are not the first page.
607                first_page = False
608            # Update offset for next request: account for received results.
609            offset += len(feed.entries)
610            # Yield query results until page is exhausted.
611            for entry in feed.entries:
612                try:
613                    yield Result._from_feed_entry(entry)
614                except Result.MissingFieldError:
615                    logger.warning("Skipping partial result")
616                    continue
617
618    def _format_url(self, search: Search, start: int, page_size: int) -> str:
619        """
620        Construct a request API for search that returns up to `page_size`
621        results starting with the result at index `start`.
622        """
623        url_args = search._url_args()
624        url_args.update({
625            "start": start,
626            "max_results": page_size,
627        })
628        return self.query_url_format.format(urlencode(url_args))
629
630    def _parse_feed(
631        self,
632        url: str,
633        first_page: bool = True
634    ) -> feedparser.FeedParserDict:
635        """
636        Fetches the specified URL and parses it with feedparser.
637
638        If a request fails or is unexpectedly empty, retries the request up to
639        `self.num_retries` times.
640        """
641        # Invoke the recursive helper with initial available retries.
642        return self.__try_parse_feed(
643            url,
644            first_page=first_page,
645            retries_left=self.num_retries
646        )
647
648    def __try_parse_feed(
649        self,
650        url: str,
651        first_page: bool,
652        retries_left: int,
653        last_err: Exception = None,
654    ) -> feedparser.FeedParserDict:
655        """
656        Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that
657        number of seconds has not passed since `_parse_feed` was last called,
658        sleeps until delay_seconds seconds have passed.
659        """
660        retry = self.num_retries - retries_left
661        # If this call would violate the rate limit, sleep until it doesn't.
662        if self._last_request_dt is not None:
663            required = timedelta(seconds=self.delay_seconds)
664            since_last_request = datetime.now() - self._last_request_dt
665            if since_last_request < required:
666                to_sleep = (required - since_last_request).total_seconds()
667                logger.info("Sleeping for %f seconds", to_sleep)
668                time.sleep(to_sleep)
669        logger.info("Requesting page of results", extra={
670            'url': url,
671            'first_page': first_page,
672            'retry': retry,
673            'last_err': last_err.message if last_err is not None else None,
674        })
675        feed = feedparser.parse(url)
676        self._last_request_dt = datetime.now()
677        err = None
678        if feed.status != 200:
679            err = HTTPError(url, retry, feed)
680        elif len(feed.entries) == 0 and not first_page:
681            err = UnexpectedEmptyPageError(url, retry)
682        if err is not None:
683            if retries_left > 0:
684                return self.__try_parse_feed(
685                    url,
686                    first_page=first_page,
687                    retries_left=retries_left-1,
688                    last_err=err,
689                )
690            # Feed was never returned in self.num_retries tries. Raise the last
691            # exception encountered.
692            raise err
693        return feed
694
695
696class ArxivError(Exception):
697    """This package's base Exception class."""
698
699    url: str
700    """The feed URL that could not be fetched."""
701    retry: int
702    """
703    The request try number which encountered this error; 0 for the initial try,
704    1 for the first retry, and so on.
705    """
706    message: str
707    """Message describing what caused this error."""
708
709    def __init__(self, url: str, retry: int, message: str):
710        """
711        Constructs an `ArxivError` encountered while fetching the specified URL.
712        """
713        self.url = url
714        self.retry = retry
715        self.message = message
716        super().__init__(self.message)
717
718    def __str__(self) -> str:
719        return '{} ({})'.format(self.message, self.url)
720
721
722class UnexpectedEmptyPageError(ArxivError):
723    """
724    An error raised when a page of results that should be non-empty is empty.
725
726    This should never happen in theory, but happens sporadically due to
727    brittleness in the underlying arXiv API; usually resolved by retries.
728
729    See `Client.results` for usage.
730    """
731    def __init__(self, url: str, retry: int):
732        """
733        Constructs an `UnexpectedEmptyPageError` encountered for the specified
734        API URL after `retry` tries.
735        """
736        self.url = url
737        super().__init__(url, retry, "Page of results was unexpectedly empty")
738
739    def __repr__(self) -> str:
740        return '{}({}, {})'.format(
741            _classname(self),
742            repr(self.url),
743            repr(self.retry)
744        )
745
746
747class HTTPError(ArxivError):
748    """
749    A non-200 status encountered while fetching a page of results.
750
751    See `Client.results` for usage.
752    """
753
754    status: int
755    """The HTTP status reported by feedparser."""
756    entry: feedparser.FeedParserDict
757    """The feed entry describing the error, if present."""
758
759    def __init__(self, url: str, retry: int, feed: feedparser.FeedParserDict):
760        """
761        Constructs an `HTTPError` for the specified status code, encountered for
762        the specified API URL after `retry` tries.
763        """
764        self.url = url
765        self.status = feed.status
766        # If the feed is valid and includes a single entry, trust it's an
767        # explanation.
768        if not feed.bozo and len(feed.entries) == 1:
769            self.entry = feed.entries[0]
770        else:
771            self.entry = None
772        super().__init__(
773            url,
774            retry,
775            "Page request resulted in HTTP {}: {}".format(
776                self.status,
777                self.entry.summary if self.entry else None,
778            ),
779        )
780
781    def __repr__(self) -> str:
782        return '{}({}, {}, {})'.format(
783            _classname(self),
784            repr(self.url),
785            repr(self.retry),
786            repr(self.status)
787        )
788
789
790def _classname(o):
791    """A helper function for use in __repr__ methods: arxiv.Result.Link."""
792    return 'arxiv.{}'.format(o.__class__.__qualname__)
class Result:
 23class Result(object):
 24    """
 25    An entry in an arXiv query results feed.
 26
 27    See [the arXiv API User's Manual: Details of Atom Results
 28    Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned).
 29    """
 30
 31    entry_id: str
 32    """A url of the form `http://arxiv.org/abs/{id}`."""
 33    updated: time.struct_time
 34    """When the result was last updated."""
 35    published: time.struct_time
 36    """When the result was originally published."""
 37    title: str
 38    """The title of the result."""
 39    authors: list
 40    """The result's authors."""
 41    summary: str
 42    """The result abstract."""
 43    comment: str
 44    """The authors' comment if present."""
 45    journal_ref: str
 46    """A journal reference if present."""
 47    doi: str
 48    """A URL for the resolved DOI to an external resource if present."""
 49    primary_category: str
 50    """
 51    The result's primary arXiv category. See [arXiv: Category
 52    Taxonomy](https://arxiv.org/category_taxonomy).
 53    """
 54    categories: List[str]
 55    """
 56    All of the result's categories. See [arXiv: Category
 57    Taxonomy](https://arxiv.org/category_taxonomy).
 58    """
 59    links: list
 60    """Up to three URLs associated with this result."""
 61    pdf_url: str
 62    """The URL of a PDF version of this result if present among links."""
 63    _raw: feedparser.FeedParserDict
 64    """
 65    The raw feedparser result object if this Result was constructed with
 66    Result._from_feed_entry.
 67    """
 68
 69    def __init__(
 70        self,
 71        entry_id: str,
 72        updated: datetime = _DEFAULT_TIME,
 73        published: datetime = _DEFAULT_TIME,
 74        title: str = "",
 75        authors: List['Result.Author'] = [],
 76        summary: str = "",
 77        comment: str = "",
 78        journal_ref: str = "",
 79        doi: str = "",
 80        primary_category: str = "",
 81        categories: List[str] = [],
 82        links: List['Result.Link'] = [],
 83        _raw: feedparser.FeedParserDict = None,
 84    ):
 85        """
 86        Constructs an arXiv search result item.
 87
 88        In most cases, prefer using `Result._from_feed_entry` to parsing and
 89        constructing `Result`s yourself.
 90        """
 91        self.entry_id = entry_id
 92        self.updated = updated
 93        self.published = published
 94        self.title = title
 95        self.authors = authors
 96        self.summary = summary
 97        self.comment = comment
 98        self.journal_ref = journal_ref
 99        self.doi = doi
100        self.primary_category = primary_category
101        self.categories = categories
102        self.links = links
103        # Calculated members
104        self.pdf_url = Result._get_pdf_url(links)
105        # Debugging
106        self._raw = _raw
107
108    def _from_feed_entry(entry: feedparser.FeedParserDict) -> 'Result':
109        """
110        Converts a feedparser entry for an arXiv search result feed into a
111        Result object.
112        """
113        if not hasattr(entry, "id"):
114            raise Result.MissingFieldError("id")
115        # Title attribute may be absent for certain titles. Defaulting to "0" as
116        # it's the only title observed to cause this bug.
117        # https://github.com/lukasschwab/arxiv.py/issues/71
118        # title = entry.title if hasattr(entry, "title") else "0"
119        title = "0"
120        if hasattr(entry, "title"):
121            title = entry.title
122        else:
123            logger.warning(
124                "Result %s is missing title attribute; defaulting to '0'",
125                entry.id
126            )
127        return Result(
128            entry_id=entry.id,
129            updated=Result._to_datetime(entry.updated_parsed),
130            published=Result._to_datetime(entry.published_parsed),
131            title=re.sub(r'\s+', ' ', title),
132            authors=[Result.Author._from_feed_author(a) for a in entry.authors],
133            summary=entry.summary,
134            comment=entry.get('arxiv_comment'),
135            journal_ref=entry.get('arxiv_journal_ref'),
136            doi=entry.get('arxiv_doi'),
137            primary_category=entry.arxiv_primary_category.get('term'),
138            categories=[tag.get('term') for tag in entry.tags],
139            links=[Result.Link._from_feed_link(link) for link in entry.links],
140            _raw=entry
141        )
142
143    def __str__(self) -> str:
144        return self.entry_id
145
146    def __repr__(self) -> str:
147        return (
148            '{}(entry_id={}, updated={}, published={}, title={}, authors={}, '
149            'summary={}, comment={}, journal_ref={}, doi={}, '
150            'primary_category={}, categories={}, links={})'
151        ).format(
152            _classname(self),
153            repr(self.entry_id),
154            repr(self.updated),
155            repr(self.published),
156            repr(self.title),
157            repr(self.authors),
158            repr(self.summary),
159            repr(self.comment),
160            repr(self.journal_ref),
161            repr(self.doi),
162            repr(self.primary_category),
163            repr(self.categories),
164            repr(self.links)
165        )
166
167    def __eq__(self, other) -> bool:
168        if isinstance(other, Result):
169            return self.entry_id == other.entry_id
170        return False
171
172    def get_short_id(self) -> str:
173        """
174        Returns the short ID for this result.
175
176        + If the result URL is `"http://arxiv.org/abs/2107.05580v1"`,
177        `result.get_short_id()` returns `2107.05580v1`.
178
179        + If the result URL is `"http://arxiv.org/abs/quant-ph/0201082v1"`,
180        `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March
181        2007 arXiv identifier format).
182
183        For an explanation of the difference between arXiv's legacy and current
184        identifiers, see [Understanding the arXiv
185        identifier](https://arxiv.org/help/arxiv_identifier).
186        """
187        return self.entry_id.split('arxiv.org/abs/')[-1]
188
189    def _get_default_filename(self, extension: str = "pdf") -> str:
190        """
191        A default `to_filename` function for the extension given.
192        """
193        nonempty_title = self.title if self.title else "UNTITLED"
194        # Remove disallowed characters.
195        clean_title = '_'.join(re.findall(r'\w+', nonempty_title))
196        return "{}.{}.{}".format(self.get_short_id(), clean_title, extension)
197
198    def download_pdf(self, dirpath: str = './', filename: str = '') -> str:
199        """
200        Downloads the PDF for this result to the specified directory.
201
202        The filename is generated by calling `to_filename(self)`.
203        """
204        if not filename:
205            filename = self._get_default_filename()
206        path = os.path.join(dirpath, filename)
207        written_path, _ = urlretrieve(self.pdf_url, path)
208        return written_path
209
210    def download_source(self, dirpath: str = './', filename: str = '') -> str:
211        """
212        Downloads the source tarfile for this result to the specified
213        directory.
214
215        The filename is generated by calling `to_filename(self)`.
216        """
217        if not filename:
218            filename = self._get_default_filename('tar.gz')
219        path = os.path.join(dirpath, filename)
220        # Bodge: construct the source URL from the PDF URL.
221        source_url = self.pdf_url.replace('/pdf/', '/src/')
222        written_path, _ = urlretrieve(source_url, path)
223        return written_path
224
225    def _get_pdf_url(links: list) -> str:
226        """
227        Finds the PDF link among a result's links and returns its URL.
228
229        Should only be called once for a given `Result`, in its constructor.
230        After construction, the URL should be available in `Result.pdf_url`.
231        """
232        pdf_urls = [link.href for link in links if link.title == 'pdf']
233        if len(pdf_urls) == 0:
234            return None
235        elif len(pdf_urls) > 1:
236            logger.warning(
237                "Result has multiple PDF links; using %s",
238                pdf_urls[0]
239            )
240        return pdf_urls[0]
241
242    def _to_datetime(ts: time.struct_time) -> datetime:
243        """
244        Converts a UTC time.struct_time into a time-zone-aware datetime.
245
246        This will be replaced with feedparser functionality [when it becomes
247        available](https://github.com/kurtmckee/feedparser/issues/212).
248        """
249        return datetime.fromtimestamp(timegm(ts), tz=timezone.utc)
250
251    class Author(object):
252        """
253        A light inner class for representing a result's authors.
254        """
255
256        name: str
257        """The author's name."""
258
259        def __init__(self, name: str):
260            """
261            Constructs an `Author` with the specified name.
262
263            In most cases, prefer using `Author._from_feed_author` to parsing
264            and constructing `Author`s yourself.
265            """
266            self.name = name
267
268        def _from_feed_author(
269            feed_author: feedparser.FeedParserDict
270        ) -> 'Result.Author':
271            """
272            Constructs an `Author` with the name specified in an author object
273            from a feed entry.
274
275            See usage in `Result._from_feed_entry`.
276            """
277            return Result.Author(feed_author.name)
278
279        def __str__(self) -> str:
280            return self.name
281
282        def __repr__(self) -> str:
283            return '{}({})'.format(_classname(self), repr(self.name))
284
285        def __eq__(self, other) -> bool:
286            if isinstance(other, Result.Author):
287                return self.name == other.name
288            return False
289
290    class Link(object):
291        """
292        A light inner class for representing a result's links.
293        """
294
295        href: str
296        """The link's `href` attribute."""
297        title: str
298        """The link's title."""
299        rel: str
300        """The link's relationship to the `Result`."""
301        content_type: str
302        """The link's HTTP content type."""
303
304        def __init__(
305            self,
306            href: str,
307            title: str = None,
308            rel: str = None,
309            content_type: str = None
310        ):
311            """
312            Constructs a `Link` with the specified link metadata.
313
314            In most cases, prefer using `Link._from_feed_link` to parsing and
315            constructing `Link`s yourself.
316            """
317            self.href = href
318            self.title = title
319            self.rel = rel
320            self.content_type = content_type
321
322        def _from_feed_link(
323            feed_link: feedparser.FeedParserDict
324        ) -> 'Result.Link':
325            """
326            Constructs a `Link` with link metadata specified in a link object
327            from a feed entry.
328
329            See usage in `Result._from_feed_entry`.
330            """
331            return Result.Link(
332                href=feed_link.href,
333                title=feed_link.get('title'),
334                rel=feed_link.get('rel'),
335                content_type=feed_link.get('content_type')
336            )
337
338        def __str__(self) -> str:
339            return self.href
340
341        def __repr__(self) -> str:
342            return '{}({}, title={}, rel={}, content_type={})'.format(
343                _classname(self),
344                repr(self.href),
345                repr(self.title),
346                repr(self.rel),
347                repr(self.content_type)
348            )
349
350        def __eq__(self, other) -> bool:
351            if isinstance(other, Result.Link):
352                return self.href == other.href
353            return False
354
355    class MissingFieldError(Exception):
356        """
357        An error indicating an entry is unparseable because it lacks required
358        fields.
359        """
360
361        missing_field: str
362        """The required field missing from the would-be entry."""
363        message: str
364        """Message describing what caused this error."""
365
366        def __init__(self, missing_field):
367            self.missing_field = missing_field
368            self.message = "Entry from arXiv missing required info"
369
370        def __repr__(self) -> str:
371            return '{}({})'.format(
372                _classname(self),
373                repr(self.missing_field)
374            )

An entry in an arXiv query results feed.

See the arXiv API User's Manual: Details of Atom Results Returned.

Result( entry_id: str, updated: datetime.datetime = datetime.datetime(1, 1, 1, 0, 0), published: datetime.datetime = datetime.datetime(1, 1, 1, 0, 0), title: str = '', authors: List[arxiv.arxiv.Result.Author] = [], summary: str = '', comment: str = '', journal_ref: str = '', doi: str = '', primary_category: str = '', categories: List[str] = [], links: List[arxiv.arxiv.Result.Link] = [], _raw: feedparser.util.FeedParserDict = None)
 69    def __init__(
 70        self,
 71        entry_id: str,
 72        updated: datetime = _DEFAULT_TIME,
 73        published: datetime = _DEFAULT_TIME,
 74        title: str = "",
 75        authors: List['Result.Author'] = [],
 76        summary: str = "",
 77        comment: str = "",
 78        journal_ref: str = "",
 79        doi: str = "",
 80        primary_category: str = "",
 81        categories: List[str] = [],
 82        links: List['Result.Link'] = [],
 83        _raw: feedparser.FeedParserDict = None,
 84    ):
 85        """
 86        Constructs an arXiv search result item.
 87
 88        In most cases, prefer using `Result._from_feed_entry` to parsing and
 89        constructing `Result`s yourself.
 90        """
 91        self.entry_id = entry_id
 92        self.updated = updated
 93        self.published = published
 94        self.title = title
 95        self.authors = authors
 96        self.summary = summary
 97        self.comment = comment
 98        self.journal_ref = journal_ref
 99        self.doi = doi
100        self.primary_category = primary_category
101        self.categories = categories
102        self.links = links
103        # Calculated members
104        self.pdf_url = Result._get_pdf_url(links)
105        # Debugging
106        self._raw = _raw

Constructs an arXiv search result item.

In most cases, prefer using Result._from_feed_entry to parsing and constructing Results yourself.

entry_id: str

A url of the form http://arxiv.org/abs/{id}.

updated: time.struct_time

When the result was last updated.

published: time.struct_time

When the result was originally published.

title: str

The title of the result.

authors: list

The result's authors.

summary: str

The result abstract.

comment: str

The authors' comment if present.

journal_ref: str

A journal reference if present.

doi: str

A URL for the resolved DOI to an external resource if present.

primary_category: str

The result's primary arXiv category. See arXiv: Category Taxonomy.

categories: List[str]

All of the result's categories. See arXiv: Category Taxonomy.

pdf_url: str

The URL of a PDF version of this result if present among links.

def get_short_id(self) -> str:
172    def get_short_id(self) -> str:
173        """
174        Returns the short ID for this result.
175
176        + If the result URL is `"http://arxiv.org/abs/2107.05580v1"`,
177        `result.get_short_id()` returns `2107.05580v1`.
178
179        + If the result URL is `"http://arxiv.org/abs/quant-ph/0201082v1"`,
180        `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March
181        2007 arXiv identifier format).
182
183        For an explanation of the difference between arXiv's legacy and current
184        identifiers, see [Understanding the arXiv
185        identifier](https://arxiv.org/help/arxiv_identifier).
186        """
187        return self.entry_id.split('arxiv.org/abs/')[-1]

Returns the short ID for this result.

  • If the result URL is "http://arxiv.org/abs/2107.05580v1", result.get_short_id() returns 2107.05580v1.

  • If the result URL is "http://arxiv.org/abs/quant-ph/0201082v1", result.get_short_id() returns "quant-ph/0201082v1" (the pre-March 2007 arXiv identifier format).

For an explanation of the difference between arXiv's legacy and current identifiers, see Understanding the arXiv identifier.

def download_pdf(self, dirpath: str = './', filename: str = '') -> str:
198    def download_pdf(self, dirpath: str = './', filename: str = '') -> str:
199        """
200        Downloads the PDF for this result to the specified directory.
201
202        The filename is generated by calling `to_filename(self)`.
203        """
204        if not filename:
205            filename = self._get_default_filename()
206        path = os.path.join(dirpath, filename)
207        written_path, _ = urlretrieve(self.pdf_url, path)
208        return written_path

Downloads the PDF for this result to the specified directory.

The filename is generated by calling to_filename(self).

def download_source(self, dirpath: str = './', filename: str = '') -> str:
210    def download_source(self, dirpath: str = './', filename: str = '') -> str:
211        """
212        Downloads the source tarfile for this result to the specified
213        directory.
214
215        The filename is generated by calling `to_filename(self)`.
216        """
217        if not filename:
218            filename = self._get_default_filename('tar.gz')
219        path = os.path.join(dirpath, filename)
220        # Bodge: construct the source URL from the PDF URL.
221        source_url = self.pdf_url.replace('/pdf/', '/src/')
222        written_path, _ = urlretrieve(source_url, path)
223        return written_path

Downloads the source tarfile for this result to the specified directory.

The filename is generated by calling to_filename(self).

class Result.Author:
251    class Author(object):
252        """
253        A light inner class for representing a result's authors.
254        """
255
256        name: str
257        """The author's name."""
258
259        def __init__(self, name: str):
260            """
261            Constructs an `Author` with the specified name.
262
263            In most cases, prefer using `Author._from_feed_author` to parsing
264            and constructing `Author`s yourself.
265            """
266            self.name = name
267
268        def _from_feed_author(
269            feed_author: feedparser.FeedParserDict
270        ) -> 'Result.Author':
271            """
272            Constructs an `Author` with the name specified in an author object
273            from a feed entry.
274
275            See usage in `Result._from_feed_entry`.
276            """
277            return Result.Author(feed_author.name)
278
279        def __str__(self) -> str:
280            return self.name
281
282        def __repr__(self) -> str:
283            return '{}({})'.format(_classname(self), repr(self.name))
284
285        def __eq__(self, other) -> bool:
286            if isinstance(other, Result.Author):
287                return self.name == other.name
288            return False

A light inner class for representing a result's authors.

Result.Author(name: str)
259        def __init__(self, name: str):
260            """
261            Constructs an `Author` with the specified name.
262
263            In most cases, prefer using `Author._from_feed_author` to parsing
264            and constructing `Author`s yourself.
265            """
266            self.name = name

Constructs an Author with the specified name.

In most cases, prefer using Author._from_feed_author to parsing and constructing Authors yourself.

name: str

The author's name.

class Result.MissingFieldError(builtins.Exception):
355    class MissingFieldError(Exception):
356        """
357        An error indicating an entry is unparseable because it lacks required
358        fields.
359        """
360
361        missing_field: str
362        """The required field missing from the would-be entry."""
363        message: str
364        """Message describing what caused this error."""
365
366        def __init__(self, missing_field):
367            self.missing_field = missing_field
368            self.message = "Entry from arXiv missing required info"
369
370        def __repr__(self) -> str:
371            return '{}({})'.format(
372                _classname(self),
373                repr(self.missing_field)
374            )

An error indicating an entry is unparseable because it lacks required fields.

Result.MissingFieldError(missing_field)
366        def __init__(self, missing_field):
367            self.missing_field = missing_field
368            self.message = "Entry from arXiv missing required info"
missing_field: str

The required field missing from the would-be entry.

message: str

Message describing what caused this error.

Inherited Members
builtins.BaseException
with_traceback
args
class SortCriterion(enum.Enum):
377class SortCriterion(Enum):
378    """
379    A SortCriterion identifies a property by which search results can be
380    sorted.
381
382    See [the arXiv API User's Manual: sort order for return
383    results](https://arxiv.org/help/api/user-manual#sort).
384    """
385    Relevance = "relevance"
386    LastUpdatedDate = "lastUpdatedDate"
387    SubmittedDate = "submittedDate"

A SortCriterion identifies a property by which search results can be sorted.

See the arXiv API User's Manual: sort order for return results.

Relevance = <SortCriterion.Relevance: 'relevance'>
LastUpdatedDate = <SortCriterion.LastUpdatedDate: 'lastUpdatedDate'>
SubmittedDate = <SortCriterion.SubmittedDate: 'submittedDate'>
Inherited Members
enum.Enum
name
value
class SortOrder(enum.Enum):
390class SortOrder(Enum):
391    """
392    A SortOrder indicates order in which search results are sorted according
393    to the specified arxiv.SortCriterion.
394
395    See [the arXiv API User's Manual: sort order for return
396    results](https://arxiv.org/help/api/user-manual#sort).
397    """
398    Ascending = "ascending"
399    Descending = "descending"

A SortOrder indicates order in which search results are sorted according to the specified arxiv.SortCriterion.

See the arXiv API User's Manual: sort order for return results.

Ascending = <SortOrder.Ascending: 'ascending'>
Descending = <SortOrder.Descending: 'descending'>
Inherited Members
enum.Enum
name
value
class Client:
506class Client(object):
507    """
508    Specifies a strategy for fetching results from arXiv's API.
509
510    This class obscures pagination and retry logic, and exposes
511    `Client.results`.
512    """
513
514    query_url_format = 'http://export.arxiv.org/api/query?{}'
515    """The arXiv query API endpoint format."""
516    page_size: int
517    """Maximum number of results fetched in a single API request."""
518    delay_seconds: int
519    """Number of seconds to wait between API requests."""
520    num_retries: int
521    """Number of times to retry a failing API request."""
522    _last_request_dt: datetime
523
524    def __init__(
525        self,
526        page_size: int = 100,
527        delay_seconds: int = 3,
528        num_retries: int = 3
529    ):
530        """
531        Constructs an arXiv API client with the specified options.
532
533        Note: the default parameters should provide a robust request strategy
534        for most use cases. Extreme page sizes, delays, or retries risk
535        violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou),
536        brittle behavior, and inconsistent results.
537        """
538        self.page_size = page_size
539        self.delay_seconds = delay_seconds
540        self.num_retries = num_retries
541        self._last_request_dt = None
542
543    def __str__(self) -> str:
544        # TODO: develop a more informative string representation.
545        return repr(self)
546
547    def __repr__(self) -> str:
548        return '{}(page_size={}, delay_seconds={}, num_retries={})'.format(
549            _classname(self),
550            repr(self.page_size),
551            repr(self.delay_seconds),
552            repr(self.num_retries)
553        )
554
555    def get(self, search: Search) -> Generator[Result, None, None]:
556        """
557        **Deprecated** after 1.2.0; use `Client.results`.
558        """
559        warnings.warn(
560            "The 'get' method is deprecated, use 'results' instead",
561            DeprecationWarning,
562            stacklevel=2
563        )
564        return self.results(search)
565
566    def results(self, search: Search) -> Generator[Result, None, None]:
567        """
568        Uses this client configuration to fetch one page of the search results
569        at a time, yielding the parsed `Result`s, until `max_results` results
570        have been yielded or there are no more search results.
571
572        If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`.
573
574        For more on using generators, see
575        [Generators](https://wiki.python.org/moin/Generators).
576        """
577        offset = 0
578        # total_results may be reduced according to the feed's
579        # opensearch:totalResults value.
580        total_results = search.max_results
581        first_page = True
582        while offset < total_results:
583            page_size = min(self.page_size, search.max_results - offset)
584            logger.info("Requesting {} results at offset {}".format(
585                page_size,
586                offset,
587            ))
588            page_url = self._format_url(search, offset, page_size)
589            feed = self._parse_feed(page_url, first_page)
590            if first_page:
591                # NOTE: this is an ugly fix for a known bug. The totalresults
592                # value is set to 1 for results with zero entries. If that API
593                # bug is fixed, we can remove this conditional and always set
594                # `total_results = min(...)`.
595                if len(feed.entries) == 0:
596                    logger.info("Got empty results; stopping generation")
597                    total_results = 0
598                else:
599                    total_results = min(
600                        total_results,
601                        int(feed.feed.opensearch_totalresults)
602                    )
603                    logger.info("Got first page; {} of {} results available".format(
604                        total_results,
605                        search.max_results
606                    ))
607                # Subsequent pages are not the first page.
608                first_page = False
609            # Update offset for next request: account for received results.
610            offset += len(feed.entries)
611            # Yield query results until page is exhausted.
612            for entry in feed.entries:
613                try:
614                    yield Result._from_feed_entry(entry)
615                except Result.MissingFieldError:
616                    logger.warning("Skipping partial result")
617                    continue
618
619    def _format_url(self, search: Search, start: int, page_size: int) -> str:
620        """
621        Construct a request API for search that returns up to `page_size`
622        results starting with the result at index `start`.
623        """
624        url_args = search._url_args()
625        url_args.update({
626            "start": start,
627            "max_results": page_size,
628        })
629        return self.query_url_format.format(urlencode(url_args))
630
631    def _parse_feed(
632        self,
633        url: str,
634        first_page: bool = True
635    ) -> feedparser.FeedParserDict:
636        """
637        Fetches the specified URL and parses it with feedparser.
638
639        If a request fails or is unexpectedly empty, retries the request up to
640        `self.num_retries` times.
641        """
642        # Invoke the recursive helper with initial available retries.
643        return self.__try_parse_feed(
644            url,
645            first_page=first_page,
646            retries_left=self.num_retries
647        )
648
649    def __try_parse_feed(
650        self,
651        url: str,
652        first_page: bool,
653        retries_left: int,
654        last_err: Exception = None,
655    ) -> feedparser.FeedParserDict:
656        """
657        Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that
658        number of seconds has not passed since `_parse_feed` was last called,
659        sleeps until delay_seconds seconds have passed.
660        """
661        retry = self.num_retries - retries_left
662        # If this call would violate the rate limit, sleep until it doesn't.
663        if self._last_request_dt is not None:
664            required = timedelta(seconds=self.delay_seconds)
665            since_last_request = datetime.now() - self._last_request_dt
666            if since_last_request < required:
667                to_sleep = (required - since_last_request).total_seconds()
668                logger.info("Sleeping for %f seconds", to_sleep)
669                time.sleep(to_sleep)
670        logger.info("Requesting page of results", extra={
671            'url': url,
672            'first_page': first_page,
673            'retry': retry,
674            'last_err': last_err.message if last_err is not None else None,
675        })
676        feed = feedparser.parse(url)
677        self._last_request_dt = datetime.now()
678        err = None
679        if feed.status != 200:
680            err = HTTPError(url, retry, feed)
681        elif len(feed.entries) == 0 and not first_page:
682            err = UnexpectedEmptyPageError(url, retry)
683        if err is not None:
684            if retries_left > 0:
685                return self.__try_parse_feed(
686                    url,
687                    first_page=first_page,
688                    retries_left=retries_left-1,
689                    last_err=err,
690                )
691            # Feed was never returned in self.num_retries tries. Raise the last
692            # exception encountered.
693            raise err
694        return feed

Specifies a strategy for fetching results from arXiv's API.

This class obscures pagination and retry logic, and exposes Client.results.

Client(page_size: int = 100, delay_seconds: int = 3, num_retries: int = 3)
524    def __init__(
525        self,
526        page_size: int = 100,
527        delay_seconds: int = 3,
528        num_retries: int = 3
529    ):
530        """
531        Constructs an arXiv API client with the specified options.
532
533        Note: the default parameters should provide a robust request strategy
534        for most use cases. Extreme page sizes, delays, or retries risk
535        violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou),
536        brittle behavior, and inconsistent results.
537        """
538        self.page_size = page_size
539        self.delay_seconds = delay_seconds
540        self.num_retries = num_retries
541        self._last_request_dt = None

Constructs an arXiv API client with the specified options.

Note: the default parameters should provide a robust request strategy for most use cases. Extreme page sizes, delays, or retries risk violating the arXiv API Terms of Use, brittle behavior, and inconsistent results.

query_url_format = 'http://export.arxiv.org/api/query?{}'

The arXiv query API endpoint format.

page_size: int

Maximum number of results fetched in a single API request.

delay_seconds: int

Number of seconds to wait between API requests.

num_retries: int

Number of times to retry a failing API request.

def get( self, search: arxiv.arxiv.Search) -> Generator[arxiv.arxiv.Result, NoneType, NoneType]:
555    def get(self, search: Search) -> Generator[Result, None, None]:
556        """
557        **Deprecated** after 1.2.0; use `Client.results`.
558        """
559        warnings.warn(
560            "The 'get' method is deprecated, use 'results' instead",
561            DeprecationWarning,
562            stacklevel=2
563        )
564        return self.results(search)

Deprecated after 1.2.0; use Client.results.

def results( self, search: arxiv.arxiv.Search) -> Generator[arxiv.arxiv.Result, NoneType, NoneType]:
566    def results(self, search: Search) -> Generator[Result, None, None]:
567        """
568        Uses this client configuration to fetch one page of the search results
569        at a time, yielding the parsed `Result`s, until `max_results` results
570        have been yielded or there are no more search results.
571
572        If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`.
573
574        For more on using generators, see
575        [Generators](https://wiki.python.org/moin/Generators).
576        """
577        offset = 0
578        # total_results may be reduced according to the feed's
579        # opensearch:totalResults value.
580        total_results = search.max_results
581        first_page = True
582        while offset < total_results:
583            page_size = min(self.page_size, search.max_results - offset)
584            logger.info("Requesting {} results at offset {}".format(
585                page_size,
586                offset,
587            ))
588            page_url = self._format_url(search, offset, page_size)
589            feed = self._parse_feed(page_url, first_page)
590            if first_page:
591                # NOTE: this is an ugly fix for a known bug. The totalresults
592                # value is set to 1 for results with zero entries. If that API
593                # bug is fixed, we can remove this conditional and always set
594                # `total_results = min(...)`.
595                if len(feed.entries) == 0:
596                    logger.info("Got empty results; stopping generation")
597                    total_results = 0
598                else:
599                    total_results = min(
600                        total_results,
601                        int(feed.feed.opensearch_totalresults)
602                    )
603                    logger.info("Got first page; {} of {} results available".format(
604                        total_results,
605                        search.max_results
606                    ))
607                # Subsequent pages are not the first page.
608                first_page = False
609            # Update offset for next request: account for received results.
610            offset += len(feed.entries)
611            # Yield query results until page is exhausted.
612            for entry in feed.entries:
613                try:
614                    yield Result._from_feed_entry(entry)
615                except Result.MissingFieldError:
616                    logger.warning("Skipping partial result")
617                    continue

Uses this client configuration to fetch one page of the search results at a time, yielding the parsed Results, until max_results results have been yielded or there are no more search results.

If all tries fail, raises an UnexpectedEmptyPageError or HTTPError.

For more on using generators, see Generators.

class ArxivError(builtins.Exception):
697class ArxivError(Exception):
698    """This package's base Exception class."""
699
700    url: str
701    """The feed URL that could not be fetched."""
702    retry: int
703    """
704    The request try number which encountered this error; 0 for the initial try,
705    1 for the first retry, and so on.
706    """
707    message: str
708    """Message describing what caused this error."""
709
710    def __init__(self, url: str, retry: int, message: str):
711        """
712        Constructs an `ArxivError` encountered while fetching the specified URL.
713        """
714        self.url = url
715        self.retry = retry
716        self.message = message
717        super().__init__(self.message)
718
719    def __str__(self) -> str:
720        return '{} ({})'.format(self.message, self.url)

This package's base Exception class.

ArxivError(url: str, retry: int, message: str)
710    def __init__(self, url: str, retry: int, message: str):
711        """
712        Constructs an `ArxivError` encountered while fetching the specified URL.
713        """
714        self.url = url
715        self.retry = retry
716        self.message = message
717        super().__init__(self.message)

Constructs an ArxivError encountered while fetching the specified URL.

url: str

The feed URL that could not be fetched.

retry: int

The request try number which encountered this error; 0 for the initial try, 1 for the first retry, and so on.

message: str

Message describing what caused this error.

Inherited Members
builtins.BaseException
with_traceback
args
class UnexpectedEmptyPageError(ArxivError):
723class UnexpectedEmptyPageError(ArxivError):
724    """
725    An error raised when a page of results that should be non-empty is empty.
726
727    This should never happen in theory, but happens sporadically due to
728    brittleness in the underlying arXiv API; usually resolved by retries.
729
730    See `Client.results` for usage.
731    """
732    def __init__(self, url: str, retry: int):
733        """
734        Constructs an `UnexpectedEmptyPageError` encountered for the specified
735        API URL after `retry` tries.
736        """
737        self.url = url
738        super().__init__(url, retry, "Page of results was unexpectedly empty")
739
740    def __repr__(self) -> str:
741        return '{}({}, {})'.format(
742            _classname(self),
743            repr(self.url),
744            repr(self.retry)
745        )

An error raised when a page of results that should be non-empty is empty.

This should never happen in theory, but happens sporadically due to brittleness in the underlying arXiv API; usually resolved by retries.

See Client.results for usage.

UnexpectedEmptyPageError(url: str, retry: int)
732    def __init__(self, url: str, retry: int):
733        """
734        Constructs an `UnexpectedEmptyPageError` encountered for the specified
735        API URL after `retry` tries.
736        """
737        self.url = url
738        super().__init__(url, retry, "Page of results was unexpectedly empty")

Constructs an UnexpectedEmptyPageError encountered for the specified API URL after retry tries.

Inherited Members
ArxivError
url
retry
message
builtins.BaseException
with_traceback
args
class HTTPError(ArxivError):
748class HTTPError(ArxivError):
749    """
750    A non-200 status encountered while fetching a page of results.
751
752    See `Client.results` for usage.
753    """
754
755    status: int
756    """The HTTP status reported by feedparser."""
757    entry: feedparser.FeedParserDict
758    """The feed entry describing the error, if present."""
759
760    def __init__(self, url: str, retry: int, feed: feedparser.FeedParserDict):
761        """
762        Constructs an `HTTPError` for the specified status code, encountered for
763        the specified API URL after `retry` tries.
764        """
765        self.url = url
766        self.status = feed.status
767        # If the feed is valid and includes a single entry, trust it's an
768        # explanation.
769        if not feed.bozo and len(feed.entries) == 1:
770            self.entry = feed.entries[0]
771        else:
772            self.entry = None
773        super().__init__(
774            url,
775            retry,
776            "Page request resulted in HTTP {}: {}".format(
777                self.status,
778                self.entry.summary if self.entry else None,
779            ),
780        )
781
782    def __repr__(self) -> str:
783        return '{}({}, {}, {})'.format(
784            _classname(self),
785            repr(self.url),
786            repr(self.retry),
787            repr(self.status)
788        )

A non-200 status encountered while fetching a page of results.

See Client.results for usage.

HTTPError(url: str, retry: int, feed: feedparser.util.FeedParserDict)
760    def __init__(self, url: str, retry: int, feed: feedparser.FeedParserDict):
761        """
762        Constructs an `HTTPError` for the specified status code, encountered for
763        the specified API URL after `retry` tries.
764        """
765        self.url = url
766        self.status = feed.status
767        # If the feed is valid and includes a single entry, trust it's an
768        # explanation.
769        if not feed.bozo and len(feed.entries) == 1:
770            self.entry = feed.entries[0]
771        else:
772            self.entry = None
773        super().__init__(
774            url,
775            retry,
776            "Page request resulted in HTTP {}: {}".format(
777                self.status,
778                self.entry.summary if self.entry else None,
779            ),
780        )

Constructs an HTTPError for the specified status code, encountered for the specified API URL after retry tries.

status: int

The HTTP status reported by feedparser.

entry: feedparser.util.FeedParserDict

The feed entry describing the error, if present.

Inherited Members
ArxivError
url
retry
message
builtins.BaseException
with_traceback
args