arxiv
arxiv.py
Python wrapper for the arXiv API.
arXiv is a project by the Cornell University Library that provides open access to 1,000,000+ articles in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics.
Usage
Install the package:
$ pip install arxiv # Or `uv add arxiv` or similar.
In your Python code, include the line:
import arxiv
Examples
Fetching results
import arxiv
# Construct the default API client.
client = Client()
# Search for the 10 most recent articles matching the keyword "quantum."
search = Search(
query = "quantum",
max_results = 10,
sort_by = SortCriterion.SubmittedDate
)
results = client.results(search)
# `results` is a generator; you can iterate over its elements one by one...
for r in client.results(search):
print(r.title)
# ...or exhaust it into a list. Careful: this is slow for large results sets.
all_results = list(results)
print([r.title for r in all_results])
# For advanced query syntax documentation, see the arXiv API User Manual:
# https://arxiv.org/help/api/user-manual#query_details
search = Search(query = "au:del_maestro AND ti:checkerboard")
first_result = next(client.results(search))
print(first_result)
# Search for the paper with ID "1605.08386v1"
search_by_id = Search(id_list=["1605.08386v1"])
# Reuse client to fetch the paper, then print its title.
first_result = next(client.results(search_by_id))
print(first_result.title)
[!TIP] [
arxivql](https://pypi.org/project/arxivql/) may simplify constructing complex query strings.
Fetching results with a custom client
import arxiv
big_slow_client = Client(
page_size = 1000,
delay_seconds = 10.0,
num_retries = 5
)
# Prints 1000 titles before needing to make another request.
for result in big_slow_client.results(Search(query="quantum")):
print(result.title)
Downloading a paper
import arxiv
from urllib.request import urlretrieve
paper = next(Client().results(Search(id_list=["1605.08386v1"])))
# Download the PDF.
urlretrieve(paper.pdf_url, "paper.pdf")
# Download the source tarball.
urlretrieve(paper.source_url(), "paper.tar.gz")
Logging
To inspect this package's network behavior and API logic, configure a DEBUG-level logger.
>>> import logging, arxiv
>>> logging.basicConfig(level=logging.DEBUG)
>>> client = Client()
>>> paper = next(client.results(Search(id_list=["1605.08386v1"])))
INFO:arxiv.arxiv:Requesting 100 results at offset 0
INFO:arxiv.arxiv:Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): export.arxiv.org:443
DEBUG:urllib3.connectionpool:https://export.arxiv.org:443 "GET /api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100&user-agent=arxiv.py%2F1.4.8 HTTP/1.1" 200 979
Types
Client
A Client specifies a reusable strategy for fetching results from arXiv's API. For most use cases the default client should suffice.
Clients configurations specify pagination and retry logic. Reusing a client allows successive API calls to use the same connection pool and ensures they abide by the rate limit you set.
Search
A Search specifies a search of arXiv's database. Use Client.results to get a generator yielding Results.
Result
The Result objects yielded by Client.results include metadata about each paper.
The meaning of the underlying raw data is documented in the arXiv API User Manual: Details of Atom Results Returned.
Development
This project uses UV for development, while maintaining compatibility with traditional pip installation for end users.
1""".. include:: ../README.md""" 2 3from __future__ import annotations 4 5import logging 6import time 7import itertools 8import requests 9 10from importlib.metadata import PackageNotFoundError, version 11from urllib.parse import urlencode 12from datetime import datetime, timedelta, timezone 13from calendar import timegm 14 15from enum import Enum 16from typing import Generator, Iterator 17 18from . import _feed 19from ._feed import ParsedFeed 20 21 22logger = logging.getLogger(__name__) 23 24try: 25 __version__ = version("arxiv") 26except PackageNotFoundError: 27 __version__ = "0.0.0+unknown" 28 29_USER_AGENT = f"arxiv.py/{__version__}" 30 31_DEFAULT_TIME = datetime.min 32 33 34class Result: 35 """ 36 An entry in an arXiv query results feed. 37 38 See [the arXiv API User's Manual: Details of Atom Results 39 Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned). 40 """ 41 42 entry_id: str 43 """A url of the form `https://arxiv.org/abs/{id}`.""" 44 updated: datetime 45 """When the result was last updated.""" 46 published: datetime 47 """When the result was originally published.""" 48 title: str 49 """The title of the result.""" 50 authors: list[Result.Author] 51 """The result's authors, including any `<arxiv:affiliation>` data.""" 52 summary: str 53 """The result abstract.""" 54 comment: str | None 55 """The authors' comment if present.""" 56 journal_ref: str | None 57 """A journal reference if present.""" 58 doi: str | None 59 """A URL for the resolved DOI to an external resource if present.""" 60 primary_category: str 61 """ 62 The result's primary arXiv category. See [arXiv: Category 63 Taxonomy](https://arxiv.org/category_taxonomy). 64 """ 65 categories: list[str] 66 """ 67 All of the result's categories. See [arXiv: Category 68 Taxonomy](https://arxiv.org/category_taxonomy). 69 """ 70 links: list[Result.Link] 71 """Up to three URLs associated with this result.""" 72 pdf_url: str | None 73 """The URL of a PDF version of this result if present among links.""" 74 75 def __init__( 76 self, 77 entry_id: str, 78 updated: datetime = _DEFAULT_TIME, 79 published: datetime = _DEFAULT_TIME, 80 title: str = "", 81 authors: list[Result.Author] | None = None, 82 summary: str = "", 83 comment: str = "", 84 journal_ref: str = "", 85 doi: str = "", 86 primary_category: str = "", 87 categories: list[str] | None = None, 88 links: list[Result.Link] | None = None, 89 ): 90 """ 91 Constructs an arXiv search result item. 92 93 In most cases, results are produced by `Client.results`, which parses 94 API responses internally. 95 """ 96 self.entry_id = entry_id 97 self.updated = updated 98 self.published = published 99 self.title = title 100 self.authors = authors or [] 101 self.summary = summary 102 self.comment = comment 103 self.journal_ref = journal_ref 104 self.doi = doi 105 self.primary_category = primary_category 106 self.categories = categories or [] 107 self.links = links or [] 108 # Calculated members 109 self.pdf_url = Result._get_pdf_url(self.links) 110 111 def __str__(self) -> str: 112 return self.entry_id 113 114 def __repr__(self) -> str: 115 return ( 116 "{}(entry_id={}, updated={}, published={}, title={}, authors={}, " 117 "summary={}, comment={}, journal_ref={}, doi={}, " 118 "primary_category={}, categories={}, links={})" 119 ).format( 120 _classname(self), 121 repr(self.entry_id), 122 repr(self.updated), 123 repr(self.published), 124 repr(self.title), 125 repr(self.authors), 126 repr(self.summary), 127 repr(self.comment), 128 repr(self.journal_ref), 129 repr(self.doi), 130 repr(self.primary_category), 131 repr(self.categories), 132 repr(self.links), 133 ) 134 135 def __eq__(self, other: object) -> bool: 136 if isinstance(other, Result): 137 return self.entry_id == other.entry_id 138 return False 139 140 def get_short_id(self) -> str: 141 """ 142 Returns the short ID for this result. 143 144 + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`, 145 `result.get_short_id()` returns `2107.05580v1`. 146 147 + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`, 148 `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March 149 2007 arXiv identifier format). 150 151 For an explanation of the difference between arXiv's legacy and current 152 identifiers, see [Understanding the arXiv 153 identifier](https://arxiv.org/help/arxiv_identifier). 154 """ 155 return self.entry_id.split("arxiv.org/abs/")[-1] 156 157 def source_url(self) -> str | None: 158 """ 159 Derives a URL for the source tarfile for this result. 160 """ 161 if self.pdf_url is None: 162 return None 163 return self.pdf_url.replace("/pdf/", "/src/") 164 165 @staticmethod 166 def _get_pdf_url(links: list[Result.Link]) -> str | None: 167 """ 168 Finds the PDF link among a result's links and returns its URL. 169 170 Should only be called once for a given `Result`, in its constructor. 171 After construction, the URL should be available in `Result.pdf_url`. 172 """ 173 pdf_urls = [link.href for link in links if link.title == "pdf"] 174 if len(pdf_urls) == 0: 175 return None 176 elif len(pdf_urls) > 1: 177 logger.warning("Result has multiple PDF links; using %s", pdf_urls[0]) 178 return pdf_urls[0] 179 180 @staticmethod 181 def _to_datetime(ts: time.struct_time) -> datetime: 182 """ 183 Converts a UTC `time.struct_time` into a time-zone-aware `datetime`. 184 185 Retained as a stable utility for callers that historically relied on 186 feedparser's `*_parsed` time tuples; the internal Atom parser produces 187 `datetime` objects directly. 188 """ 189 return datetime.fromtimestamp(timegm(ts), tz=timezone.utc) 190 191 class Author: 192 """ 193 A light inner class for representing a result's authors. 194 """ 195 196 name: str 197 """The author's name.""" 198 affiliation: list[str] 199 """ 200 Any `<arxiv:affiliation>` values associated with this author. Most 201 results have no affiliation data and this is an empty list; some 202 results have one or more affiliation strings per author. 203 204 See https://github.com/lukasschwab/arxiv.py/issues/62. 205 """ 206 207 def __init__(self, name: str, affiliation: list[str] | None = None): 208 """ 209 Constructs an `Author` with the specified name and (optional) 210 affiliations. 211 """ 212 self.name = name 213 self.affiliation = affiliation or [] 214 215 def __str__(self) -> str: 216 return self.name 217 218 def __repr__(self) -> str: 219 if self.affiliation: 220 return "{}({}, affiliation={})".format( 221 _classname(self), repr(self.name), repr(self.affiliation) 222 ) 223 return "{}({})".format(_classname(self), repr(self.name)) 224 225 def __eq__(self, other: object) -> bool: 226 if isinstance(other, Result.Author): 227 return self.name == other.name 228 return False 229 230 class Link: 231 """ 232 A light inner class for representing a result's links. 233 """ 234 235 href: str 236 """The link's `href` attribute.""" 237 title: str | None 238 """The link's title.""" 239 rel: str 240 """The link's relationship to the `Result`.""" 241 content_type: str | None 242 """The link's HTTP content type.""" 243 244 def __init__( 245 self, 246 href: str, 247 title: str | None = None, 248 rel: str = "", 249 content_type: str | None = None, 250 ): 251 """ 252 Constructs a `Link` with the specified link metadata. 253 """ 254 self.href = href 255 self.title = title 256 self.rel = rel 257 self.content_type = content_type 258 259 def __str__(self) -> str: 260 return self.href 261 262 def __repr__(self) -> str: 263 return "{}({}, title={}, rel={}, content_type={})".format( 264 _classname(self), 265 repr(self.href), 266 repr(self.title), 267 repr(self.rel), 268 repr(self.content_type), 269 ) 270 271 def __eq__(self, other: object) -> bool: 272 if isinstance(other, Result.Link): 273 return self.href == other.href 274 return False 275 276 class MissingFieldError(Exception): 277 """ 278 An error indicating an entry is unparseable because it lacks required 279 fields. 280 """ 281 282 missing_field: str 283 """The required field missing from the would-be entry.""" 284 message: str 285 """Message describing what caused this error.""" 286 287 def __init__(self, missing_field: str): 288 self.missing_field = missing_field 289 self.message = "Entry from arXiv missing required info" 290 291 def __repr__(self) -> str: 292 return "{}({})".format(_classname(self), repr(self.missing_field)) 293 294 295class SortCriterion(Enum): 296 """ 297 A SortCriterion identifies a property by which search results can be 298 sorted. 299 300 See [the arXiv API User's Manual: sort order for return 301 results](https://arxiv.org/help/api/user-manual#sort). 302 """ 303 304 Relevance = "relevance" 305 LastUpdatedDate = "lastUpdatedDate" 306 SubmittedDate = "submittedDate" 307 308 309class SortOrder(Enum): 310 """ 311 A SortOrder indicates order in which search results are sorted according 312 to the specified arxiv.SortCriterion. 313 314 See [the arXiv API User's Manual: sort order for return 315 results](https://arxiv.org/help/api/user-manual#sort). 316 """ 317 318 Ascending = "ascending" 319 Descending = "descending" 320 321 322class Search: 323 """ 324 A specification for a search of arXiv's database. 325 326 To run a search, use `Search.run` to use a default client or `Client.run` 327 with a specific client. 328 """ 329 330 query: str 331 """ 332 A query string. 333 334 This should be unencoded. Use `au:del_maestro AND ti:checkerboard`, not 335 `au:del_maestro+AND+ti:checkerboard`. 336 337 See [the arXiv API User's Manual: Details of Query 338 Construction](https://arxiv.org/help/api/user-manual#query_details). 339 """ 340 id_list: list[str] 341 """ 342 A list of arXiv article IDs to which to limit the search. 343 344 See [the arXiv API User's 345 Manual](https://arxiv.org/help/api/user-manual#search_query_and_id_list) 346 for documentation of the interaction between `query` and `id_list`. 347 """ 348 max_results: int | None 349 """ 350 The maximum number of results to be returned in an execution of this 351 search. To fetch every result available, set `max_results=None`. 352 353 The API's limit is 300,000 results per query. 354 """ 355 sort_by: SortCriterion 356 """The sort criterion for results.""" 357 sort_order: SortOrder 358 """The sort order for results.""" 359 360 def __init__( 361 self, 362 query: str = "", 363 id_list: list[str] | None = None, 364 max_results: int | None = 100, 365 sort_by: SortCriterion = SortCriterion.Relevance, 366 sort_order: SortOrder = SortOrder.Descending, 367 ): 368 """ 369 Constructs an arXiv API search with the specified criteria. 370 """ 371 self.query = query 372 self.id_list = id_list or [] 373 self.max_results = max_results 374 self.sort_by = sort_by 375 self.sort_order = sort_order 376 377 def __str__(self) -> str: 378 if self.query and self.id_list: 379 return f"Search(query='{self.query}', id_list={len(self.id_list)} items)" 380 elif self.query: 381 return f"Search(query='{self.query}')" 382 elif self.id_list: 383 return f"Search(id_list={len(self.id_list)} items)" 384 else: 385 return "Search(empty)" 386 387 def __repr__(self) -> str: 388 return ("{}(query={}, id_list={}, max_results={}, sort_by={}, sort_order={})").format( 389 _classname(self), 390 repr(self.query), 391 repr(self.id_list), 392 repr(self.max_results), 393 repr(self.sort_by), 394 repr(self.sort_order), 395 ) 396 397 def _url_args(self) -> dict[str, str]: 398 """ 399 Returns a dict of search parameters that should be included in an API 400 request for this search. 401 """ 402 return { 403 "search_query": self.query, 404 "id_list": ",".join(self.id_list), 405 "sortBy": self.sort_by.value, 406 "sortOrder": self.sort_order.value, 407 } 408 409 410class Client: 411 """ 412 Specifies a strategy for fetching results from arXiv's API. 413 414 This class obscures pagination and retry logic, and exposes 415 `Client.results`. 416 """ 417 418 query_url_format = "https://export.arxiv.org/api/query?{}" 419 """ 420 The arXiv query API endpoint format. 421 """ 422 page_size: int 423 """ 424 Maximum number of results fetched in a single API request. Smaller pages can 425 be retrieved faster, but may require more round-trips. 426 427 The API's limit is 2000 results per page. 428 """ 429 delay_seconds: float 430 """ 431 Number of seconds to wait between API requests. 432 433 [arXiv's Terms of Use](https://arxiv.org/help/api/tou) ask that you "make no 434 more than one request every three seconds." 435 """ 436 num_retries: int 437 """ 438 Number of times to retry a failing API request before raising an Exception. 439 """ 440 441 _last_request_dt: datetime | None 442 _session: requests.Session 443 444 def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3): 445 """ 446 Constructs an arXiv API client with the specified options. 447 448 Note: the default parameters should provide a robust request strategy 449 for most use cases. Extreme page sizes, delays, or retries risk 450 violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou), 451 brittle behavior, and inconsistent results. 452 """ 453 self.page_size = page_size 454 self.delay_seconds = delay_seconds 455 self.num_retries = num_retries 456 self._last_request_dt = None 457 self._session = requests.Session() 458 459 def __str__(self) -> str: 460 return f"Client(page_size={self.page_size}, delay={self.delay_seconds}s, retries={self.num_retries})" 461 462 def __repr__(self) -> str: 463 return "{}(page_size={}, delay_seconds={}, num_retries={})".format( 464 _classname(self), 465 repr(self.page_size), 466 repr(self.delay_seconds), 467 repr(self.num_retries), 468 ) 469 470 def results(self, search: Search, offset: int = 0) -> Iterator[Result]: 471 """ 472 Uses this client configuration to fetch one page of the search results 473 at a time, yielding the parsed `Result`s, until `max_results` results 474 have been yielded or there are no more search results. 475 476 If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`. 477 478 Setting a nonzero `offset` discards leading records in the result set. 479 When `offset` is greater than or equal to `search.max_results`, the full 480 result set is discarded. 481 482 For more on using generators, see 483 [Generators](https://wiki.python.org/moin/Generators). 484 """ 485 limit = search.max_results - offset if search.max_results else None 486 if limit and limit < 0: 487 return iter(()) 488 return itertools.islice(self._results(search, offset), limit) 489 490 def _results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]: 491 page_url = self._format_url(search, offset, self.page_size) 492 feed = self._parse_feed(page_url, first_page=True) 493 if not feed.results: 494 logger.info("Got empty first page; stopping generation") 495 return 496 total_results = feed.header.total_results 497 logger.info( 498 "Got first page: %d of %d total results", 499 len(feed.results), 500 total_results, 501 ) 502 503 while feed.results: 504 yield from feed.results 505 offset += len(feed.results) 506 if offset >= total_results: 507 break 508 page_url = self._format_url(search, offset, self.page_size) 509 feed = self._parse_feed(page_url, first_page=False) 510 511 def _format_url(self, search: Search, start: int, page_size: int) -> str: 512 """ 513 Construct a request API for search that returns up to `page_size` 514 results starting with the result at index `start`. 515 """ 516 url_args = search._url_args() 517 url_args.update( 518 { 519 "start": str(start), 520 "max_results": str(page_size), 521 } 522 ) 523 return self.query_url_format.format(urlencode(url_args)) 524 525 def _parse_feed(self, url: str, first_page: bool = True, _try_index: int = 0) -> ParsedFeed: 526 """ 527 Fetches the specified URL and parses it as an Atom feed. 528 529 If a request fails or is unexpectedly empty, retries the request up to 530 `self.num_retries` times. 531 """ 532 try: 533 return self.__try_parse_feed(url, first_page=first_page, try_index=_try_index) 534 except ( 535 HTTPError, 536 UnexpectedEmptyPageError, 537 requests.exceptions.ConnectionError, 538 ) as err: 539 if _try_index < self.num_retries: 540 logger.debug("Got error (try %d): %s", _try_index, err) 541 return self._parse_feed(url, first_page=first_page, _try_index=_try_index + 1) 542 logger.debug("Giving up (try %d): %s", _try_index, err) 543 raise err 544 545 def __try_parse_feed( 546 self, 547 url: str, 548 first_page: bool, 549 try_index: int, 550 ) -> ParsedFeed: 551 """ 552 Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that 553 number of seconds has not passed since `_parse_feed` was last called, 554 sleeps until delay_seconds seconds have passed. 555 """ 556 # If this call would violate the rate limit, sleep until it doesn't. 557 if self._last_request_dt is not None: 558 required = timedelta(seconds=self.delay_seconds) 559 since_last_request = datetime.now() - self._last_request_dt 560 if since_last_request < required: 561 to_sleep = (required - since_last_request).total_seconds() 562 logger.info("Sleeping: %f seconds", to_sleep) 563 time.sleep(to_sleep) 564 565 logger.info("Requesting page (first: %r, try: %d): %s", first_page, try_index, url) 566 567 resp = self._session.get(url, headers={"user-agent": _USER_AGENT}) 568 self._last_request_dt = datetime.now() 569 if resp.status_code != requests.codes.OK: 570 raise HTTPError(url, try_index, resp.status_code) 571 572 feed = _feed.parse(resp.content) 573 if len(feed.results) == 0 and not first_page: 574 raise UnexpectedEmptyPageError(url, try_index, feed) 575 576 if feed.malformed: 577 logger.warning("Malformed feed; consider handling: %s", feed.error) 578 579 return feed 580 581 582class ArxivError(Exception): 583 """This package's base Exception class.""" 584 585 url: str 586 """The feed URL that could not be fetched.""" 587 retry: int 588 """ 589 The request try number which encountered this error; 0 for the initial try, 590 1 for the first retry, and so on. 591 """ 592 message: str 593 """Message describing what caused this error.""" 594 595 def __init__(self, url: str, retry: int, message: str): 596 """ 597 Constructs an `ArxivError` encountered while fetching the specified URL. 598 """ 599 self.url = url 600 self.retry = retry 601 self.message = message 602 super().__init__(self.message) 603 604 def __reduce__(self) -> tuple: 605 return (self.__class__, (self.url, self.retry, self.message)) 606 607 def __str__(self) -> str: 608 return "{} ({})".format(self.message, self.url) 609 610 611class UnexpectedEmptyPageError(ArxivError): 612 """ 613 An error raised when a page of results that should be non-empty is empty. 614 615 This should never happen in theory, but happens sporadically due to 616 brittleness in the underlying arXiv API; usually resolved by retries. 617 618 See `Client.results` for usage. 619 """ 620 621 raw_feed: ParsedFeed 622 """ 623 The raw parsed feed. Sometimes this contains useful diagnostic information, 624 e.g. in `bozo_exception`. 625 """ 626 627 def __init__(self, url: str, retry: int, raw_feed: ParsedFeed): 628 """ 629 Constructs an `UnexpectedEmptyPageError` encountered for the specified 630 API URL after `retry` tries. 631 """ 632 self.url = url 633 self.raw_feed = raw_feed 634 super().__init__(url, retry, "Page of results was unexpectedly empty") 635 636 def __reduce__(self) -> tuple: 637 return (self.__class__, (self.url, self.retry, self.raw_feed)) 638 639 def __repr__(self) -> str: 640 return "{}({}, {}, {})".format( 641 _classname(self), repr(self.url), repr(self.retry), repr(self.raw_feed) 642 ) 643 644 645class HTTPError(ArxivError): 646 """ 647 A non-200 status encountered while fetching a page of results. 648 649 See `Client.results` for usage. 650 """ 651 652 status: int 653 """The HTTP status reported by the underlying request.""" 654 655 def __init__(self, url: str, retry: int, status: int): 656 """ 657 Constructs an `HTTPError` for the specified status code, encountered for 658 the specified API URL after `retry` tries. 659 """ 660 self.url = url 661 self.status = status 662 super().__init__( 663 url, 664 retry, 665 "Page request resulted in HTTP {}".format(self.status), 666 ) 667 668 def __reduce__(self) -> tuple: 669 return (self.__class__, (self.url, self.retry, self.status)) 670 671 def __repr__(self) -> str: 672 return "{}({}, {}, {})".format( 673 _classname(self), repr(self.url), repr(self.retry), repr(self.status) 674 ) 675 676 677def _classname(o: object) -> str: 678 """A helper function for use in __repr__ methods: arxiv.Result.Link.""" 679 return "arxiv.{}".format(o.__class__.__qualname__)
35class Result: 36 """ 37 An entry in an arXiv query results feed. 38 39 See [the arXiv API User's Manual: Details of Atom Results 40 Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned). 41 """ 42 43 entry_id: str 44 """A url of the form `https://arxiv.org/abs/{id}`.""" 45 updated: datetime 46 """When the result was last updated.""" 47 published: datetime 48 """When the result was originally published.""" 49 title: str 50 """The title of the result.""" 51 authors: list[Result.Author] 52 """The result's authors, including any `<arxiv:affiliation>` data.""" 53 summary: str 54 """The result abstract.""" 55 comment: str | None 56 """The authors' comment if present.""" 57 journal_ref: str | None 58 """A journal reference if present.""" 59 doi: str | None 60 """A URL for the resolved DOI to an external resource if present.""" 61 primary_category: str 62 """ 63 The result's primary arXiv category. See [arXiv: Category 64 Taxonomy](https://arxiv.org/category_taxonomy). 65 """ 66 categories: list[str] 67 """ 68 All of the result's categories. See [arXiv: Category 69 Taxonomy](https://arxiv.org/category_taxonomy). 70 """ 71 links: list[Result.Link] 72 """Up to three URLs associated with this result.""" 73 pdf_url: str | None 74 """The URL of a PDF version of this result if present among links.""" 75 76 def __init__( 77 self, 78 entry_id: str, 79 updated: datetime = _DEFAULT_TIME, 80 published: datetime = _DEFAULT_TIME, 81 title: str = "", 82 authors: list[Result.Author] | None = None, 83 summary: str = "", 84 comment: str = "", 85 journal_ref: str = "", 86 doi: str = "", 87 primary_category: str = "", 88 categories: list[str] | None = None, 89 links: list[Result.Link] | None = None, 90 ): 91 """ 92 Constructs an arXiv search result item. 93 94 In most cases, results are produced by `Client.results`, which parses 95 API responses internally. 96 """ 97 self.entry_id = entry_id 98 self.updated = updated 99 self.published = published 100 self.title = title 101 self.authors = authors or [] 102 self.summary = summary 103 self.comment = comment 104 self.journal_ref = journal_ref 105 self.doi = doi 106 self.primary_category = primary_category 107 self.categories = categories or [] 108 self.links = links or [] 109 # Calculated members 110 self.pdf_url = Result._get_pdf_url(self.links) 111 112 def __str__(self) -> str: 113 return self.entry_id 114 115 def __repr__(self) -> str: 116 return ( 117 "{}(entry_id={}, updated={}, published={}, title={}, authors={}, " 118 "summary={}, comment={}, journal_ref={}, doi={}, " 119 "primary_category={}, categories={}, links={})" 120 ).format( 121 _classname(self), 122 repr(self.entry_id), 123 repr(self.updated), 124 repr(self.published), 125 repr(self.title), 126 repr(self.authors), 127 repr(self.summary), 128 repr(self.comment), 129 repr(self.journal_ref), 130 repr(self.doi), 131 repr(self.primary_category), 132 repr(self.categories), 133 repr(self.links), 134 ) 135 136 def __eq__(self, other: object) -> bool: 137 if isinstance(other, Result): 138 return self.entry_id == other.entry_id 139 return False 140 141 def get_short_id(self) -> str: 142 """ 143 Returns the short ID for this result. 144 145 + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`, 146 `result.get_short_id()` returns `2107.05580v1`. 147 148 + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`, 149 `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March 150 2007 arXiv identifier format). 151 152 For an explanation of the difference between arXiv's legacy and current 153 identifiers, see [Understanding the arXiv 154 identifier](https://arxiv.org/help/arxiv_identifier). 155 """ 156 return self.entry_id.split("arxiv.org/abs/")[-1] 157 158 def source_url(self) -> str | None: 159 """ 160 Derives a URL for the source tarfile for this result. 161 """ 162 if self.pdf_url is None: 163 return None 164 return self.pdf_url.replace("/pdf/", "/src/") 165 166 @staticmethod 167 def _get_pdf_url(links: list[Result.Link]) -> str | None: 168 """ 169 Finds the PDF link among a result's links and returns its URL. 170 171 Should only be called once for a given `Result`, in its constructor. 172 After construction, the URL should be available in `Result.pdf_url`. 173 """ 174 pdf_urls = [link.href for link in links if link.title == "pdf"] 175 if len(pdf_urls) == 0: 176 return None 177 elif len(pdf_urls) > 1: 178 logger.warning("Result has multiple PDF links; using %s", pdf_urls[0]) 179 return pdf_urls[0] 180 181 @staticmethod 182 def _to_datetime(ts: time.struct_time) -> datetime: 183 """ 184 Converts a UTC `time.struct_time` into a time-zone-aware `datetime`. 185 186 Retained as a stable utility for callers that historically relied on 187 feedparser's `*_parsed` time tuples; the internal Atom parser produces 188 `datetime` objects directly. 189 """ 190 return datetime.fromtimestamp(timegm(ts), tz=timezone.utc) 191 192 class Author: 193 """ 194 A light inner class for representing a result's authors. 195 """ 196 197 name: str 198 """The author's name.""" 199 affiliation: list[str] 200 """ 201 Any `<arxiv:affiliation>` values associated with this author. Most 202 results have no affiliation data and this is an empty list; some 203 results have one or more affiliation strings per author. 204 205 See https://github.com/lukasschwab/arxiv.py/issues/62. 206 """ 207 208 def __init__(self, name: str, affiliation: list[str] | None = None): 209 """ 210 Constructs an `Author` with the specified name and (optional) 211 affiliations. 212 """ 213 self.name = name 214 self.affiliation = affiliation or [] 215 216 def __str__(self) -> str: 217 return self.name 218 219 def __repr__(self) -> str: 220 if self.affiliation: 221 return "{}({}, affiliation={})".format( 222 _classname(self), repr(self.name), repr(self.affiliation) 223 ) 224 return "{}({})".format(_classname(self), repr(self.name)) 225 226 def __eq__(self, other: object) -> bool: 227 if isinstance(other, Result.Author): 228 return self.name == other.name 229 return False 230 231 class Link: 232 """ 233 A light inner class for representing a result's links. 234 """ 235 236 href: str 237 """The link's `href` attribute.""" 238 title: str | None 239 """The link's title.""" 240 rel: str 241 """The link's relationship to the `Result`.""" 242 content_type: str | None 243 """The link's HTTP content type.""" 244 245 def __init__( 246 self, 247 href: str, 248 title: str | None = None, 249 rel: str = "", 250 content_type: str | None = None, 251 ): 252 """ 253 Constructs a `Link` with the specified link metadata. 254 """ 255 self.href = href 256 self.title = title 257 self.rel = rel 258 self.content_type = content_type 259 260 def __str__(self) -> str: 261 return self.href 262 263 def __repr__(self) -> str: 264 return "{}({}, title={}, rel={}, content_type={})".format( 265 _classname(self), 266 repr(self.href), 267 repr(self.title), 268 repr(self.rel), 269 repr(self.content_type), 270 ) 271 272 def __eq__(self, other: object) -> bool: 273 if isinstance(other, Result.Link): 274 return self.href == other.href 275 return False 276 277 class MissingFieldError(Exception): 278 """ 279 An error indicating an entry is unparseable because it lacks required 280 fields. 281 """ 282 283 missing_field: str 284 """The required field missing from the would-be entry.""" 285 message: str 286 """Message describing what caused this error.""" 287 288 def __init__(self, missing_field: str): 289 self.missing_field = missing_field 290 self.message = "Entry from arXiv missing required info" 291 292 def __repr__(self) -> str: 293 return "{}({})".format(_classname(self), repr(self.missing_field))
An entry in an arXiv query results feed.
See the arXiv API User's Manual: Details of Atom Results Returned.
76 def __init__( 77 self, 78 entry_id: str, 79 updated: datetime = _DEFAULT_TIME, 80 published: datetime = _DEFAULT_TIME, 81 title: str = "", 82 authors: list[Result.Author] | None = None, 83 summary: str = "", 84 comment: str = "", 85 journal_ref: str = "", 86 doi: str = "", 87 primary_category: str = "", 88 categories: list[str] | None = None, 89 links: list[Result.Link] | None = None, 90 ): 91 """ 92 Constructs an arXiv search result item. 93 94 In most cases, results are produced by `Client.results`, which parses 95 API responses internally. 96 """ 97 self.entry_id = entry_id 98 self.updated = updated 99 self.published = published 100 self.title = title 101 self.authors = authors or [] 102 self.summary = summary 103 self.comment = comment 104 self.journal_ref = journal_ref 105 self.doi = doi 106 self.primary_category = primary_category 107 self.categories = categories or [] 108 self.links = links or [] 109 # Calculated members 110 self.pdf_url = Result._get_pdf_url(self.links)
Constructs an arXiv search result item.
In most cases, results are produced by Client.results, which parses
API responses internally.
141 def get_short_id(self) -> str: 142 """ 143 Returns the short ID for this result. 144 145 + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`, 146 `result.get_short_id()` returns `2107.05580v1`. 147 148 + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`, 149 `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March 150 2007 arXiv identifier format). 151 152 For an explanation of the difference between arXiv's legacy and current 153 identifiers, see [Understanding the arXiv 154 identifier](https://arxiv.org/help/arxiv_identifier). 155 """ 156 return self.entry_id.split("arxiv.org/abs/")[-1]
Returns the short ID for this result.
If the result URL is
"https://arxiv.org/abs/2107.05580v1",result.get_short_id()returns2107.05580v1.If the result URL is
"https://arxiv.org/abs/quant-ph/0201082v1",result.get_short_id()returns"quant-ph/0201082v1"(the pre-March 2007 arXiv identifier format).
For an explanation of the difference between arXiv's legacy and current identifiers, see Understanding the arXiv identifier.
192 class Author: 193 """ 194 A light inner class for representing a result's authors. 195 """ 196 197 name: str 198 """The author's name.""" 199 affiliation: list[str] 200 """ 201 Any `<arxiv:affiliation>` values associated with this author. Most 202 results have no affiliation data and this is an empty list; some 203 results have one or more affiliation strings per author. 204 205 See https://github.com/lukasschwab/arxiv.py/issues/62. 206 """ 207 208 def __init__(self, name: str, affiliation: list[str] | None = None): 209 """ 210 Constructs an `Author` with the specified name and (optional) 211 affiliations. 212 """ 213 self.name = name 214 self.affiliation = affiliation or [] 215 216 def __str__(self) -> str: 217 return self.name 218 219 def __repr__(self) -> str: 220 if self.affiliation: 221 return "{}({}, affiliation={})".format( 222 _classname(self), repr(self.name), repr(self.affiliation) 223 ) 224 return "{}({})".format(_classname(self), repr(self.name)) 225 226 def __eq__(self, other: object) -> bool: 227 if isinstance(other, Result.Author): 228 return self.name == other.name 229 return False
A light inner class for representing a result's authors.
208 def __init__(self, name: str, affiliation: list[str] | None = None): 209 """ 210 Constructs an `Author` with the specified name and (optional) 211 affiliations. 212 """ 213 self.name = name 214 self.affiliation = affiliation or []
Constructs an Author with the specified name and (optional)
affiliations.
231 class Link: 232 """ 233 A light inner class for representing a result's links. 234 """ 235 236 href: str 237 """The link's `href` attribute.""" 238 title: str | None 239 """The link's title.""" 240 rel: str 241 """The link's relationship to the `Result`.""" 242 content_type: str | None 243 """The link's HTTP content type.""" 244 245 def __init__( 246 self, 247 href: str, 248 title: str | None = None, 249 rel: str = "", 250 content_type: str | None = None, 251 ): 252 """ 253 Constructs a `Link` with the specified link metadata. 254 """ 255 self.href = href 256 self.title = title 257 self.rel = rel 258 self.content_type = content_type 259 260 def __str__(self) -> str: 261 return self.href 262 263 def __repr__(self) -> str: 264 return "{}({}, title={}, rel={}, content_type={})".format( 265 _classname(self), 266 repr(self.href), 267 repr(self.title), 268 repr(self.rel), 269 repr(self.content_type), 270 ) 271 272 def __eq__(self, other: object) -> bool: 273 if isinstance(other, Result.Link): 274 return self.href == other.href 275 return False
A light inner class for representing a result's links.
245 def __init__( 246 self, 247 href: str, 248 title: str | None = None, 249 rel: str = "", 250 content_type: str | None = None, 251 ): 252 """ 253 Constructs a `Link` with the specified link metadata. 254 """ 255 self.href = href 256 self.title = title 257 self.rel = rel 258 self.content_type = content_type
Constructs a Link with the specified link metadata.
277 class MissingFieldError(Exception): 278 """ 279 An error indicating an entry is unparseable because it lacks required 280 fields. 281 """ 282 283 missing_field: str 284 """The required field missing from the would-be entry.""" 285 message: str 286 """Message describing what caused this error.""" 287 288 def __init__(self, missing_field: str): 289 self.missing_field = missing_field 290 self.message = "Entry from arXiv missing required info" 291 292 def __repr__(self) -> str: 293 return "{}({})".format(_classname(self), repr(self.missing_field))
An error indicating an entry is unparseable because it lacks required fields.
Inherited Members
- builtins.BaseException
- with_traceback
- args
296class SortCriterion(Enum): 297 """ 298 A SortCriterion identifies a property by which search results can be 299 sorted. 300 301 See [the arXiv API User's Manual: sort order for return 302 results](https://arxiv.org/help/api/user-manual#sort). 303 """ 304 305 Relevance = "relevance" 306 LastUpdatedDate = "lastUpdatedDate" 307 SubmittedDate = "submittedDate"
A SortCriterion identifies a property by which search results can be sorted.
See the arXiv API User's Manual: sort order for return results.
Inherited Members
- enum.Enum
- name
- value
310class SortOrder(Enum): 311 """ 312 A SortOrder indicates order in which search results are sorted according 313 to the specified arxiv.SortCriterion. 314 315 See [the arXiv API User's Manual: sort order for return 316 results](https://arxiv.org/help/api/user-manual#sort). 317 """ 318 319 Ascending = "ascending" 320 Descending = "descending"
A SortOrder indicates order in which search results are sorted according to the specified SortCriterion.
See the arXiv API User's Manual: sort order for return results.
Inherited Members
- enum.Enum
- name
- value
323class Search: 324 """ 325 A specification for a search of arXiv's database. 326 327 To run a search, use `Search.run` to use a default client or `Client.run` 328 with a specific client. 329 """ 330 331 query: str 332 """ 333 A query string. 334 335 This should be unencoded. Use `au:del_maestro AND ti:checkerboard`, not 336 `au:del_maestro+AND+ti:checkerboard`. 337 338 See [the arXiv API User's Manual: Details of Query 339 Construction](https://arxiv.org/help/api/user-manual#query_details). 340 """ 341 id_list: list[str] 342 """ 343 A list of arXiv article IDs to which to limit the search. 344 345 See [the arXiv API User's 346 Manual](https://arxiv.org/help/api/user-manual#search_query_and_id_list) 347 for documentation of the interaction between `query` and `id_list`. 348 """ 349 max_results: int | None 350 """ 351 The maximum number of results to be returned in an execution of this 352 search. To fetch every result available, set `max_results=None`. 353 354 The API's limit is 300,000 results per query. 355 """ 356 sort_by: SortCriterion 357 """The sort criterion for results.""" 358 sort_order: SortOrder 359 """The sort order for results.""" 360 361 def __init__( 362 self, 363 query: str = "", 364 id_list: list[str] | None = None, 365 max_results: int | None = 100, 366 sort_by: SortCriterion = SortCriterion.Relevance, 367 sort_order: SortOrder = SortOrder.Descending, 368 ): 369 """ 370 Constructs an arXiv API search with the specified criteria. 371 """ 372 self.query = query 373 self.id_list = id_list or [] 374 self.max_results = max_results 375 self.sort_by = sort_by 376 self.sort_order = sort_order 377 378 def __str__(self) -> str: 379 if self.query and self.id_list: 380 return f"Search(query='{self.query}', id_list={len(self.id_list)} items)" 381 elif self.query: 382 return f"Search(query='{self.query}')" 383 elif self.id_list: 384 return f"Search(id_list={len(self.id_list)} items)" 385 else: 386 return "Search(empty)" 387 388 def __repr__(self) -> str: 389 return ("{}(query={}, id_list={}, max_results={}, sort_by={}, sort_order={})").format( 390 _classname(self), 391 repr(self.query), 392 repr(self.id_list), 393 repr(self.max_results), 394 repr(self.sort_by), 395 repr(self.sort_order), 396 ) 397 398 def _url_args(self) -> dict[str, str]: 399 """ 400 Returns a dict of search parameters that should be included in an API 401 request for this search. 402 """ 403 return { 404 "search_query": self.query, 405 "id_list": ",".join(self.id_list), 406 "sortBy": self.sort_by.value, 407 "sortOrder": self.sort_order.value, 408 }
A specification for a search of arXiv's database.
To run a search, use Search.run to use a default client or Client.run
with a specific client.
361 def __init__( 362 self, 363 query: str = "", 364 id_list: list[str] | None = None, 365 max_results: int | None = 100, 366 sort_by: SortCriterion = SortCriterion.Relevance, 367 sort_order: SortOrder = SortOrder.Descending, 368 ): 369 """ 370 Constructs an arXiv API search with the specified criteria. 371 """ 372 self.query = query 373 self.id_list = id_list or [] 374 self.max_results = max_results 375 self.sort_by = sort_by 376 self.sort_order = sort_order
Constructs an arXiv API search with the specified criteria.
A query string.
This should be unencoded. Use au:del_maestro AND ti:checkerboard, not
au:del_maestro+AND+ti:checkerboard.
See the arXiv API User's Manual: Details of Query Construction.
A list of arXiv article IDs to which to limit the search.
See the arXiv API User's
Manual
for documentation of the interaction between query and id_list.
411class Client: 412 """ 413 Specifies a strategy for fetching results from arXiv's API. 414 415 This class obscures pagination and retry logic, and exposes 416 `Client.results`. 417 """ 418 419 query_url_format = "https://export.arxiv.org/api/query?{}" 420 """ 421 The arXiv query API endpoint format. 422 """ 423 page_size: int 424 """ 425 Maximum number of results fetched in a single API request. Smaller pages can 426 be retrieved faster, but may require more round-trips. 427 428 The API's limit is 2000 results per page. 429 """ 430 delay_seconds: float 431 """ 432 Number of seconds to wait between API requests. 433 434 [arXiv's Terms of Use](https://arxiv.org/help/api/tou) ask that you "make no 435 more than one request every three seconds." 436 """ 437 num_retries: int 438 """ 439 Number of times to retry a failing API request before raising an Exception. 440 """ 441 442 _last_request_dt: datetime | None 443 _session: requests.Session 444 445 def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3): 446 """ 447 Constructs an arXiv API client with the specified options. 448 449 Note: the default parameters should provide a robust request strategy 450 for most use cases. Extreme page sizes, delays, or retries risk 451 violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou), 452 brittle behavior, and inconsistent results. 453 """ 454 self.page_size = page_size 455 self.delay_seconds = delay_seconds 456 self.num_retries = num_retries 457 self._last_request_dt = None 458 self._session = requests.Session() 459 460 def __str__(self) -> str: 461 return f"Client(page_size={self.page_size}, delay={self.delay_seconds}s, retries={self.num_retries})" 462 463 def __repr__(self) -> str: 464 return "{}(page_size={}, delay_seconds={}, num_retries={})".format( 465 _classname(self), 466 repr(self.page_size), 467 repr(self.delay_seconds), 468 repr(self.num_retries), 469 ) 470 471 def results(self, search: Search, offset: int = 0) -> Iterator[Result]: 472 """ 473 Uses this client configuration to fetch one page of the search results 474 at a time, yielding the parsed `Result`s, until `max_results` results 475 have been yielded or there are no more search results. 476 477 If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`. 478 479 Setting a nonzero `offset` discards leading records in the result set. 480 When `offset` is greater than or equal to `search.max_results`, the full 481 result set is discarded. 482 483 For more on using generators, see 484 [Generators](https://wiki.python.org/moin/Generators). 485 """ 486 limit = search.max_results - offset if search.max_results else None 487 if limit and limit < 0: 488 return iter(()) 489 return itertools.islice(self._results(search, offset), limit) 490 491 def _results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]: 492 page_url = self._format_url(search, offset, self.page_size) 493 feed = self._parse_feed(page_url, first_page=True) 494 if not feed.results: 495 logger.info("Got empty first page; stopping generation") 496 return 497 total_results = feed.header.total_results 498 logger.info( 499 "Got first page: %d of %d total results", 500 len(feed.results), 501 total_results, 502 ) 503 504 while feed.results: 505 yield from feed.results 506 offset += len(feed.results) 507 if offset >= total_results: 508 break 509 page_url = self._format_url(search, offset, self.page_size) 510 feed = self._parse_feed(page_url, first_page=False) 511 512 def _format_url(self, search: Search, start: int, page_size: int) -> str: 513 """ 514 Construct a request API for search that returns up to `page_size` 515 results starting with the result at index `start`. 516 """ 517 url_args = search._url_args() 518 url_args.update( 519 { 520 "start": str(start), 521 "max_results": str(page_size), 522 } 523 ) 524 return self.query_url_format.format(urlencode(url_args)) 525 526 def _parse_feed(self, url: str, first_page: bool = True, _try_index: int = 0) -> ParsedFeed: 527 """ 528 Fetches the specified URL and parses it as an Atom feed. 529 530 If a request fails or is unexpectedly empty, retries the request up to 531 `self.num_retries` times. 532 """ 533 try: 534 return self.__try_parse_feed(url, first_page=first_page, try_index=_try_index) 535 except ( 536 HTTPError, 537 UnexpectedEmptyPageError, 538 requests.exceptions.ConnectionError, 539 ) as err: 540 if _try_index < self.num_retries: 541 logger.debug("Got error (try %d): %s", _try_index, err) 542 return self._parse_feed(url, first_page=first_page, _try_index=_try_index + 1) 543 logger.debug("Giving up (try %d): %s", _try_index, err) 544 raise err 545 546 def __try_parse_feed( 547 self, 548 url: str, 549 first_page: bool, 550 try_index: int, 551 ) -> ParsedFeed: 552 """ 553 Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that 554 number of seconds has not passed since `_parse_feed` was last called, 555 sleeps until delay_seconds seconds have passed. 556 """ 557 # If this call would violate the rate limit, sleep until it doesn't. 558 if self._last_request_dt is not None: 559 required = timedelta(seconds=self.delay_seconds) 560 since_last_request = datetime.now() - self._last_request_dt 561 if since_last_request < required: 562 to_sleep = (required - since_last_request).total_seconds() 563 logger.info("Sleeping: %f seconds", to_sleep) 564 time.sleep(to_sleep) 565 566 logger.info("Requesting page (first: %r, try: %d): %s", first_page, try_index, url) 567 568 resp = self._session.get(url, headers={"user-agent": _USER_AGENT}) 569 self._last_request_dt = datetime.now() 570 if resp.status_code != requests.codes.OK: 571 raise HTTPError(url, try_index, resp.status_code) 572 573 feed = _feed.parse(resp.content) 574 if len(feed.results) == 0 and not first_page: 575 raise UnexpectedEmptyPageError(url, try_index, feed) 576 577 if feed.malformed: 578 logger.warning("Malformed feed; consider handling: %s", feed.error) 579 580 return feed
Specifies a strategy for fetching results from arXiv's API.
This class obscures pagination and retry logic, and exposes
Client.results.
445 def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3): 446 """ 447 Constructs an arXiv API client with the specified options. 448 449 Note: the default parameters should provide a robust request strategy 450 for most use cases. Extreme page sizes, delays, or retries risk 451 violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou), 452 brittle behavior, and inconsistent results. 453 """ 454 self.page_size = page_size 455 self.delay_seconds = delay_seconds 456 self.num_retries = num_retries 457 self._last_request_dt = None 458 self._session = requests.Session()
Constructs an arXiv API client with the specified options.
Note: the default parameters should provide a robust request strategy for most use cases. Extreme page sizes, delays, or retries risk violating the arXiv API Terms of Use, brittle behavior, and inconsistent results.
Maximum number of results fetched in a single API request. Smaller pages can be retrieved faster, but may require more round-trips.
The API's limit is 2000 results per page.
Number of seconds to wait between API requests.
arXiv's Terms of Use ask that you "make no more than one request every three seconds."
471 def results(self, search: Search, offset: int = 0) -> Iterator[Result]: 472 """ 473 Uses this client configuration to fetch one page of the search results 474 at a time, yielding the parsed `Result`s, until `max_results` results 475 have been yielded or there are no more search results. 476 477 If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`. 478 479 Setting a nonzero `offset` discards leading records in the result set. 480 When `offset` is greater than or equal to `search.max_results`, the full 481 result set is discarded. 482 483 For more on using generators, see 484 [Generators](https://wiki.python.org/moin/Generators). 485 """ 486 limit = search.max_results - offset if search.max_results else None 487 if limit and limit < 0: 488 return iter(()) 489 return itertools.islice(self._results(search, offset), limit)
Uses this client configuration to fetch one page of the search results
at a time, yielding the parsed Results, until max_results results
have been yielded or there are no more search results.
If all tries fail, raises an UnexpectedEmptyPageError or HTTPError.
Setting a nonzero offset discards leading records in the result set.
When offset is greater than or equal to search.max_results, the full
result set is discarded.
For more on using generators, see Generators.
583class ArxivError(Exception): 584 """This package's base Exception class.""" 585 586 url: str 587 """The feed URL that could not be fetched.""" 588 retry: int 589 """ 590 The request try number which encountered this error; 0 for the initial try, 591 1 for the first retry, and so on. 592 """ 593 message: str 594 """Message describing what caused this error.""" 595 596 def __init__(self, url: str, retry: int, message: str): 597 """ 598 Constructs an `ArxivError` encountered while fetching the specified URL. 599 """ 600 self.url = url 601 self.retry = retry 602 self.message = message 603 super().__init__(self.message) 604 605 def __reduce__(self) -> tuple: 606 return (self.__class__, (self.url, self.retry, self.message)) 607 608 def __str__(self) -> str: 609 return "{} ({})".format(self.message, self.url)
This package's base Exception class.
596 def __init__(self, url: str, retry: int, message: str): 597 """ 598 Constructs an `ArxivError` encountered while fetching the specified URL. 599 """ 600 self.url = url 601 self.retry = retry 602 self.message = message 603 super().__init__(self.message)
Constructs an ArxivError encountered while fetching the specified URL.
The request try number which encountered this error; 0 for the initial try, 1 for the first retry, and so on.
Inherited Members
- builtins.BaseException
- with_traceback
- args
612class UnexpectedEmptyPageError(ArxivError): 613 """ 614 An error raised when a page of results that should be non-empty is empty. 615 616 This should never happen in theory, but happens sporadically due to 617 brittleness in the underlying arXiv API; usually resolved by retries. 618 619 See `Client.results` for usage. 620 """ 621 622 raw_feed: ParsedFeed 623 """ 624 The raw parsed feed. Sometimes this contains useful diagnostic information, 625 e.g. in `bozo_exception`. 626 """ 627 628 def __init__(self, url: str, retry: int, raw_feed: ParsedFeed): 629 """ 630 Constructs an `UnexpectedEmptyPageError` encountered for the specified 631 API URL after `retry` tries. 632 """ 633 self.url = url 634 self.raw_feed = raw_feed 635 super().__init__(url, retry, "Page of results was unexpectedly empty") 636 637 def __reduce__(self) -> tuple: 638 return (self.__class__, (self.url, self.retry, self.raw_feed)) 639 640 def __repr__(self) -> str: 641 return "{}({}, {}, {})".format( 642 _classname(self), repr(self.url), repr(self.retry), repr(self.raw_feed) 643 )
An error raised when a page of results that should be non-empty is empty.
This should never happen in theory, but happens sporadically due to brittleness in the underlying arXiv API; usually resolved by retries.
See Client.results for usage.
628 def __init__(self, url: str, retry: int, raw_feed: ParsedFeed): 629 """ 630 Constructs an `UnexpectedEmptyPageError` encountered for the specified 631 API URL after `retry` tries. 632 """ 633 self.url = url 634 self.raw_feed = raw_feed 635 super().__init__(url, retry, "Page of results was unexpectedly empty")
Constructs an UnexpectedEmptyPageError encountered for the specified
API URL after retry tries.
The raw parsed feed. Sometimes this contains useful diagnostic information,
e.g. in bozo_exception.
Inherited Members
- builtins.BaseException
- with_traceback
- args
646class HTTPError(ArxivError): 647 """ 648 A non-200 status encountered while fetching a page of results. 649 650 See `Client.results` for usage. 651 """ 652 653 status: int 654 """The HTTP status reported by the underlying request.""" 655 656 def __init__(self, url: str, retry: int, status: int): 657 """ 658 Constructs an `HTTPError` for the specified status code, encountered for 659 the specified API URL after `retry` tries. 660 """ 661 self.url = url 662 self.status = status 663 super().__init__( 664 url, 665 retry, 666 "Page request resulted in HTTP {}".format(self.status), 667 ) 668 669 def __reduce__(self) -> tuple: 670 return (self.__class__, (self.url, self.retry, self.status)) 671 672 def __repr__(self) -> str: 673 return "{}({}, {}, {})".format( 674 _classname(self), repr(self.url), repr(self.retry), repr(self.status) 675 )
A non-200 status encountered while fetching a page of results.
See Client.results for usage.
656 def __init__(self, url: str, retry: int, status: int): 657 """ 658 Constructs an `HTTPError` for the specified status code, encountered for 659 the specified API URL after `retry` tries. 660 """ 661 self.url = url 662 self.status = status 663 super().__init__( 664 url, 665 retry, 666 "Page request resulted in HTTP {}".format(self.status), 667 )
Inherited Members
- builtins.BaseException
- with_traceback
- args