arxiv
arxiv.py
Python wrapper for the arXiv API.
arXiv is a project by the Cornell University Library that provides open access to 1,000,000+ articles in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics.
Usage
Installation
$ pip install arxiv
In your Python script, include the line
import arxiv
Examples
Fetching results
import arxiv
# Construct the default API client.
client = Client()
# Search for the 10 most recent articles matching the keyword "quantum."
search = Search(
query = "quantum",
max_results = 10,
sort_by = SortCriterion.SubmittedDate
)
results = client.results(search)
# `results` is a generator; you can iterate over its elements one by one...
for r in client.results(search):
print(r.title)
# ...or exhaust it into a list. Careful: this is slow for large results sets.
all_results = list(results)
print([r.title for r in all_results])
# For advanced query syntax documentation, see the arXiv API User Manual:
# https://arxiv.org/help/api/user-manual#query_details
search = Search(query = "au:del_maestro AND ti:checkerboard")
first_result = next(client.results(search))
print(first_result)
# Search for the paper with ID "1605.08386v1"
search_by_id = Search(id_list=["1605.08386v1"])
# Reuse client to fetch the paper, then print its title.
first_result = next(client.results(search_by_id))
print(first_result.title)
Fetching results with a custom client
import arxiv
big_slow_client = Client(
page_size = 1000,
delay_seconds = 10.0,
num_retries = 5
)
# Prints 1000 titles before needing to make another request.
for result in big_slow_client.results(Search(query="quantum")):
print(result.title)
Logging
To inspect this package's network behavior and API logic, configure a DEBUG-level logger.
>>> import logging, arxiv
>>> logging.basicConfig(level=logging.DEBUG)
>>> client = Client()
>>> paper = next(client.results(Search(id_list=["1605.08386v1"])))
INFO:arxiv.arxiv:Requesting 100 results at offset 0
INFO:arxiv.arxiv:Requesting page (first: False, try: 0): https://exportarxiv.arxiv.org/api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): export.arxiv.org:443
DEBUG:urllib3.connectionpool:https://export.arxiv.org:443 "GET /api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100&user-agent=arxiv.py%2F1.4.8 HTTP/1.1" 200 979
Types
Client
A Client specifies a reusable strategy for fetching results from arXiv's API. For most use cases the default client should suffice.
Clients configurations specify pagination and retry logic. Reusing a client allows successive API calls to use the same connection pool and ensures they abide by the rate limit you set.
Search
A Search specifies a search of arXiv's database. Use Client.results to get a generator yielding Results.
Result
The Result objects yielded by Client.results include metadata about each paper and helper methods for downloading their content.
The meaning of the underlying raw data is documented in the arXiv API User Manual: Details of Atom Results Returned.
Result also exposes helper methods for downloading papers: Result.download_pdf and Result.download_source.
1""".. include:: ../README.md""" 2 3from __future__ import annotations 4 5import logging 6import time 7import itertools 8import feedparser 9import os 10import math 11import re 12import requests 13import warnings 14 15from urllib.parse import urlencode, urlparse 16from urllib.request import urlretrieve 17from datetime import datetime, timedelta, timezone 18from calendar import timegm 19 20from enum import Enum 21from typing import Dict, Generator, List, Optional 22 23logger = logging.getLogger(__name__) 24 25_DEFAULT_TIME = datetime.min 26 27 28class Result(object): 29 """ 30 An entry in an arXiv query results feed. 31 32 See [the arXiv API User's Manual: Details of Atom Results 33 Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned). 34 """ 35 36 entry_id: str 37 """A url of the form `https://arxiv.org/abs/{id}`.""" 38 updated: datetime 39 """When the result was last updated.""" 40 published: datetime 41 """When the result was originally published.""" 42 title: str 43 """The title of the result.""" 44 authors: List[Author] 45 """The result's authors.""" 46 summary: str 47 """The result abstract.""" 48 comment: Optional[str] 49 """The authors' comment if present.""" 50 journal_ref: Optional[str] 51 """A journal reference if present.""" 52 doi: Optional[str] 53 """A URL for the resolved DOI to an external resource if present.""" 54 primary_category: str 55 """ 56 The result's primary arXiv category. See [arXiv: Category 57 Taxonomy](https://arxiv.org/category_taxonomy). 58 """ 59 categories: List[str] 60 """ 61 All of the result's categories. See [arXiv: Category 62 Taxonomy](https://arxiv.org/category_taxonomy). 63 """ 64 links: List[Link] 65 """Up to three URLs associated with this result.""" 66 pdf_url: Optional[str] 67 """The URL of a PDF version of this result if present among links.""" 68 _raw: feedparser.FeedParserDict 69 """ 70 The raw feedparser result object if this Result was constructed with 71 Result._from_feed_entry. 72 """ 73 74 def __init__( 75 self, 76 entry_id: str, 77 updated: datetime = _DEFAULT_TIME, 78 published: datetime = _DEFAULT_TIME, 79 title: str = "", 80 authors: List[Author] = [], 81 summary: str = "", 82 comment: str = "", 83 journal_ref: str = "", 84 doi: str = "", 85 primary_category: str = "", 86 categories: List[str] = [], 87 links: List[Link] = [], 88 _raw: feedparser.FeedParserDict = None, 89 ): 90 """ 91 Constructs an arXiv search result item. 92 93 In most cases, prefer using `Result._from_feed_entry` to parsing and 94 constructing `Result`s yourself. 95 """ 96 self.entry_id = entry_id 97 self.updated = updated 98 self.published = published 99 self.title = title 100 self.authors = authors 101 self.summary = summary 102 self.comment = comment 103 self.journal_ref = journal_ref 104 self.doi = doi 105 self.primary_category = primary_category 106 self.categories = categories 107 self.links = links 108 # Calculated members 109 self.pdf_url = Result._get_pdf_url(links) 110 # Debugging 111 self._raw = _raw 112 113 def _from_feed_entry(entry: feedparser.FeedParserDict) -> Result: 114 """ 115 Converts a feedparser entry for an arXiv search result feed into a 116 Result object. 117 """ 118 if not hasattr(entry, "id"): 119 raise Result.MissingFieldError("id") 120 # Title attribute may be absent for certain titles. Defaulting to "0" as 121 # it's the only title observed to cause this bug. 122 # https://github.com/lukasschwab/arxiv.py/issues/71 123 # title = entry.title if hasattr(entry, "title") else "0" 124 title = "0" 125 if hasattr(entry, "title"): 126 title = entry.title 127 else: 128 logger.warning("Result %s is missing title attribute; defaulting to '0'", entry.id) 129 return Result( 130 entry_id=entry.id, 131 updated=Result._to_datetime(entry.updated_parsed), 132 published=Result._to_datetime(entry.published_parsed), 133 title=re.sub(r"\s+", " ", title), 134 authors=[Result.Author._from_feed_author(a) for a in entry.authors], 135 summary=entry.summary, 136 comment=entry.get("arxiv_comment"), 137 journal_ref=entry.get("arxiv_journal_ref"), 138 doi=entry.get("arxiv_doi"), 139 primary_category=entry.arxiv_primary_category.get("term"), 140 categories=[tag.get("term") for tag in entry.tags], 141 links=[Result.Link._from_feed_link(link) for link in entry.links], 142 _raw=entry, 143 ) 144 145 def __str__(self) -> str: 146 return self.entry_id 147 148 def __repr__(self) -> str: 149 return ( 150 "{}(entry_id={}, updated={}, published={}, title={}, authors={}, " 151 "summary={}, comment={}, journal_ref={}, doi={}, " 152 "primary_category={}, categories={}, links={})" 153 ).format( 154 _classname(self), 155 repr(self.entry_id), 156 repr(self.updated), 157 repr(self.published), 158 repr(self.title), 159 repr(self.authors), 160 repr(self.summary), 161 repr(self.comment), 162 repr(self.journal_ref), 163 repr(self.doi), 164 repr(self.primary_category), 165 repr(self.categories), 166 repr(self.links), 167 ) 168 169 def __eq__(self, other) -> bool: 170 if isinstance(other, Result): 171 return self.entry_id == other.entry_id 172 return False 173 174 def get_short_id(self) -> str: 175 """ 176 Returns the short ID for this result. 177 178 + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`, 179 `result.get_short_id()` returns `2107.05580v1`. 180 181 + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`, 182 `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March 183 2007 arXiv identifier format). 184 185 For an explanation of the difference between arXiv's legacy and current 186 identifiers, see [Understanding the arXiv 187 identifier](https://arxiv.org/help/arxiv_identifier). 188 """ 189 return self.entry_id.split("arxiv.org/abs/")[-1] 190 191 def _get_default_filename(self, extension: str = "pdf") -> str: 192 """ 193 A default `to_filename` function for the extension given. 194 """ 195 nonempty_title = self.title if self.title else "UNTITLED" 196 return ".".join( 197 [ 198 self.get_short_id().replace("/", "_"), 199 re.sub(r"[^\w]", "_", nonempty_title), 200 extension, 201 ] 202 ) 203 204 def download_pdf( 205 self, 206 dirpath: str = "./", 207 filename: str = "", 208 download_domain: str = "export.arxiv.org", 209 ) -> str: 210 """ 211 Downloads the PDF for this result to the specified directory. 212 213 The filename is generated by calling `to_filename(self)`. 214 215 **Deprecated:** future versions of this client library will not provide 216 download helpers (out of scope). Use `result.pdf_url` directly. 217 """ 218 if not filename: 219 filename = self._get_default_filename() 220 path = os.path.join(dirpath, filename) 221 pdf_url = Result._substitute_domain(self.pdf_url, download_domain) 222 written_path, _ = urlretrieve(pdf_url, path) 223 return written_path 224 225 def download_source( 226 self, 227 dirpath: str = "./", 228 filename: str = "", 229 download_domain: str = "export.arxiv.org", 230 ) -> str: 231 """ 232 Downloads the source tarfile for this result to the specified 233 directory. 234 235 The filename is generated by calling `to_filename(self)`. 236 237 **Deprecated:** future versions of this client library will not provide 238 download helpers (out of scope). Use `result.source_url` directly. 239 """ 240 if not filename: 241 filename = self._get_default_filename("tar.gz") 242 path = os.path.join(dirpath, filename) 243 source_url = Result._substitute_domain(self.source_url(), download_domain) 244 written_path, _ = urlretrieve(source_url, path) 245 return written_path 246 247 def source_url(self) -> str: 248 """ 249 Derives a URL for the source tarfile for this result. 250 """ 251 return self.pdf_url.replace("/pdf/", "/src/") 252 253 def _get_pdf_url(links: List[Link]) -> str: 254 """ 255 Finds the PDF link among a result's links and returns its URL. 256 257 Should only be called once for a given `Result`, in its constructor. 258 After construction, the URL should be available in `Result.pdf_url`. 259 """ 260 pdf_urls = [link.href for link in links if link.title == "pdf"] 261 if len(pdf_urls) == 0: 262 return None 263 elif len(pdf_urls) > 1: 264 logger.warning("Result has multiple PDF links; using %s", pdf_urls[0]) 265 return pdf_urls[0] 266 267 def _to_datetime(ts: time.struct_time) -> datetime: 268 """ 269 Converts a UTC time.struct_time into a time-zone-aware datetime. 270 271 This will be replaced with feedparser functionality [when it becomes 272 available](https://github.com/kurtmckee/feedparser/issues/212). 273 """ 274 return datetime.fromtimestamp(timegm(ts), tz=timezone.utc) 275 276 def _substitute_domain(url: str, domain: str) -> str: 277 """ 278 Replaces the domain of the given URL with the specified domain. 279 280 This is useful for testing purposes. 281 """ 282 parsed_url = urlparse(url) 283 return parsed_url._replace(netloc=domain).geturl() 284 285 class Author(object): 286 """ 287 A light inner class for representing a result's authors. 288 """ 289 290 name: str 291 """The author's name.""" 292 293 def __init__(self, name: str): 294 """ 295 Constructs an `Author` with the specified name. 296 297 In most cases, prefer using `Author._from_feed_author` to parsing 298 and constructing `Author`s yourself. 299 """ 300 self.name = name 301 302 def _from_feed_author(feed_author: feedparser.FeedParserDict) -> Result.Author: 303 """ 304 Constructs an `Author` with the name specified in an author object 305 from a feed entry. 306 307 See usage in `Result._from_feed_entry`. 308 """ 309 return Result.Author(feed_author.name) 310 311 def __str__(self) -> str: 312 return self.name 313 314 def __repr__(self) -> str: 315 return "{}({})".format(_classname(self), repr(self.name)) 316 317 def __eq__(self, other) -> bool: 318 if isinstance(other, Result.Author): 319 return self.name == other.name 320 return False 321 322 class Link(object): 323 """ 324 A light inner class for representing a result's links. 325 """ 326 327 href: str 328 """The link's `href` attribute.""" 329 title: Optional[str] 330 """The link's title.""" 331 rel: str 332 """The link's relationship to the `Result`.""" 333 content_type: str 334 """The link's HTTP content type.""" 335 336 def __init__( 337 self, 338 href: str, 339 title: str = None, 340 rel: str = None, 341 content_type: str = None, 342 ): 343 """ 344 Constructs a `Link` with the specified link metadata. 345 346 In most cases, prefer using `Link._from_feed_link` to parsing and 347 constructing `Link`s yourself. 348 """ 349 self.href = href 350 self.title = title 351 self.rel = rel 352 self.content_type = content_type 353 354 def _from_feed_link(feed_link: feedparser.FeedParserDict) -> Result.Link: 355 """ 356 Constructs a `Link` with link metadata specified in a link object 357 from a feed entry. 358 359 See usage in `Result._from_feed_entry`. 360 """ 361 return Result.Link( 362 href=feed_link.href, 363 title=feed_link.get("title"), 364 rel=feed_link.get("rel"), 365 content_type=feed_link.get("content_type"), 366 ) 367 368 def __str__(self) -> str: 369 return self.href 370 371 def __repr__(self) -> str: 372 return "{}({}, title={}, rel={}, content_type={})".format( 373 _classname(self), 374 repr(self.href), 375 repr(self.title), 376 repr(self.rel), 377 repr(self.content_type), 378 ) 379 380 def __eq__(self, other) -> bool: 381 if isinstance(other, Result.Link): 382 return self.href == other.href 383 return False 384 385 class MissingFieldError(Exception): 386 """ 387 An error indicating an entry is unparseable because it lacks required 388 fields. 389 """ 390 391 missing_field: str 392 """The required field missing from the would-be entry.""" 393 message: str 394 """Message describing what caused this error.""" 395 396 def __init__(self, missing_field): 397 self.missing_field = missing_field 398 self.message = "Entry from arXiv missing required info" 399 400 def __repr__(self) -> str: 401 return "{}({})".format(_classname(self), repr(self.missing_field)) 402 403 404class SortCriterion(Enum): 405 """ 406 A SortCriterion identifies a property by which search results can be 407 sorted. 408 409 See [the arXiv API User's Manual: sort order for return 410 results](https://arxiv.org/help/api/user-manual#sort). 411 """ 412 413 Relevance = "relevance" 414 LastUpdatedDate = "lastUpdatedDate" 415 SubmittedDate = "submittedDate" 416 417 418class SortOrder(Enum): 419 """ 420 A SortOrder indicates order in which search results are sorted according 421 to the specified arxiv.SortCriterion. 422 423 See [the arXiv API User's Manual: sort order for return 424 results](https://arxiv.org/help/api/user-manual#sort). 425 """ 426 427 Ascending = "ascending" 428 Descending = "descending" 429 430 431class Search(object): 432 """ 433 A specification for a search of arXiv's database. 434 435 To run a search, use `Search.run` to use a default client or `Client.run` 436 with a specific client. 437 """ 438 439 query: str 440 """ 441 A query string. 442 443 This should be unencoded. Use `au:del_maestro AND ti:checkerboard`, not 444 `au:del_maestro+AND+ti:checkerboard`. 445 446 See [the arXiv API User's Manual: Details of Query 447 Construction](https://arxiv.org/help/api/user-manual#query_details). 448 """ 449 id_list: List[str] 450 """ 451 A list of arXiv article IDs to which to limit the search. 452 453 See [the arXiv API User's 454 Manual](https://arxiv.org/help/api/user-manual#search_query_and_id_list) 455 for documentation of the interaction between `query` and `id_list`. 456 """ 457 max_results: int | None 458 """ 459 The maximum number of results to be returned in an execution of this 460 search. To fetch every result available, set `max_results=None`. 461 462 The API's limit is 300,000 results per query. 463 """ 464 sort_by: SortCriterion 465 """The sort criterion for results.""" 466 sort_order: SortOrder 467 """The sort order for results.""" 468 469 def __init__( 470 self, 471 query: str = "", 472 id_list: List[str] = [], 473 max_results: int | None = None, 474 sort_by: SortCriterion = SortCriterion.Relevance, 475 sort_order: SortOrder = SortOrder.Descending, 476 ): 477 """ 478 Constructs an arXiv API search with the specified criteria. 479 """ 480 self.query = query 481 self.id_list = id_list 482 # Handle deprecated v1 default behavior. 483 self.max_results = None if max_results == math.inf else max_results 484 self.sort_by = sort_by 485 self.sort_order = sort_order 486 487 def __str__(self) -> str: 488 # TODO: develop a more informative string representation. 489 return repr(self) 490 491 def __repr__(self) -> str: 492 return ("{}(query={}, id_list={}, max_results={}, sort_by={}, sort_order={})").format( 493 _classname(self), 494 repr(self.query), 495 repr(self.id_list), 496 repr(self.max_results), 497 repr(self.sort_by), 498 repr(self.sort_order), 499 ) 500 501 def _url_args(self) -> Dict[str, str]: 502 """ 503 Returns a dict of search parameters that should be included in an API 504 request for this search. 505 """ 506 return { 507 "search_query": self.query, 508 "id_list": ",".join(self.id_list), 509 "sortBy": self.sort_by.value, 510 "sortOrder": self.sort_order.value, 511 } 512 513 def results(self, offset: int = 0) -> Generator[Result, None, None]: 514 """ 515 Executes the specified search using a default arXiv API client. For info 516 on default behavior, see `Client.__init__` and `Client.results`. 517 518 **Deprecated** after 2.0.0; use `Client.results`. 519 """ 520 warnings.warn( 521 "The 'Search.results' method is deprecated, use 'Client.results' instead", 522 DeprecationWarning, 523 stacklevel=2, 524 ) 525 return Client().results(self, offset=offset) 526 527 528class Client(object): 529 """ 530 Specifies a strategy for fetching results from arXiv's API. 531 532 This class obscures pagination and retry logic, and exposes 533 `Client.results`. 534 """ 535 536 query_url_format = "https://export.arxiv.org/api/query?{}" 537 """ 538 The arXiv query API endpoint format. 539 """ 540 page_size: int 541 """ 542 Maximum number of results fetched in a single API request. Smaller pages can 543 be retrieved faster, but may require more round-trips. 544 545 The API's limit is 2000 results per page. 546 """ 547 delay_seconds: float 548 """ 549 Number of seconds to wait between API requests. 550 551 [arXiv's Terms of Use](https://arxiv.org/help/api/tou) ask that you "make no 552 more than one request every three seconds." 553 """ 554 num_retries: int 555 """ 556 Number of times to retry a failing API request before raising an Exception. 557 """ 558 559 _last_request_dt: datetime 560 _session: requests.Session 561 562 def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3): 563 """ 564 Constructs an arXiv API client with the specified options. 565 566 Note: the default parameters should provide a robust request strategy 567 for most use cases. Extreme page sizes, delays, or retries risk 568 violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou), 569 brittle behavior, and inconsistent results. 570 """ 571 self.page_size = page_size 572 self.delay_seconds = delay_seconds 573 self.num_retries = num_retries 574 self._last_request_dt = None 575 self._session = requests.Session() 576 577 def __str__(self) -> str: 578 # TODO: develop a more informative string representation. 579 return repr(self) 580 581 def __repr__(self) -> str: 582 return "{}(page_size={}, delay_seconds={}, num_retries={})".format( 583 _classname(self), 584 repr(self.page_size), 585 repr(self.delay_seconds), 586 repr(self.num_retries), 587 ) 588 589 def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]: 590 """ 591 Uses this client configuration to fetch one page of the search results 592 at a time, yielding the parsed `Result`s, until `max_results` results 593 have been yielded or there are no more search results. 594 595 If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`. 596 597 Setting a nonzero `offset` discards leading records in the result set. 598 When `offset` is greater than or equal to `search.max_results`, the full 599 result set is discarded. 600 601 For more on using generators, see 602 [Generators](https://wiki.python.org/moin/Generators). 603 """ 604 limit = search.max_results - offset if search.max_results else None 605 if limit and limit < 0: 606 return iter(()) 607 return itertools.islice(self._results(search, offset), limit) 608 609 def _results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]: 610 page_url = self._format_url(search, offset, self.page_size) 611 feed = self._parse_feed(page_url, first_page=True) 612 if not feed.entries: 613 logger.info("Got empty first page; stopping generation") 614 return 615 total_results = int(feed.feed.opensearch_totalresults) 616 logger.info( 617 "Got first page: %d of %d total results", 618 len(feed.entries), 619 total_results, 620 ) 621 622 while feed.entries: 623 for entry in feed.entries: 624 try: 625 yield Result._from_feed_entry(entry) 626 except Result.MissingFieldError as e: 627 logger.warning("Skipping partial result: %s", e) 628 offset += len(feed.entries) 629 if offset >= total_results: 630 break 631 page_url = self._format_url(search, offset, self.page_size) 632 feed = self._parse_feed(page_url, first_page=False) 633 634 def _format_url(self, search: Search, start: int, page_size: int) -> str: 635 """ 636 Construct a request API for search that returns up to `page_size` 637 results starting with the result at index `start`. 638 """ 639 url_args = search._url_args() 640 url_args.update( 641 { 642 "start": start, 643 "max_results": page_size, 644 } 645 ) 646 return self.query_url_format.format(urlencode(url_args)) 647 648 def _parse_feed( 649 self, url: str, first_page: bool = True, _try_index: int = 0 650 ) -> feedparser.FeedParserDict: 651 """ 652 Fetches the specified URL and parses it with feedparser. 653 654 If a request fails or is unexpectedly empty, retries the request up to 655 `self.num_retries` times. 656 """ 657 try: 658 return self.__try_parse_feed(url, first_page=first_page, try_index=_try_index) 659 except ( 660 HTTPError, 661 UnexpectedEmptyPageError, 662 requests.exceptions.ConnectionError, 663 ) as err: 664 if _try_index < self.num_retries: 665 logger.debug("Got error (try %d): %s", _try_index, err) 666 return self._parse_feed(url, first_page=first_page, _try_index=_try_index + 1) 667 logger.debug("Giving up (try %d): %s", _try_index, err) 668 raise err 669 670 def __try_parse_feed( 671 self, 672 url: str, 673 first_page: bool, 674 try_index: int, 675 ) -> feedparser.FeedParserDict: 676 """ 677 Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that 678 number of seconds has not passed since `_parse_feed` was last called, 679 sleeps until delay_seconds seconds have passed. 680 """ 681 # If this call would violate the rate limit, sleep until it doesn't. 682 if self._last_request_dt is not None: 683 required = timedelta(seconds=self.delay_seconds) 684 since_last_request = datetime.now() - self._last_request_dt 685 if since_last_request < required: 686 to_sleep = (required - since_last_request).total_seconds() 687 logger.info("Sleeping: %f seconds", to_sleep) 688 time.sleep(to_sleep) 689 690 logger.info("Requesting page (first: %r, try: %d): %s", first_page, try_index, url) 691 692 resp = self._session.get(url, headers={"user-agent": "arxiv.py/2.3.1"}) 693 self._last_request_dt = datetime.now() 694 if resp.status_code != requests.codes.OK: 695 raise HTTPError(url, try_index, resp.status_code) 696 697 feed = feedparser.parse(resp.content) 698 if len(feed.entries) == 0 and not first_page: 699 raise UnexpectedEmptyPageError(url, try_index, feed) 700 701 if feed.bozo: 702 logger.warning( 703 "Bozo feed; consider handling: %s", 704 feed.bozo_exception if "bozo_exception" in feed else None, 705 ) 706 707 return feed 708 709 710class ArxivError(Exception): 711 """This package's base Exception class.""" 712 713 url: str 714 """The feed URL that could not be fetched.""" 715 retry: int 716 """ 717 The request try number which encountered this error; 0 for the initial try, 718 1 for the first retry, and so on. 719 """ 720 message: str 721 """Message describing what caused this error.""" 722 723 def __init__(self, url: str, retry: int, message: str): 724 """ 725 Constructs an `ArxivError` encountered while fetching the specified URL. 726 """ 727 self.url = url 728 self.retry = retry 729 self.message = message 730 super().__init__(self.message) 731 732 def __str__(self) -> str: 733 return "{} ({})".format(self.message, self.url) 734 735 736class UnexpectedEmptyPageError(ArxivError): 737 """ 738 An error raised when a page of results that should be non-empty is empty. 739 740 This should never happen in theory, but happens sporadically due to 741 brittleness in the underlying arXiv API; usually resolved by retries. 742 743 See `Client.results` for usage. 744 """ 745 746 raw_feed: feedparser.FeedParserDict 747 """ 748 The raw output of `feedparser.parse`. Sometimes this contains useful 749 diagnostic information, e.g. in 'bozo_exception'. 750 """ 751 752 def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict): 753 """ 754 Constructs an `UnexpectedEmptyPageError` encountered for the specified 755 API URL after `retry` tries. 756 """ 757 self.url = url 758 self.raw_feed = raw_feed 759 super().__init__(url, retry, "Page of results was unexpectedly empty") 760 761 def __repr__(self) -> str: 762 return "{}({}, {}, {})".format( 763 _classname(self), repr(self.url), repr(self.retry), repr(self.raw_feed) 764 ) 765 766 767class HTTPError(ArxivError): 768 """ 769 A non-200 status encountered while fetching a page of results. 770 771 See `Client.results` for usage. 772 """ 773 774 status: int 775 """The HTTP status reported by feedparser.""" 776 777 def __init__(self, url: str, retry: int, status: int): 778 """ 779 Constructs an `HTTPError` for the specified status code, encountered for 780 the specified API URL after `retry` tries. 781 """ 782 self.url = url 783 self.status = status 784 super().__init__( 785 url, 786 retry, 787 "Page request resulted in HTTP {}".format(self.status), 788 ) 789 790 def __repr__(self) -> str: 791 return "{}({}, {}, {})".format( 792 _classname(self), repr(self.url), repr(self.retry), repr(self.status) 793 ) 794 795 796def _classname(o): 797 """A helper function for use in __repr__ methods: arxiv.Result.Link.""" 798 return "arxiv.{}".format(o.__class__.__qualname__)
29class Result(object): 30 """ 31 An entry in an arXiv query results feed. 32 33 See [the arXiv API User's Manual: Details of Atom Results 34 Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned). 35 """ 36 37 entry_id: str 38 """A url of the form `https://arxiv.org/abs/{id}`.""" 39 updated: datetime 40 """When the result was last updated.""" 41 published: datetime 42 """When the result was originally published.""" 43 title: str 44 """The title of the result.""" 45 authors: List[Author] 46 """The result's authors.""" 47 summary: str 48 """The result abstract.""" 49 comment: Optional[str] 50 """The authors' comment if present.""" 51 journal_ref: Optional[str] 52 """A journal reference if present.""" 53 doi: Optional[str] 54 """A URL for the resolved DOI to an external resource if present.""" 55 primary_category: str 56 """ 57 The result's primary arXiv category. See [arXiv: Category 58 Taxonomy](https://arxiv.org/category_taxonomy). 59 """ 60 categories: List[str] 61 """ 62 All of the result's categories. See [arXiv: Category 63 Taxonomy](https://arxiv.org/category_taxonomy). 64 """ 65 links: List[Link] 66 """Up to three URLs associated with this result.""" 67 pdf_url: Optional[str] 68 """The URL of a PDF version of this result if present among links.""" 69 _raw: feedparser.FeedParserDict 70 """ 71 The raw feedparser result object if this Result was constructed with 72 Result._from_feed_entry. 73 """ 74 75 def __init__( 76 self, 77 entry_id: str, 78 updated: datetime = _DEFAULT_TIME, 79 published: datetime = _DEFAULT_TIME, 80 title: str = "", 81 authors: List[Author] = [], 82 summary: str = "", 83 comment: str = "", 84 journal_ref: str = "", 85 doi: str = "", 86 primary_category: str = "", 87 categories: List[str] = [], 88 links: List[Link] = [], 89 _raw: feedparser.FeedParserDict = None, 90 ): 91 """ 92 Constructs an arXiv search result item. 93 94 In most cases, prefer using `Result._from_feed_entry` to parsing and 95 constructing `Result`s yourself. 96 """ 97 self.entry_id = entry_id 98 self.updated = updated 99 self.published = published 100 self.title = title 101 self.authors = authors 102 self.summary = summary 103 self.comment = comment 104 self.journal_ref = journal_ref 105 self.doi = doi 106 self.primary_category = primary_category 107 self.categories = categories 108 self.links = links 109 # Calculated members 110 self.pdf_url = Result._get_pdf_url(links) 111 # Debugging 112 self._raw = _raw 113 114 def _from_feed_entry(entry: feedparser.FeedParserDict) -> Result: 115 """ 116 Converts a feedparser entry for an arXiv search result feed into a 117 Result object. 118 """ 119 if not hasattr(entry, "id"): 120 raise Result.MissingFieldError("id") 121 # Title attribute may be absent for certain titles. Defaulting to "0" as 122 # it's the only title observed to cause this bug. 123 # https://github.com/lukasschwab/arxiv.py/issues/71 124 # title = entry.title if hasattr(entry, "title") else "0" 125 title = "0" 126 if hasattr(entry, "title"): 127 title = entry.title 128 else: 129 logger.warning("Result %s is missing title attribute; defaulting to '0'", entry.id) 130 return Result( 131 entry_id=entry.id, 132 updated=Result._to_datetime(entry.updated_parsed), 133 published=Result._to_datetime(entry.published_parsed), 134 title=re.sub(r"\s+", " ", title), 135 authors=[Result.Author._from_feed_author(a) for a in entry.authors], 136 summary=entry.summary, 137 comment=entry.get("arxiv_comment"), 138 journal_ref=entry.get("arxiv_journal_ref"), 139 doi=entry.get("arxiv_doi"), 140 primary_category=entry.arxiv_primary_category.get("term"), 141 categories=[tag.get("term") for tag in entry.tags], 142 links=[Result.Link._from_feed_link(link) for link in entry.links], 143 _raw=entry, 144 ) 145 146 def __str__(self) -> str: 147 return self.entry_id 148 149 def __repr__(self) -> str: 150 return ( 151 "{}(entry_id={}, updated={}, published={}, title={}, authors={}, " 152 "summary={}, comment={}, journal_ref={}, doi={}, " 153 "primary_category={}, categories={}, links={})" 154 ).format( 155 _classname(self), 156 repr(self.entry_id), 157 repr(self.updated), 158 repr(self.published), 159 repr(self.title), 160 repr(self.authors), 161 repr(self.summary), 162 repr(self.comment), 163 repr(self.journal_ref), 164 repr(self.doi), 165 repr(self.primary_category), 166 repr(self.categories), 167 repr(self.links), 168 ) 169 170 def __eq__(self, other) -> bool: 171 if isinstance(other, Result): 172 return self.entry_id == other.entry_id 173 return False 174 175 def get_short_id(self) -> str: 176 """ 177 Returns the short ID for this result. 178 179 + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`, 180 `result.get_short_id()` returns `2107.05580v1`. 181 182 + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`, 183 `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March 184 2007 arXiv identifier format). 185 186 For an explanation of the difference between arXiv's legacy and current 187 identifiers, see [Understanding the arXiv 188 identifier](https://arxiv.org/help/arxiv_identifier). 189 """ 190 return self.entry_id.split("arxiv.org/abs/")[-1] 191 192 def _get_default_filename(self, extension: str = "pdf") -> str: 193 """ 194 A default `to_filename` function for the extension given. 195 """ 196 nonempty_title = self.title if self.title else "UNTITLED" 197 return ".".join( 198 [ 199 self.get_short_id().replace("/", "_"), 200 re.sub(r"[^\w]", "_", nonempty_title), 201 extension, 202 ] 203 ) 204 205 def download_pdf( 206 self, 207 dirpath: str = "./", 208 filename: str = "", 209 download_domain: str = "export.arxiv.org", 210 ) -> str: 211 """ 212 Downloads the PDF for this result to the specified directory. 213 214 The filename is generated by calling `to_filename(self)`. 215 216 **Deprecated:** future versions of this client library will not provide 217 download helpers (out of scope). Use `result.pdf_url` directly. 218 """ 219 if not filename: 220 filename = self._get_default_filename() 221 path = os.path.join(dirpath, filename) 222 pdf_url = Result._substitute_domain(self.pdf_url, download_domain) 223 written_path, _ = urlretrieve(pdf_url, path) 224 return written_path 225 226 def download_source( 227 self, 228 dirpath: str = "./", 229 filename: str = "", 230 download_domain: str = "export.arxiv.org", 231 ) -> str: 232 """ 233 Downloads the source tarfile for this result to the specified 234 directory. 235 236 The filename is generated by calling `to_filename(self)`. 237 238 **Deprecated:** future versions of this client library will not provide 239 download helpers (out of scope). Use `result.source_url` directly. 240 """ 241 if not filename: 242 filename = self._get_default_filename("tar.gz") 243 path = os.path.join(dirpath, filename) 244 source_url = Result._substitute_domain(self.source_url(), download_domain) 245 written_path, _ = urlretrieve(source_url, path) 246 return written_path 247 248 def source_url(self) -> str: 249 """ 250 Derives a URL for the source tarfile for this result. 251 """ 252 return self.pdf_url.replace("/pdf/", "/src/") 253 254 def _get_pdf_url(links: List[Link]) -> str: 255 """ 256 Finds the PDF link among a result's links and returns its URL. 257 258 Should only be called once for a given `Result`, in its constructor. 259 After construction, the URL should be available in `Result.pdf_url`. 260 """ 261 pdf_urls = [link.href for link in links if link.title == "pdf"] 262 if len(pdf_urls) == 0: 263 return None 264 elif len(pdf_urls) > 1: 265 logger.warning("Result has multiple PDF links; using %s", pdf_urls[0]) 266 return pdf_urls[0] 267 268 def _to_datetime(ts: time.struct_time) -> datetime: 269 """ 270 Converts a UTC time.struct_time into a time-zone-aware datetime. 271 272 This will be replaced with feedparser functionality [when it becomes 273 available](https://github.com/kurtmckee/feedparser/issues/212). 274 """ 275 return datetime.fromtimestamp(timegm(ts), tz=timezone.utc) 276 277 def _substitute_domain(url: str, domain: str) -> str: 278 """ 279 Replaces the domain of the given URL with the specified domain. 280 281 This is useful for testing purposes. 282 """ 283 parsed_url = urlparse(url) 284 return parsed_url._replace(netloc=domain).geturl() 285 286 class Author(object): 287 """ 288 A light inner class for representing a result's authors. 289 """ 290 291 name: str 292 """The author's name.""" 293 294 def __init__(self, name: str): 295 """ 296 Constructs an `Author` with the specified name. 297 298 In most cases, prefer using `Author._from_feed_author` to parsing 299 and constructing `Author`s yourself. 300 """ 301 self.name = name 302 303 def _from_feed_author(feed_author: feedparser.FeedParserDict) -> Result.Author: 304 """ 305 Constructs an `Author` with the name specified in an author object 306 from a feed entry. 307 308 See usage in `Result._from_feed_entry`. 309 """ 310 return Result.Author(feed_author.name) 311 312 def __str__(self) -> str: 313 return self.name 314 315 def __repr__(self) -> str: 316 return "{}({})".format(_classname(self), repr(self.name)) 317 318 def __eq__(self, other) -> bool: 319 if isinstance(other, Result.Author): 320 return self.name == other.name 321 return False 322 323 class Link(object): 324 """ 325 A light inner class for representing a result's links. 326 """ 327 328 href: str 329 """The link's `href` attribute.""" 330 title: Optional[str] 331 """The link's title.""" 332 rel: str 333 """The link's relationship to the `Result`.""" 334 content_type: str 335 """The link's HTTP content type.""" 336 337 def __init__( 338 self, 339 href: str, 340 title: str = None, 341 rel: str = None, 342 content_type: str = None, 343 ): 344 """ 345 Constructs a `Link` with the specified link metadata. 346 347 In most cases, prefer using `Link._from_feed_link` to parsing and 348 constructing `Link`s yourself. 349 """ 350 self.href = href 351 self.title = title 352 self.rel = rel 353 self.content_type = content_type 354 355 def _from_feed_link(feed_link: feedparser.FeedParserDict) -> Result.Link: 356 """ 357 Constructs a `Link` with link metadata specified in a link object 358 from a feed entry. 359 360 See usage in `Result._from_feed_entry`. 361 """ 362 return Result.Link( 363 href=feed_link.href, 364 title=feed_link.get("title"), 365 rel=feed_link.get("rel"), 366 content_type=feed_link.get("content_type"), 367 ) 368 369 def __str__(self) -> str: 370 return self.href 371 372 def __repr__(self) -> str: 373 return "{}({}, title={}, rel={}, content_type={})".format( 374 _classname(self), 375 repr(self.href), 376 repr(self.title), 377 repr(self.rel), 378 repr(self.content_type), 379 ) 380 381 def __eq__(self, other) -> bool: 382 if isinstance(other, Result.Link): 383 return self.href == other.href 384 return False 385 386 class MissingFieldError(Exception): 387 """ 388 An error indicating an entry is unparseable because it lacks required 389 fields. 390 """ 391 392 missing_field: str 393 """The required field missing from the would-be entry.""" 394 message: str 395 """Message describing what caused this error.""" 396 397 def __init__(self, missing_field): 398 self.missing_field = missing_field 399 self.message = "Entry from arXiv missing required info" 400 401 def __repr__(self) -> str: 402 return "{}({})".format(_classname(self), repr(self.missing_field))
An entry in an arXiv query results feed.
See the arXiv API User's Manual: Details of Atom Results Returned.
75 def __init__( 76 self, 77 entry_id: str, 78 updated: datetime = _DEFAULT_TIME, 79 published: datetime = _DEFAULT_TIME, 80 title: str = "", 81 authors: List[Author] = [], 82 summary: str = "", 83 comment: str = "", 84 journal_ref: str = "", 85 doi: str = "", 86 primary_category: str = "", 87 categories: List[str] = [], 88 links: List[Link] = [], 89 _raw: feedparser.FeedParserDict = None, 90 ): 91 """ 92 Constructs an arXiv search result item. 93 94 In most cases, prefer using `Result._from_feed_entry` to parsing and 95 constructing `Result`s yourself. 96 """ 97 self.entry_id = entry_id 98 self.updated = updated 99 self.published = published 100 self.title = title 101 self.authors = authors 102 self.summary = summary 103 self.comment = comment 104 self.journal_ref = journal_ref 105 self.doi = doi 106 self.primary_category = primary_category 107 self.categories = categories 108 self.links = links 109 # Calculated members 110 self.pdf_url = Result._get_pdf_url(links) 111 # Debugging 112 self._raw = _raw
Constructs an arXiv search result item.
In most cases, prefer using Result._from_feed_entry to parsing and
constructing Results yourself.
175 def get_short_id(self) -> str: 176 """ 177 Returns the short ID for this result. 178 179 + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`, 180 `result.get_short_id()` returns `2107.05580v1`. 181 182 + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`, 183 `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March 184 2007 arXiv identifier format). 185 186 For an explanation of the difference between arXiv's legacy and current 187 identifiers, see [Understanding the arXiv 188 identifier](https://arxiv.org/help/arxiv_identifier). 189 """ 190 return self.entry_id.split("arxiv.org/abs/")[-1]
Returns the short ID for this result.
If the result URL is
"https://arxiv.org/abs/2107.05580v1",result.get_short_id()returns2107.05580v1.If the result URL is
"https://arxiv.org/abs/quant-ph/0201082v1",result.get_short_id()returns"quant-ph/0201082v1"(the pre-March 2007 arXiv identifier format).
For an explanation of the difference between arXiv's legacy and current identifiers, see Understanding the arXiv identifier.
205 def download_pdf( 206 self, 207 dirpath: str = "./", 208 filename: str = "", 209 download_domain: str = "export.arxiv.org", 210 ) -> str: 211 """ 212 Downloads the PDF for this result to the specified directory. 213 214 The filename is generated by calling `to_filename(self)`. 215 216 **Deprecated:** future versions of this client library will not provide 217 download helpers (out of scope). Use `result.pdf_url` directly. 218 """ 219 if not filename: 220 filename = self._get_default_filename() 221 path = os.path.join(dirpath, filename) 222 pdf_url = Result._substitute_domain(self.pdf_url, download_domain) 223 written_path, _ = urlretrieve(pdf_url, path) 224 return written_path
Downloads the PDF for this result to the specified directory.
The filename is generated by calling to_filename(self).
Deprecated: future versions of this client library will not provide
download helpers (out of scope). Use result.pdf_url directly.
226 def download_source( 227 self, 228 dirpath: str = "./", 229 filename: str = "", 230 download_domain: str = "export.arxiv.org", 231 ) -> str: 232 """ 233 Downloads the source tarfile for this result to the specified 234 directory. 235 236 The filename is generated by calling `to_filename(self)`. 237 238 **Deprecated:** future versions of this client library will not provide 239 download helpers (out of scope). Use `result.source_url` directly. 240 """ 241 if not filename: 242 filename = self._get_default_filename("tar.gz") 243 path = os.path.join(dirpath, filename) 244 source_url = Result._substitute_domain(self.source_url(), download_domain) 245 written_path, _ = urlretrieve(source_url, path) 246 return written_path
Downloads the source tarfile for this result to the specified directory.
The filename is generated by calling to_filename(self).
Deprecated: future versions of this client library will not provide
download helpers (out of scope). Use result.source_url directly.
286 class Author(object): 287 """ 288 A light inner class for representing a result's authors. 289 """ 290 291 name: str 292 """The author's name.""" 293 294 def __init__(self, name: str): 295 """ 296 Constructs an `Author` with the specified name. 297 298 In most cases, prefer using `Author._from_feed_author` to parsing 299 and constructing `Author`s yourself. 300 """ 301 self.name = name 302 303 def _from_feed_author(feed_author: feedparser.FeedParserDict) -> Result.Author: 304 """ 305 Constructs an `Author` with the name specified in an author object 306 from a feed entry. 307 308 See usage in `Result._from_feed_entry`. 309 """ 310 return Result.Author(feed_author.name) 311 312 def __str__(self) -> str: 313 return self.name 314 315 def __repr__(self) -> str: 316 return "{}({})".format(_classname(self), repr(self.name)) 317 318 def __eq__(self, other) -> bool: 319 if isinstance(other, Result.Author): 320 return self.name == other.name 321 return False
A light inner class for representing a result's authors.
323 class Link(object): 324 """ 325 A light inner class for representing a result's links. 326 """ 327 328 href: str 329 """The link's `href` attribute.""" 330 title: Optional[str] 331 """The link's title.""" 332 rel: str 333 """The link's relationship to the `Result`.""" 334 content_type: str 335 """The link's HTTP content type.""" 336 337 def __init__( 338 self, 339 href: str, 340 title: str = None, 341 rel: str = None, 342 content_type: str = None, 343 ): 344 """ 345 Constructs a `Link` with the specified link metadata. 346 347 In most cases, prefer using `Link._from_feed_link` to parsing and 348 constructing `Link`s yourself. 349 """ 350 self.href = href 351 self.title = title 352 self.rel = rel 353 self.content_type = content_type 354 355 def _from_feed_link(feed_link: feedparser.FeedParserDict) -> Result.Link: 356 """ 357 Constructs a `Link` with link metadata specified in a link object 358 from a feed entry. 359 360 See usage in `Result._from_feed_entry`. 361 """ 362 return Result.Link( 363 href=feed_link.href, 364 title=feed_link.get("title"), 365 rel=feed_link.get("rel"), 366 content_type=feed_link.get("content_type"), 367 ) 368 369 def __str__(self) -> str: 370 return self.href 371 372 def __repr__(self) -> str: 373 return "{}({}, title={}, rel={}, content_type={})".format( 374 _classname(self), 375 repr(self.href), 376 repr(self.title), 377 repr(self.rel), 378 repr(self.content_type), 379 ) 380 381 def __eq__(self, other) -> bool: 382 if isinstance(other, Result.Link): 383 return self.href == other.href 384 return False
A light inner class for representing a result's links.
337 def __init__( 338 self, 339 href: str, 340 title: str = None, 341 rel: str = None, 342 content_type: str = None, 343 ): 344 """ 345 Constructs a `Link` with the specified link metadata. 346 347 In most cases, prefer using `Link._from_feed_link` to parsing and 348 constructing `Link`s yourself. 349 """ 350 self.href = href 351 self.title = title 352 self.rel = rel 353 self.content_type = content_type
386 class MissingFieldError(Exception): 387 """ 388 An error indicating an entry is unparseable because it lacks required 389 fields. 390 """ 391 392 missing_field: str 393 """The required field missing from the would-be entry.""" 394 message: str 395 """Message describing what caused this error.""" 396 397 def __init__(self, missing_field): 398 self.missing_field = missing_field 399 self.message = "Entry from arXiv missing required info" 400 401 def __repr__(self) -> str: 402 return "{}({})".format(_classname(self), repr(self.missing_field))
An error indicating an entry is unparseable because it lacks required fields.
Inherited Members
- builtins.BaseException
- with_traceback
- args
405class SortCriterion(Enum): 406 """ 407 A SortCriterion identifies a property by which search results can be 408 sorted. 409 410 See [the arXiv API User's Manual: sort order for return 411 results](https://arxiv.org/help/api/user-manual#sort). 412 """ 413 414 Relevance = "relevance" 415 LastUpdatedDate = "lastUpdatedDate" 416 SubmittedDate = "submittedDate"
A SortCriterion identifies a property by which search results can be sorted.
See the arXiv API User's Manual: sort order for return results.
Inherited Members
- enum.Enum
- name
- value
419class SortOrder(Enum): 420 """ 421 A SortOrder indicates order in which search results are sorted according 422 to the specified arxiv.SortCriterion. 423 424 See [the arXiv API User's Manual: sort order for return 425 results](https://arxiv.org/help/api/user-manual#sort). 426 """ 427 428 Ascending = "ascending" 429 Descending = "descending"
A SortOrder indicates order in which search results are sorted according to the specified SortCriterion.
See the arXiv API User's Manual: sort order for return results.
Inherited Members
- enum.Enum
- name
- value
432class Search(object): 433 """ 434 A specification for a search of arXiv's database. 435 436 To run a search, use `Search.run` to use a default client or `Client.run` 437 with a specific client. 438 """ 439 440 query: str 441 """ 442 A query string. 443 444 This should be unencoded. Use `au:del_maestro AND ti:checkerboard`, not 445 `au:del_maestro+AND+ti:checkerboard`. 446 447 See [the arXiv API User's Manual: Details of Query 448 Construction](https://arxiv.org/help/api/user-manual#query_details). 449 """ 450 id_list: List[str] 451 """ 452 A list of arXiv article IDs to which to limit the search. 453 454 See [the arXiv API User's 455 Manual](https://arxiv.org/help/api/user-manual#search_query_and_id_list) 456 for documentation of the interaction between `query` and `id_list`. 457 """ 458 max_results: int | None 459 """ 460 The maximum number of results to be returned in an execution of this 461 search. To fetch every result available, set `max_results=None`. 462 463 The API's limit is 300,000 results per query. 464 """ 465 sort_by: SortCriterion 466 """The sort criterion for results.""" 467 sort_order: SortOrder 468 """The sort order for results.""" 469 470 def __init__( 471 self, 472 query: str = "", 473 id_list: List[str] = [], 474 max_results: int | None = None, 475 sort_by: SortCriterion = SortCriterion.Relevance, 476 sort_order: SortOrder = SortOrder.Descending, 477 ): 478 """ 479 Constructs an arXiv API search with the specified criteria. 480 """ 481 self.query = query 482 self.id_list = id_list 483 # Handle deprecated v1 default behavior. 484 self.max_results = None if max_results == math.inf else max_results 485 self.sort_by = sort_by 486 self.sort_order = sort_order 487 488 def __str__(self) -> str: 489 # TODO: develop a more informative string representation. 490 return repr(self) 491 492 def __repr__(self) -> str: 493 return ("{}(query={}, id_list={}, max_results={}, sort_by={}, sort_order={})").format( 494 _classname(self), 495 repr(self.query), 496 repr(self.id_list), 497 repr(self.max_results), 498 repr(self.sort_by), 499 repr(self.sort_order), 500 ) 501 502 def _url_args(self) -> Dict[str, str]: 503 """ 504 Returns a dict of search parameters that should be included in an API 505 request for this search. 506 """ 507 return { 508 "search_query": self.query, 509 "id_list": ",".join(self.id_list), 510 "sortBy": self.sort_by.value, 511 "sortOrder": self.sort_order.value, 512 } 513 514 def results(self, offset: int = 0) -> Generator[Result, None, None]: 515 """ 516 Executes the specified search using a default arXiv API client. For info 517 on default behavior, see `Client.__init__` and `Client.results`. 518 519 **Deprecated** after 2.0.0; use `Client.results`. 520 """ 521 warnings.warn( 522 "The 'Search.results' method is deprecated, use 'Client.results' instead", 523 DeprecationWarning, 524 stacklevel=2, 525 ) 526 return Client().results(self, offset=offset)
A specification for a search of arXiv's database.
To run a search, use Search.run to use a default client or Client.run
with a specific client.
470 def __init__( 471 self, 472 query: str = "", 473 id_list: List[str] = [], 474 max_results: int | None = None, 475 sort_by: SortCriterion = SortCriterion.Relevance, 476 sort_order: SortOrder = SortOrder.Descending, 477 ): 478 """ 479 Constructs an arXiv API search with the specified criteria. 480 """ 481 self.query = query 482 self.id_list = id_list 483 # Handle deprecated v1 default behavior. 484 self.max_results = None if max_results == math.inf else max_results 485 self.sort_by = sort_by 486 self.sort_order = sort_order
Constructs an arXiv API search with the specified criteria.
A query string.
This should be unencoded. Use au:del_maestro AND ti:checkerboard, not
au:del_maestro+AND+ti:checkerboard.
See the arXiv API User's Manual: Details of Query Construction.
A list of arXiv article IDs to which to limit the search.
See the arXiv API User's
Manual
for documentation of the interaction between query and id_list.
The maximum number of results to be returned in an execution of this
search. To fetch every result available, set max_results=None.
The API's limit is 300,000 results per query.
514 def results(self, offset: int = 0) -> Generator[Result, None, None]: 515 """ 516 Executes the specified search using a default arXiv API client. For info 517 on default behavior, see `Client.__init__` and `Client.results`. 518 519 **Deprecated** after 2.0.0; use `Client.results`. 520 """ 521 warnings.warn( 522 "The 'Search.results' method is deprecated, use 'Client.results' instead", 523 DeprecationWarning, 524 stacklevel=2, 525 ) 526 return Client().results(self, offset=offset)
Executes the specified search using a default arXiv API client. For info
on default behavior, see Client.__init__ and Client.results.
Deprecated after 2.0.0; use Client.results.
529class Client(object): 530 """ 531 Specifies a strategy for fetching results from arXiv's API. 532 533 This class obscures pagination and retry logic, and exposes 534 `Client.results`. 535 """ 536 537 query_url_format = "https://export.arxiv.org/api/query?{}" 538 """ 539 The arXiv query API endpoint format. 540 """ 541 page_size: int 542 """ 543 Maximum number of results fetched in a single API request. Smaller pages can 544 be retrieved faster, but may require more round-trips. 545 546 The API's limit is 2000 results per page. 547 """ 548 delay_seconds: float 549 """ 550 Number of seconds to wait between API requests. 551 552 [arXiv's Terms of Use](https://arxiv.org/help/api/tou) ask that you "make no 553 more than one request every three seconds." 554 """ 555 num_retries: int 556 """ 557 Number of times to retry a failing API request before raising an Exception. 558 """ 559 560 _last_request_dt: datetime 561 _session: requests.Session 562 563 def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3): 564 """ 565 Constructs an arXiv API client with the specified options. 566 567 Note: the default parameters should provide a robust request strategy 568 for most use cases. Extreme page sizes, delays, or retries risk 569 violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou), 570 brittle behavior, and inconsistent results. 571 """ 572 self.page_size = page_size 573 self.delay_seconds = delay_seconds 574 self.num_retries = num_retries 575 self._last_request_dt = None 576 self._session = requests.Session() 577 578 def __str__(self) -> str: 579 # TODO: develop a more informative string representation. 580 return repr(self) 581 582 def __repr__(self) -> str: 583 return "{}(page_size={}, delay_seconds={}, num_retries={})".format( 584 _classname(self), 585 repr(self.page_size), 586 repr(self.delay_seconds), 587 repr(self.num_retries), 588 ) 589 590 def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]: 591 """ 592 Uses this client configuration to fetch one page of the search results 593 at a time, yielding the parsed `Result`s, until `max_results` results 594 have been yielded or there are no more search results. 595 596 If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`. 597 598 Setting a nonzero `offset` discards leading records in the result set. 599 When `offset` is greater than or equal to `search.max_results`, the full 600 result set is discarded. 601 602 For more on using generators, see 603 [Generators](https://wiki.python.org/moin/Generators). 604 """ 605 limit = search.max_results - offset if search.max_results else None 606 if limit and limit < 0: 607 return iter(()) 608 return itertools.islice(self._results(search, offset), limit) 609 610 def _results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]: 611 page_url = self._format_url(search, offset, self.page_size) 612 feed = self._parse_feed(page_url, first_page=True) 613 if not feed.entries: 614 logger.info("Got empty first page; stopping generation") 615 return 616 total_results = int(feed.feed.opensearch_totalresults) 617 logger.info( 618 "Got first page: %d of %d total results", 619 len(feed.entries), 620 total_results, 621 ) 622 623 while feed.entries: 624 for entry in feed.entries: 625 try: 626 yield Result._from_feed_entry(entry) 627 except Result.MissingFieldError as e: 628 logger.warning("Skipping partial result: %s", e) 629 offset += len(feed.entries) 630 if offset >= total_results: 631 break 632 page_url = self._format_url(search, offset, self.page_size) 633 feed = self._parse_feed(page_url, first_page=False) 634 635 def _format_url(self, search: Search, start: int, page_size: int) -> str: 636 """ 637 Construct a request API for search that returns up to `page_size` 638 results starting with the result at index `start`. 639 """ 640 url_args = search._url_args() 641 url_args.update( 642 { 643 "start": start, 644 "max_results": page_size, 645 } 646 ) 647 return self.query_url_format.format(urlencode(url_args)) 648 649 def _parse_feed( 650 self, url: str, first_page: bool = True, _try_index: int = 0 651 ) -> feedparser.FeedParserDict: 652 """ 653 Fetches the specified URL and parses it with feedparser. 654 655 If a request fails or is unexpectedly empty, retries the request up to 656 `self.num_retries` times. 657 """ 658 try: 659 return self.__try_parse_feed(url, first_page=first_page, try_index=_try_index) 660 except ( 661 HTTPError, 662 UnexpectedEmptyPageError, 663 requests.exceptions.ConnectionError, 664 ) as err: 665 if _try_index < self.num_retries: 666 logger.debug("Got error (try %d): %s", _try_index, err) 667 return self._parse_feed(url, first_page=first_page, _try_index=_try_index + 1) 668 logger.debug("Giving up (try %d): %s", _try_index, err) 669 raise err 670 671 def __try_parse_feed( 672 self, 673 url: str, 674 first_page: bool, 675 try_index: int, 676 ) -> feedparser.FeedParserDict: 677 """ 678 Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that 679 number of seconds has not passed since `_parse_feed` was last called, 680 sleeps until delay_seconds seconds have passed. 681 """ 682 # If this call would violate the rate limit, sleep until it doesn't. 683 if self._last_request_dt is not None: 684 required = timedelta(seconds=self.delay_seconds) 685 since_last_request = datetime.now() - self._last_request_dt 686 if since_last_request < required: 687 to_sleep = (required - since_last_request).total_seconds() 688 logger.info("Sleeping: %f seconds", to_sleep) 689 time.sleep(to_sleep) 690 691 logger.info("Requesting page (first: %r, try: %d): %s", first_page, try_index, url) 692 693 resp = self._session.get(url, headers={"user-agent": "arxiv.py/2.3.1"}) 694 self._last_request_dt = datetime.now() 695 if resp.status_code != requests.codes.OK: 696 raise HTTPError(url, try_index, resp.status_code) 697 698 feed = feedparser.parse(resp.content) 699 if len(feed.entries) == 0 and not first_page: 700 raise UnexpectedEmptyPageError(url, try_index, feed) 701 702 if feed.bozo: 703 logger.warning( 704 "Bozo feed; consider handling: %s", 705 feed.bozo_exception if "bozo_exception" in feed else None, 706 ) 707 708 return feed
Specifies a strategy for fetching results from arXiv's API.
This class obscures pagination and retry logic, and exposes
Client.results.
563 def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3): 564 """ 565 Constructs an arXiv API client with the specified options. 566 567 Note: the default parameters should provide a robust request strategy 568 for most use cases. Extreme page sizes, delays, or retries risk 569 violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou), 570 brittle behavior, and inconsistent results. 571 """ 572 self.page_size = page_size 573 self.delay_seconds = delay_seconds 574 self.num_retries = num_retries 575 self._last_request_dt = None 576 self._session = requests.Session()
Constructs an arXiv API client with the specified options.
Note: the default parameters should provide a robust request strategy for most use cases. Extreme page sizes, delays, or retries risk violating the arXiv API Terms of Use, brittle behavior, and inconsistent results.
The arXiv query API endpoint format.
Maximum number of results fetched in a single API request. Smaller pages can be retrieved faster, but may require more round-trips.
The API's limit is 2000 results per page.
Number of seconds to wait between API requests.
arXiv's Terms of Use ask that you "make no more than one request every three seconds."
590 def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]: 591 """ 592 Uses this client configuration to fetch one page of the search results 593 at a time, yielding the parsed `Result`s, until `max_results` results 594 have been yielded or there are no more search results. 595 596 If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`. 597 598 Setting a nonzero `offset` discards leading records in the result set. 599 When `offset` is greater than or equal to `search.max_results`, the full 600 result set is discarded. 601 602 For more on using generators, see 603 [Generators](https://wiki.python.org/moin/Generators). 604 """ 605 limit = search.max_results - offset if search.max_results else None 606 if limit and limit < 0: 607 return iter(()) 608 return itertools.islice(self._results(search, offset), limit)
Uses this client configuration to fetch one page of the search results
at a time, yielding the parsed Results, until max_results results
have been yielded or there are no more search results.
If all tries fail, raises an UnexpectedEmptyPageError or HTTPError.
Setting a nonzero offset discards leading records in the result set.
When offset is greater than or equal to search.max_results, the full
result set is discarded.
For more on using generators, see Generators.
711class ArxivError(Exception): 712 """This package's base Exception class.""" 713 714 url: str 715 """The feed URL that could not be fetched.""" 716 retry: int 717 """ 718 The request try number which encountered this error; 0 for the initial try, 719 1 for the first retry, and so on. 720 """ 721 message: str 722 """Message describing what caused this error.""" 723 724 def __init__(self, url: str, retry: int, message: str): 725 """ 726 Constructs an `ArxivError` encountered while fetching the specified URL. 727 """ 728 self.url = url 729 self.retry = retry 730 self.message = message 731 super().__init__(self.message) 732 733 def __str__(self) -> str: 734 return "{} ({})".format(self.message, self.url)
This package's base Exception class.
724 def __init__(self, url: str, retry: int, message: str): 725 """ 726 Constructs an `ArxivError` encountered while fetching the specified URL. 727 """ 728 self.url = url 729 self.retry = retry 730 self.message = message 731 super().__init__(self.message)
Constructs an ArxivError encountered while fetching the specified URL.
The request try number which encountered this error; 0 for the initial try, 1 for the first retry, and so on.
Inherited Members
- builtins.BaseException
- with_traceback
- args
737class UnexpectedEmptyPageError(ArxivError): 738 """ 739 An error raised when a page of results that should be non-empty is empty. 740 741 This should never happen in theory, but happens sporadically due to 742 brittleness in the underlying arXiv API; usually resolved by retries. 743 744 See `Client.results` for usage. 745 """ 746 747 raw_feed: feedparser.FeedParserDict 748 """ 749 The raw output of `feedparser.parse`. Sometimes this contains useful 750 diagnostic information, e.g. in 'bozo_exception'. 751 """ 752 753 def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict): 754 """ 755 Constructs an `UnexpectedEmptyPageError` encountered for the specified 756 API URL after `retry` tries. 757 """ 758 self.url = url 759 self.raw_feed = raw_feed 760 super().__init__(url, retry, "Page of results was unexpectedly empty") 761 762 def __repr__(self) -> str: 763 return "{}({}, {}, {})".format( 764 _classname(self), repr(self.url), repr(self.retry), repr(self.raw_feed) 765 )
An error raised when a page of results that should be non-empty is empty.
This should never happen in theory, but happens sporadically due to brittleness in the underlying arXiv API; usually resolved by retries.
See Client.results for usage.
753 def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict): 754 """ 755 Constructs an `UnexpectedEmptyPageError` encountered for the specified 756 API URL after `retry` tries. 757 """ 758 self.url = url 759 self.raw_feed = raw_feed 760 super().__init__(url, retry, "Page of results was unexpectedly empty")
Constructs an UnexpectedEmptyPageError encountered for the specified
API URL after retry tries.
The raw output of feedparser.parse. Sometimes this contains useful
diagnostic information, e.g. in 'bozo_exception'.
Inherited Members
- builtins.BaseException
- with_traceback
- args
768class HTTPError(ArxivError): 769 """ 770 A non-200 status encountered while fetching a page of results. 771 772 See `Client.results` for usage. 773 """ 774 775 status: int 776 """The HTTP status reported by feedparser.""" 777 778 def __init__(self, url: str, retry: int, status: int): 779 """ 780 Constructs an `HTTPError` for the specified status code, encountered for 781 the specified API URL after `retry` tries. 782 """ 783 self.url = url 784 self.status = status 785 super().__init__( 786 url, 787 retry, 788 "Page request resulted in HTTP {}".format(self.status), 789 ) 790 791 def __repr__(self) -> str: 792 return "{}({}, {}, {})".format( 793 _classname(self), repr(self.url), repr(self.retry), repr(self.status) 794 )
A non-200 status encountered while fetching a page of results.
See Client.results for usage.
778 def __init__(self, url: str, retry: int, status: int): 779 """ 780 Constructs an `HTTPError` for the specified status code, encountered for 781 the specified API URL after `retry` tries. 782 """ 783 self.url = url 784 self.status = status 785 super().__init__( 786 url, 787 retry, 788 "Page request resulted in HTTP {}".format(self.status), 789 )
Inherited Members
- builtins.BaseException
- with_traceback
- args