arxiv
arxiv.py
Python wrapper for the arXiv API.
arXiv is a project by the Cornell University Library that provides open access to 1,000,000+ articles in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics.
Usage
Installation
$ pip install arxiv
In your Python script, include the line
import arxiv
Examples
Fetching results
import arxiv
# Construct the default API client.
client = Client()
# Search for the 10 most recent articles matching the keyword "quantum."
search = Search(
query = "quantum",
max_results = 10,
sort_by = SortCriterion.SubmittedDate
)
results = client.results(search)
# `results` is a generator; you can iterate over its elements one by one...
for r in client.results(search):
print(r.title)
# ...or exhaust it into a list. Careful: this is slow for large results sets.
all_results = list(results)
print([r.title for r in all_results])
# For advanced query syntax documentation, see the arXiv API User Manual:
# https://arxiv.org/help/api/user-manual#query_details
search = Search(query = "au:del_maestro AND ti:checkerboard")
first_result = next(client.results(search))
print(first_result)
# Search for the paper with ID "1605.08386v1"
search_by_id = Search(id_list=["1605.08386v1"])
# Reuse client to fetch the paper, then print its title.
first_result = next(client.results(search_by_id))
print(first_result.title)
Downloading papers
To download a PDF of the paper with ID "1605.08386v1," run a Search
and then use Result.download_pdf()
:
import arxiv
paper = next(Client().results(Search(id_list=["1605.08386v1"])))
# Download the PDF to the PWD with a default filename.
paper.download_pdf()
# Download the PDF to the PWD with a custom filename.
paper.download_pdf(filename="downloaded-paper.pdf")
# Download the PDF to a specified directory with a custom filename.
paper.download_pdf(dirpath="./mydir", filename="downloaded-paper.pdf")
The same interface is available for downloading .tar.gz files of the paper source:
import arxiv
paper = next(Client().results(Search(id_list=["1605.08386v1"])))
# Download the archive to the PWD with a default filename.
paper.download_source()
# Download the archive to the PWD with a custom filename.
paper.download_source(filename="downloaded-paper.tar.gz")
# Download the archive to a specified directory with a custom filename.
paper.download_source(dirpath="./mydir", filename="downloaded-paper.tar.gz")
Fetching results with a custom client
import arxiv
big_slow_client = Client(
page_size = 1000,
delay_seconds = 10.0,
num_retries = 5
)
# Prints 1000 titles before needing to make another request.
for result in big_slow_client.results(Search(query="quantum")):
print(result.title)
Logging
To inspect this package's network behavior and API logic, configure a DEBUG
-level logger.
>>> import logging, arxiv
>>> logging.basicConfig(level=logging.DEBUG)
>>> client = Client()
>>> paper = next(client.results(Search(id_list=["1605.08386v1"])))
INFO:arxiv.arxiv:Requesting 100 results at offset 0
INFO:arxiv.arxiv:Requesting page (first: False, try: 0): https://exportarxiv.arxiv.org/api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): export.arxiv.org:443
DEBUG:urllib3.connectionpool:https://export.arxiv.org:443 "GET /api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100&user-agent=arxiv.py%2F1.4.8 HTTP/1.1" 200 979
Types
Client
A Client
specifies a reusable strategy for fetching results from arXiv's API. For most use cases the default client should suffice.
Clients configurations specify pagination and retry logic. Reusing a client allows successive API calls to use the same connection pool and ensures they abide by the rate limit you set.
Search
A Search
specifies a search of arXiv's database. Use Client.results
to get a generator yielding Result
s.
Result
The Result
objects yielded by Client.results
include metadata about each paper and helper methods for downloading their content.
The meaning of the underlying raw data is documented in the arXiv API User Manual: Details of Atom Results Returned.
Result
also exposes helper methods for downloading papers: Result.download_pdf
and Result.download_source
.
1""".. include:: ../README.md""" 2 3from __future__ import annotations 4 5import logging 6import time 7import itertools 8import feedparser 9import os 10import math 11import re 12import requests 13import warnings 14 15from urllib.parse import urlencode, urlparse 16from urllib.request import urlretrieve 17from datetime import datetime, timedelta, timezone 18from calendar import timegm 19 20from enum import Enum 21from typing import Dict, Generator, List, Optional 22 23logger = logging.getLogger(__name__) 24 25_DEFAULT_TIME = datetime.min 26 27 28class Result(object): 29 """ 30 An entry in an arXiv query results feed. 31 32 See [the arXiv API User's Manual: Details of Atom Results 33 Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned). 34 """ 35 36 entry_id: str 37 """A url of the form `https://arxiv.org/abs/{id}`.""" 38 updated: datetime 39 """When the result was last updated.""" 40 published: datetime 41 """When the result was originally published.""" 42 title: str 43 """The title of the result.""" 44 authors: List[Author] 45 """The result's authors.""" 46 summary: str 47 """The result abstract.""" 48 comment: Optional[str] 49 """The authors' comment if present.""" 50 journal_ref: Optional[str] 51 """A journal reference if present.""" 52 doi: Optional[str] 53 """A URL for the resolved DOI to an external resource if present.""" 54 primary_category: str 55 """ 56 The result's primary arXiv category. See [arXiv: Category 57 Taxonomy](https://arxiv.org/category_taxonomy). 58 """ 59 categories: List[str] 60 """ 61 All of the result's categories. See [arXiv: Category 62 Taxonomy](https://arxiv.org/category_taxonomy). 63 """ 64 links: List[Link] 65 """Up to three URLs associated with this result.""" 66 pdf_url: Optional[str] 67 """The URL of a PDF version of this result if present among links.""" 68 _raw: feedparser.FeedParserDict 69 """ 70 The raw feedparser result object if this Result was constructed with 71 Result._from_feed_entry. 72 """ 73 74 def __init__( 75 self, 76 entry_id: str, 77 updated: datetime = _DEFAULT_TIME, 78 published: datetime = _DEFAULT_TIME, 79 title: str = "", 80 authors: List[Author] = [], 81 summary: str = "", 82 comment: str = "", 83 journal_ref: str = "", 84 doi: str = "", 85 primary_category: str = "", 86 categories: List[str] = [], 87 links: List[Link] = [], 88 _raw: feedparser.FeedParserDict = None, 89 ): 90 """ 91 Constructs an arXiv search result item. 92 93 In most cases, prefer using `Result._from_feed_entry` to parsing and 94 constructing `Result`s yourself. 95 """ 96 self.entry_id = entry_id 97 self.updated = updated 98 self.published = published 99 self.title = title 100 self.authors = authors 101 self.summary = summary 102 self.comment = comment 103 self.journal_ref = journal_ref 104 self.doi = doi 105 self.primary_category = primary_category 106 self.categories = categories 107 self.links = links 108 # Calculated members 109 self.pdf_url = Result._get_pdf_url(links) 110 # Debugging 111 self._raw = _raw 112 113 def _from_feed_entry(entry: feedparser.FeedParserDict) -> Result: 114 """ 115 Converts a feedparser entry for an arXiv search result feed into a 116 Result object. 117 """ 118 if not hasattr(entry, "id"): 119 raise Result.MissingFieldError("id") 120 # Title attribute may be absent for certain titles. Defaulting to "0" as 121 # it's the only title observed to cause this bug. 122 # https://github.com/lukasschwab/arxiv.py/issues/71 123 # title = entry.title if hasattr(entry, "title") else "0" 124 title = "0" 125 if hasattr(entry, "title"): 126 title = entry.title 127 else: 128 logger.warning("Result %s is missing title attribute; defaulting to '0'", entry.id) 129 return Result( 130 entry_id=entry.id, 131 updated=Result._to_datetime(entry.updated_parsed), 132 published=Result._to_datetime(entry.published_parsed), 133 title=re.sub(r"\s+", " ", title), 134 authors=[Result.Author._from_feed_author(a) for a in entry.authors], 135 summary=entry.summary, 136 comment=entry.get("arxiv_comment"), 137 journal_ref=entry.get("arxiv_journal_ref"), 138 doi=entry.get("arxiv_doi"), 139 primary_category=entry.arxiv_primary_category.get("term"), 140 categories=[tag.get("term") for tag in entry.tags], 141 links=[Result.Link._from_feed_link(link) for link in entry.links], 142 _raw=entry, 143 ) 144 145 def __str__(self) -> str: 146 return self.entry_id 147 148 def __repr__(self) -> str: 149 return ( 150 "{}(entry_id={}, updated={}, published={}, title={}, authors={}, " 151 "summary={}, comment={}, journal_ref={}, doi={}, " 152 "primary_category={}, categories={}, links={})" 153 ).format( 154 _classname(self), 155 repr(self.entry_id), 156 repr(self.updated), 157 repr(self.published), 158 repr(self.title), 159 repr(self.authors), 160 repr(self.summary), 161 repr(self.comment), 162 repr(self.journal_ref), 163 repr(self.doi), 164 repr(self.primary_category), 165 repr(self.categories), 166 repr(self.links), 167 ) 168 169 def __eq__(self, other) -> bool: 170 if isinstance(other, Result): 171 return self.entry_id == other.entry_id 172 return False 173 174 def get_short_id(self) -> str: 175 """ 176 Returns the short ID for this result. 177 178 + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`, 179 `result.get_short_id()` returns `2107.05580v1`. 180 181 + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`, 182 `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March 183 2007 arXiv identifier format). 184 185 For an explanation of the difference between arXiv's legacy and current 186 identifiers, see [Understanding the arXiv 187 identifier](https://arxiv.org/help/arxiv_identifier). 188 """ 189 return self.entry_id.split("arxiv.org/abs/")[-1] 190 191 def _get_default_filename(self, extension: str = "pdf") -> str: 192 """ 193 A default `to_filename` function for the extension given. 194 """ 195 nonempty_title = self.title if self.title else "UNTITLED" 196 return ".".join( 197 [ 198 self.get_short_id().replace("/", "_"), 199 re.sub(r"[^\w]", "_", nonempty_title), 200 extension, 201 ] 202 ) 203 204 def download_pdf( 205 self, 206 dirpath: str = "./", 207 filename: str = "", 208 download_domain: str = "export.arxiv.org", 209 ) -> str: 210 """ 211 Downloads the PDF for this result to the specified directory. 212 213 The filename is generated by calling `to_filename(self)`. 214 """ 215 if not filename: 216 filename = self._get_default_filename() 217 path = os.path.join(dirpath, filename) 218 pdf_url = Result._substitute_domain(self.pdf_url, download_domain) 219 written_path, _ = urlretrieve(pdf_url, path) 220 return written_path 221 222 def download_source( 223 self, 224 dirpath: str = "./", 225 filename: str = "", 226 download_domain: str = "export.arxiv.org", 227 ) -> str: 228 """ 229 Downloads the source tarfile for this result to the specified 230 directory. 231 232 The filename is generated by calling `to_filename(self)`. 233 """ 234 if not filename: 235 filename = self._get_default_filename("tar.gz") 236 path = os.path.join(dirpath, filename) 237 pdf_url = Result._substitute_domain(self.pdf_url, download_domain) 238 # Bodge: construct the source URL from the PDF URL. 239 src_url = pdf_url.replace("/pdf/", "/src/") 240 written_path, _ = urlretrieve(src_url, path) 241 return written_path 242 243 def _get_pdf_url(links: List[Link]) -> str: 244 """ 245 Finds the PDF link among a result's links and returns its URL. 246 247 Should only be called once for a given `Result`, in its constructor. 248 After construction, the URL should be available in `Result.pdf_url`. 249 """ 250 pdf_urls = [link.href for link in links if link.title == "pdf"] 251 if len(pdf_urls) == 0: 252 return None 253 elif len(pdf_urls) > 1: 254 logger.warning("Result has multiple PDF links; using %s", pdf_urls[0]) 255 return pdf_urls[0] 256 257 def _to_datetime(ts: time.struct_time) -> datetime: 258 """ 259 Converts a UTC time.struct_time into a time-zone-aware datetime. 260 261 This will be replaced with feedparser functionality [when it becomes 262 available](https://github.com/kurtmckee/feedparser/issues/212). 263 """ 264 return datetime.fromtimestamp(timegm(ts), tz=timezone.utc) 265 266 def _substitute_domain(url: str, domain: str) -> str: 267 """ 268 Replaces the domain of the given URL with the specified domain. 269 270 This is useful for testing purposes. 271 """ 272 parsed_url = urlparse(url) 273 return parsed_url._replace(netloc=domain).geturl() 274 275 class Author(object): 276 """ 277 A light inner class for representing a result's authors. 278 """ 279 280 name: str 281 """The author's name.""" 282 283 def __init__(self, name: str): 284 """ 285 Constructs an `Author` with the specified name. 286 287 In most cases, prefer using `Author._from_feed_author` to parsing 288 and constructing `Author`s yourself. 289 """ 290 self.name = name 291 292 def _from_feed_author(feed_author: feedparser.FeedParserDict) -> Result.Author: 293 """ 294 Constructs an `Author` with the name specified in an author object 295 from a feed entry. 296 297 See usage in `Result._from_feed_entry`. 298 """ 299 return Result.Author(feed_author.name) 300 301 def __str__(self) -> str: 302 return self.name 303 304 def __repr__(self) -> str: 305 return "{}({})".format(_classname(self), repr(self.name)) 306 307 def __eq__(self, other) -> bool: 308 if isinstance(other, Result.Author): 309 return self.name == other.name 310 return False 311 312 class Link(object): 313 """ 314 A light inner class for representing a result's links. 315 """ 316 317 href: str 318 """The link's `href` attribute.""" 319 title: Optional[str] 320 """The link's title.""" 321 rel: str 322 """The link's relationship to the `Result`.""" 323 content_type: str 324 """The link's HTTP content type.""" 325 326 def __init__( 327 self, 328 href: str, 329 title: str = None, 330 rel: str = None, 331 content_type: str = None, 332 ): 333 """ 334 Constructs a `Link` with the specified link metadata. 335 336 In most cases, prefer using `Link._from_feed_link` to parsing and 337 constructing `Link`s yourself. 338 """ 339 self.href = href 340 self.title = title 341 self.rel = rel 342 self.content_type = content_type 343 344 def _from_feed_link(feed_link: feedparser.FeedParserDict) -> Result.Link: 345 """ 346 Constructs a `Link` with link metadata specified in a link object 347 from a feed entry. 348 349 See usage in `Result._from_feed_entry`. 350 """ 351 return Result.Link( 352 href=feed_link.href, 353 title=feed_link.get("title"), 354 rel=feed_link.get("rel"), 355 content_type=feed_link.get("content_type"), 356 ) 357 358 def __str__(self) -> str: 359 return self.href 360 361 def __repr__(self) -> str: 362 return "{}({}, title={}, rel={}, content_type={})".format( 363 _classname(self), 364 repr(self.href), 365 repr(self.title), 366 repr(self.rel), 367 repr(self.content_type), 368 ) 369 370 def __eq__(self, other) -> bool: 371 if isinstance(other, Result.Link): 372 return self.href == other.href 373 return False 374 375 class MissingFieldError(Exception): 376 """ 377 An error indicating an entry is unparseable because it lacks required 378 fields. 379 """ 380 381 missing_field: str 382 """The required field missing from the would-be entry.""" 383 message: str 384 """Message describing what caused this error.""" 385 386 def __init__(self, missing_field): 387 self.missing_field = missing_field 388 self.message = "Entry from arXiv missing required info" 389 390 def __repr__(self) -> str: 391 return "{}({})".format(_classname(self), repr(self.missing_field)) 392 393 394class SortCriterion(Enum): 395 """ 396 A SortCriterion identifies a property by which search results can be 397 sorted. 398 399 See [the arXiv API User's Manual: sort order for return 400 results](https://arxiv.org/help/api/user-manual#sort). 401 """ 402 403 Relevance = "relevance" 404 LastUpdatedDate = "lastUpdatedDate" 405 SubmittedDate = "submittedDate" 406 407 408class SortOrder(Enum): 409 """ 410 A SortOrder indicates order in which search results are sorted according 411 to the specified arxiv.SortCriterion. 412 413 See [the arXiv API User's Manual: sort order for return 414 results](https://arxiv.org/help/api/user-manual#sort). 415 """ 416 417 Ascending = "ascending" 418 Descending = "descending" 419 420 421class Search(object): 422 """ 423 A specification for a search of arXiv's database. 424 425 To run a search, use `Search.run` to use a default client or `Client.run` 426 with a specific client. 427 """ 428 429 query: str 430 """ 431 A query string. 432 433 This should be unencoded. Use `au:del_maestro AND ti:checkerboard`, not 434 `au:del_maestro+AND+ti:checkerboard`. 435 436 See [the arXiv API User's Manual: Details of Query 437 Construction](https://arxiv.org/help/api/user-manual#query_details). 438 """ 439 id_list: List[str] 440 """ 441 A list of arXiv article IDs to which to limit the search. 442 443 See [the arXiv API User's 444 Manual](https://arxiv.org/help/api/user-manual#search_query_and_id_list) 445 for documentation of the interaction between `query` and `id_list`. 446 """ 447 max_results: int | None 448 """ 449 The maximum number of results to be returned in an execution of this 450 search. To fetch every result available, set `max_results=None`. 451 452 The API's limit is 300,000 results per query. 453 """ 454 sort_by: SortCriterion 455 """The sort criterion for results.""" 456 sort_order: SortOrder 457 """The sort order for results.""" 458 459 def __init__( 460 self, 461 query: str = "", 462 id_list: List[str] = [], 463 max_results: int | None = None, 464 sort_by: SortCriterion = SortCriterion.Relevance, 465 sort_order: SortOrder = SortOrder.Descending, 466 ): 467 """ 468 Constructs an arXiv API search with the specified criteria. 469 """ 470 self.query = query 471 self.id_list = id_list 472 # Handle deprecated v1 default behavior. 473 self.max_results = None if max_results == math.inf else max_results 474 self.sort_by = sort_by 475 self.sort_order = sort_order 476 477 def __str__(self) -> str: 478 # TODO: develop a more informative string representation. 479 return repr(self) 480 481 def __repr__(self) -> str: 482 return ("{}(query={}, id_list={}, max_results={}, sort_by={}, sort_order={})").format( 483 _classname(self), 484 repr(self.query), 485 repr(self.id_list), 486 repr(self.max_results), 487 repr(self.sort_by), 488 repr(self.sort_order), 489 ) 490 491 def _url_args(self) -> Dict[str, str]: 492 """ 493 Returns a dict of search parameters that should be included in an API 494 request for this search. 495 """ 496 return { 497 "search_query": self.query, 498 "id_list": ",".join(self.id_list), 499 "sortBy": self.sort_by.value, 500 "sortOrder": self.sort_order.value, 501 } 502 503 def results(self, offset: int = 0) -> Generator[Result, None, None]: 504 """ 505 Executes the specified search using a default arXiv API client. For info 506 on default behavior, see `Client.__init__` and `Client.results`. 507 508 **Deprecated** after 2.0.0; use `Client.results`. 509 """ 510 warnings.warn( 511 "The 'Search.results' method is deprecated, use 'Client.results' instead", 512 DeprecationWarning, 513 stacklevel=2, 514 ) 515 return Client().results(self, offset=offset) 516 517 518class Client(object): 519 """ 520 Specifies a strategy for fetching results from arXiv's API. 521 522 This class obscures pagination and retry logic, and exposes 523 `Client.results`. 524 """ 525 526 query_url_format = "https://export.arxiv.org/api/query?{}" 527 """ 528 The arXiv query API endpoint format. 529 """ 530 page_size: int 531 """ 532 Maximum number of results fetched in a single API request. Smaller pages can 533 be retrieved faster, but may require more round-trips. 534 535 The API's limit is 2000 results per page. 536 """ 537 delay_seconds: float 538 """ 539 Number of seconds to wait between API requests. 540 541 [arXiv's Terms of Use](https://arxiv.org/help/api/tou) ask that you "make no 542 more than one request every three seconds." 543 """ 544 num_retries: int 545 """ 546 Number of times to retry a failing API request before raising an Exception. 547 """ 548 549 _last_request_dt: datetime 550 _session: requests.Session 551 552 def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3): 553 """ 554 Constructs an arXiv API client with the specified options. 555 556 Note: the default parameters should provide a robust request strategy 557 for most use cases. Extreme page sizes, delays, or retries risk 558 violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou), 559 brittle behavior, and inconsistent results. 560 """ 561 self.page_size = page_size 562 self.delay_seconds = delay_seconds 563 self.num_retries = num_retries 564 self._last_request_dt = None 565 self._session = requests.Session() 566 567 def __str__(self) -> str: 568 # TODO: develop a more informative string representation. 569 return repr(self) 570 571 def __repr__(self) -> str: 572 return "{}(page_size={}, delay_seconds={}, num_retries={})".format( 573 _classname(self), 574 repr(self.page_size), 575 repr(self.delay_seconds), 576 repr(self.num_retries), 577 ) 578 579 def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]: 580 """ 581 Uses this client configuration to fetch one page of the search results 582 at a time, yielding the parsed `Result`s, until `max_results` results 583 have been yielded or there are no more search results. 584 585 If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`. 586 587 Setting a nonzero `offset` discards leading records in the result set. 588 When `offset` is greater than or equal to `search.max_results`, the full 589 result set is discarded. 590 591 For more on using generators, see 592 [Generators](https://wiki.python.org/moin/Generators). 593 """ 594 limit = search.max_results - offset if search.max_results else None 595 if limit and limit < 0: 596 return iter(()) 597 return itertools.islice(self._results(search, offset), limit) 598 599 def _results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]: 600 page_url = self._format_url(search, offset, self.page_size) 601 feed = self._parse_feed(page_url, first_page=True) 602 if not feed.entries: 603 logger.info("Got empty first page; stopping generation") 604 return 605 total_results = int(feed.feed.opensearch_totalresults) 606 logger.info( 607 "Got first page: %d of %d total results", 608 len(feed.entries), 609 total_results, 610 ) 611 612 while feed.entries: 613 for entry in feed.entries: 614 try: 615 yield Result._from_feed_entry(entry) 616 except Result.MissingFieldError as e: 617 logger.warning("Skipping partial result: %s", e) 618 offset += len(feed.entries) 619 if offset >= total_results: 620 break 621 page_url = self._format_url(search, offset, self.page_size) 622 feed = self._parse_feed(page_url, first_page=False) 623 624 def _format_url(self, search: Search, start: int, page_size: int) -> str: 625 """ 626 Construct a request API for search that returns up to `page_size` 627 results starting with the result at index `start`. 628 """ 629 url_args = search._url_args() 630 url_args.update( 631 { 632 "start": start, 633 "max_results": page_size, 634 } 635 ) 636 return self.query_url_format.format(urlencode(url_args)) 637 638 def _parse_feed( 639 self, url: str, first_page: bool = True, _try_index: int = 0 640 ) -> feedparser.FeedParserDict: 641 """ 642 Fetches the specified URL and parses it with feedparser. 643 644 If a request fails or is unexpectedly empty, retries the request up to 645 `self.num_retries` times. 646 """ 647 try: 648 return self.__try_parse_feed(url, first_page=first_page, try_index=_try_index) 649 except ( 650 HTTPError, 651 UnexpectedEmptyPageError, 652 requests.exceptions.ConnectionError, 653 ) as err: 654 if _try_index < self.num_retries: 655 logger.debug("Got error (try %d): %s", _try_index, err) 656 return self._parse_feed(url, first_page=first_page, _try_index=_try_index + 1) 657 logger.debug("Giving up (try %d): %s", _try_index, err) 658 raise err 659 660 def __try_parse_feed( 661 self, 662 url: str, 663 first_page: bool, 664 try_index: int, 665 ) -> feedparser.FeedParserDict: 666 """ 667 Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that 668 number of seconds has not passed since `_parse_feed` was last called, 669 sleeps until delay_seconds seconds have passed. 670 """ 671 # If this call would violate the rate limit, sleep until it doesn't. 672 if self._last_request_dt is not None: 673 required = timedelta(seconds=self.delay_seconds) 674 since_last_request = datetime.now() - self._last_request_dt 675 if since_last_request < required: 676 to_sleep = (required - since_last_request).total_seconds() 677 logger.info("Sleeping: %f seconds", to_sleep) 678 time.sleep(to_sleep) 679 680 logger.info("Requesting page (first: %r, try: %d): %s", first_page, try_index, url) 681 682 resp = self._session.get(url, headers={"user-agent": "arxiv.py/2.2.0"}) 683 self._last_request_dt = datetime.now() 684 if resp.status_code != requests.codes.OK: 685 raise HTTPError(url, try_index, resp.status_code) 686 687 feed = feedparser.parse(resp.content) 688 if len(feed.entries) == 0 and not first_page: 689 raise UnexpectedEmptyPageError(url, try_index, feed) 690 691 if feed.bozo: 692 logger.warning( 693 "Bozo feed; consider handling: %s", 694 feed.bozo_exception if "bozo_exception" in feed else None, 695 ) 696 697 return feed 698 699 700class ArxivError(Exception): 701 """This package's base Exception class.""" 702 703 url: str 704 """The feed URL that could not be fetched.""" 705 retry: int 706 """ 707 The request try number which encountered this error; 0 for the initial try, 708 1 for the first retry, and so on. 709 """ 710 message: str 711 """Message describing what caused this error.""" 712 713 def __init__(self, url: str, retry: int, message: str): 714 """ 715 Constructs an `ArxivError` encountered while fetching the specified URL. 716 """ 717 self.url = url 718 self.retry = retry 719 self.message = message 720 super().__init__(self.message) 721 722 def __str__(self) -> str: 723 return "{} ({})".format(self.message, self.url) 724 725 726class UnexpectedEmptyPageError(ArxivError): 727 """ 728 An error raised when a page of results that should be non-empty is empty. 729 730 This should never happen in theory, but happens sporadically due to 731 brittleness in the underlying arXiv API; usually resolved by retries. 732 733 See `Client.results` for usage. 734 """ 735 736 raw_feed: feedparser.FeedParserDict 737 """ 738 The raw output of `feedparser.parse`. Sometimes this contains useful 739 diagnostic information, e.g. in 'bozo_exception'. 740 """ 741 742 def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict): 743 """ 744 Constructs an `UnexpectedEmptyPageError` encountered for the specified 745 API URL after `retry` tries. 746 """ 747 self.url = url 748 self.raw_feed = raw_feed 749 super().__init__(url, retry, "Page of results was unexpectedly empty") 750 751 def __repr__(self) -> str: 752 return "{}({}, {}, {})".format( 753 _classname(self), repr(self.url), repr(self.retry), repr(self.raw_feed) 754 ) 755 756 757class HTTPError(ArxivError): 758 """ 759 A non-200 status encountered while fetching a page of results. 760 761 See `Client.results` for usage. 762 """ 763 764 status: int 765 """The HTTP status reported by feedparser.""" 766 767 def __init__(self, url: str, retry: int, status: int): 768 """ 769 Constructs an `HTTPError` for the specified status code, encountered for 770 the specified API URL after `retry` tries. 771 """ 772 self.url = url 773 self.status = status 774 super().__init__( 775 url, 776 retry, 777 "Page request resulted in HTTP {}".format(self.status), 778 ) 779 780 def __repr__(self) -> str: 781 return "{}({}, {}, {})".format( 782 _classname(self), repr(self.url), repr(self.retry), repr(self.status) 783 ) 784 785 786def _classname(o): 787 """A helper function for use in __repr__ methods: arxiv.Result.Link.""" 788 return "arxiv.{}".format(o.__class__.__qualname__)
29class Result(object): 30 """ 31 An entry in an arXiv query results feed. 32 33 See [the arXiv API User's Manual: Details of Atom Results 34 Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned). 35 """ 36 37 entry_id: str 38 """A url of the form `https://arxiv.org/abs/{id}`.""" 39 updated: datetime 40 """When the result was last updated.""" 41 published: datetime 42 """When the result was originally published.""" 43 title: str 44 """The title of the result.""" 45 authors: List[Author] 46 """The result's authors.""" 47 summary: str 48 """The result abstract.""" 49 comment: Optional[str] 50 """The authors' comment if present.""" 51 journal_ref: Optional[str] 52 """A journal reference if present.""" 53 doi: Optional[str] 54 """A URL for the resolved DOI to an external resource if present.""" 55 primary_category: str 56 """ 57 The result's primary arXiv category. See [arXiv: Category 58 Taxonomy](https://arxiv.org/category_taxonomy). 59 """ 60 categories: List[str] 61 """ 62 All of the result's categories. See [arXiv: Category 63 Taxonomy](https://arxiv.org/category_taxonomy). 64 """ 65 links: List[Link] 66 """Up to three URLs associated with this result.""" 67 pdf_url: Optional[str] 68 """The URL of a PDF version of this result if present among links.""" 69 _raw: feedparser.FeedParserDict 70 """ 71 The raw feedparser result object if this Result was constructed with 72 Result._from_feed_entry. 73 """ 74 75 def __init__( 76 self, 77 entry_id: str, 78 updated: datetime = _DEFAULT_TIME, 79 published: datetime = _DEFAULT_TIME, 80 title: str = "", 81 authors: List[Author] = [], 82 summary: str = "", 83 comment: str = "", 84 journal_ref: str = "", 85 doi: str = "", 86 primary_category: str = "", 87 categories: List[str] = [], 88 links: List[Link] = [], 89 _raw: feedparser.FeedParserDict = None, 90 ): 91 """ 92 Constructs an arXiv search result item. 93 94 In most cases, prefer using `Result._from_feed_entry` to parsing and 95 constructing `Result`s yourself. 96 """ 97 self.entry_id = entry_id 98 self.updated = updated 99 self.published = published 100 self.title = title 101 self.authors = authors 102 self.summary = summary 103 self.comment = comment 104 self.journal_ref = journal_ref 105 self.doi = doi 106 self.primary_category = primary_category 107 self.categories = categories 108 self.links = links 109 # Calculated members 110 self.pdf_url = Result._get_pdf_url(links) 111 # Debugging 112 self._raw = _raw 113 114 def _from_feed_entry(entry: feedparser.FeedParserDict) -> Result: 115 """ 116 Converts a feedparser entry for an arXiv search result feed into a 117 Result object. 118 """ 119 if not hasattr(entry, "id"): 120 raise Result.MissingFieldError("id") 121 # Title attribute may be absent for certain titles. Defaulting to "0" as 122 # it's the only title observed to cause this bug. 123 # https://github.com/lukasschwab/arxiv.py/issues/71 124 # title = entry.title if hasattr(entry, "title") else "0" 125 title = "0" 126 if hasattr(entry, "title"): 127 title = entry.title 128 else: 129 logger.warning("Result %s is missing title attribute; defaulting to '0'", entry.id) 130 return Result( 131 entry_id=entry.id, 132 updated=Result._to_datetime(entry.updated_parsed), 133 published=Result._to_datetime(entry.published_parsed), 134 title=re.sub(r"\s+", " ", title), 135 authors=[Result.Author._from_feed_author(a) for a in entry.authors], 136 summary=entry.summary, 137 comment=entry.get("arxiv_comment"), 138 journal_ref=entry.get("arxiv_journal_ref"), 139 doi=entry.get("arxiv_doi"), 140 primary_category=entry.arxiv_primary_category.get("term"), 141 categories=[tag.get("term") for tag in entry.tags], 142 links=[Result.Link._from_feed_link(link) for link in entry.links], 143 _raw=entry, 144 ) 145 146 def __str__(self) -> str: 147 return self.entry_id 148 149 def __repr__(self) -> str: 150 return ( 151 "{}(entry_id={}, updated={}, published={}, title={}, authors={}, " 152 "summary={}, comment={}, journal_ref={}, doi={}, " 153 "primary_category={}, categories={}, links={})" 154 ).format( 155 _classname(self), 156 repr(self.entry_id), 157 repr(self.updated), 158 repr(self.published), 159 repr(self.title), 160 repr(self.authors), 161 repr(self.summary), 162 repr(self.comment), 163 repr(self.journal_ref), 164 repr(self.doi), 165 repr(self.primary_category), 166 repr(self.categories), 167 repr(self.links), 168 ) 169 170 def __eq__(self, other) -> bool: 171 if isinstance(other, Result): 172 return self.entry_id == other.entry_id 173 return False 174 175 def get_short_id(self) -> str: 176 """ 177 Returns the short ID for this result. 178 179 + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`, 180 `result.get_short_id()` returns `2107.05580v1`. 181 182 + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`, 183 `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March 184 2007 arXiv identifier format). 185 186 For an explanation of the difference between arXiv's legacy and current 187 identifiers, see [Understanding the arXiv 188 identifier](https://arxiv.org/help/arxiv_identifier). 189 """ 190 return self.entry_id.split("arxiv.org/abs/")[-1] 191 192 def _get_default_filename(self, extension: str = "pdf") -> str: 193 """ 194 A default `to_filename` function for the extension given. 195 """ 196 nonempty_title = self.title if self.title else "UNTITLED" 197 return ".".join( 198 [ 199 self.get_short_id().replace("/", "_"), 200 re.sub(r"[^\w]", "_", nonempty_title), 201 extension, 202 ] 203 ) 204 205 def download_pdf( 206 self, 207 dirpath: str = "./", 208 filename: str = "", 209 download_domain: str = "export.arxiv.org", 210 ) -> str: 211 """ 212 Downloads the PDF for this result to the specified directory. 213 214 The filename is generated by calling `to_filename(self)`. 215 """ 216 if not filename: 217 filename = self._get_default_filename() 218 path = os.path.join(dirpath, filename) 219 pdf_url = Result._substitute_domain(self.pdf_url, download_domain) 220 written_path, _ = urlretrieve(pdf_url, path) 221 return written_path 222 223 def download_source( 224 self, 225 dirpath: str = "./", 226 filename: str = "", 227 download_domain: str = "export.arxiv.org", 228 ) -> str: 229 """ 230 Downloads the source tarfile for this result to the specified 231 directory. 232 233 The filename is generated by calling `to_filename(self)`. 234 """ 235 if not filename: 236 filename = self._get_default_filename("tar.gz") 237 path = os.path.join(dirpath, filename) 238 pdf_url = Result._substitute_domain(self.pdf_url, download_domain) 239 # Bodge: construct the source URL from the PDF URL. 240 src_url = pdf_url.replace("/pdf/", "/src/") 241 written_path, _ = urlretrieve(src_url, path) 242 return written_path 243 244 def _get_pdf_url(links: List[Link]) -> str: 245 """ 246 Finds the PDF link among a result's links and returns its URL. 247 248 Should only be called once for a given `Result`, in its constructor. 249 After construction, the URL should be available in `Result.pdf_url`. 250 """ 251 pdf_urls = [link.href for link in links if link.title == "pdf"] 252 if len(pdf_urls) == 0: 253 return None 254 elif len(pdf_urls) > 1: 255 logger.warning("Result has multiple PDF links; using %s", pdf_urls[0]) 256 return pdf_urls[0] 257 258 def _to_datetime(ts: time.struct_time) -> datetime: 259 """ 260 Converts a UTC time.struct_time into a time-zone-aware datetime. 261 262 This will be replaced with feedparser functionality [when it becomes 263 available](https://github.com/kurtmckee/feedparser/issues/212). 264 """ 265 return datetime.fromtimestamp(timegm(ts), tz=timezone.utc) 266 267 def _substitute_domain(url: str, domain: str) -> str: 268 """ 269 Replaces the domain of the given URL with the specified domain. 270 271 This is useful for testing purposes. 272 """ 273 parsed_url = urlparse(url) 274 return parsed_url._replace(netloc=domain).geturl() 275 276 class Author(object): 277 """ 278 A light inner class for representing a result's authors. 279 """ 280 281 name: str 282 """The author's name.""" 283 284 def __init__(self, name: str): 285 """ 286 Constructs an `Author` with the specified name. 287 288 In most cases, prefer using `Author._from_feed_author` to parsing 289 and constructing `Author`s yourself. 290 """ 291 self.name = name 292 293 def _from_feed_author(feed_author: feedparser.FeedParserDict) -> Result.Author: 294 """ 295 Constructs an `Author` with the name specified in an author object 296 from a feed entry. 297 298 See usage in `Result._from_feed_entry`. 299 """ 300 return Result.Author(feed_author.name) 301 302 def __str__(self) -> str: 303 return self.name 304 305 def __repr__(self) -> str: 306 return "{}({})".format(_classname(self), repr(self.name)) 307 308 def __eq__(self, other) -> bool: 309 if isinstance(other, Result.Author): 310 return self.name == other.name 311 return False 312 313 class Link(object): 314 """ 315 A light inner class for representing a result's links. 316 """ 317 318 href: str 319 """The link's `href` attribute.""" 320 title: Optional[str] 321 """The link's title.""" 322 rel: str 323 """The link's relationship to the `Result`.""" 324 content_type: str 325 """The link's HTTP content type.""" 326 327 def __init__( 328 self, 329 href: str, 330 title: str = None, 331 rel: str = None, 332 content_type: str = None, 333 ): 334 """ 335 Constructs a `Link` with the specified link metadata. 336 337 In most cases, prefer using `Link._from_feed_link` to parsing and 338 constructing `Link`s yourself. 339 """ 340 self.href = href 341 self.title = title 342 self.rel = rel 343 self.content_type = content_type 344 345 def _from_feed_link(feed_link: feedparser.FeedParserDict) -> Result.Link: 346 """ 347 Constructs a `Link` with link metadata specified in a link object 348 from a feed entry. 349 350 See usage in `Result._from_feed_entry`. 351 """ 352 return Result.Link( 353 href=feed_link.href, 354 title=feed_link.get("title"), 355 rel=feed_link.get("rel"), 356 content_type=feed_link.get("content_type"), 357 ) 358 359 def __str__(self) -> str: 360 return self.href 361 362 def __repr__(self) -> str: 363 return "{}({}, title={}, rel={}, content_type={})".format( 364 _classname(self), 365 repr(self.href), 366 repr(self.title), 367 repr(self.rel), 368 repr(self.content_type), 369 ) 370 371 def __eq__(self, other) -> bool: 372 if isinstance(other, Result.Link): 373 return self.href == other.href 374 return False 375 376 class MissingFieldError(Exception): 377 """ 378 An error indicating an entry is unparseable because it lacks required 379 fields. 380 """ 381 382 missing_field: str 383 """The required field missing from the would-be entry.""" 384 message: str 385 """Message describing what caused this error.""" 386 387 def __init__(self, missing_field): 388 self.missing_field = missing_field 389 self.message = "Entry from arXiv missing required info" 390 391 def __repr__(self) -> str: 392 return "{}({})".format(_classname(self), repr(self.missing_field))
An entry in an arXiv query results feed.
See the arXiv API User's Manual: Details of Atom Results Returned.
75 def __init__( 76 self, 77 entry_id: str, 78 updated: datetime = _DEFAULT_TIME, 79 published: datetime = _DEFAULT_TIME, 80 title: str = "", 81 authors: List[Author] = [], 82 summary: str = "", 83 comment: str = "", 84 journal_ref: str = "", 85 doi: str = "", 86 primary_category: str = "", 87 categories: List[str] = [], 88 links: List[Link] = [], 89 _raw: feedparser.FeedParserDict = None, 90 ): 91 """ 92 Constructs an arXiv search result item. 93 94 In most cases, prefer using `Result._from_feed_entry` to parsing and 95 constructing `Result`s yourself. 96 """ 97 self.entry_id = entry_id 98 self.updated = updated 99 self.published = published 100 self.title = title 101 self.authors = authors 102 self.summary = summary 103 self.comment = comment 104 self.journal_ref = journal_ref 105 self.doi = doi 106 self.primary_category = primary_category 107 self.categories = categories 108 self.links = links 109 # Calculated members 110 self.pdf_url = Result._get_pdf_url(links) 111 # Debugging 112 self._raw = _raw
Constructs an arXiv search result item.
In most cases, prefer using Result._from_feed_entry
to parsing and
constructing Result
s yourself.
175 def get_short_id(self) -> str: 176 """ 177 Returns the short ID for this result. 178 179 + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`, 180 `result.get_short_id()` returns `2107.05580v1`. 181 182 + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`, 183 `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March 184 2007 arXiv identifier format). 185 186 For an explanation of the difference between arXiv's legacy and current 187 identifiers, see [Understanding the arXiv 188 identifier](https://arxiv.org/help/arxiv_identifier). 189 """ 190 return self.entry_id.split("arxiv.org/abs/")[-1]
Returns the short ID for this result.
If the result URL is
"https://arxiv.org/abs/2107.05580v1"
,result.get_short_id()
returns2107.05580v1
.If the result URL is
"https://arxiv.org/abs/quant-ph/0201082v1"
,result.get_short_id()
returns"quant-ph/0201082v1"
(the pre-March 2007 arXiv identifier format).
For an explanation of the difference between arXiv's legacy and current identifiers, see Understanding the arXiv identifier.
205 def download_pdf( 206 self, 207 dirpath: str = "./", 208 filename: str = "", 209 download_domain: str = "export.arxiv.org", 210 ) -> str: 211 """ 212 Downloads the PDF for this result to the specified directory. 213 214 The filename is generated by calling `to_filename(self)`. 215 """ 216 if not filename: 217 filename = self._get_default_filename() 218 path = os.path.join(dirpath, filename) 219 pdf_url = Result._substitute_domain(self.pdf_url, download_domain) 220 written_path, _ = urlretrieve(pdf_url, path) 221 return written_path
Downloads the PDF for this result to the specified directory.
The filename is generated by calling to_filename(self)
.
223 def download_source( 224 self, 225 dirpath: str = "./", 226 filename: str = "", 227 download_domain: str = "export.arxiv.org", 228 ) -> str: 229 """ 230 Downloads the source tarfile for this result to the specified 231 directory. 232 233 The filename is generated by calling `to_filename(self)`. 234 """ 235 if not filename: 236 filename = self._get_default_filename("tar.gz") 237 path = os.path.join(dirpath, filename) 238 pdf_url = Result._substitute_domain(self.pdf_url, download_domain) 239 # Bodge: construct the source URL from the PDF URL. 240 src_url = pdf_url.replace("/pdf/", "/src/") 241 written_path, _ = urlretrieve(src_url, path) 242 return written_path
Downloads the source tarfile for this result to the specified directory.
The filename is generated by calling to_filename(self)
.
276 class Author(object): 277 """ 278 A light inner class for representing a result's authors. 279 """ 280 281 name: str 282 """The author's name.""" 283 284 def __init__(self, name: str): 285 """ 286 Constructs an `Author` with the specified name. 287 288 In most cases, prefer using `Author._from_feed_author` to parsing 289 and constructing `Author`s yourself. 290 """ 291 self.name = name 292 293 def _from_feed_author(feed_author: feedparser.FeedParserDict) -> Result.Author: 294 """ 295 Constructs an `Author` with the name specified in an author object 296 from a feed entry. 297 298 See usage in `Result._from_feed_entry`. 299 """ 300 return Result.Author(feed_author.name) 301 302 def __str__(self) -> str: 303 return self.name 304 305 def __repr__(self) -> str: 306 return "{}({})".format(_classname(self), repr(self.name)) 307 308 def __eq__(self, other) -> bool: 309 if isinstance(other, Result.Author): 310 return self.name == other.name 311 return False
A light inner class for representing a result's authors.
313 class Link(object): 314 """ 315 A light inner class for representing a result's links. 316 """ 317 318 href: str 319 """The link's `href` attribute.""" 320 title: Optional[str] 321 """The link's title.""" 322 rel: str 323 """The link's relationship to the `Result`.""" 324 content_type: str 325 """The link's HTTP content type.""" 326 327 def __init__( 328 self, 329 href: str, 330 title: str = None, 331 rel: str = None, 332 content_type: str = None, 333 ): 334 """ 335 Constructs a `Link` with the specified link metadata. 336 337 In most cases, prefer using `Link._from_feed_link` to parsing and 338 constructing `Link`s yourself. 339 """ 340 self.href = href 341 self.title = title 342 self.rel = rel 343 self.content_type = content_type 344 345 def _from_feed_link(feed_link: feedparser.FeedParserDict) -> Result.Link: 346 """ 347 Constructs a `Link` with link metadata specified in a link object 348 from a feed entry. 349 350 See usage in `Result._from_feed_entry`. 351 """ 352 return Result.Link( 353 href=feed_link.href, 354 title=feed_link.get("title"), 355 rel=feed_link.get("rel"), 356 content_type=feed_link.get("content_type"), 357 ) 358 359 def __str__(self) -> str: 360 return self.href 361 362 def __repr__(self) -> str: 363 return "{}({}, title={}, rel={}, content_type={})".format( 364 _classname(self), 365 repr(self.href), 366 repr(self.title), 367 repr(self.rel), 368 repr(self.content_type), 369 ) 370 371 def __eq__(self, other) -> bool: 372 if isinstance(other, Result.Link): 373 return self.href == other.href 374 return False
A light inner class for representing a result's links.
327 def __init__( 328 self, 329 href: str, 330 title: str = None, 331 rel: str = None, 332 content_type: str = None, 333 ): 334 """ 335 Constructs a `Link` with the specified link metadata. 336 337 In most cases, prefer using `Link._from_feed_link` to parsing and 338 constructing `Link`s yourself. 339 """ 340 self.href = href 341 self.title = title 342 self.rel = rel 343 self.content_type = content_type
376 class MissingFieldError(Exception): 377 """ 378 An error indicating an entry is unparseable because it lacks required 379 fields. 380 """ 381 382 missing_field: str 383 """The required field missing from the would-be entry.""" 384 message: str 385 """Message describing what caused this error.""" 386 387 def __init__(self, missing_field): 388 self.missing_field = missing_field 389 self.message = "Entry from arXiv missing required info" 390 391 def __repr__(self) -> str: 392 return "{}({})".format(_classname(self), repr(self.missing_field))
An error indicating an entry is unparseable because it lacks required fields.
Inherited Members
- builtins.BaseException
- with_traceback
- args
395class SortCriterion(Enum): 396 """ 397 A SortCriterion identifies a property by which search results can be 398 sorted. 399 400 See [the arXiv API User's Manual: sort order for return 401 results](https://arxiv.org/help/api/user-manual#sort). 402 """ 403 404 Relevance = "relevance" 405 LastUpdatedDate = "lastUpdatedDate" 406 SubmittedDate = "submittedDate"
A SortCriterion identifies a property by which search results can be sorted.
See the arXiv API User's Manual: sort order for return results.
Inherited Members
- enum.Enum
- name
- value
409class SortOrder(Enum): 410 """ 411 A SortOrder indicates order in which search results are sorted according 412 to the specified arxiv.SortCriterion. 413 414 See [the arXiv API User's Manual: sort order for return 415 results](https://arxiv.org/help/api/user-manual#sort). 416 """ 417 418 Ascending = "ascending" 419 Descending = "descending"
A SortOrder indicates order in which search results are sorted according to the specified SortCriterion.
See the arXiv API User's Manual: sort order for return results.
Inherited Members
- enum.Enum
- name
- value
422class Search(object): 423 """ 424 A specification for a search of arXiv's database. 425 426 To run a search, use `Search.run` to use a default client or `Client.run` 427 with a specific client. 428 """ 429 430 query: str 431 """ 432 A query string. 433 434 This should be unencoded. Use `au:del_maestro AND ti:checkerboard`, not 435 `au:del_maestro+AND+ti:checkerboard`. 436 437 See [the arXiv API User's Manual: Details of Query 438 Construction](https://arxiv.org/help/api/user-manual#query_details). 439 """ 440 id_list: List[str] 441 """ 442 A list of arXiv article IDs to which to limit the search. 443 444 See [the arXiv API User's 445 Manual](https://arxiv.org/help/api/user-manual#search_query_and_id_list) 446 for documentation of the interaction between `query` and `id_list`. 447 """ 448 max_results: int | None 449 """ 450 The maximum number of results to be returned in an execution of this 451 search. To fetch every result available, set `max_results=None`. 452 453 The API's limit is 300,000 results per query. 454 """ 455 sort_by: SortCriterion 456 """The sort criterion for results.""" 457 sort_order: SortOrder 458 """The sort order for results.""" 459 460 def __init__( 461 self, 462 query: str = "", 463 id_list: List[str] = [], 464 max_results: int | None = None, 465 sort_by: SortCriterion = SortCriterion.Relevance, 466 sort_order: SortOrder = SortOrder.Descending, 467 ): 468 """ 469 Constructs an arXiv API search with the specified criteria. 470 """ 471 self.query = query 472 self.id_list = id_list 473 # Handle deprecated v1 default behavior. 474 self.max_results = None if max_results == math.inf else max_results 475 self.sort_by = sort_by 476 self.sort_order = sort_order 477 478 def __str__(self) -> str: 479 # TODO: develop a more informative string representation. 480 return repr(self) 481 482 def __repr__(self) -> str: 483 return ("{}(query={}, id_list={}, max_results={}, sort_by={}, sort_order={})").format( 484 _classname(self), 485 repr(self.query), 486 repr(self.id_list), 487 repr(self.max_results), 488 repr(self.sort_by), 489 repr(self.sort_order), 490 ) 491 492 def _url_args(self) -> Dict[str, str]: 493 """ 494 Returns a dict of search parameters that should be included in an API 495 request for this search. 496 """ 497 return { 498 "search_query": self.query, 499 "id_list": ",".join(self.id_list), 500 "sortBy": self.sort_by.value, 501 "sortOrder": self.sort_order.value, 502 } 503 504 def results(self, offset: int = 0) -> Generator[Result, None, None]: 505 """ 506 Executes the specified search using a default arXiv API client. For info 507 on default behavior, see `Client.__init__` and `Client.results`. 508 509 **Deprecated** after 2.0.0; use `Client.results`. 510 """ 511 warnings.warn( 512 "The 'Search.results' method is deprecated, use 'Client.results' instead", 513 DeprecationWarning, 514 stacklevel=2, 515 ) 516 return Client().results(self, offset=offset)
A specification for a search of arXiv's database.
To run a search, use Search.run
to use a default client or Client.run
with a specific client.
460 def __init__( 461 self, 462 query: str = "", 463 id_list: List[str] = [], 464 max_results: int | None = None, 465 sort_by: SortCriterion = SortCriterion.Relevance, 466 sort_order: SortOrder = SortOrder.Descending, 467 ): 468 """ 469 Constructs an arXiv API search with the specified criteria. 470 """ 471 self.query = query 472 self.id_list = id_list 473 # Handle deprecated v1 default behavior. 474 self.max_results = None if max_results == math.inf else max_results 475 self.sort_by = sort_by 476 self.sort_order = sort_order
Constructs an arXiv API search with the specified criteria.
A query string.
This should be unencoded. Use au:del_maestro AND ti:checkerboard
, not
au:del_maestro+AND+ti:checkerboard
.
See the arXiv API User's Manual: Details of Query Construction.
A list of arXiv article IDs to which to limit the search.
See the arXiv API User's
Manual
for documentation of the interaction between query
and id_list
.
The maximum number of results to be returned in an execution of this
search. To fetch every result available, set max_results=None
.
The API's limit is 300,000 results per query.
504 def results(self, offset: int = 0) -> Generator[Result, None, None]: 505 """ 506 Executes the specified search using a default arXiv API client. For info 507 on default behavior, see `Client.__init__` and `Client.results`. 508 509 **Deprecated** after 2.0.0; use `Client.results`. 510 """ 511 warnings.warn( 512 "The 'Search.results' method is deprecated, use 'Client.results' instead", 513 DeprecationWarning, 514 stacklevel=2, 515 ) 516 return Client().results(self, offset=offset)
Executes the specified search using a default arXiv API client. For info
on default behavior, see Client.__init__
and Client.results
.
Deprecated after 2.0.0; use Client.results
.
519class Client(object): 520 """ 521 Specifies a strategy for fetching results from arXiv's API. 522 523 This class obscures pagination and retry logic, and exposes 524 `Client.results`. 525 """ 526 527 query_url_format = "https://export.arxiv.org/api/query?{}" 528 """ 529 The arXiv query API endpoint format. 530 """ 531 page_size: int 532 """ 533 Maximum number of results fetched in a single API request. Smaller pages can 534 be retrieved faster, but may require more round-trips. 535 536 The API's limit is 2000 results per page. 537 """ 538 delay_seconds: float 539 """ 540 Number of seconds to wait between API requests. 541 542 [arXiv's Terms of Use](https://arxiv.org/help/api/tou) ask that you "make no 543 more than one request every three seconds." 544 """ 545 num_retries: int 546 """ 547 Number of times to retry a failing API request before raising an Exception. 548 """ 549 550 _last_request_dt: datetime 551 _session: requests.Session 552 553 def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3): 554 """ 555 Constructs an arXiv API client with the specified options. 556 557 Note: the default parameters should provide a robust request strategy 558 for most use cases. Extreme page sizes, delays, or retries risk 559 violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou), 560 brittle behavior, and inconsistent results. 561 """ 562 self.page_size = page_size 563 self.delay_seconds = delay_seconds 564 self.num_retries = num_retries 565 self._last_request_dt = None 566 self._session = requests.Session() 567 568 def __str__(self) -> str: 569 # TODO: develop a more informative string representation. 570 return repr(self) 571 572 def __repr__(self) -> str: 573 return "{}(page_size={}, delay_seconds={}, num_retries={})".format( 574 _classname(self), 575 repr(self.page_size), 576 repr(self.delay_seconds), 577 repr(self.num_retries), 578 ) 579 580 def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]: 581 """ 582 Uses this client configuration to fetch one page of the search results 583 at a time, yielding the parsed `Result`s, until `max_results` results 584 have been yielded or there are no more search results. 585 586 If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`. 587 588 Setting a nonzero `offset` discards leading records in the result set. 589 When `offset` is greater than or equal to `search.max_results`, the full 590 result set is discarded. 591 592 For more on using generators, see 593 [Generators](https://wiki.python.org/moin/Generators). 594 """ 595 limit = search.max_results - offset if search.max_results else None 596 if limit and limit < 0: 597 return iter(()) 598 return itertools.islice(self._results(search, offset), limit) 599 600 def _results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]: 601 page_url = self._format_url(search, offset, self.page_size) 602 feed = self._parse_feed(page_url, first_page=True) 603 if not feed.entries: 604 logger.info("Got empty first page; stopping generation") 605 return 606 total_results = int(feed.feed.opensearch_totalresults) 607 logger.info( 608 "Got first page: %d of %d total results", 609 len(feed.entries), 610 total_results, 611 ) 612 613 while feed.entries: 614 for entry in feed.entries: 615 try: 616 yield Result._from_feed_entry(entry) 617 except Result.MissingFieldError as e: 618 logger.warning("Skipping partial result: %s", e) 619 offset += len(feed.entries) 620 if offset >= total_results: 621 break 622 page_url = self._format_url(search, offset, self.page_size) 623 feed = self._parse_feed(page_url, first_page=False) 624 625 def _format_url(self, search: Search, start: int, page_size: int) -> str: 626 """ 627 Construct a request API for search that returns up to `page_size` 628 results starting with the result at index `start`. 629 """ 630 url_args = search._url_args() 631 url_args.update( 632 { 633 "start": start, 634 "max_results": page_size, 635 } 636 ) 637 return self.query_url_format.format(urlencode(url_args)) 638 639 def _parse_feed( 640 self, url: str, first_page: bool = True, _try_index: int = 0 641 ) -> feedparser.FeedParserDict: 642 """ 643 Fetches the specified URL and parses it with feedparser. 644 645 If a request fails or is unexpectedly empty, retries the request up to 646 `self.num_retries` times. 647 """ 648 try: 649 return self.__try_parse_feed(url, first_page=first_page, try_index=_try_index) 650 except ( 651 HTTPError, 652 UnexpectedEmptyPageError, 653 requests.exceptions.ConnectionError, 654 ) as err: 655 if _try_index < self.num_retries: 656 logger.debug("Got error (try %d): %s", _try_index, err) 657 return self._parse_feed(url, first_page=first_page, _try_index=_try_index + 1) 658 logger.debug("Giving up (try %d): %s", _try_index, err) 659 raise err 660 661 def __try_parse_feed( 662 self, 663 url: str, 664 first_page: bool, 665 try_index: int, 666 ) -> feedparser.FeedParserDict: 667 """ 668 Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that 669 number of seconds has not passed since `_parse_feed` was last called, 670 sleeps until delay_seconds seconds have passed. 671 """ 672 # If this call would violate the rate limit, sleep until it doesn't. 673 if self._last_request_dt is not None: 674 required = timedelta(seconds=self.delay_seconds) 675 since_last_request = datetime.now() - self._last_request_dt 676 if since_last_request < required: 677 to_sleep = (required - since_last_request).total_seconds() 678 logger.info("Sleeping: %f seconds", to_sleep) 679 time.sleep(to_sleep) 680 681 logger.info("Requesting page (first: %r, try: %d): %s", first_page, try_index, url) 682 683 resp = self._session.get(url, headers={"user-agent": "arxiv.py/2.2.0"}) 684 self._last_request_dt = datetime.now() 685 if resp.status_code != requests.codes.OK: 686 raise HTTPError(url, try_index, resp.status_code) 687 688 feed = feedparser.parse(resp.content) 689 if len(feed.entries) == 0 and not first_page: 690 raise UnexpectedEmptyPageError(url, try_index, feed) 691 692 if feed.bozo: 693 logger.warning( 694 "Bozo feed; consider handling: %s", 695 feed.bozo_exception if "bozo_exception" in feed else None, 696 ) 697 698 return feed
Specifies a strategy for fetching results from arXiv's API.
This class obscures pagination and retry logic, and exposes
Client.results
.
553 def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3): 554 """ 555 Constructs an arXiv API client with the specified options. 556 557 Note: the default parameters should provide a robust request strategy 558 for most use cases. Extreme page sizes, delays, or retries risk 559 violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou), 560 brittle behavior, and inconsistent results. 561 """ 562 self.page_size = page_size 563 self.delay_seconds = delay_seconds 564 self.num_retries = num_retries 565 self._last_request_dt = None 566 self._session = requests.Session()
Constructs an arXiv API client with the specified options.
Note: the default parameters should provide a robust request strategy for most use cases. Extreme page sizes, delays, or retries risk violating the arXiv API Terms of Use, brittle behavior, and inconsistent results.
The arXiv query API endpoint format.
Maximum number of results fetched in a single API request. Smaller pages can be retrieved faster, but may require more round-trips.
The API's limit is 2000 results per page.
Number of seconds to wait between API requests.
arXiv's Terms of Use ask that you "make no more than one request every three seconds."
580 def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]: 581 """ 582 Uses this client configuration to fetch one page of the search results 583 at a time, yielding the parsed `Result`s, until `max_results` results 584 have been yielded or there are no more search results. 585 586 If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`. 587 588 Setting a nonzero `offset` discards leading records in the result set. 589 When `offset` is greater than or equal to `search.max_results`, the full 590 result set is discarded. 591 592 For more on using generators, see 593 [Generators](https://wiki.python.org/moin/Generators). 594 """ 595 limit = search.max_results - offset if search.max_results else None 596 if limit and limit < 0: 597 return iter(()) 598 return itertools.islice(self._results(search, offset), limit)
Uses this client configuration to fetch one page of the search results
at a time, yielding the parsed Result
s, until max_results
results
have been yielded or there are no more search results.
If all tries fail, raises an UnexpectedEmptyPageError
or HTTPError
.
Setting a nonzero offset
discards leading records in the result set.
When offset
is greater than or equal to search.max_results
, the full
result set is discarded.
For more on using generators, see Generators.
701class ArxivError(Exception): 702 """This package's base Exception class.""" 703 704 url: str 705 """The feed URL that could not be fetched.""" 706 retry: int 707 """ 708 The request try number which encountered this error; 0 for the initial try, 709 1 for the first retry, and so on. 710 """ 711 message: str 712 """Message describing what caused this error.""" 713 714 def __init__(self, url: str, retry: int, message: str): 715 """ 716 Constructs an `ArxivError` encountered while fetching the specified URL. 717 """ 718 self.url = url 719 self.retry = retry 720 self.message = message 721 super().__init__(self.message) 722 723 def __str__(self) -> str: 724 return "{} ({})".format(self.message, self.url)
This package's base Exception class.
714 def __init__(self, url: str, retry: int, message: str): 715 """ 716 Constructs an `ArxivError` encountered while fetching the specified URL. 717 """ 718 self.url = url 719 self.retry = retry 720 self.message = message 721 super().__init__(self.message)
Constructs an ArxivError
encountered while fetching the specified URL.
The request try number which encountered this error; 0 for the initial try, 1 for the first retry, and so on.
Inherited Members
- builtins.BaseException
- with_traceback
- args
727class UnexpectedEmptyPageError(ArxivError): 728 """ 729 An error raised when a page of results that should be non-empty is empty. 730 731 This should never happen in theory, but happens sporadically due to 732 brittleness in the underlying arXiv API; usually resolved by retries. 733 734 See `Client.results` for usage. 735 """ 736 737 raw_feed: feedparser.FeedParserDict 738 """ 739 The raw output of `feedparser.parse`. Sometimes this contains useful 740 diagnostic information, e.g. in 'bozo_exception'. 741 """ 742 743 def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict): 744 """ 745 Constructs an `UnexpectedEmptyPageError` encountered for the specified 746 API URL after `retry` tries. 747 """ 748 self.url = url 749 self.raw_feed = raw_feed 750 super().__init__(url, retry, "Page of results was unexpectedly empty") 751 752 def __repr__(self) -> str: 753 return "{}({}, {}, {})".format( 754 _classname(self), repr(self.url), repr(self.retry), repr(self.raw_feed) 755 )
An error raised when a page of results that should be non-empty is empty.
This should never happen in theory, but happens sporadically due to brittleness in the underlying arXiv API; usually resolved by retries.
See Client.results
for usage.
743 def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict): 744 """ 745 Constructs an `UnexpectedEmptyPageError` encountered for the specified 746 API URL after `retry` tries. 747 """ 748 self.url = url 749 self.raw_feed = raw_feed 750 super().__init__(url, retry, "Page of results was unexpectedly empty")
Constructs an UnexpectedEmptyPageError
encountered for the specified
API URL after retry
tries.
The raw output of feedparser.parse
. Sometimes this contains useful
diagnostic information, e.g. in 'bozo_exception'.
Inherited Members
- builtins.BaseException
- with_traceback
- args
758class HTTPError(ArxivError): 759 """ 760 A non-200 status encountered while fetching a page of results. 761 762 See `Client.results` for usage. 763 """ 764 765 status: int 766 """The HTTP status reported by feedparser.""" 767 768 def __init__(self, url: str, retry: int, status: int): 769 """ 770 Constructs an `HTTPError` for the specified status code, encountered for 771 the specified API URL after `retry` tries. 772 """ 773 self.url = url 774 self.status = status 775 super().__init__( 776 url, 777 retry, 778 "Page request resulted in HTTP {}".format(self.status), 779 ) 780 781 def __repr__(self) -> str: 782 return "{}({}, {}, {})".format( 783 _classname(self), repr(self.url), repr(self.retry), repr(self.status) 784 )
A non-200 status encountered while fetching a page of results.
See Client.results
for usage.
768 def __init__(self, url: str, retry: int, status: int): 769 """ 770 Constructs an `HTTPError` for the specified status code, encountered for 771 the specified API URL after `retry` tries. 772 """ 773 self.url = url 774 self.status = status 775 super().__init__( 776 url, 777 retry, 778 "Page request resulted in HTTP {}".format(self.status), 779 )
Inherited Members
- builtins.BaseException
- with_traceback
- args