arxiv
arxiv.py
Python wrapper for the arXiv API.
arXiv is a project by the Cornell University Library that provides open access to 1,000,000+ articles in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics.
Usage
Installation
$ pip install arxiv
In your Python script, include the line
import arxiv
Examples
Fetching results
import arxiv
# Construct the default API client.
client = Client()
# Search for the 10 most recent articles matching the keyword "quantum."
search = Search(
query = "quantum",
max_results = 10,
sort_by = SortCriterion.SubmittedDate
)
results = client.results(search)
# `results` is a generator; you can iterate over its elements one by one...
for r in client.results(search):
print(r.title)
# ...or exhaust it into a list. Careful: this is slow for large results sets.
all_results = list(results)
print([r.title for r in all_results])
# For advanced query syntax documentation, see the arXiv API User Manual:
# https://arxiv.org/help/api/user-manual#query_details
search = Search(query = "au:del_maestro AND ti:checkerboard")
first_result = next(client.results(search))
print(first_result)
# Search for the paper with ID "1605.08386v1"
search_by_id = Search(id_list=["1605.08386v1"])
# Reuse client to fetch the paper, then print its title.
first_result = next(client.results(search))
print(first_result.title)
Downloading papers
To download a PDF of the paper with ID "1605.08386v1," run a Search
and then use Result.download_pdf()
:
import arxiv
paper = next(Client().results(Search(id_list=["1605.08386v1"])))
# Download the PDF to the PWD with a default filename.
paper.download_pdf()
# Download the PDF to the PWD with a custom filename.
paper.download_pdf(filename="downloaded-paper.pdf")
# Download the PDF to a specified directory with a custom filename.
paper.download_pdf(dirpath="./mydir", filename="downloaded-paper.pdf")
The same interface is available for downloading .tar.gz files of the paper source:
import arxiv
paper = next(Client().results(Search(id_list=["1605.08386v1"])))
# Download the archive to the PWD with a default filename.
paper.download_source()
# Download the archive to the PWD with a custom filename.
paper.download_source(filename="downloaded-paper.tar.gz")
# Download the archive to a specified directory with a custom filename.
paper.download_source(dirpath="./mydir", filename="downloaded-paper.tar.gz")
Fetching results with a custom client
import arxiv
big_slow_client = Client(
page_size = 1000,
delay_seconds = 10.0,
num_retries = 5
)
# Prints 1000 titles before needing to make another request.
for result in big_slow_client.results(Search(query="quantum")):
print(result.title)
Logging
To inspect this package's network behavior and API logic, configure a DEBUG
-level logger.
>>> import logging, arxiv
>>> logging.basicConfig(level=logging.DEBUG)
>>> client = Client()
>>> paper = next(client.results(Search(id_list=["1605.08386v1"])))
INFO:arxiv.arxiv:Requesting 100 results at offset 0
INFO:arxiv.arxiv:Requesting page (first: False, try: 0): https://exportarxiv.arxiv.org/api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): export.arxiv.org:443
DEBUG:urllib3.connectionpool:https://export.arxiv.org:443 "GET /api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100&user-agent=arxiv.py%2F1.4.8 HTTP/1.1" 200 979
Types
Client
A Client
specifies a reusable strategy for fetching results from arXiv's API. For most use cases the default client should suffice.
Clients configurations specify pagination and retry logic. Reusing a client allows successive API calls to use the same connection pool and ensures they abide by the rate limit you set.
Search
A Search
specifies a search of arXiv's database. Use Client.results
to get a generator yielding Result
s.
Result
The Result
objects yielded by Client.results
include metadata about each paper and helper methods for downloading their content.
The meaning of the underlying raw data is documented in the arXiv API User Manual: Details of Atom Results Returned.
Result
also exposes helper methods for downloading papers: Result.download_pdf
and Result.download_source
.
1""".. include:: ../README.md""" 2from __future__ import annotations 3 4import logging 5import time 6import itertools 7import feedparser 8import os 9import math 10import re 11import requests 12import warnings 13 14from urllib.parse import urlencode 15from urllib.request import urlretrieve 16from datetime import datetime, timedelta, timezone 17from calendar import timegm 18 19from enum import Enum 20from typing import Dict, Generator, List, Optional 21 22logger = logging.getLogger(__name__) 23 24_DEFAULT_TIME = datetime.min 25 26 27class Result(object): 28 """ 29 An entry in an arXiv query results feed. 30 31 See [the arXiv API User's Manual: Details of Atom Results 32 Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned). 33 """ 34 35 entry_id: str 36 """A url of the form `https://arxiv.org/abs/{id}`.""" 37 updated: datetime 38 """When the result was last updated.""" 39 published: datetime 40 """When the result was originally published.""" 41 title: str 42 """The title of the result.""" 43 authors: List[Author] 44 """The result's authors.""" 45 summary: str 46 """The result abstract.""" 47 comment: Optional[str] 48 """The authors' comment if present.""" 49 journal_ref: Optional[str] 50 """A journal reference if present.""" 51 doi: Optional[str] 52 """A URL for the resolved DOI to an external resource if present.""" 53 primary_category: str 54 """ 55 The result's primary arXiv category. See [arXiv: Category 56 Taxonomy](https://arxiv.org/category_taxonomy). 57 """ 58 categories: List[str] 59 """ 60 All of the result's categories. See [arXiv: Category 61 Taxonomy](https://arxiv.org/category_taxonomy). 62 """ 63 links: List[Link] 64 """Up to three URLs associated with this result.""" 65 pdf_url: Optional[str] 66 """The URL of a PDF version of this result if present among links.""" 67 _raw: feedparser.FeedParserDict 68 """ 69 The raw feedparser result object if this Result was constructed with 70 Result._from_feed_entry. 71 """ 72 73 def __init__( 74 self, 75 entry_id: str, 76 updated: datetime = _DEFAULT_TIME, 77 published: datetime = _DEFAULT_TIME, 78 title: str = "", 79 authors: List[Author] = [], 80 summary: str = "", 81 comment: str = "", 82 journal_ref: str = "", 83 doi: str = "", 84 primary_category: str = "", 85 categories: List[str] = [], 86 links: List[Link] = [], 87 _raw: feedparser.FeedParserDict = None, 88 ): 89 """ 90 Constructs an arXiv search result item. 91 92 In most cases, prefer using `Result._from_feed_entry` to parsing and 93 constructing `Result`s yourself. 94 """ 95 self.entry_id = entry_id 96 self.updated = updated 97 self.published = published 98 self.title = title 99 self.authors = authors 100 self.summary = summary 101 self.comment = comment 102 self.journal_ref = journal_ref 103 self.doi = doi 104 self.primary_category = primary_category 105 self.categories = categories 106 self.links = links 107 # Calculated members 108 self.pdf_url = Result._get_pdf_url(links) 109 # Debugging 110 self._raw = _raw 111 112 def _from_feed_entry(entry: feedparser.FeedParserDict) -> Result: 113 """ 114 Converts a feedparser entry for an arXiv search result feed into a 115 Result object. 116 """ 117 if not hasattr(entry, "id"): 118 raise Result.MissingFieldError("id") 119 # Title attribute may be absent for certain titles. Defaulting to "0" as 120 # it's the only title observed to cause this bug. 121 # https://github.com/lukasschwab/arxiv.py/issues/71 122 # title = entry.title if hasattr(entry, "title") else "0" 123 title = "0" 124 if hasattr(entry, "title"): 125 title = entry.title 126 else: 127 logger.warning("Result %s is missing title attribute; defaulting to '0'", entry.id) 128 return Result( 129 entry_id=entry.id, 130 updated=Result._to_datetime(entry.updated_parsed), 131 published=Result._to_datetime(entry.published_parsed), 132 title=re.sub(r"\s+", " ", title), 133 authors=[Result.Author._from_feed_author(a) for a in entry.authors], 134 summary=entry.summary, 135 comment=entry.get("arxiv_comment"), 136 journal_ref=entry.get("arxiv_journal_ref"), 137 doi=entry.get("arxiv_doi"), 138 primary_category=entry.arxiv_primary_category.get("term"), 139 categories=[tag.get("term") for tag in entry.tags], 140 links=[Result.Link._from_feed_link(link) for link in entry.links], 141 _raw=entry, 142 ) 143 144 def __str__(self) -> str: 145 return self.entry_id 146 147 def __repr__(self) -> str: 148 return ( 149 "{}(entry_id={}, updated={}, published={}, title={}, authors={}, " 150 "summary={}, comment={}, journal_ref={}, doi={}, " 151 "primary_category={}, categories={}, links={})" 152 ).format( 153 _classname(self), 154 repr(self.entry_id), 155 repr(self.updated), 156 repr(self.published), 157 repr(self.title), 158 repr(self.authors), 159 repr(self.summary), 160 repr(self.comment), 161 repr(self.journal_ref), 162 repr(self.doi), 163 repr(self.primary_category), 164 repr(self.categories), 165 repr(self.links), 166 ) 167 168 def __eq__(self, other) -> bool: 169 if isinstance(other, Result): 170 return self.entry_id == other.entry_id 171 return False 172 173 def get_short_id(self) -> str: 174 """ 175 Returns the short ID for this result. 176 177 + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`, 178 `result.get_short_id()` returns `2107.05580v1`. 179 180 + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`, 181 `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March 182 2007 arXiv identifier format). 183 184 For an explanation of the difference between arXiv's legacy and current 185 identifiers, see [Understanding the arXiv 186 identifier](https://arxiv.org/help/arxiv_identifier). 187 """ 188 return self.entry_id.split("arxiv.org/abs/")[-1] 189 190 def _get_default_filename(self, extension: str = "pdf") -> str: 191 """ 192 A default `to_filename` function for the extension given. 193 """ 194 nonempty_title = self.title if self.title else "UNTITLED" 195 return ".".join( 196 [ 197 self.get_short_id().replace("/", "_"), 198 re.sub(r"[^\w]", "_", nonempty_title), 199 extension, 200 ] 201 ) 202 203 def download_pdf(self, dirpath: str = "./", filename: str = "") -> str: 204 """ 205 Downloads the PDF for this result to the specified directory. 206 207 The filename is generated by calling `to_filename(self)`. 208 """ 209 if not filename: 210 filename = self._get_default_filename() 211 path = os.path.join(dirpath, filename) 212 written_path, _ = urlretrieve(self.pdf_url, path) 213 return written_path 214 215 def download_source(self, dirpath: str = "./", filename: str = "") -> str: 216 """ 217 Downloads the source tarfile for this result to the specified 218 directory. 219 220 The filename is generated by calling `to_filename(self)`. 221 """ 222 if not filename: 223 filename = self._get_default_filename("tar.gz") 224 path = os.path.join(dirpath, filename) 225 # Bodge: construct the source URL from the PDF URL. 226 source_url = self.pdf_url.replace("/pdf/", "/src/") 227 written_path, _ = urlretrieve(source_url, path) 228 return written_path 229 230 def _get_pdf_url(links: List[Link]) -> str: 231 """ 232 Finds the PDF link among a result's links and returns its URL. 233 234 Should only be called once for a given `Result`, in its constructor. 235 After construction, the URL should be available in `Result.pdf_url`. 236 """ 237 pdf_urls = [link.href for link in links if link.title == "pdf"] 238 if len(pdf_urls) == 0: 239 return None 240 elif len(pdf_urls) > 1: 241 logger.warning("Result has multiple PDF links; using %s", pdf_urls[0]) 242 return pdf_urls[0] 243 244 def _to_datetime(ts: time.struct_time) -> datetime: 245 """ 246 Converts a UTC time.struct_time into a time-zone-aware datetime. 247 248 This will be replaced with feedparser functionality [when it becomes 249 available](https://github.com/kurtmckee/feedparser/issues/212). 250 """ 251 return datetime.fromtimestamp(timegm(ts), tz=timezone.utc) 252 253 class Author(object): 254 """ 255 A light inner class for representing a result's authors. 256 """ 257 258 name: str 259 """The author's name.""" 260 261 def __init__(self, name: str): 262 """ 263 Constructs an `Author` with the specified name. 264 265 In most cases, prefer using `Author._from_feed_author` to parsing 266 and constructing `Author`s yourself. 267 """ 268 self.name = name 269 270 def _from_feed_author(feed_author: feedparser.FeedParserDict) -> Result.Author: 271 """ 272 Constructs an `Author` with the name specified in an author object 273 from a feed entry. 274 275 See usage in `Result._from_feed_entry`. 276 """ 277 return Result.Author(feed_author.name) 278 279 def __str__(self) -> str: 280 return self.name 281 282 def __repr__(self) -> str: 283 return "{}({})".format(_classname(self), repr(self.name)) 284 285 def __eq__(self, other) -> bool: 286 if isinstance(other, Result.Author): 287 return self.name == other.name 288 return False 289 290 class Link(object): 291 """ 292 A light inner class for representing a result's links. 293 """ 294 295 href: str 296 """The link's `href` attribute.""" 297 title: Optional[str] 298 """The link's title.""" 299 rel: str 300 """The link's relationship to the `Result`.""" 301 content_type: str 302 """The link's HTTP content type.""" 303 304 def __init__( 305 self, 306 href: str, 307 title: str = None, 308 rel: str = None, 309 content_type: str = None, 310 ): 311 """ 312 Constructs a `Link` with the specified link metadata. 313 314 In most cases, prefer using `Link._from_feed_link` to parsing and 315 constructing `Link`s yourself. 316 """ 317 self.href = href 318 self.title = title 319 self.rel = rel 320 self.content_type = content_type 321 322 def _from_feed_link(feed_link: feedparser.FeedParserDict) -> Result.Link: 323 """ 324 Constructs a `Link` with link metadata specified in a link object 325 from a feed entry. 326 327 See usage in `Result._from_feed_entry`. 328 """ 329 return Result.Link( 330 href=feed_link.href, 331 title=feed_link.get("title"), 332 rel=feed_link.get("rel"), 333 content_type=feed_link.get("content_type"), 334 ) 335 336 def __str__(self) -> str: 337 return self.href 338 339 def __repr__(self) -> str: 340 return "{}({}, title={}, rel={}, content_type={})".format( 341 _classname(self), 342 repr(self.href), 343 repr(self.title), 344 repr(self.rel), 345 repr(self.content_type), 346 ) 347 348 def __eq__(self, other) -> bool: 349 if isinstance(other, Result.Link): 350 return self.href == other.href 351 return False 352 353 class MissingFieldError(Exception): 354 """ 355 An error indicating an entry is unparseable because it lacks required 356 fields. 357 """ 358 359 missing_field: str 360 """The required field missing from the would-be entry.""" 361 message: str 362 """Message describing what caused this error.""" 363 364 def __init__(self, missing_field): 365 self.missing_field = missing_field 366 self.message = "Entry from arXiv missing required info" 367 368 def __repr__(self) -> str: 369 return "{}({})".format(_classname(self), repr(self.missing_field)) 370 371 372class SortCriterion(Enum): 373 """ 374 A SortCriterion identifies a property by which search results can be 375 sorted. 376 377 See [the arXiv API User's Manual: sort order for return 378 results](https://arxiv.org/help/api/user-manual#sort). 379 """ 380 381 Relevance = "relevance" 382 LastUpdatedDate = "lastUpdatedDate" 383 SubmittedDate = "submittedDate" 384 385 386class SortOrder(Enum): 387 """ 388 A SortOrder indicates order in which search results are sorted according 389 to the specified arxiv.SortCriterion. 390 391 See [the arXiv API User's Manual: sort order for return 392 results](https://arxiv.org/help/api/user-manual#sort). 393 """ 394 395 Ascending = "ascending" 396 Descending = "descending" 397 398 399class Search(object): 400 """ 401 A specification for a search of arXiv's database. 402 403 To run a search, use `Search.run` to use a default client or `Client.run` 404 with a specific client. 405 """ 406 407 query: str 408 """ 409 A query string. 410 411 This should be unencoded. Use `au:del_maestro AND ti:checkerboard`, not 412 `au:del_maestro+AND+ti:checkerboard`. 413 414 See [the arXiv API User's Manual: Details of Query 415 Construction](https://arxiv.org/help/api/user-manual#query_details). 416 """ 417 id_list: List[str] 418 """ 419 A list of arXiv article IDs to which to limit the search. 420 421 See [the arXiv API User's 422 Manual](https://arxiv.org/help/api/user-manual#search_query_and_id_list) 423 for documentation of the interaction between `query` and `id_list`. 424 """ 425 max_results: int | None 426 """ 427 The maximum number of results to be returned in an execution of this 428 search. To fetch every result available, set `max_results=None`. 429 430 The API's limit is 300,000 results per query. 431 """ 432 sort_by: SortCriterion 433 """The sort criterion for results.""" 434 sort_order: SortOrder 435 """The sort order for results.""" 436 437 def __init__( 438 self, 439 query: str = "", 440 id_list: List[str] = [], 441 max_results: int | None = None, 442 sort_by: SortCriterion = SortCriterion.Relevance, 443 sort_order: SortOrder = SortOrder.Descending, 444 ): 445 """ 446 Constructs an arXiv API search with the specified criteria. 447 """ 448 self.query = query 449 self.id_list = id_list 450 # Handle deprecated v1 default behavior. 451 self.max_results = None if max_results == math.inf else max_results 452 self.sort_by = sort_by 453 self.sort_order = sort_order 454 455 def __str__(self) -> str: 456 # TODO: develop a more informative string representation. 457 return repr(self) 458 459 def __repr__(self) -> str: 460 return ("{}(query={}, id_list={}, max_results={}, sort_by={}, " "sort_order={})").format( 461 _classname(self), 462 repr(self.query), 463 repr(self.id_list), 464 repr(self.max_results), 465 repr(self.sort_by), 466 repr(self.sort_order), 467 ) 468 469 def _url_args(self) -> Dict[str, str]: 470 """ 471 Returns a dict of search parameters that should be included in an API 472 request for this search. 473 """ 474 return { 475 "search_query": self.query, 476 "id_list": ",".join(self.id_list), 477 "sortBy": self.sort_by.value, 478 "sortOrder": self.sort_order.value, 479 } 480 481 def results(self, offset: int = 0) -> Generator[Result, None, None]: 482 """ 483 Executes the specified search using a default arXiv API client. For info 484 on default behavior, see `Client.__init__` and `Client.results`. 485 486 **Deprecated** after 2.0.0; use `Client.results`. 487 """ 488 warnings.warn( 489 "The 'Search.results' method is deprecated, use 'Client.results' instead", 490 DeprecationWarning, 491 stacklevel=2, 492 ) 493 return Client().results(self, offset=offset) 494 495 496class Client(object): 497 """ 498 Specifies a strategy for fetching results from arXiv's API. 499 500 This class obscures pagination and retry logic, and exposes 501 `Client.results`. 502 """ 503 504 query_url_format = "https://export.arxiv.org/api/query?{}" 505 """ 506 The arXiv query API endpoint format. 507 """ 508 page_size: int 509 """ 510 Maximum number of results fetched in a single API request. Smaller pages can 511 be retrieved faster, but may require more round-trips. 512 513 The API's limit is 2000 results per page. 514 """ 515 delay_seconds: float 516 """ 517 Number of seconds to wait between API requests. 518 519 [arXiv's Terms of Use](https://arxiv.org/help/api/tou) ask that you "make no 520 more than one request every three seconds." 521 """ 522 num_retries: int 523 """ 524 Number of times to retry a failing API request before raising an Exception. 525 """ 526 527 _last_request_dt: datetime 528 _session: requests.Session 529 530 def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3): 531 """ 532 Constructs an arXiv API client with the specified options. 533 534 Note: the default parameters should provide a robust request strategy 535 for most use cases. Extreme page sizes, delays, or retries risk 536 violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou), 537 brittle behavior, and inconsistent results. 538 """ 539 self.page_size = page_size 540 self.delay_seconds = delay_seconds 541 self.num_retries = num_retries 542 self._last_request_dt = None 543 self._session = requests.Session() 544 545 def __str__(self) -> str: 546 # TODO: develop a more informative string representation. 547 return repr(self) 548 549 def __repr__(self) -> str: 550 return "{}(page_size={}, delay_seconds={}, num_retries={})".format( 551 _classname(self), 552 repr(self.page_size), 553 repr(self.delay_seconds), 554 repr(self.num_retries), 555 ) 556 557 def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]: 558 """ 559 Uses this client configuration to fetch one page of the search results 560 at a time, yielding the parsed `Result`s, until `max_results` results 561 have been yielded or there are no more search results. 562 563 If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`. 564 565 Setting a nonzero `offset` discards leading records in the result set. 566 When `offset` is greater than or equal to `search.max_results`, the full 567 result set is discarded. 568 569 For more on using generators, see 570 [Generators](https://wiki.python.org/moin/Generators). 571 """ 572 limit = search.max_results - offset if search.max_results else None 573 if limit and limit < 0: 574 return iter(()) 575 return itertools.islice(self._results(search, offset), limit) 576 577 def _results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]: 578 page_url = self._format_url(search, offset, self.page_size) 579 feed = self._parse_feed(page_url, first_page=True) 580 if not feed.entries: 581 logger.info("Got empty first page; stopping generation") 582 return 583 total_results = int(feed.feed.opensearch_totalresults) 584 logger.info( 585 "Got first page: %d of %d total results", 586 len(feed.entries), 587 total_results, 588 ) 589 590 while feed.entries: 591 for entry in feed.entries: 592 try: 593 yield Result._from_feed_entry(entry) 594 except Result.MissingFieldError as e: 595 logger.warning("Skipping partial result: %s", e) 596 offset += len(feed.entries) 597 if offset >= total_results: 598 break 599 page_url = self._format_url(search, offset, self.page_size) 600 feed = self._parse_feed(page_url, first_page=False) 601 602 def _format_url(self, search: Search, start: int, page_size: int) -> str: 603 """ 604 Construct a request API for search that returns up to `page_size` 605 results starting with the result at index `start`. 606 """ 607 url_args = search._url_args() 608 url_args.update( 609 { 610 "start": start, 611 "max_results": page_size, 612 } 613 ) 614 return self.query_url_format.format(urlencode(url_args)) 615 616 def _parse_feed( 617 self, url: str, first_page: bool = True, _try_index: int = 0 618 ) -> feedparser.FeedParserDict: 619 """ 620 Fetches the specified URL and parses it with feedparser. 621 622 If a request fails or is unexpectedly empty, retries the request up to 623 `self.num_retries` times. 624 """ 625 try: 626 return self.__try_parse_feed(url, first_page=first_page, try_index=_try_index) 627 except ( 628 HTTPError, 629 UnexpectedEmptyPageError, 630 requests.exceptions.ConnectionError, 631 ) as err: 632 if _try_index < self.num_retries: 633 logger.debug("Got error (try %d): %s", _try_index, err) 634 return self._parse_feed(url, first_page=first_page, _try_index=_try_index + 1) 635 logger.debug("Giving up (try %d): %s", _try_index, err) 636 raise err 637 638 def __try_parse_feed( 639 self, 640 url: str, 641 first_page: bool, 642 try_index: int, 643 ) -> feedparser.FeedParserDict: 644 """ 645 Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that 646 number of seconds has not passed since `_parse_feed` was last called, 647 sleeps until delay_seconds seconds have passed. 648 """ 649 # If this call would violate the rate limit, sleep until it doesn't. 650 if self._last_request_dt is not None: 651 required = timedelta(seconds=self.delay_seconds) 652 since_last_request = datetime.now() - self._last_request_dt 653 if since_last_request < required: 654 to_sleep = (required - since_last_request).total_seconds() 655 logger.info("Sleeping: %f seconds", to_sleep) 656 time.sleep(to_sleep) 657 658 logger.info("Requesting page (first: %r, try: %d): %s", first_page, try_index, url) 659 660 resp = self._session.get(url, headers={"user-agent": "arxiv.py/2.1.3"}) 661 self._last_request_dt = datetime.now() 662 if resp.status_code != requests.codes.OK: 663 raise HTTPError(url, try_index, resp.status_code) 664 665 feed = feedparser.parse(resp.content) 666 if len(feed.entries) == 0 and not first_page: 667 raise UnexpectedEmptyPageError(url, try_index, feed) 668 669 if feed.bozo: 670 logger.warning( 671 "Bozo feed; consider handling: %s", 672 feed.bozo_exception if "bozo_exception" in feed else None, 673 ) 674 675 return feed 676 677 678class ArxivError(Exception): 679 """This package's base Exception class.""" 680 681 url: str 682 """The feed URL that could not be fetched.""" 683 retry: int 684 """ 685 The request try number which encountered this error; 0 for the initial try, 686 1 for the first retry, and so on. 687 """ 688 message: str 689 """Message describing what caused this error.""" 690 691 def __init__(self, url: str, retry: int, message: str): 692 """ 693 Constructs an `ArxivError` encountered while fetching the specified URL. 694 """ 695 self.url = url 696 self.retry = retry 697 self.message = message 698 super().__init__(self.message) 699 700 def __str__(self) -> str: 701 return "{} ({})".format(self.message, self.url) 702 703 704class UnexpectedEmptyPageError(ArxivError): 705 """ 706 An error raised when a page of results that should be non-empty is empty. 707 708 This should never happen in theory, but happens sporadically due to 709 brittleness in the underlying arXiv API; usually resolved by retries. 710 711 See `Client.results` for usage. 712 """ 713 714 raw_feed: feedparser.FeedParserDict 715 """ 716 The raw output of `feedparser.parse`. Sometimes this contains useful 717 diagnostic information, e.g. in 'bozo_exception'. 718 """ 719 720 def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict): 721 """ 722 Constructs an `UnexpectedEmptyPageError` encountered for the specified 723 API URL after `retry` tries. 724 """ 725 self.url = url 726 self.raw_feed = raw_feed 727 super().__init__(url, retry, "Page of results was unexpectedly empty") 728 729 def __repr__(self) -> str: 730 return "{}({}, {}, {})".format( 731 _classname(self), repr(self.url), repr(self.retry), repr(self.raw_feed) 732 ) 733 734 735class HTTPError(ArxivError): 736 """ 737 A non-200 status encountered while fetching a page of results. 738 739 See `Client.results` for usage. 740 """ 741 742 status: int 743 """The HTTP status reported by feedparser.""" 744 745 def __init__(self, url: str, retry: int, status: int): 746 """ 747 Constructs an `HTTPError` for the specified status code, encountered for 748 the specified API URL after `retry` tries. 749 """ 750 self.url = url 751 self.status = status 752 super().__init__( 753 url, 754 retry, 755 "Page request resulted in HTTP {}".format(self.status), 756 ) 757 758 def __repr__(self) -> str: 759 return "{}({}, {}, {})".format( 760 _classname(self), repr(self.url), repr(self.retry), repr(self.status) 761 ) 762 763 764def _classname(o): 765 """A helper function for use in __repr__ methods: arxiv.Result.Link.""" 766 return "arxiv.{}".format(o.__class__.__qualname__)
28class Result(object): 29 """ 30 An entry in an arXiv query results feed. 31 32 See [the arXiv API User's Manual: Details of Atom Results 33 Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned). 34 """ 35 36 entry_id: str 37 """A url of the form `https://arxiv.org/abs/{id}`.""" 38 updated: datetime 39 """When the result was last updated.""" 40 published: datetime 41 """When the result was originally published.""" 42 title: str 43 """The title of the result.""" 44 authors: List[Author] 45 """The result's authors.""" 46 summary: str 47 """The result abstract.""" 48 comment: Optional[str] 49 """The authors' comment if present.""" 50 journal_ref: Optional[str] 51 """A journal reference if present.""" 52 doi: Optional[str] 53 """A URL for the resolved DOI to an external resource if present.""" 54 primary_category: str 55 """ 56 The result's primary arXiv category. See [arXiv: Category 57 Taxonomy](https://arxiv.org/category_taxonomy). 58 """ 59 categories: List[str] 60 """ 61 All of the result's categories. See [arXiv: Category 62 Taxonomy](https://arxiv.org/category_taxonomy). 63 """ 64 links: List[Link] 65 """Up to three URLs associated with this result.""" 66 pdf_url: Optional[str] 67 """The URL of a PDF version of this result if present among links.""" 68 _raw: feedparser.FeedParserDict 69 """ 70 The raw feedparser result object if this Result was constructed with 71 Result._from_feed_entry. 72 """ 73 74 def __init__( 75 self, 76 entry_id: str, 77 updated: datetime = _DEFAULT_TIME, 78 published: datetime = _DEFAULT_TIME, 79 title: str = "", 80 authors: List[Author] = [], 81 summary: str = "", 82 comment: str = "", 83 journal_ref: str = "", 84 doi: str = "", 85 primary_category: str = "", 86 categories: List[str] = [], 87 links: List[Link] = [], 88 _raw: feedparser.FeedParserDict = None, 89 ): 90 """ 91 Constructs an arXiv search result item. 92 93 In most cases, prefer using `Result._from_feed_entry` to parsing and 94 constructing `Result`s yourself. 95 """ 96 self.entry_id = entry_id 97 self.updated = updated 98 self.published = published 99 self.title = title 100 self.authors = authors 101 self.summary = summary 102 self.comment = comment 103 self.journal_ref = journal_ref 104 self.doi = doi 105 self.primary_category = primary_category 106 self.categories = categories 107 self.links = links 108 # Calculated members 109 self.pdf_url = Result._get_pdf_url(links) 110 # Debugging 111 self._raw = _raw 112 113 def _from_feed_entry(entry: feedparser.FeedParserDict) -> Result: 114 """ 115 Converts a feedparser entry for an arXiv search result feed into a 116 Result object. 117 """ 118 if not hasattr(entry, "id"): 119 raise Result.MissingFieldError("id") 120 # Title attribute may be absent for certain titles. Defaulting to "0" as 121 # it's the only title observed to cause this bug. 122 # https://github.com/lukasschwab/arxiv.py/issues/71 123 # title = entry.title if hasattr(entry, "title") else "0" 124 title = "0" 125 if hasattr(entry, "title"): 126 title = entry.title 127 else: 128 logger.warning("Result %s is missing title attribute; defaulting to '0'", entry.id) 129 return Result( 130 entry_id=entry.id, 131 updated=Result._to_datetime(entry.updated_parsed), 132 published=Result._to_datetime(entry.published_parsed), 133 title=re.sub(r"\s+", " ", title), 134 authors=[Result.Author._from_feed_author(a) for a in entry.authors], 135 summary=entry.summary, 136 comment=entry.get("arxiv_comment"), 137 journal_ref=entry.get("arxiv_journal_ref"), 138 doi=entry.get("arxiv_doi"), 139 primary_category=entry.arxiv_primary_category.get("term"), 140 categories=[tag.get("term") for tag in entry.tags], 141 links=[Result.Link._from_feed_link(link) for link in entry.links], 142 _raw=entry, 143 ) 144 145 def __str__(self) -> str: 146 return self.entry_id 147 148 def __repr__(self) -> str: 149 return ( 150 "{}(entry_id={}, updated={}, published={}, title={}, authors={}, " 151 "summary={}, comment={}, journal_ref={}, doi={}, " 152 "primary_category={}, categories={}, links={})" 153 ).format( 154 _classname(self), 155 repr(self.entry_id), 156 repr(self.updated), 157 repr(self.published), 158 repr(self.title), 159 repr(self.authors), 160 repr(self.summary), 161 repr(self.comment), 162 repr(self.journal_ref), 163 repr(self.doi), 164 repr(self.primary_category), 165 repr(self.categories), 166 repr(self.links), 167 ) 168 169 def __eq__(self, other) -> bool: 170 if isinstance(other, Result): 171 return self.entry_id == other.entry_id 172 return False 173 174 def get_short_id(self) -> str: 175 """ 176 Returns the short ID for this result. 177 178 + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`, 179 `result.get_short_id()` returns `2107.05580v1`. 180 181 + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`, 182 `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March 183 2007 arXiv identifier format). 184 185 For an explanation of the difference between arXiv's legacy and current 186 identifiers, see [Understanding the arXiv 187 identifier](https://arxiv.org/help/arxiv_identifier). 188 """ 189 return self.entry_id.split("arxiv.org/abs/")[-1] 190 191 def _get_default_filename(self, extension: str = "pdf") -> str: 192 """ 193 A default `to_filename` function for the extension given. 194 """ 195 nonempty_title = self.title if self.title else "UNTITLED" 196 return ".".join( 197 [ 198 self.get_short_id().replace("/", "_"), 199 re.sub(r"[^\w]", "_", nonempty_title), 200 extension, 201 ] 202 ) 203 204 def download_pdf(self, dirpath: str = "./", filename: str = "") -> str: 205 """ 206 Downloads the PDF for this result to the specified directory. 207 208 The filename is generated by calling `to_filename(self)`. 209 """ 210 if not filename: 211 filename = self._get_default_filename() 212 path = os.path.join(dirpath, filename) 213 written_path, _ = urlretrieve(self.pdf_url, path) 214 return written_path 215 216 def download_source(self, dirpath: str = "./", filename: str = "") -> str: 217 """ 218 Downloads the source tarfile for this result to the specified 219 directory. 220 221 The filename is generated by calling `to_filename(self)`. 222 """ 223 if not filename: 224 filename = self._get_default_filename("tar.gz") 225 path = os.path.join(dirpath, filename) 226 # Bodge: construct the source URL from the PDF URL. 227 source_url = self.pdf_url.replace("/pdf/", "/src/") 228 written_path, _ = urlretrieve(source_url, path) 229 return written_path 230 231 def _get_pdf_url(links: List[Link]) -> str: 232 """ 233 Finds the PDF link among a result's links and returns its URL. 234 235 Should only be called once for a given `Result`, in its constructor. 236 After construction, the URL should be available in `Result.pdf_url`. 237 """ 238 pdf_urls = [link.href for link in links if link.title == "pdf"] 239 if len(pdf_urls) == 0: 240 return None 241 elif len(pdf_urls) > 1: 242 logger.warning("Result has multiple PDF links; using %s", pdf_urls[0]) 243 return pdf_urls[0] 244 245 def _to_datetime(ts: time.struct_time) -> datetime: 246 """ 247 Converts a UTC time.struct_time into a time-zone-aware datetime. 248 249 This will be replaced with feedparser functionality [when it becomes 250 available](https://github.com/kurtmckee/feedparser/issues/212). 251 """ 252 return datetime.fromtimestamp(timegm(ts), tz=timezone.utc) 253 254 class Author(object): 255 """ 256 A light inner class for representing a result's authors. 257 """ 258 259 name: str 260 """The author's name.""" 261 262 def __init__(self, name: str): 263 """ 264 Constructs an `Author` with the specified name. 265 266 In most cases, prefer using `Author._from_feed_author` to parsing 267 and constructing `Author`s yourself. 268 """ 269 self.name = name 270 271 def _from_feed_author(feed_author: feedparser.FeedParserDict) -> Result.Author: 272 """ 273 Constructs an `Author` with the name specified in an author object 274 from a feed entry. 275 276 See usage in `Result._from_feed_entry`. 277 """ 278 return Result.Author(feed_author.name) 279 280 def __str__(self) -> str: 281 return self.name 282 283 def __repr__(self) -> str: 284 return "{}({})".format(_classname(self), repr(self.name)) 285 286 def __eq__(self, other) -> bool: 287 if isinstance(other, Result.Author): 288 return self.name == other.name 289 return False 290 291 class Link(object): 292 """ 293 A light inner class for representing a result's links. 294 """ 295 296 href: str 297 """The link's `href` attribute.""" 298 title: Optional[str] 299 """The link's title.""" 300 rel: str 301 """The link's relationship to the `Result`.""" 302 content_type: str 303 """The link's HTTP content type.""" 304 305 def __init__( 306 self, 307 href: str, 308 title: str = None, 309 rel: str = None, 310 content_type: str = None, 311 ): 312 """ 313 Constructs a `Link` with the specified link metadata. 314 315 In most cases, prefer using `Link._from_feed_link` to parsing and 316 constructing `Link`s yourself. 317 """ 318 self.href = href 319 self.title = title 320 self.rel = rel 321 self.content_type = content_type 322 323 def _from_feed_link(feed_link: feedparser.FeedParserDict) -> Result.Link: 324 """ 325 Constructs a `Link` with link metadata specified in a link object 326 from a feed entry. 327 328 See usage in `Result._from_feed_entry`. 329 """ 330 return Result.Link( 331 href=feed_link.href, 332 title=feed_link.get("title"), 333 rel=feed_link.get("rel"), 334 content_type=feed_link.get("content_type"), 335 ) 336 337 def __str__(self) -> str: 338 return self.href 339 340 def __repr__(self) -> str: 341 return "{}({}, title={}, rel={}, content_type={})".format( 342 _classname(self), 343 repr(self.href), 344 repr(self.title), 345 repr(self.rel), 346 repr(self.content_type), 347 ) 348 349 def __eq__(self, other) -> bool: 350 if isinstance(other, Result.Link): 351 return self.href == other.href 352 return False 353 354 class MissingFieldError(Exception): 355 """ 356 An error indicating an entry is unparseable because it lacks required 357 fields. 358 """ 359 360 missing_field: str 361 """The required field missing from the would-be entry.""" 362 message: str 363 """Message describing what caused this error.""" 364 365 def __init__(self, missing_field): 366 self.missing_field = missing_field 367 self.message = "Entry from arXiv missing required info" 368 369 def __repr__(self) -> str: 370 return "{}({})".format(_classname(self), repr(self.missing_field))
An entry in an arXiv query results feed.
See the arXiv API User's Manual: Details of Atom Results Returned.
74 def __init__( 75 self, 76 entry_id: str, 77 updated: datetime = _DEFAULT_TIME, 78 published: datetime = _DEFAULT_TIME, 79 title: str = "", 80 authors: List[Author] = [], 81 summary: str = "", 82 comment: str = "", 83 journal_ref: str = "", 84 doi: str = "", 85 primary_category: str = "", 86 categories: List[str] = [], 87 links: List[Link] = [], 88 _raw: feedparser.FeedParserDict = None, 89 ): 90 """ 91 Constructs an arXiv search result item. 92 93 In most cases, prefer using `Result._from_feed_entry` to parsing and 94 constructing `Result`s yourself. 95 """ 96 self.entry_id = entry_id 97 self.updated = updated 98 self.published = published 99 self.title = title 100 self.authors = authors 101 self.summary = summary 102 self.comment = comment 103 self.journal_ref = journal_ref 104 self.doi = doi 105 self.primary_category = primary_category 106 self.categories = categories 107 self.links = links 108 # Calculated members 109 self.pdf_url = Result._get_pdf_url(links) 110 # Debugging 111 self._raw = _raw
Constructs an arXiv search result item.
In most cases, prefer using Result._from_feed_entry
to parsing and
constructing Result
s yourself.
174 def get_short_id(self) -> str: 175 """ 176 Returns the short ID for this result. 177 178 + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`, 179 `result.get_short_id()` returns `2107.05580v1`. 180 181 + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`, 182 `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March 183 2007 arXiv identifier format). 184 185 For an explanation of the difference between arXiv's legacy and current 186 identifiers, see [Understanding the arXiv 187 identifier](https://arxiv.org/help/arxiv_identifier). 188 """ 189 return self.entry_id.split("arxiv.org/abs/")[-1]
Returns the short ID for this result.
If the result URL is
"https://arxiv.org/abs/2107.05580v1"
,result.get_short_id()
returns2107.05580v1
.If the result URL is
"https://arxiv.org/abs/quant-ph/0201082v1"
,result.get_short_id()
returns"quant-ph/0201082v1"
(the pre-March 2007 arXiv identifier format).
For an explanation of the difference between arXiv's legacy and current identifiers, see Understanding the arXiv identifier.
204 def download_pdf(self, dirpath: str = "./", filename: str = "") -> str: 205 """ 206 Downloads the PDF for this result to the specified directory. 207 208 The filename is generated by calling `to_filename(self)`. 209 """ 210 if not filename: 211 filename = self._get_default_filename() 212 path = os.path.join(dirpath, filename) 213 written_path, _ = urlretrieve(self.pdf_url, path) 214 return written_path
Downloads the PDF for this result to the specified directory.
The filename is generated by calling to_filename(self)
.
216 def download_source(self, dirpath: str = "./", filename: str = "") -> str: 217 """ 218 Downloads the source tarfile for this result to the specified 219 directory. 220 221 The filename is generated by calling `to_filename(self)`. 222 """ 223 if not filename: 224 filename = self._get_default_filename("tar.gz") 225 path = os.path.join(dirpath, filename) 226 # Bodge: construct the source URL from the PDF URL. 227 source_url = self.pdf_url.replace("/pdf/", "/src/") 228 written_path, _ = urlretrieve(source_url, path) 229 return written_path
Downloads the source tarfile for this result to the specified directory.
The filename is generated by calling to_filename(self)
.
254 class Author(object): 255 """ 256 A light inner class for representing a result's authors. 257 """ 258 259 name: str 260 """The author's name.""" 261 262 def __init__(self, name: str): 263 """ 264 Constructs an `Author` with the specified name. 265 266 In most cases, prefer using `Author._from_feed_author` to parsing 267 and constructing `Author`s yourself. 268 """ 269 self.name = name 270 271 def _from_feed_author(feed_author: feedparser.FeedParserDict) -> Result.Author: 272 """ 273 Constructs an `Author` with the name specified in an author object 274 from a feed entry. 275 276 See usage in `Result._from_feed_entry`. 277 """ 278 return Result.Author(feed_author.name) 279 280 def __str__(self) -> str: 281 return self.name 282 283 def __repr__(self) -> str: 284 return "{}({})".format(_classname(self), repr(self.name)) 285 286 def __eq__(self, other) -> bool: 287 if isinstance(other, Result.Author): 288 return self.name == other.name 289 return False
A light inner class for representing a result's authors.
291 class Link(object): 292 """ 293 A light inner class for representing a result's links. 294 """ 295 296 href: str 297 """The link's `href` attribute.""" 298 title: Optional[str] 299 """The link's title.""" 300 rel: str 301 """The link's relationship to the `Result`.""" 302 content_type: str 303 """The link's HTTP content type.""" 304 305 def __init__( 306 self, 307 href: str, 308 title: str = None, 309 rel: str = None, 310 content_type: str = None, 311 ): 312 """ 313 Constructs a `Link` with the specified link metadata. 314 315 In most cases, prefer using `Link._from_feed_link` to parsing and 316 constructing `Link`s yourself. 317 """ 318 self.href = href 319 self.title = title 320 self.rel = rel 321 self.content_type = content_type 322 323 def _from_feed_link(feed_link: feedparser.FeedParserDict) -> Result.Link: 324 """ 325 Constructs a `Link` with link metadata specified in a link object 326 from a feed entry. 327 328 See usage in `Result._from_feed_entry`. 329 """ 330 return Result.Link( 331 href=feed_link.href, 332 title=feed_link.get("title"), 333 rel=feed_link.get("rel"), 334 content_type=feed_link.get("content_type"), 335 ) 336 337 def __str__(self) -> str: 338 return self.href 339 340 def __repr__(self) -> str: 341 return "{}({}, title={}, rel={}, content_type={})".format( 342 _classname(self), 343 repr(self.href), 344 repr(self.title), 345 repr(self.rel), 346 repr(self.content_type), 347 ) 348 349 def __eq__(self, other) -> bool: 350 if isinstance(other, Result.Link): 351 return self.href == other.href 352 return False
A light inner class for representing a result's links.
305 def __init__( 306 self, 307 href: str, 308 title: str = None, 309 rel: str = None, 310 content_type: str = None, 311 ): 312 """ 313 Constructs a `Link` with the specified link metadata. 314 315 In most cases, prefer using `Link._from_feed_link` to parsing and 316 constructing `Link`s yourself. 317 """ 318 self.href = href 319 self.title = title 320 self.rel = rel 321 self.content_type = content_type
354 class MissingFieldError(Exception): 355 """ 356 An error indicating an entry is unparseable because it lacks required 357 fields. 358 """ 359 360 missing_field: str 361 """The required field missing from the would-be entry.""" 362 message: str 363 """Message describing what caused this error.""" 364 365 def __init__(self, missing_field): 366 self.missing_field = missing_field 367 self.message = "Entry from arXiv missing required info" 368 369 def __repr__(self) -> str: 370 return "{}({})".format(_classname(self), repr(self.missing_field))
An error indicating an entry is unparseable because it lacks required fields.
Inherited Members
- builtins.BaseException
- with_traceback
- args
373class SortCriterion(Enum): 374 """ 375 A SortCriterion identifies a property by which search results can be 376 sorted. 377 378 See [the arXiv API User's Manual: sort order for return 379 results](https://arxiv.org/help/api/user-manual#sort). 380 """ 381 382 Relevance = "relevance" 383 LastUpdatedDate = "lastUpdatedDate" 384 SubmittedDate = "submittedDate"
A SortCriterion identifies a property by which search results can be sorted.
See the arXiv API User's Manual: sort order for return results.
Inherited Members
- enum.Enum
- name
- value
387class SortOrder(Enum): 388 """ 389 A SortOrder indicates order in which search results are sorted according 390 to the specified arxiv.SortCriterion. 391 392 See [the arXiv API User's Manual: sort order for return 393 results](https://arxiv.org/help/api/user-manual#sort). 394 """ 395 396 Ascending = "ascending" 397 Descending = "descending"
A SortOrder indicates order in which search results are sorted according to the specified SortCriterion.
See the arXiv API User's Manual: sort order for return results.
Inherited Members
- enum.Enum
- name
- value
400class Search(object): 401 """ 402 A specification for a search of arXiv's database. 403 404 To run a search, use `Search.run` to use a default client or `Client.run` 405 with a specific client. 406 """ 407 408 query: str 409 """ 410 A query string. 411 412 This should be unencoded. Use `au:del_maestro AND ti:checkerboard`, not 413 `au:del_maestro+AND+ti:checkerboard`. 414 415 See [the arXiv API User's Manual: Details of Query 416 Construction](https://arxiv.org/help/api/user-manual#query_details). 417 """ 418 id_list: List[str] 419 """ 420 A list of arXiv article IDs to which to limit the search. 421 422 See [the arXiv API User's 423 Manual](https://arxiv.org/help/api/user-manual#search_query_and_id_list) 424 for documentation of the interaction between `query` and `id_list`. 425 """ 426 max_results: int | None 427 """ 428 The maximum number of results to be returned in an execution of this 429 search. To fetch every result available, set `max_results=None`. 430 431 The API's limit is 300,000 results per query. 432 """ 433 sort_by: SortCriterion 434 """The sort criterion for results.""" 435 sort_order: SortOrder 436 """The sort order for results.""" 437 438 def __init__( 439 self, 440 query: str = "", 441 id_list: List[str] = [], 442 max_results: int | None = None, 443 sort_by: SortCriterion = SortCriterion.Relevance, 444 sort_order: SortOrder = SortOrder.Descending, 445 ): 446 """ 447 Constructs an arXiv API search with the specified criteria. 448 """ 449 self.query = query 450 self.id_list = id_list 451 # Handle deprecated v1 default behavior. 452 self.max_results = None if max_results == math.inf else max_results 453 self.sort_by = sort_by 454 self.sort_order = sort_order 455 456 def __str__(self) -> str: 457 # TODO: develop a more informative string representation. 458 return repr(self) 459 460 def __repr__(self) -> str: 461 return ("{}(query={}, id_list={}, max_results={}, sort_by={}, " "sort_order={})").format( 462 _classname(self), 463 repr(self.query), 464 repr(self.id_list), 465 repr(self.max_results), 466 repr(self.sort_by), 467 repr(self.sort_order), 468 ) 469 470 def _url_args(self) -> Dict[str, str]: 471 """ 472 Returns a dict of search parameters that should be included in an API 473 request for this search. 474 """ 475 return { 476 "search_query": self.query, 477 "id_list": ",".join(self.id_list), 478 "sortBy": self.sort_by.value, 479 "sortOrder": self.sort_order.value, 480 } 481 482 def results(self, offset: int = 0) -> Generator[Result, None, None]: 483 """ 484 Executes the specified search using a default arXiv API client. For info 485 on default behavior, see `Client.__init__` and `Client.results`. 486 487 **Deprecated** after 2.0.0; use `Client.results`. 488 """ 489 warnings.warn( 490 "The 'Search.results' method is deprecated, use 'Client.results' instead", 491 DeprecationWarning, 492 stacklevel=2, 493 ) 494 return Client().results(self, offset=offset)
A specification for a search of arXiv's database.
To run a search, use Search.run
to use a default client or Client.run
with a specific client.
438 def __init__( 439 self, 440 query: str = "", 441 id_list: List[str] = [], 442 max_results: int | None = None, 443 sort_by: SortCriterion = SortCriterion.Relevance, 444 sort_order: SortOrder = SortOrder.Descending, 445 ): 446 """ 447 Constructs an arXiv API search with the specified criteria. 448 """ 449 self.query = query 450 self.id_list = id_list 451 # Handle deprecated v1 default behavior. 452 self.max_results = None if max_results == math.inf else max_results 453 self.sort_by = sort_by 454 self.sort_order = sort_order
Constructs an arXiv API search with the specified criteria.
A query string.
This should be unencoded. Use au:del_maestro AND ti:checkerboard
, not
au:del_maestro+AND+ti:checkerboard
.
See the arXiv API User's Manual: Details of Query Construction.
A list of arXiv article IDs to which to limit the search.
See the arXiv API User's
Manual
for documentation of the interaction between query
and id_list
.
The maximum number of results to be returned in an execution of this
search. To fetch every result available, set max_results=None
.
The API's limit is 300,000 results per query.
482 def results(self, offset: int = 0) -> Generator[Result, None, None]: 483 """ 484 Executes the specified search using a default arXiv API client. For info 485 on default behavior, see `Client.__init__` and `Client.results`. 486 487 **Deprecated** after 2.0.0; use `Client.results`. 488 """ 489 warnings.warn( 490 "The 'Search.results' method is deprecated, use 'Client.results' instead", 491 DeprecationWarning, 492 stacklevel=2, 493 ) 494 return Client().results(self, offset=offset)
Executes the specified search using a default arXiv API client. For info
on default behavior, see Client.__init__
and Client.results
.
Deprecated after 2.0.0; use Client.results
.
497class Client(object): 498 """ 499 Specifies a strategy for fetching results from arXiv's API. 500 501 This class obscures pagination and retry logic, and exposes 502 `Client.results`. 503 """ 504 505 query_url_format = "https://export.arxiv.org/api/query?{}" 506 """ 507 The arXiv query API endpoint format. 508 """ 509 page_size: int 510 """ 511 Maximum number of results fetched in a single API request. Smaller pages can 512 be retrieved faster, but may require more round-trips. 513 514 The API's limit is 2000 results per page. 515 """ 516 delay_seconds: float 517 """ 518 Number of seconds to wait between API requests. 519 520 [arXiv's Terms of Use](https://arxiv.org/help/api/tou) ask that you "make no 521 more than one request every three seconds." 522 """ 523 num_retries: int 524 """ 525 Number of times to retry a failing API request before raising an Exception. 526 """ 527 528 _last_request_dt: datetime 529 _session: requests.Session 530 531 def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3): 532 """ 533 Constructs an arXiv API client with the specified options. 534 535 Note: the default parameters should provide a robust request strategy 536 for most use cases. Extreme page sizes, delays, or retries risk 537 violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou), 538 brittle behavior, and inconsistent results. 539 """ 540 self.page_size = page_size 541 self.delay_seconds = delay_seconds 542 self.num_retries = num_retries 543 self._last_request_dt = None 544 self._session = requests.Session() 545 546 def __str__(self) -> str: 547 # TODO: develop a more informative string representation. 548 return repr(self) 549 550 def __repr__(self) -> str: 551 return "{}(page_size={}, delay_seconds={}, num_retries={})".format( 552 _classname(self), 553 repr(self.page_size), 554 repr(self.delay_seconds), 555 repr(self.num_retries), 556 ) 557 558 def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]: 559 """ 560 Uses this client configuration to fetch one page of the search results 561 at a time, yielding the parsed `Result`s, until `max_results` results 562 have been yielded or there are no more search results. 563 564 If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`. 565 566 Setting a nonzero `offset` discards leading records in the result set. 567 When `offset` is greater than or equal to `search.max_results`, the full 568 result set is discarded. 569 570 For more on using generators, see 571 [Generators](https://wiki.python.org/moin/Generators). 572 """ 573 limit = search.max_results - offset if search.max_results else None 574 if limit and limit < 0: 575 return iter(()) 576 return itertools.islice(self._results(search, offset), limit) 577 578 def _results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]: 579 page_url = self._format_url(search, offset, self.page_size) 580 feed = self._parse_feed(page_url, first_page=True) 581 if not feed.entries: 582 logger.info("Got empty first page; stopping generation") 583 return 584 total_results = int(feed.feed.opensearch_totalresults) 585 logger.info( 586 "Got first page: %d of %d total results", 587 len(feed.entries), 588 total_results, 589 ) 590 591 while feed.entries: 592 for entry in feed.entries: 593 try: 594 yield Result._from_feed_entry(entry) 595 except Result.MissingFieldError as e: 596 logger.warning("Skipping partial result: %s", e) 597 offset += len(feed.entries) 598 if offset >= total_results: 599 break 600 page_url = self._format_url(search, offset, self.page_size) 601 feed = self._parse_feed(page_url, first_page=False) 602 603 def _format_url(self, search: Search, start: int, page_size: int) -> str: 604 """ 605 Construct a request API for search that returns up to `page_size` 606 results starting with the result at index `start`. 607 """ 608 url_args = search._url_args() 609 url_args.update( 610 { 611 "start": start, 612 "max_results": page_size, 613 } 614 ) 615 return self.query_url_format.format(urlencode(url_args)) 616 617 def _parse_feed( 618 self, url: str, first_page: bool = True, _try_index: int = 0 619 ) -> feedparser.FeedParserDict: 620 """ 621 Fetches the specified URL and parses it with feedparser. 622 623 If a request fails or is unexpectedly empty, retries the request up to 624 `self.num_retries` times. 625 """ 626 try: 627 return self.__try_parse_feed(url, first_page=first_page, try_index=_try_index) 628 except ( 629 HTTPError, 630 UnexpectedEmptyPageError, 631 requests.exceptions.ConnectionError, 632 ) as err: 633 if _try_index < self.num_retries: 634 logger.debug("Got error (try %d): %s", _try_index, err) 635 return self._parse_feed(url, first_page=first_page, _try_index=_try_index + 1) 636 logger.debug("Giving up (try %d): %s", _try_index, err) 637 raise err 638 639 def __try_parse_feed( 640 self, 641 url: str, 642 first_page: bool, 643 try_index: int, 644 ) -> feedparser.FeedParserDict: 645 """ 646 Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that 647 number of seconds has not passed since `_parse_feed` was last called, 648 sleeps until delay_seconds seconds have passed. 649 """ 650 # If this call would violate the rate limit, sleep until it doesn't. 651 if self._last_request_dt is not None: 652 required = timedelta(seconds=self.delay_seconds) 653 since_last_request = datetime.now() - self._last_request_dt 654 if since_last_request < required: 655 to_sleep = (required - since_last_request).total_seconds() 656 logger.info("Sleeping: %f seconds", to_sleep) 657 time.sleep(to_sleep) 658 659 logger.info("Requesting page (first: %r, try: %d): %s", first_page, try_index, url) 660 661 resp = self._session.get(url, headers={"user-agent": "arxiv.py/2.1.3"}) 662 self._last_request_dt = datetime.now() 663 if resp.status_code != requests.codes.OK: 664 raise HTTPError(url, try_index, resp.status_code) 665 666 feed = feedparser.parse(resp.content) 667 if len(feed.entries) == 0 and not first_page: 668 raise UnexpectedEmptyPageError(url, try_index, feed) 669 670 if feed.bozo: 671 logger.warning( 672 "Bozo feed; consider handling: %s", 673 feed.bozo_exception if "bozo_exception" in feed else None, 674 ) 675 676 return feed
Specifies a strategy for fetching results from arXiv's API.
This class obscures pagination and retry logic, and exposes
Client.results
.
531 def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3): 532 """ 533 Constructs an arXiv API client with the specified options. 534 535 Note: the default parameters should provide a robust request strategy 536 for most use cases. Extreme page sizes, delays, or retries risk 537 violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou), 538 brittle behavior, and inconsistent results. 539 """ 540 self.page_size = page_size 541 self.delay_seconds = delay_seconds 542 self.num_retries = num_retries 543 self._last_request_dt = None 544 self._session = requests.Session()
Constructs an arXiv API client with the specified options.
Note: the default parameters should provide a robust request strategy for most use cases. Extreme page sizes, delays, or retries risk violating the arXiv API Terms of Use, brittle behavior, and inconsistent results.
The arXiv query API endpoint format.
Maximum number of results fetched in a single API request. Smaller pages can be retrieved faster, but may require more round-trips.
The API's limit is 2000 results per page.
Number of seconds to wait between API requests.
arXiv's Terms of Use ask that you "make no more than one request every three seconds."
558 def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]: 559 """ 560 Uses this client configuration to fetch one page of the search results 561 at a time, yielding the parsed `Result`s, until `max_results` results 562 have been yielded or there are no more search results. 563 564 If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`. 565 566 Setting a nonzero `offset` discards leading records in the result set. 567 When `offset` is greater than or equal to `search.max_results`, the full 568 result set is discarded. 569 570 For more on using generators, see 571 [Generators](https://wiki.python.org/moin/Generators). 572 """ 573 limit = search.max_results - offset if search.max_results else None 574 if limit and limit < 0: 575 return iter(()) 576 return itertools.islice(self._results(search, offset), limit)
Uses this client configuration to fetch one page of the search results
at a time, yielding the parsed Result
s, until max_results
results
have been yielded or there are no more search results.
If all tries fail, raises an UnexpectedEmptyPageError
or HTTPError
.
Setting a nonzero offset
discards leading records in the result set.
When offset
is greater than or equal to search.max_results
, the full
result set is discarded.
For more on using generators, see Generators.
679class ArxivError(Exception): 680 """This package's base Exception class.""" 681 682 url: str 683 """The feed URL that could not be fetched.""" 684 retry: int 685 """ 686 The request try number which encountered this error; 0 for the initial try, 687 1 for the first retry, and so on. 688 """ 689 message: str 690 """Message describing what caused this error.""" 691 692 def __init__(self, url: str, retry: int, message: str): 693 """ 694 Constructs an `ArxivError` encountered while fetching the specified URL. 695 """ 696 self.url = url 697 self.retry = retry 698 self.message = message 699 super().__init__(self.message) 700 701 def __str__(self) -> str: 702 return "{} ({})".format(self.message, self.url)
This package's base Exception class.
692 def __init__(self, url: str, retry: int, message: str): 693 """ 694 Constructs an `ArxivError` encountered while fetching the specified URL. 695 """ 696 self.url = url 697 self.retry = retry 698 self.message = message 699 super().__init__(self.message)
Constructs an ArxivError
encountered while fetching the specified URL.
The request try number which encountered this error; 0 for the initial try, 1 for the first retry, and so on.
Inherited Members
- builtins.BaseException
- with_traceback
- args
705class UnexpectedEmptyPageError(ArxivError): 706 """ 707 An error raised when a page of results that should be non-empty is empty. 708 709 This should never happen in theory, but happens sporadically due to 710 brittleness in the underlying arXiv API; usually resolved by retries. 711 712 See `Client.results` for usage. 713 """ 714 715 raw_feed: feedparser.FeedParserDict 716 """ 717 The raw output of `feedparser.parse`. Sometimes this contains useful 718 diagnostic information, e.g. in 'bozo_exception'. 719 """ 720 721 def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict): 722 """ 723 Constructs an `UnexpectedEmptyPageError` encountered for the specified 724 API URL after `retry` tries. 725 """ 726 self.url = url 727 self.raw_feed = raw_feed 728 super().__init__(url, retry, "Page of results was unexpectedly empty") 729 730 def __repr__(self) -> str: 731 return "{}({}, {}, {})".format( 732 _classname(self), repr(self.url), repr(self.retry), repr(self.raw_feed) 733 )
An error raised when a page of results that should be non-empty is empty.
This should never happen in theory, but happens sporadically due to brittleness in the underlying arXiv API; usually resolved by retries.
See Client.results
for usage.
721 def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict): 722 """ 723 Constructs an `UnexpectedEmptyPageError` encountered for the specified 724 API URL after `retry` tries. 725 """ 726 self.url = url 727 self.raw_feed = raw_feed 728 super().__init__(url, retry, "Page of results was unexpectedly empty")
Constructs an UnexpectedEmptyPageError
encountered for the specified
API URL after retry
tries.
The raw output of feedparser.parse
. Sometimes this contains useful
diagnostic information, e.g. in 'bozo_exception'.
Inherited Members
- builtins.BaseException
- with_traceback
- args
736class HTTPError(ArxivError): 737 """ 738 A non-200 status encountered while fetching a page of results. 739 740 See `Client.results` for usage. 741 """ 742 743 status: int 744 """The HTTP status reported by feedparser.""" 745 746 def __init__(self, url: str, retry: int, status: int): 747 """ 748 Constructs an `HTTPError` for the specified status code, encountered for 749 the specified API URL after `retry` tries. 750 """ 751 self.url = url 752 self.status = status 753 super().__init__( 754 url, 755 retry, 756 "Page request resulted in HTTP {}".format(self.status), 757 ) 758 759 def __repr__(self) -> str: 760 return "{}({}, {}, {})".format( 761 _classname(self), repr(self.url), repr(self.retry), repr(self.status) 762 )
A non-200 status encountered while fetching a page of results.
See Client.results
for usage.
746 def __init__(self, url: str, retry: int, status: int): 747 """ 748 Constructs an `HTTPError` for the specified status code, encountered for 749 the specified API URL after `retry` tries. 750 """ 751 self.url = url 752 self.status = status 753 super().__init__( 754 url, 755 retry, 756 "Page request resulted in HTTP {}".format(self.status), 757 )
Inherited Members
- builtins.BaseException
- with_traceback
- args