arxiv
arxiv.py
Python wrapper for the arXiv API.
arXiv is a project by the Cornell University Library that provides open access to 1,000,000+ articles in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics.
Usage
Install the package:
$ pip install arxiv # Or `uv add arxiv` or similar.
In your Python code, include the line:
import arxiv
Examples
[!TIP] [
arxivql](https://pypi.org/project/arxivql/) may simplify constructing complex query strings.
Fetching results
import arxiv
# Construct the default API client.
client = Client()
# Search for the 10 most recent articles matching the keyword "quantum."
search = Search(
query = "quantum",
max_results = 10,
sort_by = SortCriterion.SubmittedDate
)
results = client.results(search)
# `results` is a generator; you can iterate over its elements one by one...
for r in client.results(search):
print(r.title)
# ...or exhaust it into a list. Careful: this is slow for large results sets.
all_results = list(results)
print([r.title for r in all_results])
# For advanced query syntax documentation, see the arXiv API User Manual:
# https://arxiv.org/help/api/user-manual#query_details
search = Search(query = "au:del_maestro AND ti:checkerboard")
first_result = next(client.results(search))
print(first_result)
# Search for the paper with ID "1605.08386v1"
search_by_id = Search(id_list=["1605.08386v1"])
# Reuse client to fetch the paper, then print its title.
first_result = next(client.results(search_by_id))
print(first_result.title)
Fetching results with a custom client
import arxiv
big_slow_client = Client(
page_size = 1000,
delay_seconds = 10.0,
num_retries = 5
)
# Prints 1000 titles before needing to make another request.
for result in big_slow_client.results(Search(query="quantum")):
print(result.title)
Logging
To inspect this package's network behavior and API logic, configure a DEBUG-level logger.
>>> import logging, arxiv
>>> logging.basicConfig(level=logging.DEBUG)
>>> client = Client()
>>> paper = next(client.results(Search(id_list=["1605.08386v1"])))
INFO:arxiv.arxiv:Requesting 100 results at offset 0
INFO:arxiv.arxiv:Requesting page (first: False, try: 0): https://exportarxiv.arxiv.org/api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): export.arxiv.org:443
DEBUG:urllib3.connectionpool:https://export.arxiv.org:443 "GET /api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100&user-agent=arxiv.py%2F1.4.8 HTTP/1.1" 200 979
Types
Client
A Client specifies a reusable strategy for fetching results from arXiv's API. For most use cases the default client should suffice.
Clients configurations specify pagination and retry logic. Reusing a client allows successive API calls to use the same connection pool and ensures they abide by the rate limit you set.
Search
A Search specifies a search of arXiv's database. Use Client.results to get a generator yielding Results.
Result
The Result objects yielded by Client.results include metadata about each paper and helper methods for downloading their content.
The meaning of the underlying raw data is documented in the arXiv API User Manual: Details of Atom Results Returned.
Result also exposes helper methods for downloading papers: Result.download_pdf and Result.download_source.
Development
This project uses UV for development, while maintaining compatibility with traditional pip installation for end users.
Development Setup
Install UV (if you haven't already):
curl -LsSf https://astral.sh/uv/install.sh | shClone and setup:
git clone https://github.com/lukasschwab/arxiv.py cd arxiv.py make dev-setup
1""".. include:: ../README.md""" 2 3from __future__ import annotations 4 5import logging 6import time 7import itertools 8import feedparser 9import os 10import math 11import re 12import requests 13import warnings 14 15from urllib.parse import urlencode, urlparse 16from urllib.request import urlretrieve 17from datetime import datetime, timedelta, timezone 18from calendar import timegm 19 20from enum import Enum 21from typing import TYPE_CHECKING, Generator, Iterator 22 23if TYPE_CHECKING: 24 from typing_extensions import TypedDict 25 import feedparser 26 27 class FeedParserDict(TypedDict, total=False): 28 id: str 29 title: str 30 summary: str 31 authors: list[dict[str, str]] 32 links: list[dict[str, str]] 33 tags: list[dict[str, str]] 34 updated_parsed: time.struct_time 35 published_parsed: time.struct_time 36 arxiv_comment: str 37 arxiv_journal_ref: str 38 arxiv_doi: str 39 arxiv_primary_category: dict[str, str] 40 41 42logger = logging.getLogger(__name__) 43 44_DEFAULT_TIME = datetime.min 45 46 47class Result: 48 """ 49 An entry in an arXiv query results feed. 50 51 See [the arXiv API User's Manual: Details of Atom Results 52 Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned). 53 """ 54 55 entry_id: str 56 """A url of the form `https://arxiv.org/abs/{id}`.""" 57 updated: datetime 58 """When the result was last updated.""" 59 published: datetime 60 """When the result was originally published.""" 61 title: str 62 """The title of the result.""" 63 authors: list[Result.Author] 64 """The result's authors.""" 65 summary: str 66 """The result abstract.""" 67 comment: str | None 68 """The authors' comment if present.""" 69 journal_ref: str | None 70 """A journal reference if present.""" 71 doi: str | None 72 """A URL for the resolved DOI to an external resource if present.""" 73 primary_category: str 74 """ 75 The result's primary arXiv category. See [arXiv: Category 76 Taxonomy](https://arxiv.org/category_taxonomy). 77 """ 78 categories: list[str] 79 """ 80 All of the result's categories. See [arXiv: Category 81 Taxonomy](https://arxiv.org/category_taxonomy). 82 """ 83 links: list[Result.Link] 84 """Up to three URLs associated with this result.""" 85 pdf_url: str | None 86 """The URL of a PDF version of this result if present among links.""" 87 _raw: feedparser.FeedParserDict 88 """ 89 The raw feedparser result object if this Result was constructed with 90 Result._from_feed_entry. 91 """ 92 93 def __init__( 94 self, 95 entry_id: str, 96 updated: datetime = _DEFAULT_TIME, 97 published: datetime = _DEFAULT_TIME, 98 title: str = "", 99 authors: list[Result.Author] | None = None, 100 summary: str = "", 101 comment: str = "", 102 journal_ref: str = "", 103 doi: str = "", 104 primary_category: str = "", 105 categories: list[str] | None = None, 106 links: list[Result.Link] | None = None, 107 _raw: feedparser.FeedParserDict | None = None, 108 ): 109 """ 110 Constructs an arXiv search result item. 111 112 In most cases, prefer using `Result._from_feed_entry` to parsing and 113 constructing `Result`s yourself. 114 """ 115 self.entry_id = entry_id 116 self.updated = updated 117 self.published = published 118 self.title = title 119 self.authors = authors or [] 120 self.summary = summary 121 self.comment = comment 122 self.journal_ref = journal_ref 123 self.doi = doi 124 self.primary_category = primary_category 125 self.categories = categories or [] 126 self.links = links or [] 127 # Calculated members 128 self.pdf_url = Result._get_pdf_url(self.links) 129 # Debugging 130 self._raw = _raw 131 132 @classmethod 133 def _from_feed_entry(cls, entry: feedparser.FeedParserDict) -> Result: 134 """ 135 Converts a feedparser entry for an arXiv search result feed into a 136 Result object. 137 """ 138 if not hasattr(entry, "id"): 139 raise Result.MissingFieldError("id") 140 # Title attribute may be absent for certain titles. Defaulting to "0" as 141 # it's the only title observed to cause this bug. 142 # https://github.com/lukasschwab/arxiv.py/issues/71 143 # title = entry.title if hasattr(entry, "title") else "0" 144 title = "0" 145 if hasattr(entry, "title"): 146 title = entry.title 147 else: 148 logger.warning("Result %s is missing title attribute; defaulting to '0'", entry.id) 149 return Result( 150 entry_id=entry.id, 151 updated=Result._to_datetime(entry.updated_parsed), 152 published=Result._to_datetime(entry.published_parsed), 153 title=re.sub(r"\s+", " ", title), 154 authors=[Result.Author._from_feed_author(a) for a in entry.authors], 155 summary=entry.summary, 156 comment=entry.get("arxiv_comment"), 157 journal_ref=entry.get("arxiv_journal_ref"), 158 doi=entry.get("arxiv_doi"), 159 primary_category=entry.arxiv_primary_category.get("term"), 160 categories=[tag.get("term") for tag in entry.tags], 161 links=[Result.Link._from_feed_link(link) for link in entry.links], 162 _raw=entry, 163 ) 164 165 def __str__(self) -> str: 166 return self.entry_id 167 168 def __repr__(self) -> str: 169 return ( 170 "{}(entry_id={}, updated={}, published={}, title={}, authors={}, " 171 "summary={}, comment={}, journal_ref={}, doi={}, " 172 "primary_category={}, categories={}, links={})" 173 ).format( 174 _classname(self), 175 repr(self.entry_id), 176 repr(self.updated), 177 repr(self.published), 178 repr(self.title), 179 repr(self.authors), 180 repr(self.summary), 181 repr(self.comment), 182 repr(self.journal_ref), 183 repr(self.doi), 184 repr(self.primary_category), 185 repr(self.categories), 186 repr(self.links), 187 ) 188 189 def __eq__(self, other: object) -> bool: 190 if isinstance(other, Result): 191 return self.entry_id == other.entry_id 192 return False 193 194 def get_short_id(self) -> str: 195 """ 196 Returns the short ID for this result. 197 198 + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`, 199 `result.get_short_id()` returns `2107.05580v1`. 200 201 + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`, 202 `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March 203 2007 arXiv identifier format). 204 205 For an explanation of the difference between arXiv's legacy and current 206 identifiers, see [Understanding the arXiv 207 identifier](https://arxiv.org/help/arxiv_identifier). 208 """ 209 return self.entry_id.split("arxiv.org/abs/")[-1] 210 211 def _get_default_filename(self, extension: str = "pdf") -> str: 212 """ 213 A default `to_filename` function for the extension given. 214 """ 215 nonempty_title = self.title if self.title else "UNTITLED" 216 return ".".join( 217 [ 218 self.get_short_id().replace("/", "_"), 219 re.sub(r"[^\w]", "_", nonempty_title), 220 extension, 221 ] 222 ) 223 224 def download_pdf( 225 self, 226 dirpath: str = "./", 227 filename: str = "", 228 download_domain: str = "export.arxiv.org", 229 ) -> str: 230 """ 231 Downloads the PDF for this result to the specified directory. 232 233 The filename is generated by calling `to_filename(self)`. 234 235 **Deprecated:** future versions of this client library will not provide 236 download helpers (out of scope). Use `result.pdf_url` directly. 237 """ 238 if not filename: 239 filename = self._get_default_filename() 240 path = os.path.join(dirpath, filename) 241 if self.pdf_url is None: 242 raise ValueError("No PDF URL available for this result") 243 pdf_url = Result._substitute_domain(self.pdf_url, download_domain) 244 written_path, _ = urlretrieve(pdf_url, path) 245 return written_path 246 247 def download_source( 248 self, 249 dirpath: str = "./", 250 filename: str = "", 251 download_domain: str = "export.arxiv.org", 252 ) -> str: 253 """ 254 Downloads the source tarfile for this result to the specified 255 directory. 256 257 The filename is generated by calling `to_filename(self)`. 258 259 **Deprecated:** future versions of this client library will not provide 260 download helpers (out of scope). Use `result.source_url` directly. 261 """ 262 if not filename: 263 filename = self._get_default_filename("tar.gz") 264 path = os.path.join(dirpath, filename) 265 source_url_str = self.source_url() 266 if source_url_str is None: 267 raise ValueError("No source URL available for this result") 268 source_url = Result._substitute_domain(source_url_str, download_domain) 269 written_path, _ = urlretrieve(source_url, path) 270 return written_path 271 272 def source_url(self) -> str | None: 273 """ 274 Derives a URL for the source tarfile for this result. 275 """ 276 if self.pdf_url is None: 277 return None 278 return self.pdf_url.replace("/pdf/", "/src/") 279 280 @staticmethod 281 def _get_pdf_url(links: list[Result.Link]) -> str | None: 282 """ 283 Finds the PDF link among a result's links and returns its URL. 284 285 Should only be called once for a given `Result`, in its constructor. 286 After construction, the URL should be available in `Result.pdf_url`. 287 """ 288 pdf_urls = [link.href for link in links if link.title == "pdf"] 289 if len(pdf_urls) == 0: 290 return None 291 elif len(pdf_urls) > 1: 292 logger.warning("Result has multiple PDF links; using %s", pdf_urls[0]) 293 return pdf_urls[0] 294 295 @staticmethod 296 def _to_datetime(ts: time.struct_time) -> datetime: 297 """ 298 Converts a UTC time.struct_time into a time-zone-aware datetime. 299 300 This will be replaced with feedparser functionality [when it becomes 301 available](https://github.com/kurtmckee/feedparser/issues/212). 302 """ 303 return datetime.fromtimestamp(timegm(ts), tz=timezone.utc) 304 305 @staticmethod 306 def _substitute_domain(url: str, domain: str) -> str: 307 """ 308 Replaces the domain of the given URL with the specified domain. 309 310 This is useful for testing purposes. 311 """ 312 parsed_url = urlparse(url) 313 return parsed_url._replace(netloc=domain).geturl() 314 315 class Author: 316 """ 317 A light inner class for representing a result's authors. 318 """ 319 320 name: str 321 """The author's name.""" 322 323 def __init__(self, name: str): 324 """ 325 Constructs an `Author` with the specified name. 326 327 In most cases, prefer using `Author._from_feed_author` to parsing 328 and constructing `Author`s yourself. 329 """ 330 self.name = name 331 332 @classmethod 333 def _from_feed_author(cls, feed_author: feedparser.FeedParserDict) -> Result.Author: 334 """ 335 Constructs an `Author` with the name specified in an author object 336 from a feed entry. 337 338 See usage in `Result._from_feed_entry`. 339 """ 340 return Result.Author(feed_author.name) 341 342 def __str__(self) -> str: 343 return self.name 344 345 def __repr__(self) -> str: 346 return "{}({})".format(_classname(self), repr(self.name)) 347 348 def __eq__(self, other: object) -> bool: 349 if isinstance(other, Result.Author): 350 return self.name == other.name 351 return False 352 353 class Link: 354 """ 355 A light inner class for representing a result's links. 356 """ 357 358 href: str 359 """The link's `href` attribute.""" 360 title: str | None 361 """The link's title.""" 362 rel: str 363 """The link's relationship to the `Result`.""" 364 content_type: str | None 365 """The link's HTTP content type.""" 366 367 def __init__( 368 self, 369 href: str, 370 title: str | None = None, 371 rel: str = "", 372 content_type: str | None = None, 373 ): 374 """ 375 Constructs a `Link` with the specified link metadata. 376 377 In most cases, prefer using `Link._from_feed_link` to parsing and 378 constructing `Link`s yourself. 379 """ 380 self.href = href 381 self.title = title 382 self.rel = rel 383 self.content_type = content_type 384 385 @classmethod 386 def _from_feed_link(cls, feed_link: feedparser.FeedParserDict) -> Result.Link: 387 """ 388 Constructs a `Link` with link metadata specified in a link object 389 from a feed entry. 390 391 See usage in `Result._from_feed_entry`. 392 """ 393 return Result.Link( 394 href=feed_link.href, 395 title=feed_link.get("title"), 396 rel=feed_link.get("rel") or "", 397 content_type=feed_link.get("content_type"), 398 ) 399 400 def __str__(self) -> str: 401 return self.href 402 403 def __repr__(self) -> str: 404 return "{}({}, title={}, rel={}, content_type={})".format( 405 _classname(self), 406 repr(self.href), 407 repr(self.title), 408 repr(self.rel), 409 repr(self.content_type), 410 ) 411 412 def __eq__(self, other: object) -> bool: 413 if isinstance(other, Result.Link): 414 return self.href == other.href 415 return False 416 417 class MissingFieldError(Exception): 418 """ 419 An error indicating an entry is unparseable because it lacks required 420 fields. 421 """ 422 423 missing_field: str 424 """The required field missing from the would-be entry.""" 425 message: str 426 """Message describing what caused this error.""" 427 428 def __init__(self, missing_field: str): 429 self.missing_field = missing_field 430 self.message = "Entry from arXiv missing required info" 431 432 def __repr__(self) -> str: 433 return "{}({})".format(_classname(self), repr(self.missing_field)) 434 435 436class SortCriterion(Enum): 437 """ 438 A SortCriterion identifies a property by which search results can be 439 sorted. 440 441 See [the arXiv API User's Manual: sort order for return 442 results](https://arxiv.org/help/api/user-manual#sort). 443 """ 444 445 Relevance = "relevance" 446 LastUpdatedDate = "lastUpdatedDate" 447 SubmittedDate = "submittedDate" 448 449 450class SortOrder(Enum): 451 """ 452 A SortOrder indicates order in which search results are sorted according 453 to the specified arxiv.SortCriterion. 454 455 See [the arXiv API User's Manual: sort order for return 456 results](https://arxiv.org/help/api/user-manual#sort). 457 """ 458 459 Ascending = "ascending" 460 Descending = "descending" 461 462 463class Search: 464 """ 465 A specification for a search of arXiv's database. 466 467 To run a search, use `Search.run` to use a default client or `Client.run` 468 with a specific client. 469 """ 470 471 query: str 472 """ 473 A query string. 474 475 This should be unencoded. Use `au:del_maestro AND ti:checkerboard`, not 476 `au:del_maestro+AND+ti:checkerboard`. 477 478 See [the arXiv API User's Manual: Details of Query 479 Construction](https://arxiv.org/help/api/user-manual#query_details). 480 """ 481 id_list: list[str] 482 """ 483 A list of arXiv article IDs to which to limit the search. 484 485 See [the arXiv API User's 486 Manual](https://arxiv.org/help/api/user-manual#search_query_and_id_list) 487 for documentation of the interaction between `query` and `id_list`. 488 """ 489 max_results: int | None 490 """ 491 The maximum number of results to be returned in an execution of this 492 search. To fetch every result available, set `max_results=None`. 493 494 The API's limit is 300,000 results per query. 495 """ 496 sort_by: SortCriterion 497 """The sort criterion for results.""" 498 sort_order: SortOrder 499 """The sort order for results.""" 500 501 def __init__( 502 self, 503 query: str = "", 504 id_list: list[str] | None = None, 505 max_results: int | None = None, 506 sort_by: SortCriterion = SortCriterion.Relevance, 507 sort_order: SortOrder = SortOrder.Descending, 508 ): 509 """ 510 Constructs an arXiv API search with the specified criteria. 511 """ 512 self.query = query 513 self.id_list = id_list or [] 514 # Handle deprecated v1 default behavior. 515 self.max_results = None if max_results == math.inf else max_results 516 self.sort_by = sort_by 517 self.sort_order = sort_order 518 519 def __str__(self) -> str: 520 if self.query and self.id_list: 521 return f"Search(query='{self.query}', id_list={len(self.id_list)} items)" 522 elif self.query: 523 return f"Search(query='{self.query}')" 524 elif self.id_list: 525 return f"Search(id_list={len(self.id_list)} items)" 526 else: 527 return "Search(empty)" 528 529 def __repr__(self) -> str: 530 return ("{}(query={}, id_list={}, max_results={}, sort_by={}, sort_order={})").format( 531 _classname(self), 532 repr(self.query), 533 repr(self.id_list), 534 repr(self.max_results), 535 repr(self.sort_by), 536 repr(self.sort_order), 537 ) 538 539 def _url_args(self) -> dict[str, str]: 540 """ 541 Returns a dict of search parameters that should be included in an API 542 request for this search. 543 """ 544 return { 545 "search_query": self.query, 546 "id_list": ",".join(self.id_list), 547 "sortBy": self.sort_by.value, 548 "sortOrder": self.sort_order.value, 549 } 550 551 def results(self, offset: int = 0) -> Iterator[Result]: 552 """ 553 Executes the specified search using a default arXiv API client. For info 554 on default behavior, see `Client.__init__` and `Client.results`. 555 556 **Deprecated** after 2.0.0; use `Client.results`. 557 """ 558 warnings.warn( 559 "The 'Search.results' method is deprecated, use 'Client.results' instead", 560 DeprecationWarning, 561 stacklevel=2, 562 ) 563 return Client().results(self, offset=offset) 564 565 566class Client: 567 """ 568 Specifies a strategy for fetching results from arXiv's API. 569 570 This class obscures pagination and retry logic, and exposes 571 `Client.results`. 572 """ 573 574 query_url_format = "https://export.arxiv.org/api/query?{}" 575 """ 576 The arXiv query API endpoint format. 577 """ 578 page_size: int 579 """ 580 Maximum number of results fetched in a single API request. Smaller pages can 581 be retrieved faster, but may require more round-trips. 582 583 The API's limit is 2000 results per page. 584 """ 585 delay_seconds: float 586 """ 587 Number of seconds to wait between API requests. 588 589 [arXiv's Terms of Use](https://arxiv.org/help/api/tou) ask that you "make no 590 more than one request every three seconds." 591 """ 592 num_retries: int 593 """ 594 Number of times to retry a failing API request before raising an Exception. 595 """ 596 597 _last_request_dt: datetime | None 598 _session: requests.Session 599 600 def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3): 601 """ 602 Constructs an arXiv API client with the specified options. 603 604 Note: the default parameters should provide a robust request strategy 605 for most use cases. Extreme page sizes, delays, or retries risk 606 violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou), 607 brittle behavior, and inconsistent results. 608 """ 609 self.page_size = page_size 610 self.delay_seconds = delay_seconds 611 self.num_retries = num_retries 612 self._last_request_dt = None 613 self._session = requests.Session() 614 615 def __str__(self) -> str: 616 return f"Client(page_size={self.page_size}, delay={self.delay_seconds}s, retries={self.num_retries})" 617 618 def __repr__(self) -> str: 619 return "{}(page_size={}, delay_seconds={}, num_retries={})".format( 620 _classname(self), 621 repr(self.page_size), 622 repr(self.delay_seconds), 623 repr(self.num_retries), 624 ) 625 626 def results(self, search: Search, offset: int = 0) -> Iterator[Result]: 627 """ 628 Uses this client configuration to fetch one page of the search results 629 at a time, yielding the parsed `Result`s, until `max_results` results 630 have been yielded or there are no more search results. 631 632 If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`. 633 634 Setting a nonzero `offset` discards leading records in the result set. 635 When `offset` is greater than or equal to `search.max_results`, the full 636 result set is discarded. 637 638 For more on using generators, see 639 [Generators](https://wiki.python.org/moin/Generators). 640 """ 641 limit = search.max_results - offset if search.max_results else None 642 if limit and limit < 0: 643 return iter(()) 644 return itertools.islice(self._results(search, offset), limit) 645 646 def _results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]: 647 page_url = self._format_url(search, offset, self.page_size) 648 feed = self._parse_feed(page_url, first_page=True) 649 if not feed.entries: 650 logger.info("Got empty first page; stopping generation") 651 return 652 total_results = int(feed.feed.opensearch_totalresults) 653 logger.info( 654 "Got first page: %d of %d total results", 655 len(feed.entries), 656 total_results, 657 ) 658 659 while feed.entries: 660 for entry in feed.entries: 661 try: 662 yield Result._from_feed_entry(entry) 663 except Result.MissingFieldError as e: 664 logger.warning("Skipping partial result: %s", e) 665 offset += len(feed.entries) 666 if offset >= total_results: 667 break 668 page_url = self._format_url(search, offset, self.page_size) 669 feed = self._parse_feed(page_url, first_page=False) 670 671 def _format_url(self, search: Search, start: int, page_size: int) -> str: 672 """ 673 Construct a request API for search that returns up to `page_size` 674 results starting with the result at index `start`. 675 """ 676 url_args = search._url_args() 677 url_args.update( 678 { 679 "start": str(start), 680 "max_results": str(page_size), 681 } 682 ) 683 return self.query_url_format.format(urlencode(url_args)) 684 685 def _parse_feed( 686 self, url: str, first_page: bool = True, _try_index: int = 0 687 ) -> feedparser.FeedParserDict: 688 """ 689 Fetches the specified URL and parses it with feedparser. 690 691 If a request fails or is unexpectedly empty, retries the request up to 692 `self.num_retries` times. 693 """ 694 try: 695 return self.__try_parse_feed(url, first_page=first_page, try_index=_try_index) 696 except ( 697 HTTPError, 698 UnexpectedEmptyPageError, 699 requests.exceptions.ConnectionError, 700 ) as err: 701 if _try_index < self.num_retries: 702 logger.debug("Got error (try %d): %s", _try_index, err) 703 return self._parse_feed(url, first_page=first_page, _try_index=_try_index + 1) 704 logger.debug("Giving up (try %d): %s", _try_index, err) 705 raise err 706 707 def __try_parse_feed( 708 self, 709 url: str, 710 first_page: bool, 711 try_index: int, 712 ) -> feedparser.FeedParserDict: 713 """ 714 Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that 715 number of seconds has not passed since `_parse_feed` was last called, 716 sleeps until delay_seconds seconds have passed. 717 """ 718 # If this call would violate the rate limit, sleep until it doesn't. 719 if self._last_request_dt is not None: 720 required = timedelta(seconds=self.delay_seconds) 721 since_last_request = datetime.now() - self._last_request_dt 722 if since_last_request < required: 723 to_sleep = (required - since_last_request).total_seconds() 724 logger.info("Sleeping: %f seconds", to_sleep) 725 time.sleep(to_sleep) 726 727 logger.info("Requesting page (first: %r, try: %d): %s", first_page, try_index, url) 728 729 resp = self._session.get(url, headers={"user-agent": "arxiv.py/2.3.2"}) 730 self._last_request_dt = datetime.now() 731 if resp.status_code != requests.codes.OK: 732 raise HTTPError(url, try_index, resp.status_code) 733 734 feed = feedparser.parse(resp.content) 735 if len(feed.entries) == 0 and not first_page: 736 raise UnexpectedEmptyPageError(url, try_index, feed) 737 738 if feed.bozo: 739 logger.warning( 740 "Bozo feed; consider handling: %s", 741 feed.bozo_exception if "bozo_exception" in feed else None, 742 ) 743 744 return feed 745 746 747class ArxivError(Exception): 748 """This package's base Exception class.""" 749 750 url: str 751 """The feed URL that could not be fetched.""" 752 retry: int 753 """ 754 The request try number which encountered this error; 0 for the initial try, 755 1 for the first retry, and so on. 756 """ 757 message: str 758 """Message describing what caused this error.""" 759 760 def __init__(self, url: str, retry: int, message: str): 761 """ 762 Constructs an `ArxivError` encountered while fetching the specified URL. 763 """ 764 self.url = url 765 self.retry = retry 766 self.message = message 767 super().__init__(self.message) 768 769 def __str__(self) -> str: 770 return "{} ({})".format(self.message, self.url) 771 772 773class UnexpectedEmptyPageError(ArxivError): 774 """ 775 An error raised when a page of results that should be non-empty is empty. 776 777 This should never happen in theory, but happens sporadically due to 778 brittleness in the underlying arXiv API; usually resolved by retries. 779 780 See `Client.results` for usage. 781 """ 782 783 raw_feed: feedparser.FeedParserDict 784 """ 785 The raw output of `feedparser.parse`. Sometimes this contains useful 786 diagnostic information, e.g. in 'bozo_exception'. 787 """ 788 789 def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict): 790 """ 791 Constructs an `UnexpectedEmptyPageError` encountered for the specified 792 API URL after `retry` tries. 793 """ 794 self.url = url 795 self.raw_feed = raw_feed 796 super().__init__(url, retry, "Page of results was unexpectedly empty") 797 798 def __repr__(self) -> str: 799 return "{}({}, {}, {})".format( 800 _classname(self), repr(self.url), repr(self.retry), repr(self.raw_feed) 801 ) 802 803 804class HTTPError(ArxivError): 805 """ 806 A non-200 status encountered while fetching a page of results. 807 808 See `Client.results` for usage. 809 """ 810 811 status: int 812 """The HTTP status reported by feedparser.""" 813 814 def __init__(self, url: str, retry: int, status: int): 815 """ 816 Constructs an `HTTPError` for the specified status code, encountered for 817 the specified API URL after `retry` tries. 818 """ 819 self.url = url 820 self.status = status 821 super().__init__( 822 url, 823 retry, 824 "Page request resulted in HTTP {}".format(self.status), 825 ) 826 827 def __repr__(self) -> str: 828 return "{}({}, {}, {})".format( 829 _classname(self), repr(self.url), repr(self.retry), repr(self.status) 830 ) 831 832 833def _classname(o: object) -> str: 834 """A helper function for use in __repr__ methods: arxiv.Result.Link.""" 835 return "arxiv.{}".format(o.__class__.__qualname__)
48class Result: 49 """ 50 An entry in an arXiv query results feed. 51 52 See [the arXiv API User's Manual: Details of Atom Results 53 Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned). 54 """ 55 56 entry_id: str 57 """A url of the form `https://arxiv.org/abs/{id}`.""" 58 updated: datetime 59 """When the result was last updated.""" 60 published: datetime 61 """When the result was originally published.""" 62 title: str 63 """The title of the result.""" 64 authors: list[Result.Author] 65 """The result's authors.""" 66 summary: str 67 """The result abstract.""" 68 comment: str | None 69 """The authors' comment if present.""" 70 journal_ref: str | None 71 """A journal reference if present.""" 72 doi: str | None 73 """A URL for the resolved DOI to an external resource if present.""" 74 primary_category: str 75 """ 76 The result's primary arXiv category. See [arXiv: Category 77 Taxonomy](https://arxiv.org/category_taxonomy). 78 """ 79 categories: list[str] 80 """ 81 All of the result's categories. See [arXiv: Category 82 Taxonomy](https://arxiv.org/category_taxonomy). 83 """ 84 links: list[Result.Link] 85 """Up to three URLs associated with this result.""" 86 pdf_url: str | None 87 """The URL of a PDF version of this result if present among links.""" 88 _raw: feedparser.FeedParserDict 89 """ 90 The raw feedparser result object if this Result was constructed with 91 Result._from_feed_entry. 92 """ 93 94 def __init__( 95 self, 96 entry_id: str, 97 updated: datetime = _DEFAULT_TIME, 98 published: datetime = _DEFAULT_TIME, 99 title: str = "", 100 authors: list[Result.Author] | None = None, 101 summary: str = "", 102 comment: str = "", 103 journal_ref: str = "", 104 doi: str = "", 105 primary_category: str = "", 106 categories: list[str] | None = None, 107 links: list[Result.Link] | None = None, 108 _raw: feedparser.FeedParserDict | None = None, 109 ): 110 """ 111 Constructs an arXiv search result item. 112 113 In most cases, prefer using `Result._from_feed_entry` to parsing and 114 constructing `Result`s yourself. 115 """ 116 self.entry_id = entry_id 117 self.updated = updated 118 self.published = published 119 self.title = title 120 self.authors = authors or [] 121 self.summary = summary 122 self.comment = comment 123 self.journal_ref = journal_ref 124 self.doi = doi 125 self.primary_category = primary_category 126 self.categories = categories or [] 127 self.links = links or [] 128 # Calculated members 129 self.pdf_url = Result._get_pdf_url(self.links) 130 # Debugging 131 self._raw = _raw 132 133 @classmethod 134 def _from_feed_entry(cls, entry: feedparser.FeedParserDict) -> Result: 135 """ 136 Converts a feedparser entry for an arXiv search result feed into a 137 Result object. 138 """ 139 if not hasattr(entry, "id"): 140 raise Result.MissingFieldError("id") 141 # Title attribute may be absent for certain titles. Defaulting to "0" as 142 # it's the only title observed to cause this bug. 143 # https://github.com/lukasschwab/arxiv.py/issues/71 144 # title = entry.title if hasattr(entry, "title") else "0" 145 title = "0" 146 if hasattr(entry, "title"): 147 title = entry.title 148 else: 149 logger.warning("Result %s is missing title attribute; defaulting to '0'", entry.id) 150 return Result( 151 entry_id=entry.id, 152 updated=Result._to_datetime(entry.updated_parsed), 153 published=Result._to_datetime(entry.published_parsed), 154 title=re.sub(r"\s+", " ", title), 155 authors=[Result.Author._from_feed_author(a) for a in entry.authors], 156 summary=entry.summary, 157 comment=entry.get("arxiv_comment"), 158 journal_ref=entry.get("arxiv_journal_ref"), 159 doi=entry.get("arxiv_doi"), 160 primary_category=entry.arxiv_primary_category.get("term"), 161 categories=[tag.get("term") for tag in entry.tags], 162 links=[Result.Link._from_feed_link(link) for link in entry.links], 163 _raw=entry, 164 ) 165 166 def __str__(self) -> str: 167 return self.entry_id 168 169 def __repr__(self) -> str: 170 return ( 171 "{}(entry_id={}, updated={}, published={}, title={}, authors={}, " 172 "summary={}, comment={}, journal_ref={}, doi={}, " 173 "primary_category={}, categories={}, links={})" 174 ).format( 175 _classname(self), 176 repr(self.entry_id), 177 repr(self.updated), 178 repr(self.published), 179 repr(self.title), 180 repr(self.authors), 181 repr(self.summary), 182 repr(self.comment), 183 repr(self.journal_ref), 184 repr(self.doi), 185 repr(self.primary_category), 186 repr(self.categories), 187 repr(self.links), 188 ) 189 190 def __eq__(self, other: object) -> bool: 191 if isinstance(other, Result): 192 return self.entry_id == other.entry_id 193 return False 194 195 def get_short_id(self) -> str: 196 """ 197 Returns the short ID for this result. 198 199 + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`, 200 `result.get_short_id()` returns `2107.05580v1`. 201 202 + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`, 203 `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March 204 2007 arXiv identifier format). 205 206 For an explanation of the difference between arXiv's legacy and current 207 identifiers, see [Understanding the arXiv 208 identifier](https://arxiv.org/help/arxiv_identifier). 209 """ 210 return self.entry_id.split("arxiv.org/abs/")[-1] 211 212 def _get_default_filename(self, extension: str = "pdf") -> str: 213 """ 214 A default `to_filename` function for the extension given. 215 """ 216 nonempty_title = self.title if self.title else "UNTITLED" 217 return ".".join( 218 [ 219 self.get_short_id().replace("/", "_"), 220 re.sub(r"[^\w]", "_", nonempty_title), 221 extension, 222 ] 223 ) 224 225 def download_pdf( 226 self, 227 dirpath: str = "./", 228 filename: str = "", 229 download_domain: str = "export.arxiv.org", 230 ) -> str: 231 """ 232 Downloads the PDF for this result to the specified directory. 233 234 The filename is generated by calling `to_filename(self)`. 235 236 **Deprecated:** future versions of this client library will not provide 237 download helpers (out of scope). Use `result.pdf_url` directly. 238 """ 239 if not filename: 240 filename = self._get_default_filename() 241 path = os.path.join(dirpath, filename) 242 if self.pdf_url is None: 243 raise ValueError("No PDF URL available for this result") 244 pdf_url = Result._substitute_domain(self.pdf_url, download_domain) 245 written_path, _ = urlretrieve(pdf_url, path) 246 return written_path 247 248 def download_source( 249 self, 250 dirpath: str = "./", 251 filename: str = "", 252 download_domain: str = "export.arxiv.org", 253 ) -> str: 254 """ 255 Downloads the source tarfile for this result to the specified 256 directory. 257 258 The filename is generated by calling `to_filename(self)`. 259 260 **Deprecated:** future versions of this client library will not provide 261 download helpers (out of scope). Use `result.source_url` directly. 262 """ 263 if not filename: 264 filename = self._get_default_filename("tar.gz") 265 path = os.path.join(dirpath, filename) 266 source_url_str = self.source_url() 267 if source_url_str is None: 268 raise ValueError("No source URL available for this result") 269 source_url = Result._substitute_domain(source_url_str, download_domain) 270 written_path, _ = urlretrieve(source_url, path) 271 return written_path 272 273 def source_url(self) -> str | None: 274 """ 275 Derives a URL for the source tarfile for this result. 276 """ 277 if self.pdf_url is None: 278 return None 279 return self.pdf_url.replace("/pdf/", "/src/") 280 281 @staticmethod 282 def _get_pdf_url(links: list[Result.Link]) -> str | None: 283 """ 284 Finds the PDF link among a result's links and returns its URL. 285 286 Should only be called once for a given `Result`, in its constructor. 287 After construction, the URL should be available in `Result.pdf_url`. 288 """ 289 pdf_urls = [link.href for link in links if link.title == "pdf"] 290 if len(pdf_urls) == 0: 291 return None 292 elif len(pdf_urls) > 1: 293 logger.warning("Result has multiple PDF links; using %s", pdf_urls[0]) 294 return pdf_urls[0] 295 296 @staticmethod 297 def _to_datetime(ts: time.struct_time) -> datetime: 298 """ 299 Converts a UTC time.struct_time into a time-zone-aware datetime. 300 301 This will be replaced with feedparser functionality [when it becomes 302 available](https://github.com/kurtmckee/feedparser/issues/212). 303 """ 304 return datetime.fromtimestamp(timegm(ts), tz=timezone.utc) 305 306 @staticmethod 307 def _substitute_domain(url: str, domain: str) -> str: 308 """ 309 Replaces the domain of the given URL with the specified domain. 310 311 This is useful for testing purposes. 312 """ 313 parsed_url = urlparse(url) 314 return parsed_url._replace(netloc=domain).geturl() 315 316 class Author: 317 """ 318 A light inner class for representing a result's authors. 319 """ 320 321 name: str 322 """The author's name.""" 323 324 def __init__(self, name: str): 325 """ 326 Constructs an `Author` with the specified name. 327 328 In most cases, prefer using `Author._from_feed_author` to parsing 329 and constructing `Author`s yourself. 330 """ 331 self.name = name 332 333 @classmethod 334 def _from_feed_author(cls, feed_author: feedparser.FeedParserDict) -> Result.Author: 335 """ 336 Constructs an `Author` with the name specified in an author object 337 from a feed entry. 338 339 See usage in `Result._from_feed_entry`. 340 """ 341 return Result.Author(feed_author.name) 342 343 def __str__(self) -> str: 344 return self.name 345 346 def __repr__(self) -> str: 347 return "{}({})".format(_classname(self), repr(self.name)) 348 349 def __eq__(self, other: object) -> bool: 350 if isinstance(other, Result.Author): 351 return self.name == other.name 352 return False 353 354 class Link: 355 """ 356 A light inner class for representing a result's links. 357 """ 358 359 href: str 360 """The link's `href` attribute.""" 361 title: str | None 362 """The link's title.""" 363 rel: str 364 """The link's relationship to the `Result`.""" 365 content_type: str | None 366 """The link's HTTP content type.""" 367 368 def __init__( 369 self, 370 href: str, 371 title: str | None = None, 372 rel: str = "", 373 content_type: str | None = None, 374 ): 375 """ 376 Constructs a `Link` with the specified link metadata. 377 378 In most cases, prefer using `Link._from_feed_link` to parsing and 379 constructing `Link`s yourself. 380 """ 381 self.href = href 382 self.title = title 383 self.rel = rel 384 self.content_type = content_type 385 386 @classmethod 387 def _from_feed_link(cls, feed_link: feedparser.FeedParserDict) -> Result.Link: 388 """ 389 Constructs a `Link` with link metadata specified in a link object 390 from a feed entry. 391 392 See usage in `Result._from_feed_entry`. 393 """ 394 return Result.Link( 395 href=feed_link.href, 396 title=feed_link.get("title"), 397 rel=feed_link.get("rel") or "", 398 content_type=feed_link.get("content_type"), 399 ) 400 401 def __str__(self) -> str: 402 return self.href 403 404 def __repr__(self) -> str: 405 return "{}({}, title={}, rel={}, content_type={})".format( 406 _classname(self), 407 repr(self.href), 408 repr(self.title), 409 repr(self.rel), 410 repr(self.content_type), 411 ) 412 413 def __eq__(self, other: object) -> bool: 414 if isinstance(other, Result.Link): 415 return self.href == other.href 416 return False 417 418 class MissingFieldError(Exception): 419 """ 420 An error indicating an entry is unparseable because it lacks required 421 fields. 422 """ 423 424 missing_field: str 425 """The required field missing from the would-be entry.""" 426 message: str 427 """Message describing what caused this error.""" 428 429 def __init__(self, missing_field: str): 430 self.missing_field = missing_field 431 self.message = "Entry from arXiv missing required info" 432 433 def __repr__(self) -> str: 434 return "{}({})".format(_classname(self), repr(self.missing_field))
An entry in an arXiv query results feed.
See the arXiv API User's Manual: Details of Atom Results Returned.
94 def __init__( 95 self, 96 entry_id: str, 97 updated: datetime = _DEFAULT_TIME, 98 published: datetime = _DEFAULT_TIME, 99 title: str = "", 100 authors: list[Result.Author] | None = None, 101 summary: str = "", 102 comment: str = "", 103 journal_ref: str = "", 104 doi: str = "", 105 primary_category: str = "", 106 categories: list[str] | None = None, 107 links: list[Result.Link] | None = None, 108 _raw: feedparser.FeedParserDict | None = None, 109 ): 110 """ 111 Constructs an arXiv search result item. 112 113 In most cases, prefer using `Result._from_feed_entry` to parsing and 114 constructing `Result`s yourself. 115 """ 116 self.entry_id = entry_id 117 self.updated = updated 118 self.published = published 119 self.title = title 120 self.authors = authors or [] 121 self.summary = summary 122 self.comment = comment 123 self.journal_ref = journal_ref 124 self.doi = doi 125 self.primary_category = primary_category 126 self.categories = categories or [] 127 self.links = links or [] 128 # Calculated members 129 self.pdf_url = Result._get_pdf_url(self.links) 130 # Debugging 131 self._raw = _raw
Constructs an arXiv search result item.
In most cases, prefer using Result._from_feed_entry to parsing and
constructing Results yourself.
195 def get_short_id(self) -> str: 196 """ 197 Returns the short ID for this result. 198 199 + If the result URL is `"https://arxiv.org/abs/2107.05580v1"`, 200 `result.get_short_id()` returns `2107.05580v1`. 201 202 + If the result URL is `"https://arxiv.org/abs/quant-ph/0201082v1"`, 203 `result.get_short_id()` returns `"quant-ph/0201082v1"` (the pre-March 204 2007 arXiv identifier format). 205 206 For an explanation of the difference between arXiv's legacy and current 207 identifiers, see [Understanding the arXiv 208 identifier](https://arxiv.org/help/arxiv_identifier). 209 """ 210 return self.entry_id.split("arxiv.org/abs/")[-1]
Returns the short ID for this result.
If the result URL is
"https://arxiv.org/abs/2107.05580v1",result.get_short_id()returns2107.05580v1.If the result URL is
"https://arxiv.org/abs/quant-ph/0201082v1",result.get_short_id()returns"quant-ph/0201082v1"(the pre-March 2007 arXiv identifier format).
For an explanation of the difference between arXiv's legacy and current identifiers, see Understanding the arXiv identifier.
225 def download_pdf( 226 self, 227 dirpath: str = "./", 228 filename: str = "", 229 download_domain: str = "export.arxiv.org", 230 ) -> str: 231 """ 232 Downloads the PDF for this result to the specified directory. 233 234 The filename is generated by calling `to_filename(self)`. 235 236 **Deprecated:** future versions of this client library will not provide 237 download helpers (out of scope). Use `result.pdf_url` directly. 238 """ 239 if not filename: 240 filename = self._get_default_filename() 241 path = os.path.join(dirpath, filename) 242 if self.pdf_url is None: 243 raise ValueError("No PDF URL available for this result") 244 pdf_url = Result._substitute_domain(self.pdf_url, download_domain) 245 written_path, _ = urlretrieve(pdf_url, path) 246 return written_path
Downloads the PDF for this result to the specified directory.
The filename is generated by calling to_filename(self).
Deprecated: future versions of this client library will not provide
download helpers (out of scope). Use result.pdf_url directly.
248 def download_source( 249 self, 250 dirpath: str = "./", 251 filename: str = "", 252 download_domain: str = "export.arxiv.org", 253 ) -> str: 254 """ 255 Downloads the source tarfile for this result to the specified 256 directory. 257 258 The filename is generated by calling `to_filename(self)`. 259 260 **Deprecated:** future versions of this client library will not provide 261 download helpers (out of scope). Use `result.source_url` directly. 262 """ 263 if not filename: 264 filename = self._get_default_filename("tar.gz") 265 path = os.path.join(dirpath, filename) 266 source_url_str = self.source_url() 267 if source_url_str is None: 268 raise ValueError("No source URL available for this result") 269 source_url = Result._substitute_domain(source_url_str, download_domain) 270 written_path, _ = urlretrieve(source_url, path) 271 return written_path
Downloads the source tarfile for this result to the specified directory.
The filename is generated by calling to_filename(self).
Deprecated: future versions of this client library will not provide
download helpers (out of scope). Use result.source_url directly.
316 class Author: 317 """ 318 A light inner class for representing a result's authors. 319 """ 320 321 name: str 322 """The author's name.""" 323 324 def __init__(self, name: str): 325 """ 326 Constructs an `Author` with the specified name. 327 328 In most cases, prefer using `Author._from_feed_author` to parsing 329 and constructing `Author`s yourself. 330 """ 331 self.name = name 332 333 @classmethod 334 def _from_feed_author(cls, feed_author: feedparser.FeedParserDict) -> Result.Author: 335 """ 336 Constructs an `Author` with the name specified in an author object 337 from a feed entry. 338 339 See usage in `Result._from_feed_entry`. 340 """ 341 return Result.Author(feed_author.name) 342 343 def __str__(self) -> str: 344 return self.name 345 346 def __repr__(self) -> str: 347 return "{}({})".format(_classname(self), repr(self.name)) 348 349 def __eq__(self, other: object) -> bool: 350 if isinstance(other, Result.Author): 351 return self.name == other.name 352 return False
A light inner class for representing a result's authors.
354 class Link: 355 """ 356 A light inner class for representing a result's links. 357 """ 358 359 href: str 360 """The link's `href` attribute.""" 361 title: str | None 362 """The link's title.""" 363 rel: str 364 """The link's relationship to the `Result`.""" 365 content_type: str | None 366 """The link's HTTP content type.""" 367 368 def __init__( 369 self, 370 href: str, 371 title: str | None = None, 372 rel: str = "", 373 content_type: str | None = None, 374 ): 375 """ 376 Constructs a `Link` with the specified link metadata. 377 378 In most cases, prefer using `Link._from_feed_link` to parsing and 379 constructing `Link`s yourself. 380 """ 381 self.href = href 382 self.title = title 383 self.rel = rel 384 self.content_type = content_type 385 386 @classmethod 387 def _from_feed_link(cls, feed_link: feedparser.FeedParserDict) -> Result.Link: 388 """ 389 Constructs a `Link` with link metadata specified in a link object 390 from a feed entry. 391 392 See usage in `Result._from_feed_entry`. 393 """ 394 return Result.Link( 395 href=feed_link.href, 396 title=feed_link.get("title"), 397 rel=feed_link.get("rel") or "", 398 content_type=feed_link.get("content_type"), 399 ) 400 401 def __str__(self) -> str: 402 return self.href 403 404 def __repr__(self) -> str: 405 return "{}({}, title={}, rel={}, content_type={})".format( 406 _classname(self), 407 repr(self.href), 408 repr(self.title), 409 repr(self.rel), 410 repr(self.content_type), 411 ) 412 413 def __eq__(self, other: object) -> bool: 414 if isinstance(other, Result.Link): 415 return self.href == other.href 416 return False
A light inner class for representing a result's links.
368 def __init__( 369 self, 370 href: str, 371 title: str | None = None, 372 rel: str = "", 373 content_type: str | None = None, 374 ): 375 """ 376 Constructs a `Link` with the specified link metadata. 377 378 In most cases, prefer using `Link._from_feed_link` to parsing and 379 constructing `Link`s yourself. 380 """ 381 self.href = href 382 self.title = title 383 self.rel = rel 384 self.content_type = content_type
418 class MissingFieldError(Exception): 419 """ 420 An error indicating an entry is unparseable because it lacks required 421 fields. 422 """ 423 424 missing_field: str 425 """The required field missing from the would-be entry.""" 426 message: str 427 """Message describing what caused this error.""" 428 429 def __init__(self, missing_field: str): 430 self.missing_field = missing_field 431 self.message = "Entry from arXiv missing required info" 432 433 def __repr__(self) -> str: 434 return "{}({})".format(_classname(self), repr(self.missing_field))
An error indicating an entry is unparseable because it lacks required fields.
Inherited Members
- builtins.BaseException
- with_traceback
- args
437class SortCriterion(Enum): 438 """ 439 A SortCriterion identifies a property by which search results can be 440 sorted. 441 442 See [the arXiv API User's Manual: sort order for return 443 results](https://arxiv.org/help/api/user-manual#sort). 444 """ 445 446 Relevance = "relevance" 447 LastUpdatedDate = "lastUpdatedDate" 448 SubmittedDate = "submittedDate"
A SortCriterion identifies a property by which search results can be sorted.
See the arXiv API User's Manual: sort order for return results.
Inherited Members
- enum.Enum
- name
- value
451class SortOrder(Enum): 452 """ 453 A SortOrder indicates order in which search results are sorted according 454 to the specified arxiv.SortCriterion. 455 456 See [the arXiv API User's Manual: sort order for return 457 results](https://arxiv.org/help/api/user-manual#sort). 458 """ 459 460 Ascending = "ascending" 461 Descending = "descending"
A SortOrder indicates order in which search results are sorted according to the specified SortCriterion.
See the arXiv API User's Manual: sort order for return results.
Inherited Members
- enum.Enum
- name
- value
464class Search: 465 """ 466 A specification for a search of arXiv's database. 467 468 To run a search, use `Search.run` to use a default client or `Client.run` 469 with a specific client. 470 """ 471 472 query: str 473 """ 474 A query string. 475 476 This should be unencoded. Use `au:del_maestro AND ti:checkerboard`, not 477 `au:del_maestro+AND+ti:checkerboard`. 478 479 See [the arXiv API User's Manual: Details of Query 480 Construction](https://arxiv.org/help/api/user-manual#query_details). 481 """ 482 id_list: list[str] 483 """ 484 A list of arXiv article IDs to which to limit the search. 485 486 See [the arXiv API User's 487 Manual](https://arxiv.org/help/api/user-manual#search_query_and_id_list) 488 for documentation of the interaction between `query` and `id_list`. 489 """ 490 max_results: int | None 491 """ 492 The maximum number of results to be returned in an execution of this 493 search. To fetch every result available, set `max_results=None`. 494 495 The API's limit is 300,000 results per query. 496 """ 497 sort_by: SortCriterion 498 """The sort criterion for results.""" 499 sort_order: SortOrder 500 """The sort order for results.""" 501 502 def __init__( 503 self, 504 query: str = "", 505 id_list: list[str] | None = None, 506 max_results: int | None = None, 507 sort_by: SortCriterion = SortCriterion.Relevance, 508 sort_order: SortOrder = SortOrder.Descending, 509 ): 510 """ 511 Constructs an arXiv API search with the specified criteria. 512 """ 513 self.query = query 514 self.id_list = id_list or [] 515 # Handle deprecated v1 default behavior. 516 self.max_results = None if max_results == math.inf else max_results 517 self.sort_by = sort_by 518 self.sort_order = sort_order 519 520 def __str__(self) -> str: 521 if self.query and self.id_list: 522 return f"Search(query='{self.query}', id_list={len(self.id_list)} items)" 523 elif self.query: 524 return f"Search(query='{self.query}')" 525 elif self.id_list: 526 return f"Search(id_list={len(self.id_list)} items)" 527 else: 528 return "Search(empty)" 529 530 def __repr__(self) -> str: 531 return ("{}(query={}, id_list={}, max_results={}, sort_by={}, sort_order={})").format( 532 _classname(self), 533 repr(self.query), 534 repr(self.id_list), 535 repr(self.max_results), 536 repr(self.sort_by), 537 repr(self.sort_order), 538 ) 539 540 def _url_args(self) -> dict[str, str]: 541 """ 542 Returns a dict of search parameters that should be included in an API 543 request for this search. 544 """ 545 return { 546 "search_query": self.query, 547 "id_list": ",".join(self.id_list), 548 "sortBy": self.sort_by.value, 549 "sortOrder": self.sort_order.value, 550 } 551 552 def results(self, offset: int = 0) -> Iterator[Result]: 553 """ 554 Executes the specified search using a default arXiv API client. For info 555 on default behavior, see `Client.__init__` and `Client.results`. 556 557 **Deprecated** after 2.0.0; use `Client.results`. 558 """ 559 warnings.warn( 560 "The 'Search.results' method is deprecated, use 'Client.results' instead", 561 DeprecationWarning, 562 stacklevel=2, 563 ) 564 return Client().results(self, offset=offset)
A specification for a search of arXiv's database.
To run a search, use Search.run to use a default client or Client.run
with a specific client.
502 def __init__( 503 self, 504 query: str = "", 505 id_list: list[str] | None = None, 506 max_results: int | None = None, 507 sort_by: SortCriterion = SortCriterion.Relevance, 508 sort_order: SortOrder = SortOrder.Descending, 509 ): 510 """ 511 Constructs an arXiv API search with the specified criteria. 512 """ 513 self.query = query 514 self.id_list = id_list or [] 515 # Handle deprecated v1 default behavior. 516 self.max_results = None if max_results == math.inf else max_results 517 self.sort_by = sort_by 518 self.sort_order = sort_order
Constructs an arXiv API search with the specified criteria.
A query string.
This should be unencoded. Use au:del_maestro AND ti:checkerboard, not
au:del_maestro+AND+ti:checkerboard.
See the arXiv API User's Manual: Details of Query Construction.
A list of arXiv article IDs to which to limit the search.
See the arXiv API User's
Manual
for documentation of the interaction between query and id_list.
The maximum number of results to be returned in an execution of this
search. To fetch every result available, set max_results=None.
The API's limit is 300,000 results per query.
552 def results(self, offset: int = 0) -> Iterator[Result]: 553 """ 554 Executes the specified search using a default arXiv API client. For info 555 on default behavior, see `Client.__init__` and `Client.results`. 556 557 **Deprecated** after 2.0.0; use `Client.results`. 558 """ 559 warnings.warn( 560 "The 'Search.results' method is deprecated, use 'Client.results' instead", 561 DeprecationWarning, 562 stacklevel=2, 563 ) 564 return Client().results(self, offset=offset)
Executes the specified search using a default arXiv API client. For info
on default behavior, see Client.__init__ and Client.results.
Deprecated after 2.0.0; use Client.results.
567class Client: 568 """ 569 Specifies a strategy for fetching results from arXiv's API. 570 571 This class obscures pagination and retry logic, and exposes 572 `Client.results`. 573 """ 574 575 query_url_format = "https://export.arxiv.org/api/query?{}" 576 """ 577 The arXiv query API endpoint format. 578 """ 579 page_size: int 580 """ 581 Maximum number of results fetched in a single API request. Smaller pages can 582 be retrieved faster, but may require more round-trips. 583 584 The API's limit is 2000 results per page. 585 """ 586 delay_seconds: float 587 """ 588 Number of seconds to wait between API requests. 589 590 [arXiv's Terms of Use](https://arxiv.org/help/api/tou) ask that you "make no 591 more than one request every three seconds." 592 """ 593 num_retries: int 594 """ 595 Number of times to retry a failing API request before raising an Exception. 596 """ 597 598 _last_request_dt: datetime | None 599 _session: requests.Session 600 601 def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3): 602 """ 603 Constructs an arXiv API client with the specified options. 604 605 Note: the default parameters should provide a robust request strategy 606 for most use cases. Extreme page sizes, delays, or retries risk 607 violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou), 608 brittle behavior, and inconsistent results. 609 """ 610 self.page_size = page_size 611 self.delay_seconds = delay_seconds 612 self.num_retries = num_retries 613 self._last_request_dt = None 614 self._session = requests.Session() 615 616 def __str__(self) -> str: 617 return f"Client(page_size={self.page_size}, delay={self.delay_seconds}s, retries={self.num_retries})" 618 619 def __repr__(self) -> str: 620 return "{}(page_size={}, delay_seconds={}, num_retries={})".format( 621 _classname(self), 622 repr(self.page_size), 623 repr(self.delay_seconds), 624 repr(self.num_retries), 625 ) 626 627 def results(self, search: Search, offset: int = 0) -> Iterator[Result]: 628 """ 629 Uses this client configuration to fetch one page of the search results 630 at a time, yielding the parsed `Result`s, until `max_results` results 631 have been yielded or there are no more search results. 632 633 If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`. 634 635 Setting a nonzero `offset` discards leading records in the result set. 636 When `offset` is greater than or equal to `search.max_results`, the full 637 result set is discarded. 638 639 For more on using generators, see 640 [Generators](https://wiki.python.org/moin/Generators). 641 """ 642 limit = search.max_results - offset if search.max_results else None 643 if limit and limit < 0: 644 return iter(()) 645 return itertools.islice(self._results(search, offset), limit) 646 647 def _results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]: 648 page_url = self._format_url(search, offset, self.page_size) 649 feed = self._parse_feed(page_url, first_page=True) 650 if not feed.entries: 651 logger.info("Got empty first page; stopping generation") 652 return 653 total_results = int(feed.feed.opensearch_totalresults) 654 logger.info( 655 "Got first page: %d of %d total results", 656 len(feed.entries), 657 total_results, 658 ) 659 660 while feed.entries: 661 for entry in feed.entries: 662 try: 663 yield Result._from_feed_entry(entry) 664 except Result.MissingFieldError as e: 665 logger.warning("Skipping partial result: %s", e) 666 offset += len(feed.entries) 667 if offset >= total_results: 668 break 669 page_url = self._format_url(search, offset, self.page_size) 670 feed = self._parse_feed(page_url, first_page=False) 671 672 def _format_url(self, search: Search, start: int, page_size: int) -> str: 673 """ 674 Construct a request API for search that returns up to `page_size` 675 results starting with the result at index `start`. 676 """ 677 url_args = search._url_args() 678 url_args.update( 679 { 680 "start": str(start), 681 "max_results": str(page_size), 682 } 683 ) 684 return self.query_url_format.format(urlencode(url_args)) 685 686 def _parse_feed( 687 self, url: str, first_page: bool = True, _try_index: int = 0 688 ) -> feedparser.FeedParserDict: 689 """ 690 Fetches the specified URL and parses it with feedparser. 691 692 If a request fails or is unexpectedly empty, retries the request up to 693 `self.num_retries` times. 694 """ 695 try: 696 return self.__try_parse_feed(url, first_page=first_page, try_index=_try_index) 697 except ( 698 HTTPError, 699 UnexpectedEmptyPageError, 700 requests.exceptions.ConnectionError, 701 ) as err: 702 if _try_index < self.num_retries: 703 logger.debug("Got error (try %d): %s", _try_index, err) 704 return self._parse_feed(url, first_page=first_page, _try_index=_try_index + 1) 705 logger.debug("Giving up (try %d): %s", _try_index, err) 706 raise err 707 708 def __try_parse_feed( 709 self, 710 url: str, 711 first_page: bool, 712 try_index: int, 713 ) -> feedparser.FeedParserDict: 714 """ 715 Recursive helper for _parse_feed. Enforces `self.delay_seconds`: if that 716 number of seconds has not passed since `_parse_feed` was last called, 717 sleeps until delay_seconds seconds have passed. 718 """ 719 # If this call would violate the rate limit, sleep until it doesn't. 720 if self._last_request_dt is not None: 721 required = timedelta(seconds=self.delay_seconds) 722 since_last_request = datetime.now() - self._last_request_dt 723 if since_last_request < required: 724 to_sleep = (required - since_last_request).total_seconds() 725 logger.info("Sleeping: %f seconds", to_sleep) 726 time.sleep(to_sleep) 727 728 logger.info("Requesting page (first: %r, try: %d): %s", first_page, try_index, url) 729 730 resp = self._session.get(url, headers={"user-agent": "arxiv.py/2.3.2"}) 731 self._last_request_dt = datetime.now() 732 if resp.status_code != requests.codes.OK: 733 raise HTTPError(url, try_index, resp.status_code) 734 735 feed = feedparser.parse(resp.content) 736 if len(feed.entries) == 0 and not first_page: 737 raise UnexpectedEmptyPageError(url, try_index, feed) 738 739 if feed.bozo: 740 logger.warning( 741 "Bozo feed; consider handling: %s", 742 feed.bozo_exception if "bozo_exception" in feed else None, 743 ) 744 745 return feed
Specifies a strategy for fetching results from arXiv's API.
This class obscures pagination and retry logic, and exposes
Client.results.
601 def __init__(self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3): 602 """ 603 Constructs an arXiv API client with the specified options. 604 605 Note: the default parameters should provide a robust request strategy 606 for most use cases. Extreme page sizes, delays, or retries risk 607 violating the arXiv [API Terms of Use](https://arxiv.org/help/api/tou), 608 brittle behavior, and inconsistent results. 609 """ 610 self.page_size = page_size 611 self.delay_seconds = delay_seconds 612 self.num_retries = num_retries 613 self._last_request_dt = None 614 self._session = requests.Session()
Constructs an arXiv API client with the specified options.
Note: the default parameters should provide a robust request strategy for most use cases. Extreme page sizes, delays, or retries risk violating the arXiv API Terms of Use, brittle behavior, and inconsistent results.
The arXiv query API endpoint format.
Maximum number of results fetched in a single API request. Smaller pages can be retrieved faster, but may require more round-trips.
The API's limit is 2000 results per page.
Number of seconds to wait between API requests.
arXiv's Terms of Use ask that you "make no more than one request every three seconds."
627 def results(self, search: Search, offset: int = 0) -> Iterator[Result]: 628 """ 629 Uses this client configuration to fetch one page of the search results 630 at a time, yielding the parsed `Result`s, until `max_results` results 631 have been yielded or there are no more search results. 632 633 If all tries fail, raises an `UnexpectedEmptyPageError` or `HTTPError`. 634 635 Setting a nonzero `offset` discards leading records in the result set. 636 When `offset` is greater than or equal to `search.max_results`, the full 637 result set is discarded. 638 639 For more on using generators, see 640 [Generators](https://wiki.python.org/moin/Generators). 641 """ 642 limit = search.max_results - offset if search.max_results else None 643 if limit and limit < 0: 644 return iter(()) 645 return itertools.islice(self._results(search, offset), limit)
Uses this client configuration to fetch one page of the search results
at a time, yielding the parsed Results, until max_results results
have been yielded or there are no more search results.
If all tries fail, raises an UnexpectedEmptyPageError or HTTPError.
Setting a nonzero offset discards leading records in the result set.
When offset is greater than or equal to search.max_results, the full
result set is discarded.
For more on using generators, see Generators.
748class ArxivError(Exception): 749 """This package's base Exception class.""" 750 751 url: str 752 """The feed URL that could not be fetched.""" 753 retry: int 754 """ 755 The request try number which encountered this error; 0 for the initial try, 756 1 for the first retry, and so on. 757 """ 758 message: str 759 """Message describing what caused this error.""" 760 761 def __init__(self, url: str, retry: int, message: str): 762 """ 763 Constructs an `ArxivError` encountered while fetching the specified URL. 764 """ 765 self.url = url 766 self.retry = retry 767 self.message = message 768 super().__init__(self.message) 769 770 def __str__(self) -> str: 771 return "{} ({})".format(self.message, self.url)
This package's base Exception class.
761 def __init__(self, url: str, retry: int, message: str): 762 """ 763 Constructs an `ArxivError` encountered while fetching the specified URL. 764 """ 765 self.url = url 766 self.retry = retry 767 self.message = message 768 super().__init__(self.message)
Constructs an ArxivError encountered while fetching the specified URL.
The request try number which encountered this error; 0 for the initial try, 1 for the first retry, and so on.
Inherited Members
- builtins.BaseException
- with_traceback
- args
774class UnexpectedEmptyPageError(ArxivError): 775 """ 776 An error raised when a page of results that should be non-empty is empty. 777 778 This should never happen in theory, but happens sporadically due to 779 brittleness in the underlying arXiv API; usually resolved by retries. 780 781 See `Client.results` for usage. 782 """ 783 784 raw_feed: feedparser.FeedParserDict 785 """ 786 The raw output of `feedparser.parse`. Sometimes this contains useful 787 diagnostic information, e.g. in 'bozo_exception'. 788 """ 789 790 def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict): 791 """ 792 Constructs an `UnexpectedEmptyPageError` encountered for the specified 793 API URL after `retry` tries. 794 """ 795 self.url = url 796 self.raw_feed = raw_feed 797 super().__init__(url, retry, "Page of results was unexpectedly empty") 798 799 def __repr__(self) -> str: 800 return "{}({}, {}, {})".format( 801 _classname(self), repr(self.url), repr(self.retry), repr(self.raw_feed) 802 )
An error raised when a page of results that should be non-empty is empty.
This should never happen in theory, but happens sporadically due to brittleness in the underlying arXiv API; usually resolved by retries.
See Client.results for usage.
790 def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict): 791 """ 792 Constructs an `UnexpectedEmptyPageError` encountered for the specified 793 API URL after `retry` tries. 794 """ 795 self.url = url 796 self.raw_feed = raw_feed 797 super().__init__(url, retry, "Page of results was unexpectedly empty")
Constructs an UnexpectedEmptyPageError encountered for the specified
API URL after retry tries.
The raw output of feedparser.parse. Sometimes this contains useful
diagnostic information, e.g. in 'bozo_exception'.
Inherited Members
- builtins.BaseException
- with_traceback
- args
805class HTTPError(ArxivError): 806 """ 807 A non-200 status encountered while fetching a page of results. 808 809 See `Client.results` for usage. 810 """ 811 812 status: int 813 """The HTTP status reported by feedparser.""" 814 815 def __init__(self, url: str, retry: int, status: int): 816 """ 817 Constructs an `HTTPError` for the specified status code, encountered for 818 the specified API URL after `retry` tries. 819 """ 820 self.url = url 821 self.status = status 822 super().__init__( 823 url, 824 retry, 825 "Page request resulted in HTTP {}".format(self.status), 826 ) 827 828 def __repr__(self) -> str: 829 return "{}({}, {}, {})".format( 830 _classname(self), repr(self.url), repr(self.retry), repr(self.status) 831 )
A non-200 status encountered while fetching a page of results.
See Client.results for usage.
815 def __init__(self, url: str, retry: int, status: int): 816 """ 817 Constructs an `HTTPError` for the specified status code, encountered for 818 the specified API URL after `retry` tries. 819 """ 820 self.url = url 821 self.status = status 822 super().__init__( 823 url, 824 retry, 825 "Page request resulted in HTTP {}".format(self.status), 826 )
Inherited Members
- builtins.BaseException
- with_traceback
- args