« blog

Graphs at Work

You might never invert a binary tree, but graph theory — and graph algorithms — worm their way into software engineering, whether you recognize them or not. This problem wormed its way into my brain last week, where it reminded me API engineering isn’t all just flat-mapping responses. Conceiving this as a graph problem doesn’t just guide implementation: it determines the product’s attributes, which will impact its direction.

One of Sequin’s products is a two-way integration betwen our customers’ HubSpot accounts and their Postgres databases, using the HubSpot API to live-update the database and forward writes to Postgres back to HubSpot. There’s a variety of entities in HubSpot: Companies, Contacts, Deals, and all the custom object types our customers’ hearts desire. (Want to index Cars in HubSpot? Make a custom Car type.) Our customers might have thousands of Contacts, hundreds of Companies, and so on.

Of course, these entities aren’t islands; a Contact works for a Company, a Deal binds a company and is negotiated by a Contact, and so on. Rather than capturing every possible relationship on each type, HubSpot has a separate associations model for connecting two objects. For example, an association might declare “Contact Lukas Schwab is associated with Company Sequin.” Those associations are always bidirectional; if I’m associated with Sequin, Sequin’s also associated with me.

Because associations are a core part of HubSpot’s product model, Sequin’s sync needs to scrape them into Postgres. HubSpot’s API makes that a little tricky; I’ll try to explain this in brief, because it’s important to undertanding the task at hand.

If you want to fetch all associations between Contacts and Companies from the HubSpot v3 API, you can use either of two endpoints:

Because HubSpot limits each customer’s daily API usage, the second endpoint is the much more desirable access pattern. Still, we have to pick a request strategy. Given the association types a customer wants synced to Postgres — type pairs like Contact ↔︎ Company — what fromObjectTypes should Sequin list? The project requirements are to design a request strategy which

  1. Systematically pulls all associations the customer desires, specified as pairs of types.
  2. Minimizes the number of requests to HubSpot’s API.

For instance, a customer’s desired association types might look like this:

From To
Contact Company
Deal Contact
Company Deal
Company Car
Company Policy
Car Policy
Company Owner

But it’s easier to start thinking about a simpler configuration:

From To
Contact Company
Deal Contact

This is when I recognized graphs would be good visualizations for request strategies. Make each unique type (in this example, Contact, Company, and Deal) a vertex; then the associations we want to fetch are edges on the graph. The customer configuration is essentially an adjacency list!

The associations we want, expressed as a graph: Contact ↔︎ Company and Deal ↔︎ Contact. The Contact type is involved in both associations, but it’s reduced to a single vertex.

Coloring the types you’d list with GET requests gives you a visual representation of each request strategy.

Request strategy Visualization
GET contacts

GET deals
GET contacts
associations=[company, deal]
GET deals

GET companies

Now that you’re working with graphs, you have access to a pithier expression of the problem. The set of desired associations defines a graph G = (V, E), where the edges E are desired associations and the vertices V are the HubSpot types involved. Our search for a “request strategy” is a search for an optimal subset Vreq ⊆ V such that every e ∈ E is incident upon — touching — at least one object type v ∈ Vreq.

All of the strategies in the table above satisfy that condition, but one of them — and only one valid strategy for that graph! — only lists objects of a single type, Contacts.

There isn’t always a single-type silver bullet. For some graphs, no single vertex is incident on every edge.

In this associations graph, the simplest valid strategy lists objects of two types: Deal and Company. No single vertex is incident on every edge.

You’re faced with the algorithmic question I faced on Friday: given any graph like this, how can you systematically produce a valid minimal strategy?

Cracking the graph

As a rule of thumb, greedy graph algorithms — approaches which repeatedly apply a single heuristic to make local decisions — are easier to stomach. Because the heuristic is the crux of the implementation, you can focus on documenting that. With any luck, it won’t give your code reviewer a headache.

Luckily, there’s an intuitive greedy heuristic for selecting Vreq from V! Consider your start state and end state:

At any intermediate step, what change to the Vreq you’re building makes the most progress towards that end state? The one that adds the most new edges in E to Ereq!

More formally, given an associations graph G = (V, E),

  1. Let Vreq = {}, Ereq = {}.
  2. While E ≠ {}:
  3. Return Vreq.

G shrinks with each step, so this is guaranteed to terminate for any finite associations graph (and, since there’s a finite limit to the number of object types in a HubSpot account, G is guaranteed to be finite — phew). I’ll call this the clever-greedy algorithm, after Zito.2

This has a straightforward — if inefficient — recursive implementation,3 but don’t sweat it if you don’t want to read code. You can use the visualizations to see successive applications of the algorithm chip away at G as it builds the set of types Vreq.

Step Vreq Ereq Subgraph Evaluation
0 Company and Deal both have three incident edges. Take Deal.
1 {Deal} Deal↔︎Contact
Company has two incident edges; Contact and Policy each have one. Take Company.
2 {Deal, Company} Deal↔︎Contact
There aren’t any more edges in G, so we have a valid Vreq!

Problem solved, right? You reframed the problem in terms of graphs, which led you to a greedy algorithm — just a couple-dozen lines of code — that’ll produce a valid, seemingly minimal request strategy every time.

There’s one issue. In practice, what’s optimal? What do the project requirements say about optimality?

  1. Systematically pulls all associations the customer desires, specified as pairs of types.
  2. Minimizes the number of requests to HubSpot’s API.

Are you minimizing API usage with the greedy strategy? Not necessarily! You’re minimizing the number of types you have to list, true, but the API cost of listing a type is nonuniform. Because each GET request only returns a page of ≤100 objects, the number of API requests we have to make for a strategy scales with the total number of objects of the types in our strategy.

Do fewer types mean fewer objects? That’s not an unreasonable guess if you know nothing about the system, but you don’t have to look far for a counterexample. Consider a tweaked version of our initial simple associations graph, where you have additional info about the number of objects in HubSpot: there are 300 Contacts, 100 Companies, and 100 Deals.

Strategy Vreq API cost
{Contact} 300 contacts.
Three requests.
{Deal, Company} 100 Deals.
100 Companies.
Two requests.

The first strategy is simpler — it only involves listing objects of one type — but it’s costlier. The project requires optimizing API cost, not strategy simplicity!

Can you get there by modifying the greedy strategy? That’s not clear. The simple heuristic (“pick the vertex that adds the most new edges in E to Ereq”) also needs to factor in each vertex’s cost. Sometimes a voluminous type is worth requesting because it’s incident on many edges; sometimes it isn’t. In complex graphs, it’s hard to figure out locally whether a given vertex will wind up being a good pick or a bad one. Hmm.

The graph cracks back

Looks can be deceiving! This puzzle is a minimal vertex cover problem disguised as a HubSpot API question. Introducing cost minimization makes it a weighted minimal vertex cover problem. With or without weights, the optimization problem is NP-hard: as far as the career math-freaks know, there isn’t a solution that works in polynomial time. Any general algorithm for truly optimizing your HubSpot associations-fetching is going to be really, really slow. Bummer!

If you don’t care about performance, there’s a brute-force solution in O(2n):

  1. Filter the 2n subsets of V for vertex covers of G, the valid API strategies.
  2. For each, calculate the API cost.
  3. Take the lowest-cost strategy.

What’s wrong with the clever-greedy algorithm? Provided a graph designed to trip it up, it can’t even guarantee results within any constant-factor multiple of the optimal weight!4 There are other greedy algorithms for vertex cover guaranteeing results no larger than twice the minimum (2-approximations).5 My favorite is Savage’s, which simply prunes the leaves from a spanning tree.6

It stands to reason that introducing weights makes this harder; unfortunately, since we’re working with weighted vertices, you can’t just apply Savage’s reasoning to a minimum weighted spanning tree (which you could build with Prim’s algorithm). There are 2-approximation algorithms for weighted vertex cover, like this one adapted from Vazirani:7

2.11 Consider the following algorithm for the weighted vertex cover problem. For each vertex v, t(v) is initialized to its weight), and when t(v) drops to 0, v is picked in the cover.

Algorithm 2.17

  1. Initialization:
    • Vreq ← ∅
    • v ∈ V, t(v) ← weight(v)
  2. While Vreq is not a vertex cover do:
    • Pick an uncovered edge, say (u, v). Let m = min (t(u), t(v)).
    • t(u) ← t(u) − m.
    • t(v) ← t(v) − m.
    • Include in Vreq all vertices having t(v) = 0.
  3. Output Vreq.

This has essentially the same halting condition as our greedy algorithm, but it decreases a vertex’s weight t(v) as its neighbors are added to the eventual set cover Vreq. Intuitively, a vertex with a high initial weight is unlikely to ever hit t(v) = 0, so it’s unlikely to wind up in Vreq. An algorithm like Vazirani’s might make a fine solution for getting an okay minimization.

Is it the solution for you? Like the earlier question about whether a “small” strategy was optimal, this comes down to the requirements — implicit and explicit — of the project. The pragmatizing effect of project requirements is really, in some sense, what sets software engineering apart from computer science.

This puzzle explicitly requires you minimize API usage. Unfortunately, neither the clever-greedy algorithm nor Vazirani’s algorithm guarantee optimal outcomes. Vazirani guarantees at worst a 2-approximation of the optimal strategy, whereas the clever-greedy algorithm can’t even promise a constant factor for an unweighted set cover. Practically, though, worst-case performance might not matter! User-provided lists of assosiations to fetch usually won’t be the graphs that trick clever-greedy into poor outcomes. If customer concerns are well-isolated, an adversarial customer can force your algorithm into an inefficient solution… but they can’t impact anyone else’s experience. To improve the customer experience, what you need to understand is the distribution of strategy performances for a given algorithm on real-world graphs, not the theoretical worst case. Here clever-greedy might be better, even if it assumes constant vertex weights.8

Instead, implicit project requirements — values inherent to the engineering task — could reasonably determine what you implement. New engineers will have to read and modify the algorithm you choose; because that time trades off with other work, your customers may be better-served holistically if you pick a simple, maintainable algorithm here over a strong n-approximation. That’s the principle behind minimizing tech debt.

Lehman describes the evolution of programs through a system of three nesting levels:

  1. S-programs implement specifications: “return the lowest common multiple of two integers.”
  2. P-programs, made up of S-programs, strive to perform optimal behaviors: “play chess.”
  3. E-programs, made up of P-programs, “mechanize a human or societal activity.” Software prducts, by nature of having collaborators and users, are E-programs.9

The HubSpot project requirements ask you to design a P-program. At several junctures, feeling comfortable working with graphs helped break that task into S-programs: graphs gave us a natural visualization, which helped design a greedy algorithm. Recognizing a well-studied problem in graph theory, the vertex cover, exposed flaws in that greedy algorithm — the adversarial cases — and led to the conclusion all known optimal solutions are inefficient.

Critically, though, graph theory is handy for understanding the relationship between P-program and E-program. Discovering an existing 2-approximation like Vazirani’s might be the difference between a worthwhile feature and a waste of time. Having the vocabulary to study the associations graphs you see in practice lets you study whether worst-case performance matters, or if maybe something simpler (unweighted) will usually do better.

In summary, this isn’t just engineering navel gazing; recognizing a minimum weigted vertex cover problem when you see one impacts your bottom line. Losing engineering hours to battling P ≠ NP or reading abstruse optimizations won’t keep the lights on.

Brush up on your graph theory, and wield it, but remember you’re building a product — don’t miss the forest for the trees.

I didn’t recognize the weighted vertex cover problem at first. Intead, I worked through expressing this problem as two other NP-hard optimization problems, distrusting my own conclusions. Here’s a bit more math for anyone curious!

My first theory was that I could calculate the minimum edge cover of a modified line graph L(G) — a task much easier than NP-hard — but that was wrong. For our problem, the edges in L(G) are non-independent; you have to take or leave all the Contact edges as a batch. If I successfully reduced minimum vertex cover to minimum edge cover, I would suddenly be very, very popular.

Fiddling with line graphs led me to this more fruitful representation of the three-vertex weighted example discussed above:

The root is an artificial vertex. It’s connected to each vertex in V by an edge weighted by the cost of listing all objects of that type. The final row of vertices represents the association types we want to fetch. An association vertex’s parents are the types at either end of the association. Once you’ve paid the cost of listing Contacts, you have all the associations involving contexts; that holds for any type, so all of these edges are weighted 0.

The optimal request strategy is a subgraph spanning the blue terminal vertices, the subgraph with the smallest total edge weight that does so. Think of it this way: the bottom row of vertices represents the data we need to get to. We can either go through (list) Deals to get the Contact↔︎Deal associations, or we have to go through (list) Contacts.

We’re looking for a minimal Steiner tree in a directed graph — like a minimal spanning tree, but free to ignore non-terminal vertices. The terminal vertices are in blue above: the root and the desired association types.

The minimal Steiner tree for this simple example has weight 200.

Certain properties of these graphs make them a simpler than the general class of Steiner tree problems (namely, their fixed height guarantees that they’re pseudo-bipartite), but that doesn’t save it from NP-hard status.

Computing minimum-cost r-Steiner trees is NP-hard for r ≥ 4, even if the underlying graph is quasibipartite. The complexity status for r = 3 is unresolved, and the case r = 2 reduces to the minimum-cost spanning tree problem.10

I’ll spare you the details, but r here is one plus the highest degree of a vertex in G — three for this simple example case, and often greater than or equal to four in practice.

Still, I wondered if there was some other relevant feature of these graphs that simplifies the problem. I looked for a more familiar representation, and found one: a weighted set cover.

For brevity, I’ll use shorthand for the HubSpot types; instead of Contact, Company, Deal, etc., I’ll write A, B, C, and so on. You’re looking for a number of associations specified by unique unordered pairs of those types; for example, {(A, B), (B, C), (B, D), (C, D)}.

This set of pairs is the universe 𝒰 to be covered. The sets doing the covering represent the endpoints we can call: for each type, the associations in 𝒰 incident on that type:

These constitute a family 𝒮 of subsets of 𝒰. The set cover problem is to find a minimal number of members in that family such that their union spans 𝒰, analogous to the optimal HubSpot API strategy. In this case you can get all desired associations by listing objects of two types: BsetCset  = {(A, B), (B, C), (B, D)}⋃{(B, C), (C, D)}  = {(A, B), (B, C), (B, D), (C, D)}  = 𝒰.

Bad news: finding a minimal set cover is NP-hard.

As you might imagine, the weighted set cover assigns a weight to each member of 𝒮 and demands you minimize the weights. For this problem, you’d assign a weight Aset with the API cost of listing all objects of type A… but don’t bother: solving weighted set cover isn’t any easier.

Interestingly, there are exceptions! Finding a minimum set cover on 2-sets is equivalent to finding a minimum edge cover, which is easier. Tugging on this thread11 led me to discovering the weighted vertex cover problem, which exactly matched my original project definition.

That match confirmed it: any weighted vertex cover problem can be expressed as a question about HubSpot associations, so optimizing this HubSpot API strategy is, in fact, NP-hard.

  1. A request for a page of Contacts, with Company and Deal associations included, takes this form. Note the weird URL-encoding.

    curl --request GET \
      --url 'https://api.hubapi.com/crm/v3/objects/contacts?limit=10&associations=contact&associations=deal&associations=company&archived=false' \
      --header 'authorization: Bearer YOUR_ACCESS_TOKEN'
  2. Zito, Michele. “Vertex Cover.” (2005). Also see Lavrov (2020).↩︎

  3. Here’s an Elixir implementation of the greedy strategy for finding Vreq:

    @spec greedy_V_req(list({String.t(), String.t()})) :: list(String.t())
    def greedy_V_req([]), do: [] # Base case: no edges in G.
    def greedy_V_req(graph) do
      # Greedily take the highest-degree vertex.
      taken =
        |> Enum.flat_map(&Tuple.to_list(&1))
        |> Enum.frequencies()
        |> Enum.max_by(fn {_k, freq} -> freq end)
        |> elem(0)
      # Remove taken from graph; recurse.
      subgraph =
        |> Enum.filter(fn pair -> taken not in Tuple.to_list(pair) end)
      [taken | greedy_V_req(subgraph)]

    &greedy_V_req/1 receives graph as an adjacency list of order-agnostic tuples. To specify the example graph:

      {"Contact", "Company"},
      {"Contact", "Deal"},
      {"Deal", "Company"},
      {"Policy", "Deal"},
      {"Policy", "Company"}

    This implementation is inefficient — O(|V|2) — but it isn’t worth optimizing.↩︎

  4. Lavrov, Mikhail. “Lecture 36: Approximation Algorithms” (2020). 1-2.↩︎

  5. Zito, Michele. “Vertex Cover.” (2005). Also see Lavrov (2020).↩︎

  6. Savage, Carla. “Depth-first search and the vertex cover problem.” Information processing letters 14, no. 5 (1982): 233-235. A very tidy three pages.↩︎

  7. Vazirani, Vijay V. Approximation Algorithms. Vol. 1. Berlin: Springer, 2001. My adaptation renames C to Vreq for consistency with earlier examples, and removes steps updating c(e) — our problem doesn’t require we track edge “costs,” part of Vazirani’s exercise.↩︎

  8. I want to test this on random graphs. For now, consider the strategy Vazirani’s algorithm picks for this simple weighted graph:

    Vazirani’s algorithm lists all three types, even though, once you’ve listed any pair of types, listing the third is strictly redundant.

    Notably, this four-request solution is still a 2-approximation of the optimal 2-request solution from running clever-greedy on the unweighted graph. You could tweak Vazirani’s last step (“Include in C all vertices having t(v) = 0”) to guard against trivially-unnecessary additions like this.

    Counterintuitively, one 2-approximation algorithm for the unweighted set cover intentionally includes vertices which redundantly cover a given edge; doing so thwarts the clever-greedy adversarial case.↩︎

  9. Lehman, Meir M. “Programs, life cycles, and laws of software evolution.” Proceedings of the IEEE 68, no. 9 (1980): 1060-1076.↩︎

  10. Könemann, Jochen, David Pritchard, and Kunlun Tan. “A partition-based relaxation for Steiner trees.” Mathematical Programming 127, no. 2 (2011): 345-370. The introduction to this paper is the clearest overview of Steiner trees I found.↩︎

  11. Thanks to David Richerby for answering “Set cover problem with sets of size 2” (July 5, 2017).↩︎