You might never invert a binary tree, but graph theory — and graph algorithms — worm their way into software engineering, whether you recognize them or not. This problem wormed its way into my brain last week, where it reminded me API engineering isn’t all just flatmapping responses. Conceiving this as a graph problem doesn’t just guide implementation: it determines the product’s attributes, which will impact its direction.
One of Sequin’s products is a twoway integration betwen our customers’ HubSpot accounts and their Postgres databases, using the HubSpot API to liveupdate the database and forward writes to Postgres back to HubSpot. There’s a variety of entities in HubSpot: Companies, Contacts, Deals, and all the custom object types our customers’ hearts desire. (Want to index Cars in HubSpot? Make a custom Car type.) Our customers might have thousands of Contacts, hundreds of Companies, and so on.
Of course, these entities aren’t islands; a Contact works for a Company, a Deal binds a company and is negotiated by a Contact, and so on. Rather than capturing every possible relationship on each type, HubSpot has a separate associations model for connecting two objects. For example, an association might declare “Contact Lukas Schwab is associated with Company Sequin.” Those associations are always bidirectional; if I’m associated with Sequin, Sequin’s also associated with me.
Because associations are a core part of HubSpot’s product model, Sequin’s sync needs to scrape them into Postgres. HubSpot’s API makes that a little tricky; I’ll try to explain this in brief, because it’s important to undertanding the task at hand.
If you want to fetch all associations between Contacts and Companies from the HubSpot v3 API, you can use either of two endpoints:
/crm/v3/objects/{fromObjectType}/{fromObjectId}/associations/{toObjectType}
This tells you all of the associations between a single object — specified by fromObjectType
and fromObjectId
— and objects of toObjectType
. For example, you could GET /crm/v3/objects/contact/1860/associations/companies
to list all Companies associated with Contact 1860.
Of course, if you have 10,000 Contacts, you’ll have to call this endpoint 10,000 times to get a complete picture of Contact ↔︎ Company associations. If you also want associations between Contacts and Deals, you’ll have to make another 10,000 requests with toObjectType
set to deal
.
/crm/v3/objects/{fromObjectType}
This endpoint for listing objects of fromObjectType
, 100 at a time, has a handy associations
parameter. When specified, the response includes all associations with objects of the specified types!
If you have 10,000 Contacts and you want to get Contact ↔︎ Company and Contact ↔︎ Deal associations, 100 requests will do the job.^{1}
Because HubSpot limits each customer’s daily API usage, the second endpoint is the much more desirable access pattern. Still, we have to pick a request strategy. Given the association types a customer wants synced to Postgres — type pairs like Contact ↔︎ Company — what fromObjectType
s should Sequin list? The project requirements are to design a request strategy which
For instance, a customer’s desired association types might look like this:
From  To 

Contact  Company 
Deal  Contact 
Company  Deal 
Company  Car 
Company  Policy 
Car  Policy 
Company  Owner 
But it’s easier to start thinking about a simpler configuration:
From  To 

Contact  Company 
Deal  Contact 
This is when I recognized graphs would be good visualizations for request strategies. Make each unique type (in this example, Contact, Company, and Deal) a vertex; then the associations we want to fetch are edges on the graph. The customer configuration is essentially an adjacency list!
Coloring the types you’d list with GET
requests gives you a visual representation of each request strategy.
Request strategy  Visualization 

GET contactsassociations=[company] GET dealsassociations=[contact] 

GET contactsassociations=[company, deal] 

GET dealsassociations=[contact] GET companiesassociations=[contact] 
Now that you’re working with graphs, you have access to a pithier expression of the problem. The set of desired associations defines a graph G = (V, E), where the edges E are desired associations and the vertices V are the HubSpot types involved. Our search for a “request strategy” is a search for an optimal subset V_{req} ⊆ V such that every e ∈ E is incident upon — touching — at least one object type v ∈ V_{req}.
All of the strategies in the table above satisfy that condition, but one of them — and only one valid strategy for that graph! — only lists objects of a single type, Contacts.
There isn’t always a singletype silver bullet. For some graphs, no single vertex is incident on every edge.
You’re faced with the algorithmic question I faced on Friday: given any graph like this, how can you systematically produce a valid minimal strategy?
As a rule of thumb, greedy graph algorithms — approaches which repeatedly apply a single heuristic to make local decisions — are easier to stomach. Because the heuristic is the crux of the implementation, you can focus on documenting that. With any luck, it won’t give your code reviewer a headache.
Luckily, there’s an intuitive greedy heuristic for selecting V_{req} from V! Consider your start state and end state:
When your algorithm starts, you have an empty V_{req}; you don’t know what types to request. By definition, the set of edges E_{req} incident on V_{req} is empty: no vertices, no edges.
When your algorithm ends, E_{req} = E: every edge in the graph is incident on V_{req}.
At any intermediate step, what change to the V_{req} you’re building makes the most progress towards that end state? The one that adds the most new edges in E to E_{req}!
More formally, given an associations graph G = (V, E),
G shrinks with each step, so this is guaranteed to terminate for any finite associations graph (and, since there’s a finite limit to the number of object types in a HubSpot account, G is guaranteed to be finite — phew). I’ll call this the clevergreedy algorithm, after Zito.^{2}
This has a straightforward — if inefficient — recursive implementation,^{3} but don’t sweat it if you don’t want to read code. You can use the visualizations to see successive applications of the algorithm chip away at G as it builds the set of types V_{req}.
Step  V_{req}  E_{req}  Subgraph  Evaluation 

0  ∅  ∅  Company and Deal both have three incident edges. Take Deal.  
1  {Deal}  Deal↔︎Contact Deal↔︎Company Deal↔︎Policy 
Company has two incident edges; Contact and Policy each have one. Take Company.  
2  {Deal, Company}  Deal↔︎Contact Deal↔︎Company Deal↔︎Policy Company↔︎Contact Company↔︎Policy 
There aren’t any more edges in G, so we have a valid V_{req}! 
Problem solved, right? You reframed the problem in terms of graphs, which led you to a greedy algorithm — just a coupledozen lines of code — that’ll produce a valid, seemingly minimal request strategy every time.
There’s one issue. In practice, what’s optimal? What do the project requirements say about optimality?
 Systematically pulls all associations the customer desires, specified as pairs of types.
 Minimizes the number of requests to HubSpot’s API.
Are you minimizing API usage with the greedy strategy? Not necessarily! You’re minimizing the number of types you have to list, true, but the API cost of listing a type is nonuniform. Because each GET
request only returns a page of ≤100 objects, the number of API requests we have to make for a strategy scales with the total number of objects of the types in our strategy.
Do fewer types mean fewer objects? That’s not an unreasonable guess if you know nothing about the system, but you don’t have to look far for a counterexample. Consider a tweaked version of our initial simple associations graph, where you have additional info about the number of objects in HubSpot: there are 300 Contacts, 100 Companies, and 100 Deals.
Strategy  V_{req}  API cost 

{Contact}  300 contacts. ∴ Three requests. 

{Deal, Company}  100 Deals. 100 Companies. ∴ Two requests. 
The first strategy is simpler — it only involves listing objects of one type — but it’s costlier. The project requires optimizing API cost, not strategy simplicity!
Can you get there by modifying the greedy strategy? That’s not clear. The simple heuristic (“pick the vertex that adds the most new edges in E to E_{req}”) also needs to factor in each vertex’s cost. Sometimes a voluminous type is worth requesting because it’s incident on many edges; sometimes it isn’t. In complex graphs, it’s hard to figure out locally whether a given vertex will wind up being a good pick or a bad one. Hmm.
Looks can be deceiving! This puzzle is a minimal vertex cover problem disguised as a HubSpot API question. Introducing cost minimization makes it a weighted minimal vertex cover problem. With or without weights, the optimization problem is NPhard: as far as the career mathfreaks know, there isn’t a solution that works in polynomial time. Any general algorithm for truly optimizing your HubSpot associationsfetching is going to be really, really slow. Bummer!
If you don’t care about performance, there’s a bruteforce solution in O(2^{n}):
What’s wrong with the clevergreedy algorithm? Provided a graph designed to trip it up, it can’t even guarantee results within any constantfactor multiple of the optimal weight!^{4} There are other greedy algorithms for vertex cover guaranteeing results no larger than twice the minimum (2approximations).^{5} My favorite is Savage’s, which simply prunes the leaves from a spanning tree.^{6}
It stands to reason that introducing weights makes this harder; unfortunately, since we’re working with weighted vertices, you can’t just apply Savage’s reasoning to a minimum weighted spanning tree (which you could build with Prim’s algorithm). There are 2approximation algorithms for weighted vertex cover, like this one adapted from Vazirani:^{7}
2.11 Consider the following algorithm for the weighted vertex cover problem. For each vertex v, t(v) is initialized to its weight), and when t(v) drops to 0, v is picked in the cover.
Algorithm 2.17
 Initialization:
 V_{req} ← ∅
 ∀v ∈ V, t(v) ← weight(v)
 While V_{req} is not a vertex cover do:
 Pick an uncovered edge, say (u, v). Let m = min (t(u), t(v)).
 t(u) ← t(u) − m.
 t(v) ← t(v) − m.
 Include in V_{req} all vertices having t(v) = 0.
 Output V_{req}.
This has essentially the same halting condition as our greedy algorithm, but it decreases a vertex’s weight t(v) as its neighbors are added to the eventual set cover V_{req}. Intuitively, a vertex with a high initial weight is unlikely to ever hit t(v) = 0, so it’s unlikely to wind up in V_{req}. An algorithm like Vazirani’s might make a fine solution for getting an okay minimization.
Is it the solution for you? Like the earlier question about whether a “small” strategy was optimal, this comes down to the requirements — implicit and explicit — of the project. The pragmatizing effect of project requirements is really, in some sense, what sets software engineering apart from computer science.
This puzzle explicitly requires you minimize API usage. Unfortunately, neither the clevergreedy algorithm nor Vazirani’s algorithm guarantee optimal outcomes. Vazirani guarantees at worst a 2approximation of the optimal strategy, whereas the clevergreedy algorithm can’t even promise a constant factor for an unweighted set cover. Practically, though, worstcase performance might not matter! Userprovided lists of assosiations to fetch usually won’t be the graphs that trick clevergreedy into poor outcomes. If customer concerns are wellisolated, an adversarial customer can force your algorithm into an inefficient solution… but they can’t impact anyone else’s experience. To improve the customer experience, what you need to understand is the distribution of strategy performances for a given algorithm on realworld graphs, not the theoretical worst case. Here clevergreedy might be better, even if it assumes constant vertex weights.^{8}
Instead, implicit project requirements — values inherent to the engineering task — could reasonably determine what you implement. New engineers will have to read and modify the algorithm you choose; because that time trades off with other work, your customers may be betterserved holistically if you pick a simple, maintainable algorithm here over a strong napproximation. That’s the principle behind minimizing tech debt.
Lehman describes the evolution of programs through a system of three nesting levels:
The HubSpot project requirements ask you to design a Pprogram. At several junctures, feeling comfortable working with graphs helped break that task into Sprograms: graphs gave us a natural visualization, which helped design a greedy algorithm. Recognizing a wellstudied problem in graph theory, the vertex cover, exposed flaws in that greedy algorithm — the adversarial cases — and led to the conclusion all known optimal solutions are inefficient.
Critically, though, graph theory is handy for understanding the relationship between Pprogram and Eprogram. Discovering an existing 2approximation like Vazirani’s might be the difference between a worthwhile feature and a waste of time. Having the vocabulary to study the associations graphs you see in practice lets you study whether worstcase performance matters, or if maybe something simpler (unweighted) will usually do better.
In summary, this isn’t just engineering navel gazing; recognizing a minimum weigted vertex cover problem when you see one impacts your bottom line. Losing engineering hours to battling P ≠ NP or reading abstruse optimizations won’t keep the lights on.
Brush up on your graph theory, and wield it, but remember you’re building a product — don’t miss the forest for the trees.
I didn’t recognize the weighted vertex cover problem at first. Intead, I worked through expressing this problem as two other NPhard optimization problems, distrusting my own conclusions. Here’s a bit more math for anyone curious!
My first theory was that I could calculate the minimum edge cover of a modified line graph L(G) — a task much easier than NPhard — but that was wrong. For our problem, the edges in L(G) are nonindependent; you have to take or leave all the Contact edges as a batch. If I successfully reduced minimum vertex cover to minimum edge cover, I would suddenly be very, very popular.
Fiddling with line graphs led me to this more fruitful representation of the threevertex weighted example discussed above:
The optimal request strategy is a subgraph spanning the blue terminal vertices, the subgraph with the smallest total edge weight that does so. Think of it this way: the bottom row of vertices represents the data we need to get to. We can either go through (list) Deals to get the Contact↔︎Deal associations, or we have to go through (list) Contacts.
We’re looking for a minimal Steiner tree in a directed graph — like a minimal spanning tree, but free to ignore nonterminal vertices. The terminal vertices are in blue above: the root and the desired association types.
Certain properties of these graphs make them a simpler than the general class of Steiner tree problems (namely, their fixed height guarantees that they’re pseudobipartite), but that doesn’t save it from NPhard status.
Computing minimumcost rSteiner trees is NPhard for r ≥ 4, even if the underlying graph is quasibipartite. The complexity status for r = 3 is unresolved, and the case r = 2 reduces to the minimumcost spanning tree problem.^{10}
I’ll spare you the details, but r here is one plus the highest degree of a vertex in G — three for this simple example case, and often greater than or equal to four in practice.
Still, I wondered if there was some other relevant feature of these graphs that simplifies the problem. I looked for a more familiar representation, and found one: a weighted set cover.
For brevity, I’ll use shorthand for the HubSpot types; instead of Contact, Company, Deal, etc., I’ll write A, B, C, and so on. You’re looking for a number of associations specified by unique unordered pairs of those types; for example, {(A, B), (B, C), (B, D), (C, D)}.
This set of pairs is the universe 𝒰 to be covered. The sets doing the covering represent the endpoints we can call: for each type, the associations in 𝒰 incident on that type:
These constitute a family 𝒮 of subsets of 𝒰. The set cover problem is to find a minimal number of members in that family such that their union spans 𝒰, analogous to the optimal HubSpot API strategy. In this case you can get all desired associations by listing objects of two types: B_{set}⋃C_{set} = {(A, B), (B, C), (B, D)}⋃{(B, C), (C, D)} = {(A, B), (B, C), (B, D), (C, D)} = 𝒰.
Bad news: finding a minimal set cover is NPhard.
As you might imagine, the weighted set cover assigns a weight to each member of 𝒮 and demands you minimize the weights. For this problem, you’d assign a weight A_{set} with the API cost of listing all objects of type A… but don’t bother: solving weighted set cover isn’t any easier.
Interestingly, there are exceptions! Finding a minimum set cover on 2sets is equivalent to finding a minimum edge cover, which is easier. Tugging on this thread^{11} led me to discovering the weighted vertex cover problem, which exactly matched my original project definition.
That match confirmed it: any weighted vertex cover problem can be expressed as a question about HubSpot associations, so optimizing this HubSpot API strategy is, in fact, NPhard.
A request for a page of Contacts, with Company and Deal associations included, takes this form. Note the weird URLencoding.
curl request GET \
url 'https://api.hubapi.com/crm/v3/objects/contacts?limit=10&associations=contact&associations=deal&associations=company&archived=false' \
header 'authorization: Bearer YOUR_ACCESS_TOKEN'
Zito, Michele. “Vertex Cover.” (2005). Also see Lavrov (2020).↩︎
Here’s an Elixir implementation of the greedy strategy for finding V_{req}:
@spec greedy_V_req(list({String.t(), String.t()})) :: list(String.t())
def greedy_V_req([]), do: [] # Base case: no edges in G.
def greedy_V_req(graph) do
# Greedily take the highestdegree vertex.
taken =
graph
> Enum.flat_map(&Tuple.to_list(&1))
> Enum.frequencies()
> Enum.max_by(fn {_k, freq} > freq end)
> elem(0)
# Remove taken from graph; recurse.
subgraph =
graph
> Enum.filter(fn pair > taken not in Tuple.to_list(pair) end)
[taken  greedy_V_req(subgraph)]
end
&greedy_V_req/1
receives graph
as an adjacency list of orderagnostic tuples. To specify the example graph:
greedy_V_req([
{"Contact", "Company"},
{"Contact", "Deal"},
{"Deal", "Company"},
{"Policy", "Deal"},
{"Policy", "Company"}
])
This implementation is inefficient — O(V^{2}) — but it isn’t worth optimizing.↩︎
Lavrov, Mikhail. “Lecture 36: Approximation Algorithms” (2020). 12.↩︎
Zito, Michele. “Vertex Cover.” (2005). Also see Lavrov (2020).↩︎
Savage, Carla. “Depthfirst search and the vertex cover problem.” Information processing letters 14, no. 5 (1982): 233235. A very tidy three pages.↩︎
Vazirani, Vijay V. Approximation Algorithms. Vol. 1. Berlin: Springer, 2001. My adaptation renames C to V_{req} for consistency with earlier examples, and removes steps updating c(e) — our problem doesn’t require we track edge “costs,” part of Vazirani’s exercise.↩︎
I want to test this on random graphs. For now, consider the strategy Vazirani’s algorithm picks for this simple weighted graph:
Notably, this fourrequest solution is still a 2approximation of the optimal 2request solution from running clevergreedy on the unweighted graph. You could tweak Vazirani’s last step (“Include in C all vertices having t(v) = 0”) to guard against triviallyunnecessary additions like this.
Counterintuitively, one 2approximation algorithm for the unweighted set cover intentionally includes vertices which redundantly cover a given edge; doing so thwarts the clevergreedy adversarial case.↩︎
Lehman, Meir M. “Programs, life cycles, and laws of software evolution.” Proceedings of the IEEE 68, no. 9 (1980): 10601076.↩︎
Könemann, Jochen, David Pritchard, and Kunlun Tan. “A partitionbased relaxation for Steiner trees.” Mathematical Programming 127, no. 2 (2011): 345370. The introduction to this paper is the clearest overview of Steiner trees I found.↩︎
Thanks to David Richerby for answering “Set cover problem with sets of size 2” (July 5, 2017).↩︎