‘Magic’ in Elixir’s Logger

I first drafted this post in December 2022. I stopped following progress in Elixir shortly after; maybe its logging has completely changed! That’s besides the point: this is a post about control and cognitive load, specifically with process-scoped global variables. Even if the examples go stale, the principle holds.

def log_and_operate(trace_id) do
  Logger.metadata(trace_id: trace_id)
  Logger.info("Reached pre-operation")
  operate()
  Logger.info("Reached post-operation")
end

Setting a trace_id in your structured logger should contextualize one log entry by locating it in a stream of other log entries with the same key. You include it because you want to fetch your logs later, search for something like trace_id:"correct-horse", and see logs like this:

But, in Elixir¹, operate() has free access to all the logger metadata, including the trace_id field doing the correlating! If it so chooses, it’ll break your trace_id connection between the pre- and post-operation log lines.

def operate() do
  Logger.metadata(trace_id: "battery-staple")
  Logger.info("Changed the trace ID")
end

Of course, that’s a reductive example. In practice, your trace_id has to survive a huge tree of conditionally-connected codepaths, any of which could trample your precious log metadata. You have to read a lot of code to feel confident that won’t happen.

Logger metadata is a process-scoped variable, so it’s intrinsically tied to what happens in that process, and in what order (“control”²). Even if a function with Logger calls doesn’t receive any logger-specific arguments — that is, it doesn’t explicitly depend on logging elsewhere — its behavior can implicitly depend on all sequentially prior operations in the same process. If you imagine the process’s execution as the depth-first traversal of a tree of function calls, this means knowing all the previously-visited vertices, not just the lexical parents of the current vertex.

Elixir’s standard Logger does what Peter Bourgon calls magic: it obscures a complex process-global data-interdependency. Unfortunately, ruling out influence is an indispensable part of reading code. When influence extends to a whole process, and when your program is anything but strictly non-recursive, you can’t just treat code lexically: you have to mentally model the process execution as you read it.

Functional work-arounds

One work-around is to, for some particularly sensitive metadata keys, defer to those already on the logger:

# safe_set_trace_id is a dedicated setter for a single Logger metadata field.
def safe_set_trace_id(trace_id) do
  if !Logger.metadata(:trace_id) do
    Logger.metadata(trace_id: trace_id)
  end
end

# protected_metadata and safe_set_metadata protect a static set of metadata
# keys.
@protected_metadata [:trace_id, :others]
def safe_set_metadata(opts) do
  opts
  |> Enum.each(fn {key, value} ->
    if !Enum.member(@protected_metadata, key) || !Logger.metadata(key) do
      Logger.metadata(key, value)
  end)
end

This requires either sharing a canonical set of precious metadata keys, then enforcing that every Logger call checks the list, or writing a module wrapping Logger and forbidding direct Logger calls.

Alternatively, you can enforce that keys for correlation be unchanged by redundantly(?) re-setting them:

def log_and_operate(trace_id) do
  Logger.metadata(trace_id: trace_id)
  Logger.info("Reached pre-operation")
  operate()
  Logger.info("Reached post-operation", trace_id: trace_id)
end

Better: explicit sharing

The Go approach is to relay the updates through return values: an updated logger or context.Context, which callers can take or leave selectively.

In Go convention, instead of having process-scoped logger metadata, loggers (with metadata) are objects. Of course, you commonly want to inherit some logger metadata; that’s why you see code building on a service-wide base logger:

func (s *Server) Operate() {
  s.Logger.With("key", "value")
}

func Operate(ctx context.Context) {
  log.From(ctx).With("key", "value")
}

Or receiving a logger from the caller, then forking it — extending the received logger’s metadata in a local copy:

func Operate(logger *zap.SugaredLogger) {
  var value string
  logger = logger.With("key", "value")
}

Even in Go, sometimes you do want to modify a logger for the caller scope. How do you do this without process-scoped magic? Return values that let the caller decide what to do with their own logger. Admittedly, this can feel a little clumsy — threading contextual data up and down the callstack resists later refactoring, and including an updated logger among return values feels even weirder. Regardless, at least the Go approach is explicit.

That’s one of my main complaints about Elixir. Every function call implicitly delegates — even just briefly — control of the caller’s logger. If you want to make any strong guarantees about the logging behavior of any given function, you have to know

And I don’t want to know that much! Excluding swaths of code from consideration is essential to being productive in a big, shared codebase (and especially for doing the kind of parachute-in spot debugging routinely required in a live product).

My point isn’t that you should pick Go³, but that you should spurn ‘magic,’ particularly the “comprehension-subverting magic of globals” (Bourgon), a point on which the Go community is pretty disciplined. Think less, get more done; better to be ugly and explicit than opaque.

At its heart, Elixir’s Logger really uses the metadata management API exposed by Erlang’s logger module.↩︎
For some more formal discussion, see Moseley, Ben, and Peter Marks. “Out of the tar pit.” Software Practice Advancement (SPA) 2006 (2006):

Control is basically about the order in which things happen.

The problem with control is that very often we do not want to have to be concerned with this. […]

This is because the person reading the code above must effectively duplicate the work of the hypothetical compoiler — they must (by virtue of the definition of the language semantics) start with the assumption that the ordering specified is seignificant, and then by further inspection determine that it is not (in cases less trivial than the one above determining this can become very difficult). The problem here is that mistakes in this determination can lead to the introduction of very subtle and hard-to-find bugs. (p. 8-9)

Available online.↩︎
This isn’t to say the language you write doesn’t matter. Choose wisely! Some things to consider:
1. Strong structured logging.
2. Static types.
3. Linters. Do they exist? Are they extensible?
4. Talent. Is the language niche? How much time will engineers spend learning it? Can you find community-maintained clients for common APIs, or will you have to roll your own?
Anecdotally, the functional PLs I’ve written professionally (Scala and Elixir) have disappointed in several of these respects. ↩︎