Maki: on-disk memoization for (deterministic) fun and profit
Greating, earthlings! Today I will present an exciting new OCaml library called maki. This post will be longer as usual, because it is actually two posts in one:
- first, I'll present the goals of Maki, sketch its design, and present the most salient parts of its API;
- then I will demonstrate how to use it to quickly build a (very naive) OCaml build system, featuring parallel build and cached computations thanks to Maki.
Overview of Maki
In a nutshell: maki is a OCaml library for on-disk memoization. It allows
you to cache the result of pure computations on the disk, with some special
support for computations that input or output files (and that are deterministic).
The initial use-case for it is to memoize the results of running theorem
provers on a large set of problems (to benchmark them), so that
the hours- or days-long computation is parallelized and interruptible at any
moment: to resume an interrupted computation, just re-compute everything
from scratch and let the memoization take care of not repeating what was
already done. In other words, it looks like this:
open Lwt.Infix
let benchmark
: prover list -> file list -> (prover * result list) list Lwt.t
= fun provers files ->
(* no more than 10 simultaenous jobs *)
let limit = Maki.Limit.create 10 in
Lwt_list.map_p
(fun p ->
Lwt_list.map_p
(fun f ->
(* here be magic *)
Maki.call ~limit (fun () -> run_prover_on_file p f)
files
>|= fun results ->
prover, results)
provers
This code would be very naive if it weren't for maki, because:
- we would be running
length provers * length filessub-processes at the same time; if we have a lot of files this will freeze the machine or just not work; - if we stop the computation before it completes, all the work is lost.
But the point is that Maki.call (fun () -> run_prover_on_file p f)
will only be computed once
for a given pair (prover, file); if it was already computed before we
kill and restart the computation, then this call returns immediately with
the result. The parameter limit is there to limit the number of
jobs that run concurrently.
Maki builds on SHA1 (values are memoized by the hash of their computation)
and Lwt (for concurrency and IOs;
limit is basically just Lwt_pool.t underneath).
Assume we had a list of files f1, …, f100, and provers p1 and p2.
If we called benchmark [p1;p2] [f1;...;f100], and we want to print
the results, nothing easier:
(* ... *)
let benchmark_and_print provers files =
benchmark provers files >>= fun l ->
print_results l
let () =
benchmark_and_print [p1;p2] [f1;...;f100]
We can just run benchmark_and_print on the same list of provers and files;
it will call benchmark (which returns quickly because all its calls
to run_prover_on_file have been memoized), and print its results.
If we add a few files f101,…,f105 to the list,
the only work to do is calling run_prover_on_file p1 f10{1...5} and
run_prover_on_file p2 f10{1..5}; the other 200 invocations are cached
and return immediately. If we hadn't run benchmark earlier,
however, then benchmark_and_print will first do all the
computations, cache them, and print the result.
The point is that for the user, all happens as if the pure computations are done every time; the only difference is speed.
The design and Implementation
The code first: here is the repository (and the current API documentation). The most important modules are:
Maki: the main module and entry point to the library. It contains a few submodules; in particular,Maki.Valueis used to describe how to store memoized results on the disk and retrieve them.Maki_storage: a basic abstraction of a persistent key-value store, with a very simple default implementation that stores a pairkey -> valueas a file~/.cache/maki/keycontainingvalue. It is possible to provide one's own implementation, for instance using a database.Maki_bencode: some serialization support.Maki_log: used mostly for debugging purpose.
In practice, memoizing a function is done this way:
let call_prover_on_file ?limit prover file =
let module V = Maki.Value in
Maki.call_exn
?limit
~lifetime:(`KeepFor Maki.Time.(days 2))
~deps:[V.pack Config.maki config;
V.pack_program prover;
V.pack_file file]
~op:Res.maki
~name:"call_prover_on_file"
(fun () -> actual_shell_invocation prover file)
What's going on here?
-
Maki.call_exn : […other arguments…] -> (unit -> 'a Lwt.t) -> 'a Lwt.tis the actual memoization function. There is alsoMaki.call : […] -> (unit -> 'a Lwt.t) -> ('a, exn) result Lwt.twhen error handling matters. -
the parameter
deps(not optional) describes the set of parametersfdepends upon. When callingMaki.call_exn ~deps f, the user promises thatf ()always returns the same results for the same value ofdeps. Conceptually,Maki.call_exn fis similar tof (), butf ()is called only if its result couldn't be found on disk.In this case, we promise that
actual_shell_invocation prover filereturns the same result for fixed values of(config, prover, file)(the prover should be deterministic).Maki.Valuecontains some builtin serialization operators for basic values, files, programs, integers, lists, and sets. -
the parameter
opdescribes how to serialize and deserialize the result off(). If the result was cached,opis used to deserialize and return it; if it was not cached,f()is called, and its result is serialized and put on disk. -
the parameter
nameis the unique name of this computation. Caching is controlled by the pair(name, serialization of deps) -
the optional parameter
lifetimecontrols how long the memoized value will stay on disk (here, we are allowed to remove it if it's not used for 2 days). More on this when I explainmaki_gcbelow.
The memoization scheme
Invoking Maki.call_exn ~name ~deps ~op f starts by mapping each
value in deps into a Bencode value.
Each type is encoded in a different way; files are encoded into a pair
(absolute_path, sha1), programs are looked up in $PATH and then
processed like files, lists are mapped to Bencode lists, etc.
Then, name and the encoding of deps are combined into a string s,
and SHA1(s) becomes the unique identifier of this computation (up to
very, very unlikely collisions).
The library looks SHA1(s) up in the file storage. If not found,
then f() is computed, obtaining result, and op.to_bencode(result) is
stored under the key SHA1(s). This way, next time we ask for this
very same computation, SHA1(s) will be a cache hit and f() will
not be called!
In some sense, this is similar to how git, nix, and the likes
store immutable values by content (maki is a kind of Merkle tree since
the result of a computation is hashed before it is used as the
dependency of another computation). The difference is that we do not
access values by their content, but by the computation that produces them.
Note: most operations in maki are cached throughout a given process;
for instance, once the SHA1 of a file is computed, it is cached along with
the file's current mtime (and recomputed only if this timestamp changes).
Cleaning the cache: maki_gc
Values are actually stored on disk with a few meta-data attributes,
including the lifetime parameter mentioned above. A simple tool called
maki_gc can be used to remove key/value bindings that have expired.
What maki is not
- a reliable long-term storage (it's just caching!).
- a gigantic distributed system. It (currently) only works on one computer,
although one can dream of a distributed implementation of
Maki_storage.tbased on memcache and a DHT. - stable. Currently it's still in the "experimental" phase, but feedback is welcome!
Case study: a toy build system
I think maki might be useful for many things involving heavy computations,
but since the name is derived from "make" I want to show how to build a
very naive build system for OCaml with it.
The code is here — a bit more than 300 lines for compiling
OCaml libraries and executables, parallelizing jobs, and avoiding
recompiling.
Note: do not use this in production, or kittens will die of sadness.
There is some boilerplate and code for corner cases, mapping module
names to file names, etc. in addition to parsing a _oasis file to get
a description of libraries and executables.
Let us review the central parts of the build system:
Computing dependencies
The first function is find_deps. It takes a module m,
a list of libraries deps that are required for building m, and
the path in which m lives; it returns the list of modules m
depends upon using a memoized invocation of ocamldep.
(* other modules [m] depends on *)
let find_deps ~deps ~path m : string list Lwt.t =
let file = module_to_ml ~path m in
let pdeps = CCList.flat_map (fun d -> ["-package"; d]) deps in
(* call "ocamldep" *)
Maki.call_exn ~name:"find_deps" ~limit
~deps:[V.pack_program "ocamldep"; V.pack_file file; V.pack_set V.string pdeps]
~op:V.(set string)
(fun () ->
shellf "@[<h>ocamlfind ocamldep -modules %a %s@]"
pp_strings_space pdeps file
>|= fun (out,_,_) ->
out
|> CCString.Split.left_exn ~by:":"
|> snd
|> CCString.Split.list_cpy ~by:" "
|> List.map String.trim
|> List.filter (fun s -> s<>"")
)
>|= fun l ->
Maki_log.logf 5 (fun k->k "deps of %s/%s: %a" path m pp_strings l);
l
The body of the call consists in calling ocamlfind ocamldep in a shell
and parsing the output.
This result is then memoized using ~op:Maki.Value.(set string) (the order of
the modules does not matter). It depends on the program ocamldep, the
module m (rather, the actual .ml file for m) and the list of library
dependencies. Indeed, if all those are fixed, ocamldep should always return
the same result.
Finding recursive dependencies
When compiling an executable, we are only given the main module's name m,
so we need to discover the list of modules it depends on. The
function find_local_deps_rec returns this list (without duplicates), by a
recursive computation. Note that every step is memoized!
- call
find_local_deps(which is a wrapper offind_depsthat only keeps dependencies in the same path), obtain a listmdeps - for each module
m'inmdeps, compute its own recursive dependencies (possibly memoized already) - flatten the result and append
mdeps(immediate dependencies) - remove duplicates
(* find recursive deps (without duplicates) *)
let rec find_local_deps_rec ~deps ~path m =
let%lwt mdeps = find_local_deps ~deps ~path m in
Maki.call_exn
~name:"find_local_deps_rec"
~deps:[V.pack_string m; V.pack_string path; V.pack_set V.string deps]
~op:V.(set string)
(fun () ->
Lwt_list.map_p (find_local_deps_rec ~deps ~path) mdeps
>|= List.flatten
>|= (fun l->List.rev_append mdeps l)
>|= CCList.sort_uniq ~cmp:String.compare)
Linking order
We skip on build_interface, which invokes ocamlc to build a .cmi,
for something more interesting. When linking an executable, we have to provide
all its modules in a topological order (that is, if A depends on B, then
B has to come earlier than A in the list). Again this is easy to write
in a naive way, as we can let maki's memoization avoid duplicating
costly computations (such as calls to ocamldep):
- given the list of modules to sort, compute
mdeps, the direct dependencies of each module (usingfind_deps); - the mapping
m -> find_deps mdescribes a directed graph; compute some topological ordering of this graph (here, using containers's graph module. Since is is relatively cheap, we do not bother memoizing it.
(* find a topological ordering of given modules *)
let topo_order ~path ~deps modules : string list Lwt.t =
let%lwt mdeps =
Lwt_list.map_p
(fun m ->
find_deps ~deps ~path m
>|= fun l -> m, List.filter (fun m' -> List.mem m' modules) l)
modules
in
(* build a graph to obtain a topological order *)
let g = CCGraph.make_tuple (fun m -> List.assoc m mdeps |> CCList.to_seq) in
let l = CCGraph.topo_sort ~rev:true ~graph:g (CCList.to_seq modules) in
Lwt.return l
Compiling a .cmo file
Now for the final blow: compile a module into a .cmo bytecode file.
Native compilation is left as an exercise! Again, we just write the naive
code:
- compute dependencies;
- compile each dependency, recursively and in parallel;
- build the
.cmiinterface; - build the
.cmofile, memoizing the result as a file.
There are several interesting points here:
-
compiling a module
Foodepends on several values:- the
ocamlcprogram (if we modify the compiler, e.g. after an upgrade, then compiled files will change); - the input file
foo.mlitself; - the library dependencies;
- the list
m_deps'(actually, a set): this is the list of.cmothatFoodirectly depends upon. I think this is overkill for bytecode compilation, but it plays a role in native compilation since inlining can happen between modules.
- the
-
the
op(used to serialize and deserialize the result) isMaki.Value.file. This means that the computation returns a file name, and the file's content is assumed to be the actual computation's result. In this case, the on-disk memoization table only contains the output's path + SHA1. Upon cache hit,makiwill check if the file exists and if it has the correct hash; otherwise the computation is done again. -
the computation might fail (here, it just checks if the expected output file was produced).
(* build module [m] (after building its dependencies).
@param path path in which [m] lives
@param deps library [m] depends upon *)
let rec build_module ~path ~deps m : Maki.path Lwt.t =
(* compute deps *)
let%lwt m_deps = find_local_deps ~deps ~path m in
(* build deps, obtain the resulting .cmo files *)
let%lwt m_deps' =
Lwt_list.map_p (fun m' -> build_module ~path ~deps m') m_deps
in
(* build interface *)
let%lwt _ = build_interface ~path ~deps m in
let file_ml = module_to_ml ~path m in
let file_cmo = module_to_cmo ~path m in
Maki.call_exn
~name:"build_module" ~limit
~lifetime:(`KeepFor Maki.Time.(hours 2))
~deps:[V.pack_program "ocamlc"; V.pack_file file_ml;
V.pack_set V.string deps; V.pack_set V.file m_deps']
~op:V.file
(fun () ->
shellf "@[<h>ocamlfind ocamlc -I %s -c %a %a %s -o %s@]"
path
pp_strings_space (deps_to_args deps)
pp_strings_space m_deps'
file_ml file_cmo
>>= fun (o,e,_) ->
if Sys.file_exists file_cmo
then Lwt.return file_cmo
else failf_lwt "failed to build %s for %s:\n%s\n%s" file_cmo m o e)
After that, there is some glue code for parsing command line arguments,
linking libraries and executables (using topo_sort from above),
parsing the _oasis file, etc.
I tested this build system on a few small, pure OCaml projects (it doesn't
handle C bindings or packs!); it works and does a decent job at
parallelizing. Note, however, that it computes more hashes than most build
system (which are happy using the timestamp as a shortcut, since they
are not interesting in the produced files' content, but only their
being up-to-date w.r.t their sources).
Conclusion
Well, that was quite a long post! I intended to give a tour of maki original
intent and design, and to demonstrate how to use it by showing this
baroque build system that-you-should-not-actually-use.
The idea should be quite straightforward to port to other languages, and in particular Haskell where the notion of purity is enforceable in types (although here we cut some slack to the user, allowing her to run sub-processes, create files, etc. as long as determinism is guaranteed).