Maki: on-disk memoization for (deterministic) fun and profit
Greating, earthlings! Today I will present an exciting new OCaml library called maki. This post will be longer as usual, because it is actually two posts in one:
- first, I'll present the goals of Maki, sketch its design, and present the most salient parts of its API;
- then I will demonstrate how to use it to quickly build a (very naive) OCaml build system, featuring parallel build and cached computations thanks to Maki.
Overview of Maki
In a nutshell: maki
is a OCaml library for on-disk memoization. It allows
you to cache the result of pure computations on the disk, with some special
support for computations that input or output files (and that are deterministic).
The initial use-case for it is to memoize the results of running theorem
provers on a large set of problems (to benchmark them), so that
the hours- or days-long computation is parallelized and interruptible at any
moment: to resume an interrupted computation, just re-compute everything
from scratch and let the memoization take care of not repeating what was
already done. In other words, it looks like this:
open Lwt.Infix
let benchmark
: prover list -> file list -> (prover * result list) list Lwt.t
= fun provers files ->
(* no more than 10 simultaenous jobs *)
let limit = Maki.Limit.create 10 in
Lwt_list.map_p
(fun p ->
Lwt_list.map_p
(fun f ->
(* here be magic *)
Maki.call ~limit (fun () -> run_prover_on_file p f)
files
>|= fun results ->
prover, results)
provers
This code would be very naive if it weren't for maki, because:
- we would be running
length provers * length files
sub-processes at the same time; if we have a lot of files this will freeze the machine or just not work; - if we stop the computation before it completes, all the work is lost.
But the point is that Maki.call (fun () -> run_prover_on_file p f)
will only be computed once
for a given pair (prover, file)
; if it was already computed before we
kill and restart the computation, then this call returns immediately with
the result. The parameter limit
is there to limit the number of
jobs that run concurrently.
Maki builds on SHA1 (values are memoized by the hash of their computation)
and Lwt (for concurrency and IOs;
limit
is basically just Lwt_pool.t
underneath).
Assume we had a list of files f1, …, f100
, and provers p1
and p2
.
If we called benchmark [p1;p2] [f1;...;f100]
, and we want to print
the results, nothing easier:
(* ... *)
let benchmark_and_print provers files =
benchmark provers files >>= fun l ->
print_results l
let () =
benchmark_and_print [p1;p2] [f1;...;f100]
We can just run benchmark_and_print
on the same list of provers and files;
it will call benchmark
(which returns quickly because all its calls
to run_prover_on_file
have been memoized), and print its results.
If we add a few files f101,…,f105
to the list,
the only work to do is calling run_prover_on_file p1 f10{1...5}
and
run_prover_on_file p2 f10{1..5}
; the other 200 invocations are cached
and return immediately. If we hadn't run benchmark
earlier,
however, then benchmark_and_print
will first do all the
computations, cache them, and print the result.
The point is that for the user, all happens as if the pure computations are done every time; the only difference is speed.
The design and Implementation
The code first: here is the repository (and the current API documentation). The most important modules are:
Maki
: the main module and entry point to the library. It contains a few submodules; in particular,Maki.Value
is used to describe how to store memoized results on the disk and retrieve them.Maki_storage
: a basic abstraction of a persistent key-value store, with a very simple default implementation that stores a pairkey -> value
as a file~/.cache/maki/key
containingvalue
. It is possible to provide one's own implementation, for instance using a database.Maki_bencode
: some serialization support.Maki_log
: used mostly for debugging purpose.
In practice, memoizing a function is done this way:
let call_prover_on_file ?limit prover file =
let module V = Maki.Value in
Maki.call_exn
?limit
~lifetime:(`KeepFor Maki.Time.(days 2))
~deps:[V.pack Config.maki config;
V.pack_program prover;
V.pack_file file]
~op:Res.maki
~name:"call_prover_on_file"
(fun () -> actual_shell_invocation prover file)
What's going on here?
-
Maki.call_exn : […other arguments…] -> (unit -> 'a Lwt.t) -> 'a Lwt.t
is the actual memoization function. There is alsoMaki.call : […] -> (unit -> 'a Lwt.t) -> ('a, exn) result Lwt.t
when error handling matters. -
the parameter
deps
(not optional) describes the set of parametersf
depends upon. When callingMaki.call_exn ~deps f
, the user promises thatf ()
always returns the same results for the same value ofdeps
. Conceptually,Maki.call_exn f
is similar tof ()
, butf ()
is called only if its result couldn't be found on disk.In this case, we promise that
actual_shell_invocation prover file
returns the same result for fixed values of(config, prover, file)
(the prover should be deterministic).Maki.Value
contains some builtin serialization operators for basic values, files, programs, integers, lists, and sets. -
the parameter
op
describes how to serialize and deserialize the result off()
. If the result was cached,op
is used to deserialize and return it; if it was not cached,f()
is called, and its result is serialized and put on disk. -
the parameter
name
is the unique name of this computation. Caching is controlled by the pair(name, serialization of deps)
-
the optional parameter
lifetime
controls how long the memoized value will stay on disk (here, we are allowed to remove it if it's not used for 2 days). More on this when I explainmaki_gc
below.
The memoization scheme
Invoking Maki.call_exn ~name ~deps ~op f
starts by mapping each
value in deps
into a Bencode value.
Each type is encoded in a different way; files are encoded into a pair
(absolute_path, sha1)
, programs are looked up in $PATH
and then
processed like files, lists are mapped to Bencode
lists, etc.
Then, name
and the encoding of deps
are combined into a string s
,
and SHA1(s)
becomes the unique identifier of this computation (up to
very, very unlikely collisions).
The library looks SHA1(s)
up in the file storage. If not found,
then f()
is computed, obtaining result
, and op.to_bencode(result)
is
stored under the key SHA1(s)
. This way, next time we ask for this
very same computation, SHA1(s)
will be a cache hit and f()
will
not be called!
In some sense, this is similar to how git, nix, and the likes
store immutable values by content (maki
is a kind of Merkle tree since
the result of a computation is hashed before it is used as the
dependency of another computation). The difference is that we do not
access values by their content, but by the computation that produces them.
Note: most operations in maki
are cached throughout a given process;
for instance, once the SHA1 of a file is computed, it is cached along with
the file's current mtime
(and recomputed only if this timestamp changes).
Cleaning the cache: maki_gc
Values are actually stored on disk with a few meta-data attributes,
including the lifetime
parameter mentioned above. A simple tool called
maki_gc
can be used to remove key/value bindings that have expired.
What maki
is not
- a reliable long-term storage (it's just caching!).
- a gigantic distributed system. It (currently) only works on one computer,
although one can dream of a distributed implementation of
Maki_storage.t
based on memcache and a DHT. - stable. Currently it's still in the "experimental" phase, but feedback is welcome!
Case study: a toy build system
I think maki
might be useful for many things involving heavy computations,
but since the name is derived from "make" I want to show how to build a
very naive build system for OCaml with it.
The code is here — a bit more than 300 lines for compiling
OCaml libraries and executables, parallelizing jobs, and avoiding
recompiling.
Note: do not use this in production, or kittens will die of sadness.
There is some boilerplate and code for corner cases, mapping module
names to file names, etc. in addition to parsing a _oasis
file to get
a description of libraries and executables.
Let us review the central parts of the build system:
Computing dependencies
The first function is find_deps
. It takes a module m
,
a list of libraries deps
that are required for building m
, and
the path
in which m
lives; it returns the list of modules m
depends upon using a memoized invocation of ocamldep
.
(* other modules [m] depends on *)
let find_deps ~deps ~path m : string list Lwt.t =
let file = module_to_ml ~path m in
let pdeps = CCList.flat_map (fun d -> ["-package"; d]) deps in
(* call "ocamldep" *)
Maki.call_exn ~name:"find_deps" ~limit
~deps:[V.pack_program "ocamldep"; V.pack_file file; V.pack_set V.string pdeps]
~op:V.(set string)
(fun () ->
shellf "@[<h>ocamlfind ocamldep -modules %a %s@]"
pp_strings_space pdeps file
>|= fun (out,_,_) ->
out
|> CCString.Split.left_exn ~by:":"
|> snd
|> CCString.Split.list_cpy ~by:" "
|> List.map String.trim
|> List.filter (fun s -> s<>"")
)
>|= fun l ->
Maki_log.logf 5 (fun k->k "deps of %s/%s: %a" path m pp_strings l);
l
The body of the call consists in calling ocamlfind ocamldep
in a shell
and parsing the output.
This result is then memoized using ~op:Maki.Value.(set string)
(the order of
the modules does not matter). It depends on the program ocamldep
, the
module m
(rather, the actual .ml
file for m
) and the list of library
dependencies. Indeed, if all those are fixed, ocamldep
should always return
the same result.
Finding recursive dependencies
When compiling an executable, we are only given the main module's name m
,
so we need to discover the list of modules it depends on. The
function find_local_deps_rec
returns this list (without duplicates), by a
recursive computation. Note that every step is memoized!
- call
find_local_deps
(which is a wrapper offind_deps
that only keeps dependencies in the same path), obtain a listmdeps
- for each module
m'
inmdeps
, compute its own recursive dependencies (possibly memoized already) - flatten the result and append
mdeps
(immediate dependencies) - remove duplicates
(* find recursive deps (without duplicates) *)
let rec find_local_deps_rec ~deps ~path m =
let%lwt mdeps = find_local_deps ~deps ~path m in
Maki.call_exn
~name:"find_local_deps_rec"
~deps:[V.pack_string m; V.pack_string path; V.pack_set V.string deps]
~op:V.(set string)
(fun () ->
Lwt_list.map_p (find_local_deps_rec ~deps ~path) mdeps
>|= List.flatten
>|= (fun l->List.rev_append mdeps l)
>|= CCList.sort_uniq ~cmp:String.compare)
Linking order
We skip on build_interface
, which invokes ocamlc
to build a .cmi
,
for something more interesting. When linking an executable, we have to provide
all its modules in a topological order (that is, if A
depends on B
, then
B
has to come earlier than A
in the list). Again this is easy to write
in a naive way, as we can let maki
's memoization avoid duplicating
costly computations (such as calls to ocamldep
):
- given the list of modules to sort, compute
mdeps
, the direct dependencies of each module (usingfind_deps
); - the mapping
m -> find_deps m
describes a directed graph; compute some topological ordering of this graph (here, using containers's graph module. Since is is relatively cheap, we do not bother memoizing it.
(* find a topological ordering of given modules *)
let topo_order ~path ~deps modules : string list Lwt.t =
let%lwt mdeps =
Lwt_list.map_p
(fun m ->
find_deps ~deps ~path m
>|= fun l -> m, List.filter (fun m' -> List.mem m' modules) l)
modules
in
(* build a graph to obtain a topological order *)
let g = CCGraph.make_tuple (fun m -> List.assoc m mdeps |> CCList.to_seq) in
let l = CCGraph.topo_sort ~rev:true ~graph:g (CCList.to_seq modules) in
Lwt.return l
Compiling a .cmo
file
Now for the final blow: compile a module into a .cmo
bytecode file.
Native compilation is left as an exercise! Again, we just write the naive
code:
- compute dependencies;
- compile each dependency, recursively and in parallel;
- build the
.cmi
interface; - build the
.cmo
file, memoizing the result as a file.
There are several interesting points here:
-
compiling a module
Foo
depends on several values:- the
ocamlc
program (if we modify the compiler, e.g. after an upgrade, then compiled files will change); - the input file
foo.ml
itself; - the library dependencies;
- the list
m_deps'
(actually, a set): this is the list of.cmo
thatFoo
directly depends upon. I think this is overkill for bytecode compilation, but it plays a role in native compilation since inlining can happen between modules.
- the
-
the
op
(used to serialize and deserialize the result) isMaki.Value.file
. This means that the computation returns a file name, and the file's content is assumed to be the actual computation's result. In this case, the on-disk memoization table only contains the output's path + SHA1. Upon cache hit,maki
will check if the file exists and if it has the correct hash; otherwise the computation is done again. -
the computation might fail (here, it just checks if the expected output file was produced).
(* build module [m] (after building its dependencies).
@param path path in which [m] lives
@param deps library [m] depends upon *)
let rec build_module ~path ~deps m : Maki.path Lwt.t =
(* compute deps *)
let%lwt m_deps = find_local_deps ~deps ~path m in
(* build deps, obtain the resulting .cmo files *)
let%lwt m_deps' =
Lwt_list.map_p (fun m' -> build_module ~path ~deps m') m_deps
in
(* build interface *)
let%lwt _ = build_interface ~path ~deps m in
let file_ml = module_to_ml ~path m in
let file_cmo = module_to_cmo ~path m in
Maki.call_exn
~name:"build_module" ~limit
~lifetime:(`KeepFor Maki.Time.(hours 2))
~deps:[V.pack_program "ocamlc"; V.pack_file file_ml;
V.pack_set V.string deps; V.pack_set V.file m_deps']
~op:V.file
(fun () ->
shellf "@[<h>ocamlfind ocamlc -I %s -c %a %a %s -o %s@]"
path
pp_strings_space (deps_to_args deps)
pp_strings_space m_deps'
file_ml file_cmo
>>= fun (o,e,_) ->
if Sys.file_exists file_cmo
then Lwt.return file_cmo
else failf_lwt "failed to build %s for %s:\n%s\n%s" file_cmo m o e)
After that, there is some glue code for parsing command line arguments,
linking libraries and executables (using topo_sort
from above),
parsing the _oasis
file, etc.
I tested this build system on a few small, pure OCaml projects (it doesn't
handle C bindings or packs!); it works and does a decent job at
parallelizing. Note, however, that it computes more hashes than most build
system (which are happy using the timestamp as a shortcut, since they
are not interesting in the produced files' content, but only their
being up-to-date w.r.t their sources).
Conclusion
Well, that was quite a long post! I intended to give a tour of maki
original
intent and design, and to demonstrate how to use it by showing this
baroque build system that-you-should-not-actually-use.
The idea should be quite straightforward to port to other languages, and in particular Haskell where the notion of purity is enforceable in types (although here we cut some slack to the user, allowing her to run sub-processes, create files, etc. as long as determinism is guaranteed).