The function of the OCaml standard lib is not implemented tail-recursively. The current version (as of 4.00 on my computer) is

let rec map f = function
    [] -> []
  | a::l -> let r = f a in r :: map f l

It's pretty efficient on small lists, but blows up the stack on big lists. After discussing on IRC with a friend of mine, that loves C and believes functional languages are too slow, I tried to write an efficient, although totally unsafe, version of The trick is to use the module Obj to manipulate the local, non-shared list we are creating. The code is:

let unsafe_map f l =
  let rec recurse first prev l =
  match l with
    | [] -> assert false
    | x::[] ->
    let l' = [f x] in
    Obj.set_field (Obj.repr prev) 1 (Obj.repr l'); first
    | x::l' ->
      let cur = [f x] in
      Obj.set_field (Obj.repr prev) 1 (Obj.repr cur);
      recurse first cur l'
  in match l with
    | [] -> []
    | x::l' -> let first = [f x] in recurse first first l'

So, it's uglier, and longer. However, it is tail-recursive, and allocates only n elements, where n is the length of the list l. The helper function recurse traverses l, applying f to its elements, keeping a handle on the previous list node. It then modifies (unsafe) the next pointer of the previous node to point to the new one, and continue until it reaches the last node. Then it can return a pointer to the very first node of the list.

Let's benchmark it:

let bench n =
  (* generate big list *)
  let rec generate acc i = match i with
    | 0 -> acc
    | _ -> generate (i :: acc) (i-1)
  let l = generate [] n in
  let t1 = Unix.gettimeofday () in
  let l' = unsafe_map f l in
  let t2 = Unix.gettimeofday () in
  let l'' = List.rev (List.rev_map f l) in
  let t3 = Unix.gettimeofday () in
  Format.printf "%d elements: %fs for unsafe, %f for safe@."
    n (t2 -. t1) (t3 -. t2)

let _ =
  List.iter bench [10; 100; 10000; 1000000; 10000000]

Here we compare the unsafe_map implementation with fun f l -> List.rev ( f l), but not with because it cannot handle very long lists. The results on my laptop (with a dual-core "Genuine Intel(R) CPU U7300 @ 1.30GHz", according to /proc/cpuinfo), are:

10 elements: 0.000004s for unsafe, 0.000002 for safe
100 elements: 0.000005s for unsafe, 0.000005 for safe
10000 elements: 0.000395s for unsafe, 0.000490 for safe
1000000 elements: 0.186290s for unsafe, 0.470396 for safe
10000000 elements: 2.000005s for unsafe, 4.958645 for safe

So we see that allocating twice as few list nodes pays off quite quickly.