10.1. File Input and Output in OCaml

File: ReadingFiles.ml

Any realistic program interacts with an outside world by either getting an input from the user via textual or graphical interface, or reading/writing from/to files.

Input/Output (IO) with files in OCaml can be implemented in multiple ways, and we will employ some of the state-of-the art libraries that provide convenient mechanisms to do so. In order to compile and run the rest of this lecture, please make sure that your have packages core and batteries installed via opam:

opam install core batteries

Amongst other things, core redefines and enhances some of the familiar modules, which we used before, such as List and Array. Specifically, it heavily uses named arguments for functions. Such arguments require a specific name to be provided before the value passed (in a form ~name:value). With such, they can be placed at any position in the parameter list. As an example, the following expression:

List.filter (fun x -> x > 1) [1; 2; 3];;

can be written, using a version of List by core as follows:

List.filter ~f:(fun x -> x > 1) [1; 2; 3];;

or:

List.filter [1; 2; 3] ~f:(fun x -> x > 1);;

Since the parameter f is named, it can be located at any position.

10.1.1. Reading and Writing with Channels

In an operating system, files can be concurrently accessed for reading/writing by multiple applications. Because of this, the access to them needs to be controlled. OCaml enables this via channels — an abstraction that guarantees that no one is modifying the file, from which reading is done, and no one is reading from a file, to which we write.

A channel for reading can be used as in the following example that reads all lines from a file with the path filename:

let read_file_to_strings filename =
  let file = In_channel.create filename in
  let strings = In_channel.input_lines file in
  In_channel.close file;
  strings

Notice that before the function returns its result, it has to close the channel, thus giving up the read access to it, so other applications could use it. This should be done because operating systems have a limit on the number of files that can be opened simultaneously for reading and writing.

In OCaml, the pattern of reading from a file and closing the channel after completing the optation can be done using the with_file function which takes a file name an a function that tells f how to obtain a result from the input channel input of the file:

let read_file_to_single_string filename =
  In_channel.with_file filename ~f:(fun input ->
      In_channel.input_all input)

Writing from the files is done similarly, although the corresponding functions for manipulating with write-channels take some additional parameters:

let write_string_to_file filename text =
  let outc = Out_channel.create ~append:false filename in
  Out_channel.output_string outc text;
  Out_channel.close outc

let write_strings_to_file filename lines =
  Out_channel.with_file ~append:false ~fail_if_exists:false
    filename ~f:(fun out -> List.iter lines ~f:(fun s -> Out_channel.fprintf out "%s\r\n" s))

For instance, both Out_channel.create and Out_channel.with_file take optional parameters (that come with default values) ~append and ~fail_if_exists that determine the corresponding behaviour in the case if the file already exists. For instance, by passing ~append:false we indicate that the contents of the file needs to be rewritten, rather than appended to.

10.1.2. Copying Files

We can use the functions above to copy files:

let copy_file old_file new_file =
  let contents = read_file_to_single_string old_file in
  write_string_to_file new_file contents

Any Unix-like system comes with hash utilities to ensure that the contents of a file are intact by computing its checksum or hash. This can be done for a file filename using either:

cksum filename

or:

md5 filename

10.1.3. Representing Strings

One can think of files as of sequences of 0 and 1 stored in a computer’s memory. How can one tell that a file stores “text” or it is “binary”?

The text files are identified (usually empirically) according to the encoding used to represent text in them. One of the most common encoding ASCII, uses 8-bit sequences (known as bytes or OCaml type char) to encode 256 characters, including upper/lowercase letters of the latin alphabet, numbers and some punctuation marks. Another encoding UTF-16 uses 16-bit sequence, which allows it to encode 65536 symbols, so it includes all of ASCII plus the letters of most of existing alphabets. OCaml strings are treated as sequences of bytes (represented by the data type char). Therefore, the characters from ASCII are represented by char accurately, while UTF-16 characters are broken into two bytes, when considering them as string components. The difference can be observed via the following example:

utop # let ascii_string = "ATR";;
val ascii_string : string = "ATR"
utop # String.length ascii_string;;
- : int = 3
utop # ascii_string.[2];;
- : char = 'R'

Let us try a string that has a Cyrillic character from UTF-16 encoding:

utop # let utf16_string = "ATЯ";;
val utf16_string : string = "ATЯ"
utop # String.length utf16_string;;
- : int = 4
utop # utf16_string.[2];;
- : char = '\208'

When working with strings the following functions implemented via core machinery will come useful:

let trimmer = Core.String.strip
   ~drop:(fun c -> Core.List.mem ['\n'; ' '; '\r'] c
                    ~equal:(fun a b -> equal_char a b))

let splitter s =
  String.split_on_chars ~on:['\n'; ' '; '\r'] s |>
  List.filter ~f:(fun s -> not @@ String.is_empty s)