The post How to Compute Fibonacci Numbers? appeared first on niche computing science.

]]>Let `Nat`

be the type of natural numbers. We shall all be familiar with the following definition of Fibonacci number:

```
``` fib :: Nat -> Nat
fib 0 = 0
fib 1 = 1
fib (n+2) = fib (n+1) + fib n

`(When defining functions on natural numbers I prefer to see `

`0`

, `(+1)`

(and thus `(+2)`

), as constructors that can appear on the LHS, while avoiding subtraction on the RHS. It makes some proofs more natural, and it is not hard to recover the Haskell definition anyway.)

= (+1) . (+1)

Executing the definition without other support (such as memoization) gives you a very slow algorithm, due to lots of re-computation. I had some programming textbooks in the 80’s wrongly using this as an evidence that “recursion is slow” (`fib`

is usually one of the only two examples in a sole chapter on recursion in such books, the other being tree traversal).

By defining `fib2 n = (fib n, fib (n+1))`

, one can easily derive an inductive definition of `fib2`

,

```
``` fib2 :: Nat -> (Nat, Nat)
fib2 0 = (0, 1)
fib2 (n+1) = (y, x+y)
where (x,y) = fib2 n

` which computes `

`fib n`

(and `fib (n+1)`

) in `O(n)`

recursive calls. Be warned, however, that it does not imply that `fib2 n`

runs in `O(n)`

time, as we shall see soon.

To be even faster, some might recall, do we not have a closed-form formula for Fibonacci numbers?

```
``` fib n = (((1+√5)/2)^n - ((1-√5)/2)^n) /√5

`It was believed that the formula was discovered by Jacques P. M. Binet in 1843, thus we call it `

*Binet’s formula* by convention, although the formula can be traced back earlier. Proving (or even discovering) the formula is a very good exercise in inductive proofs. On that I recommend this tutorial by Joe Halpern (CS 280 @ Cornell, , 2005). Having a closed-form formula gives one an impression that it give you a quick algorithm. Some even claim that it delivers a `O(1)`

algorithm for computing Fibonacci numbers. One shall not assume, however, that `((1+√5)/2)^n`

and `((1-√5)/2)^n`

can always be computed in a snap!

When processing large numbers, we cannot assume that arithmetic operations such as addition and multiplication take constant time. In fact, it is fascinating knowing that multiplying large numbers, something that appears to be the most fundamental, is a research topic that can still see new breakthrough in 2019 [HvdH19].

There is another family of algorithms that manages to compute `fib n`

in `O(log n)`

recursive calls. To construct such algorithms, one might start by asking oneself: can we express `fib (n+k)`

in terms of `fib n`

and `fib k`

(and some other nearby `fib`

if necessary)? Given such a formula, we can perhaps compute `fib (n+n)`

from `fib n`

, and design an algorithm that uses only `O(log n)`

recursive calls.

Indeed, for `n >= 1`

, we have

```
``` fib (n+k) = fib (n-1) * fib k + fib n * fib (k+1) . -- (Vor)

`This property can be traced back to Nikolai. N. Vorobev, and we therefore refer to it as `

*Vorobev’s Equation*. A proof will be given later. For now, let us see how it helps us.

With Vorobev’s equation we can derive a number of (similar) algorithms that computes `fib n`

in `O(log n)`

recursive calls. For example, let `n, k`

in (Vor) be `n+1, n`

, we get

```
``` fib (2n+1) = (fib (n+1))^2 + (fib n)^2 -- (1)

`Let `

`n, k`

be `n+1, n+1`

, we get

```
``` fib (2n+2) = 2 * fib n * fib (n+1) + (fib (n+1))^2 -- (2)

`Subtract (1) from (2), we get`

```
``` fib 2n = 2 * fib n * fib (n+1) - (fib n)^2 -- (3)

The LHS of (1) and (3) are respectively odd and even, while their RHSs involve only `fib n`

and `fib (n+1)`

. Define `fib2v n = (fib n, fib (n+1))`

, we can derive the program below, which uses only `O(log n)`

recursive calls.

```
``` fib2v :: Nat -> (Nat, Nat)
fib2v 0 = (0, 1)
fib2v n | n `mod` 2 == 0 = (c,d)
| otherwise = (d, c + d)
where (a, b) = fib2v (div n 2)
c = 2 * a * b - a * a
d = a * a + b * b

Having so many algorithms, the ultimate question is: which runs faster?

Interestingly, in 1988, James L. Holloway devoted an entire Master’s thesis to analysis and benchmarking of algorithms computing Fibonacci numbers. The thesis reviewed algorithms including (counterparts of) all those mentioned in this post so far, and some more algorithms based on matrix multiplication. I will summarise some of his results below.

For a theoretical analysis, we need know the number of bits needed to represent `fib n`

. Holloway estimated that to represent `fib n`

, we need approximately `n * 0.69424`

bits. We will denote this number by `N n`

. That `N n`

is linear in `n`

is consistent with our impression that `fib n`

grows exponentially in `n`

.

Algorithm `fib2`

makes `O(n)`

recursive calls, but it does not mean that the running time is `O(n)`

. Instead, `fib2 n`

needs around `N (n^2/2 - n/2)`

bit operations to compute. (Note that we are not talking about big-O here, but an approximated upper bound.)

What about Binet formula? We can compute `√5`

by Newton’s method. One can assume that each `n`

bit division needs `n^2`

operations. In each round, however, we need only the most significant `N n + log n`

bits. Overall, the number of bit operations needed to compute Binet formula is dominated by `log n * (N n + log n)^2`

— not faster than `fib2`

.

Holloway studied several matrix based algorithm. Generally, they need around `(N n)^2`

bit operations, multiplied by different constants.

Meanwhile, algorithms based on Vorobev’s Equation perform quite well — it takes about `1/2 * (N n)^2`

bit operations to compute `fib2v n`

!

What about benchmarking? Holloway ran each algorithm up to five minutes. In one of the experiments, the program based on Binet’s formula exceeds 5 minutes when `log n = 7`

. The program based on `fib2`

terminated within 5 minutes until `log n = 15`

. In another experiment (using simpler programs considering only cases when `n`

is a power of `2`

), the program based on Binet’s formula exceeds 5 minutes when `log n = 13`

. Meanwhile the matrix-based algorithms terminated within 3 to 5 seconds, while the program based on Vorobev’s Equation terminated within around 2 seconds.

Finally, let us see how Vorobev’s Equation can be proved. We perform induction on `n`

. The cases when `n := 1`

and `2`

can be easily established. Assuming the equation holds for `n`

(that is, (Vor)) and `n:= n+1`

(abbreviating `fib`

to `f`

):

```
``` f (n+1+k) = f n * f k + f (n+1) * f(k+1) -- (Vor')

`we prove the case for `

`n:=n+2`

:

```
``` f (n+2+k)
= { definition of f }
f (n+k) + f (n+k+1)
= { (Vor) & (Vor') }
f (n-1) * f k + f n * f (k+1) +
f n * f k + f (n+1) * f(k+1)
= { f (n+1) = f n + f (n-1) }
f (n+1) * f k + f n * f (k+1) + f (n+1) * f (k+1)
= { f (n+2) = f (n+1) + f n }
f (n+1) * f k + f (n+2) * f (k+1) .

`Thus completes the proof.`

Dijkstra derived another algorithm that computes `fib n`

in `O(log n)`

recursive calls in EWD654 [Dij78].

Besides his master’s thesis, Holloway and his supervisor Paul Cull also published a journal version of their results [CH89]. I do not know the whereabouts of Holloway — it seems that he didn’t pursue a career in academics. I wish him all the best. It comforts me imagining that any thesis that is written with enthusiasm and love, whatever the topic, will eventually be found by some readers who are also enthusiastic about it, somewhere, sometime.

I found many interesting information on this page hosted by Ron Knott from University of Surrey, and would recommend it too.

After the flamewar, Yoda Lee (李祐棠) conducted many experiments computing Fibonacci, taking into considerations things like precision of floating point computation and choosing suitable floating point libraries. It is worth a read too. (In Chinese.)

So, what was the flamewar about? It started with someone suggesting that we should store on the moon (yes, the moon. Don’t ask me why) some important constants such as `π`

and `e`

and, with the constants being available in very large precision, many problems can be solved in constant time. Then people started arguing what it means computing something in constant time, whether Binet’s formula gives you a constant time algorithm… and here we are. Silly, but we learned something fun.

[**CH89**] Paul Cull, James L. Holloway. Computing fibonacci numbers quickly. Information Processing Letters, 32(3), pp 143-149. 1989.

[**Dij78**] Dijkstra. In honor of Fibonacci. EWD654, 1978.

[**Hol88**] James L. Holloway. Algorithms for Computing Fibonaci Numbers Quickly. Master Thesis, Oregon State University, 1988.

[**HvdH19**] David Harvey, Joris Van Der Hoeven. Integer multiplication in time `O(n log n)`

. 2019. hal-02070778.

The post How to Compute Fibonacci Numbers? appeared first on niche computing science.

]]>The post Adjoint Functors Induce Monads and Comonads appeared first on niche computing science.

]]>Given categories `C`

and `D`

, we call two functors `L : C → D`

and `R : D → C`

a pair of *adjoint functors* if, for all object `A`

in `C`

and object `B`

in `D`

, we have the following *natural isomorphism*:

```
``` Hom (L A, B) ≅ Hom (A, R B)

This is denoted by `L ⊣ R`

. Functors `L`

and `R`

are respectively called *the left and the right adjoint*.

Concepts such as `Hom (A, B)`

and natural isomorphism will be explained in more detail later. For now, it suffices to say that `Hom (A, B)`

is the collection of all *morphisms* from `A`

to `B`

. For an example, in Set (the category of sets and total functions), `Hom (A, B)`

are all the functions having type `A → B`

, and that `Hom (L A, B) ≅ Hom (A, R B)`

can be understood as

```
``` L A → B ≅ A → R B

That is, given a function `L A → B`

there is a unique corresponding function `A → R B`

, and vice versa. A typical example is when `L A = A × S`

and `R B = S → B`

for some `S`

. Indeed we have

```
``` (A × S) → B ≅ A → (S → B)

` with the mapping from left-to-right being `

`curry`

, and the reverse mapping being `uncurry`

.

Note that Set is but an instance of a category (that is easier for me to understand). The notion of adjoint functors is much more general. For example, when the categories are such that the objects are elements of a partially ordered set, and there is a morphism `a → b`

if `a ≼ b`

(thus there is either zero or one morphism between any two objects), then `L`

and `R`

being adjoint functors means that they form a Galois connection.

That `Hom (L A, B)`

and `Hom (A, R B)`

being isomorphic means that there exists a pair of mappings `ϕ : Hom (L A, B) → Hom (A, R B)`

and `θ : Hom (A , R B) → Hom (L A, B)`

, such that ` ϕ ∘ θ = id`

and ` θ ∘ ϕ = id`

. Being *natural isomorphic* refers to an additional constraint: that `ϕ`

and `θ`

must be natural with respect to `A`

and `B`

. This is an important property that will be explained and used later.

If `L : C → D`

and `R : D → C`

form a pair of adjoint functors, `R ∘ L`

is a monad, while `L ∘ R`

is a comonad.

Recall the example `L A = A × S`

and `R B = S → B`

. Indeed, we have `(R ∘ L) A = S → (A × S)`

— the type of state monad!

Merely having the type does not constitute a monad — we have got to construct the monad operators. In a more programming-oriented definition, a monad `M : * → *`

comes with two operators `return : A → M A`

and `(>>=) : M A → (A → M B) → M B`

. In traditional, mathematics-oriented definition, a monad `M`

comes with three operators: `return`

, `map : (A → B) → M A → M B`

, and `join : M (M A) → M A`

— as a convention, `return`

and `join`

are often respectively written as `η`

and `μ`

. Dually, a comonad `N`

comes with three operators: `ε : N B → B`

, `map : (A → B) → N A → N B`

, and ` δ : N B → N (N B)`

.

As mentioned before, adjoint functors `L`

and `R`

induce a monad `M = R ∘ L`

and a comonad `N = L ∘ R`

. The operators `η`

and `ε`

are given by:

```
``` η : A → R (L A)
η = ϕ id -- id : L A → L A
ε : L (R B) → B
ε = θ id -- id : R B → R B

`The types of `

`id`

are given in the comments. Operators `μ`

and `δ`

can then be defined by:

```
``` μ : R (L (R (L A))) → R (L A)
μ = R ε
δ : L (R B) → L (R (L (R B)))
δ = L η

`where `

`η : R B → R (L (R B))`

and `ε : L (R (L A)) → L A`

.

The operators do have correct types. Do they satisfy the monad laws? There are six monad laws for for `(M, η, μ)`

:

`M id = id`

`M f ∘ M g = M (f ∘ g)`

`η ∘ f = M f ∘ η`

`M f ∘ μ = μ ∘ M (M f)`

`μ ∘ η = id = μ ∘ M η`

`μ ∘ μ = μ ∘ M μ`

The first two laws demand that `M`

be a functor. Since `L`

and `R`

are functors, the two laws are immediately true. Laws 3 and 4 demands that `η`

and `μ`

be natural transformations, while Laws 5 and 6 are important computational rules for monads. We have got to check they do hold for the definitions of `μ`

and `η`

.

For comonads there is a collections of dual laws. Since the proofs are dual, we talk only about the laws for monads in this post.

To prove the four remaining monad rules we need more properties about `ϕ`

and `θ`

. For that we give the concepts of hom-set and natural isomorphism, which we quickly skimmed through, a closer look.

The collection of all morphisms from `A`

to `B`

(both of them objects in category `C`

) is denoted by `Hom(A,B)`

. (In general `Hom(A,B)`

is not necessarily a set. When it happens to be a set, `C`

is called a *locally small category*. See hom-set on ncatlab for details.)

Given category `C`

, `Hom`

is also a functor `Cᵒᵖ × C → Set`

(where `Cᵒᵖ`

denotes the dual category of `C`

). It maps an object `(c,d)`

(in `Cᵒᵖ × C`

) to `Hom(A,B)`

, which is now an object in Set, and maps a pair of morphisms `f : A₂ → A₁`

與`g : B₁ → B₂`

to a morphism `Hom(A₁,B₁) → Hom(A₂, B₂)`

in Set, defined by

```
``` Hom : (A₂ → A₁ × B₁ → B₂) → Hom(A₁,B₁) → Hom(A₂, B₂)
Hom (f,g) h = g ∘ h ∘ f

(The “type” given to `Hom`

is not a rigorous notation, but to aid understanding. For more details, see hom-functor on ncatlab.)

Recall that, given functors `F`

and `G`

, when we say `h : F → G`

is a natural transformation (from `F`

to `G`

), we mean that `h`

is a series of morphisms — for each object `A`

there is a morphism `h : F A → G A`

, and for all `f : A → B`

we have the following

```
``` h ∘ F f = G f ∘ h

The types of `h`

on the left and right hand sides are respectively `F B → G B`

and `F A → G A`

.

Recall also that the definition of adjoint functors demands `ϕ`

and `θ`

be *natural with respect to A and B*. What we mean by being natural here is essentially the same, but slightly complicated by the fact that

`Hom`

is a more complex functor: for all `f : A₂ → A₁`

and `g : B₁ → B₂`

, we want

```
``` ϕ ∘ Hom (L f, g) = Hom (f, R g) ∘ ϕ
θ ∘ Hom (f, R g) = Hom (L f, g) ∘ θ

If we expanding the definition of `Hom`

, apply both sides of the equation regarding `ϕ`

to an argument `h : L A₁ → B₁`

, and apply both sides of the equation regarding `θ`

to an argument `k : A₁ → R B₁`

, we get

```
``` ϕ (g ∘ h ∘ L f) = R g ∘ ϕ h ∘ f
θ (R g ∘ k ∘ f) = g ∘ θ k ∘ L f

These naturality condition will be of crucial importance in the proofs later.

Now we are ready to prove the monad laws 3-6:

3. `η ∘ f = M f ∘ η`

:

```
``` R (L f) ∘ ϕ id
= { ϕ (g ∘ h ∘ L f) = R g ∘ ϕ h ∘ f, [g, h, f := L f, id, id] }
ϕ (L f ∘ id ∘ L id)
= ϕ (L f)
= ϕ (id ∘ id ∘ L f)
= { ϕ (g ∘ h ∘ L f) = R g ∘ ϕ h ∘ f, [g, h := id, id]}
ϕ id ∘ f

4. `M f ∘ μ = μ ∘ M (M f)`

```
``` R (θ id) ∘ (R (L (R (L f)))
= R (θ id ∘ L (R (L f)))
= { θ (R g ∘ k ∘ f) = g ∘ θ k ∘ L f, [g, k, f := id, id, R (L f)] }
R (θ (R id ∘ id, R (L f)))
= R (θ (R (L f)))
{ θ (R g ∘ k ∘ f) = g ∘ θ k ∘ L f, [g, k, f := L f, id, id] }
= R (L f ∘ θ id ∘ id)
= R (L f) ∘ R (θ id)

5.1 `μ ∘ η = id`

```
``` R (θ id) ∘ ϕ id
= { ϕ (g ∘ h ∘ L f) = R g ∘ ϕ h ∘ f, [g, h, f := θ id, id, id] }
ϕ (θ id ∘ id ∘ id)
= ϕ (θ id)
= { ϕ ∘ θ = id }
id

5.2 `μ ∘ M η = id`

```
``` R (θ id) ∘ R (L (ϕ id))
= R (θ id ∘ L (ϕ id))
= { θ (R g ∘ k ∘ f) = g ∘ θ k ∘ L f, [g, k, f := id, id, ϕ id] }
R (θ (R id ∘ id ∘ ϕ id))
= R (θ (ϕ id))
= { θ ∘ ϕ = id }
R id
= id .

6. `μ ∘ μ = μ ∘ M μ`

```
``` R (θ id) ∘ R (L (R (θ id)))
= R (θ id ∘ L (R (θ id)))
= { θ (R g ∘ k ∘ f) = g ∘ θ k ∘ L f, [g, k, f := id, id, R (θ id)] }
R (θ (R id ∘ id ∘ R (θ id)))
= R (θ (R (θ id)))
= { θ (R g ∘ k ∘ f) = g ∘ θ k ∘ L f, [g, k, f := θ id, id, id]}
R (θ id ∘ θ id)
= R (θ id) ∘ R (θ id)

The proofs above use only functor laws, the fact that `ϕ`

and `θ`

are inverses, and the naturality laws of `ϕ`

and `θ`

. Traditionally, the proofs would proceed by diagram chasing, which would probably be easier for those who familiar with them. I am personally happy about being able to construct these equational proofs, guided mostly by the syntax.

- adjoint functor on ncatlab.
- Anton Hilado, Adjoint Functors and Monads, June 20, 2017.
- Thorsten Wißmann, Adjunctions and monads. Seminar “Categories in Programming”, June 3, 2015.
- Steve Awodey, Monads and algebras. Course Notes of Category Theory, LMU Munich, Sommer Semester 2011.

The post Adjoint Functors Induce Monads and Comonads appeared first on niche computing science.

]]>The post Deriving Monadic Programs appeared first on niche computing science.

]]>That was how I started to take an interest in reasoning and derivation of monadic programs. Several years having passed, I collaborated with many nice people, managed to get some results published, failed to publish some stuffs I personally like, and am still working on some interesting tiny problems. This post summaries what was done, and what remains to be done.

Priori to that, all program reasoning I have done was restricted to pure programs. They are beautiful mathematical expressions suitable for equational reasoning, while effectful programs are the awkward squad not worthy of rigorous treatment — so I thought, and I could not have been more wrong! It turned out that there are plenty of fun reasoning one can do with monadic programs. The rule of the game is that you do not know how the monad you are working with is implemented, thus you rely only on the monad laws:

```
return >>= f = f
m >>= return = m
(m >>= f) >>= g = m >>= (\x -> f x >>= g)
```

and the laws of the effect operators. For non-determinism monad we usually assume two operators: `0`

for failure, and `(|)`

for non-deterministic choice (usually denoted by `mzero`

and `mplus`

of the type class `MonadPlus`

). It is usually assumed that `(|)`

is associative with `0`

as its identity element, and they interact with `(>>=)`

by the following laws:

```
0 >>= f = 0 (left-zero)
(m1 | m2) >>= f = (m1 >>= f) | (m2 >>= f) (left-distr.)
m >>= 0 = 0 (right-zero)
m >>= (\x -> f1 x | f2 x) = (m >>= f1) | (m >>= f2) (right-distr.)
```

The four laws are respectively named *left-zero*, *left-distributivity*, *right-zero*, and *right-distributivity*, about which we will discuss more later. These laws are sufficient for proving quite a lot of interesting properties about non-deterministic monad, as well as properties of Spark programs. I find it very fascinating.

Unfortunately, it turns out that monads were too heavy a machinery for the target readers of the Spark paper. The version we eventually published in NETYS 2017 [CHLM17] consists of pure-looking functional programs that occasionally uses “non-deterministic functions” in an informal, but probably more accessible way. Ondřej Lengál should be given credit for most, if not all of the proofs. My proofs using non-deterministic monad was instead collected in a tech. report [Mu19a]. (Why a tech. report? We will come to this later.)

Certainly, it would be more fun if, besides non-determinism, more effects are involved. I have also been asking myself: rather than proving properties of given programs, can I *derive* monadic programs? For example, is it possible to start from a non-deterministic specification, and derive a program solving the problem using states?

The most obvious class of problems that involve both non-determinism and state are backtracking programs. Thus I tried to tackle a problem previously dealt with by Jeremy Gibbons and Ralf Hinze [GH11], the `n`

-Queens problem — placing `n`

queens on a `n`

by `n`

chess board in such a way that no queen can attack another. The specification non-deterministically generates all chess arrangements, before filtering out safe ones. We wish to derive a backtracking program that remembers the currently occupied diagonals in a state monad.

Jeremy Gibbons suggested to generalise the problem a bit: given a problem specification in terms of a non-deterministic `scanl`

, is it possible to transform it to a non-deterministic *and* stateful `foldr`

?

Assuming all the previous laws and, in addition, laws about `get`

and `put`

of state monad (the same as those assumed by Gibbons and Hinze [GH11], omitted here), I managed to come up with some general theorems for such transformations.

The interaction between non-determinism and state turned out to be intricate. Recall the *right-zero* and *right-distributivity* laws:

```
m >>= 0 = 0 (right-zero)
m >>= (\x -> f1 x | f2 x) = (m >>= f1) | (m >>= f2) (right-distr.)
```

While they do not explicit mention state at all, with the presence of state, these two laws imply that *each non-deterministic branch has its own copy of the state*. In the *right-zero* law, if a computation fails, it just fails — all state modifications in `m`

are forgotten. In *right-distributivity*, the two `m`

on the RHS each operates on their local copy of the state, thus locally it appears that the side effects in `m`

happen only once.

We call a non-deterministic state monad satisfying these laws a *local state* monad. A typical example is `M a = S -> List (a, S)`

where `S`

is the type of the state — modulo order and repetition in the list monad, that is. The same monad can be constructed by `StateT s (ListT Identity)`

in the Monad Transformer Library. With effect handling [KI15], we get the desired monad if we run the handler for state before that for list.

The local state monad is the ideal combination of non-determinism and state we would like to have. It has nice properties, and is much more manageable. However, there are practical reasons where one may want a state to be shared globally. For example, when the state is a large array that is costly to copy. Typically one uses operations to explicit “roll back” the global state to its previous configuration upon the end of each non-deterministic branch.

Can we reason about programs that use a global state?

The non-determinism monad with a global state turns out to be a weird beast to tame.

While we are concerned with what laws a monad satisfy, rather than how it is implemented, we digress a little and consider how to implement a global state monad, just to see the issues involved. By intuition one might guess `M a = S -> (List a, S)`

, but that is not even a monad — the direct but naive implementation of its `(>>=)`

does not meet the monad laws! The type `ListT (State s)`

generated using the Monad Transformer Library expands to essentially the same implementation, and is flawed in the same way (but the authors of MTL do not seem to bother fixing it). For correct implementations, see discussions on the Haskell wiki. With effect handling [KI15], we do get a monad by running the handler for list before that for state.

Assuming that we do have a correct implementation of a global state monad. What can we say about the it? We do not have *right-zero* and *right-distributivity* laws anymore, but *left-zero* and *left-distributivity* still hold. For now we assume an informal, intuitive understanding of the semantics: a global state is shared among non-deterministic branches, which are executed left-to-right. We will need more laws to, for example, formally specify what we mean by “the state is shared”. This will turn out to be tricky, so we postpone that for illustrative purpose.

In backtracking algorithms that keep a global state, it is a common pattern to

- update the current state to its next step,
- recursively search for solutions, and
- roll back the state to the previous step.

To implement such pattern as a monadic program, one might come up with something like the code below:

```
modify next >> search >>= modReturn prev
```

where `next`

advances the state, `prev`

undoes the modification of `next`

, and `modify`

and `modReturn`

are defined by:

```
modify f = get >>= (put . f)
modReturn f v = modify f >> return v
```

Let the initial state be `st`

and assume that `search`

found three choices `m1 | m2 | m3`

. The intention is that `m1`

, `m2`

, and `m3`

all start running with state `next st`

, and the state is restored to `prev (next st) = st`

afterwards. By *left-distributivity*, however,

```
modify next >> (m1 | m2 | m3) >>= modReturn prev =
modify next >> ( (m1 >>= modReturn prev) |
(m2 >>= modReturn prev) |
(m3 >>= modReturn prev))
```

which, with a global state, means that `m2`

starts with state `st`

, after which the state is rolled back too early to `prev st`

. The computation `m3`

starts with `prev st`

, after which the state is rolled too far to `prev (prev st)`

.

We need a way to say that “`modify next`

and `modReturn prev`

are run exactly once, respectively before and after all non-deterministic branches in `solve`

.” Fortunately, we have discovered a curious technique. Since non-deterministic branches are executed sequentially, the program

```
(modify next >> 0) | m1 | m2 | m3 | (modify prev >> 0)
```

executes `modify next`

and `modify prev`

once, respectively before and after all the non-deterministic branches, even if they fail. Note that `modify next >> 0`

does not generate a result. Its presence is merely for the side-effect of `modify next`

.

The reader might wonder: now that we are using `(|)`

as a sequencing operator, does it simply coincide with `(>>)`

? Recall that we still have left-distributivity and, therefore, `(m1 | m2) >> n`

equals `(m1 >> n) | (m2 >> n)`

. That is, `(|)`

acts as “insertion points”, where future code followed by `(>>)`

can be inserted into! This is certainly a dangerous feature, whose undisciplined use can lead to chaos.

To be slightly disciplined, we can go a bit further by defining the following variations of `put`

, which restores the original state when it is backtracked over:

```
putR s = get >>= (\s0 -> put s | (put s0 >> 0))
```

To see how it works, assuming that some computation `comp`

follows `putR s`

. By left-distributivity we get:

```
putR s >> comp
= (get >>= (\s0 -> put s | (put s0 >> 0))) >> comp
= { monad laws, left dist., left zero }
get >>= (\s0 -> put s >> comp |
(put s0 >> 0))
```

Therefore, `comp`

runs with new state `s`

. After it finishes, the current state `s0`

is restored.

The hope is that, by replacing all `put`

with `putR`

, we can program as if we are working with local states, while there is actually a shared global state.

(I later learned that Tom Schrijvers had developed similar and more complete techniques, in the context of simulating Prolog boxes in Haskell.)

So was the idea. I had to find out what laws are sufficient to formally specify the behaviour of a global state monad (note that the discussion above has been informal), and make sure that there exists a model/implementation satisfying these laws.

I prepared a draft paper containing proofs about Spark functions using non-determinism monad, a derivation of backtracking algorithms solving problems including `n`

-Queens using a local state monad and, after proposing laws a global state monad should satisfy, derived another backtracking algorithm using a shared global state. I submitted the draft and also sent the draft to some friends for comments. Very soon, Tom Schrijvers wrote back and warned me: the laws I proposed for the global state monad could not be true!

I quickly withdrew the draft, and invited Tom Schrijvers to collaborate and fix the issues. Together with Koen Pauwels, they carefully figured out what the laws should be, showed that the laws are sufficient to guarantee that one can simulate local states using a global state (in the context of effect handling), that there exists a model/implementation of the monad, and verified key theorems in Coq. That resulted in a paper Handling local state with global state, which we published in MPC 2019.

The paper is about semantical concerns of the local/global state interaction. I am grateful to Koen and Tom, who deserve credits for most of the hard work — without their help the paper could not have been done. The backtracking algorithm, meanwhile, became a motivating example that was briefly mentioned.

I was still holding out hope that my derivations could be published in a conference or journal, until I noticed, by chance, a submission to MPC 2019 by Affeldt et al [ANS19]. They formalised a hierarchy of monadic effects in Coq and, for demonstration, needed examples of equational reasoning about monadic programs. They somehow found the draft that was previously withdrawn, and corrected some of its errors. I am still not sure how that happened — I might have put the draft on my web server to communicate with my students, and somehow it showed up on the search engine. The file name was `test.pdf`

. And that was how the draft was cited!

“Oh my god,” I thought in horror, “please do not cite an unfinished work of mine, especially when it is called `test.pdf`

!”

I quickly wrote to the authors, thanked them for noticing the draft and finding errors in it, and said that I will turn them to tech. reports, which they can cite more properly. That resulted in two tech. reports: Equational reasoning for non-determinism monad: the case of Spark aggregation [Mu19a] contains my proofs of Spark programs, and Calculating a backtracking algorithm: an exercise in monadic program derivation [Mu19b] the derivation of backtracking algorithms.

There are plenty of potentially interesting topics one can do with monadic program derivation. For one, people have been suggesting pointwise notations for relational program calculation (e.g. de Moor and Gibbons [dMG00], Bird and Rabe [RB19]). I believe that monads offer a good alternative. Plenty of relational program calculation can be carried out in terms of non-determinism monad. Program refinement can be defined by

`m1 ⊆ m2 ≡ m1 | m2 = m2`

This definition applies to monads having other effects too. I have a draft demonstrating the idea with quicksort. Sorting is specified by a non-determinism monad returning a permutation of the input that is sorted — when the ordering is not anti-symmetric, there can be more than one ways to sort a list, therefore the specification is non-deterministic. From the specification, one can derive pure quicksort on lists, as well as quicksort that mutates an array. Let us hope I have better luck publishing it this time.

With Kleisli composition, there is even a natural definition of factors. Lifting `(⊆)`

to functions (that is `f ⊆ g ≡ (∀ x : f x ⊆ g x)`

), and recall that `(f >=> g) x = f x >>= g`

, the left factor `(\)`

can be specified by the Galois connection:

`(f >=> g) ⊆ h ≡ g ⊆ (f \ h)`

That is, `f \ h`

is the most non-deterministic (least constrained) monadic program that, when ran after the postcondition set up by `f`

, still meets the result specified by `h`

.

If, in addition, we have a proper notion of *converses*, I believe that plenty of optimisation problems can be specified and solved using calculation rules of factors and converses. I believe these are worth exploring.

[**ANS19**] Reynald Affeldt, David Nowak and Takafumi Saikawa. A hierarchy of monadic effects for program verification using equational reasoning. In *Mathematics of Program Construction (MPC)*, Graham Hutton, editor, pp. 226-254. Springer, 2019.

[**BR19**] Richard Bird, Florian Rabe. How to calculate with nondeterministic functions. In *Mathematics of Program Construction (MPC)*, Graham Hutton, editor, pp. 138-154. Springer, 2019.

[**CHLM17**] Yu-Fang Chen, Chih-Duo Hong, Ondřej Lengál, Shin-Cheng Mu, Nishant Sinha, and Bow-Yaw Wang. An executable sequential specification for Spark aggregation. In *Networked Systems (NETYS)*, pp. 421-438. 2017.

[**GH11**] Jeremy Gibbons, Ralf Hinze. Just do it: simple monadic equational reasoning. In *International Conference on Functional Programming (ICFP)*, pp 2-14, 2011.

[**KI15**] Oleg Kiselyov, Hiromi Ishii. Freer monads, more extensible effects. In *Symposium on Haskell*, pp 94-105, 2015.

[**dMG00**] Oege de Moor, Jeremy Gibbons. Pointwise relational programming. In Rus, T. (ed.) *Algebraic Methodology and Software Technology*. pp. 371–390, Springer, 2000.

[**Mu19a**] Shin-Cheng Mu. Equational reasoning for non-determinism monad: the case of Spark aggregation. Tech. Report TR-IIS-19-002, Institute of Information Science, Academia Sinica, June 2019.

[**Mu19b**] Shin-Cheng Mu. Calculating a backtracking algorithm: an exercise in monadic program derivation. Tech. Report TR-IIS-19-003, Institute of Information Science, Academia Sinica, June 2019.

[**PSM19**] Koen Pauwels, Tom Schrijvers and Shin-Cheng Mu. Handling local state with global state. In *Mathematics of Program Construction (MPC)*, Graham Hutton, editor, pp. 18-44. Springer, 2019.

The post Deriving Monadic Programs appeared first on niche computing science.

]]>Proving the Church-Rosser Theorem Using a Locally Nameless Representation Read More »

The post Proving the Church-Rosser Theorem Using a Locally Nameless Representation appeared first on niche computing science.

]]>With the only reference I used, Engineering Formal Metatheory by Aydemir et al., which outlines the principle ideas of the approach, I imagined how it works and tried to implement my own version in Agda. My first few implementations, however, all ended up in a mess. It appeared that there was an endless number of properties to prove. Besides the complexity of the language I was implementing, there must be something I got wrong about the locally nameless representation. Realising that I could not finish the project this way, I eventually decided to learn from the basics.

I started with the tutorial by Chargueraud, with complete code in Coq. I would then follow his footprints using Agda. The task is a classical one: define untyped λ-calculus and its reduction rules, and prove the Church-Rosser theorem.

From an abstract level, there is nothing too surprising. We define a syntax for untyped λ-calculus that distinguishes between free and bound variables:

```
```data Term : Set where
bv : (i : BName) → Term
fv : (x : FName) → Term
ƛ : (e : Term) → Term
_·_ : (e₁ : Term) → (e₂ : Term) → Term

` where `

`BName = ℕ`

represents bound variables by de Bruin indexes, while `FName`

is the type of free variables. The latter can be any type that supports equality check and a method that generates a new variable not in a given set (in fact, a `List`

) of variables. If one takes `FName = String`

, the expression `λ x → x y`

, where `y`

occurs free, is represented by `ƛ (bv 0 · fv "y")`

. For ease of implementation, one may take`FName = ℕ`

as well.

Not all terms you can build are valid. For example, `ƛ (bv 1 · fv "y")`

is not a valid term since there is only one `ƛ`

binder. How to distinguish the valid terms from invalid ones? I would (and did) switch to a dependent datatype `Term n`

, indexed by the number of enclosing binders, and let `BName = Fix n`

. The index is passed top-down and is incremented each time we encounter a `ƛ`

. Closed terms are then represented by the type `Term 0`

.

The representation above, if works, has the advantage that a term that can be build at all must be valid. Choosing such a representation was perhaps the first thing I did wrong, however. Chargueraud mentioned a similar predicate that also passes the “level” information top-down, and claimed that the predicate, on which we will have to perform induction on to prove property about terms, does not fit the usual pattern of induction. This was probably why I had so much trouble proving properties about terms.

The way to go, instead, is to use a predicate that assembles the information bottom up. The predicate `LC`

(“locally-closed” — a term is valid if it is locally closed) is defined by:

```
```data LC : Term → Set where
fv : ∀ x → LC (fv x)
ƛ : (L : FNames) → ∀ {e} →
(fe : ∀ {x} → (x∉L : x ∉ L) → LC ([ 0 ↦ fv x ] e)) → LC (ƛ e)
_·_ : ∀ {e₁ e₂} → LC e₁ → LC e₂ → LC (e₁ · e₂)

`A free variable alone is a valid term. A application `

`f · e`

is valid if both `f`

and `e`

are. And an abstraction `ƛ e`

is valid if `e`

becomes a valid term after we substitute any free variable `x`

for the first (`0`

-th) bound variable. There can be an additional constraint on `x`

, that it is not in `L`

, a finite set of “protected” variables — such *co-finite* quantification is one of the features of the locally nameless style.

The “open” operator `[ n ↦ t ] e`

substitutes the term `t`

for the `n`

-th bound variable in `e`

. It is defined by

```
```[_↦_] : ℕ → Term → Term → Term
[ n ↦ t ] (bv i) with n ≟ i
... | yes _ = t
... | no _ = bv i
[ n ↦ t ] (fv y) = fv y
[ n ↦ t ] (ƛ e) = ƛ ([ suc n ↦ t ] e)
[ n ↦ t ] (e₁ · e₂) = [ n ↦ t ] e₁ · [ n ↦ t ] e₂

`note how `

`n`

is incremented each time we go into a `ƛ`

. A dual operator,

```
```[_↤_] : ℕ → FName → Term → Term

`instantiates the `

`n`

-th bound variable to a term.

Small-step β reduction can be defined by:

```
```data _β→_ : Term → Term → Set where
β-red : ∀ {t₁ t₂} → Body t₁ → LC t₂
→ ((ƛ t₁) · t₂) β→ [ 0 ↦ t₂ ] t₁
β-app₁ : ∀ {t₁ t₁' t₂} → LC t₂
→ t₁ β→ t₁'
→ (t₁ · t₂) β→ (t₁' · t₂)
β-app₂ : ∀ {t₁ t₂ t₂'} → LC t₁
→ t₂ β→ t₂'
→ (t₁ · t₂) β→ (t₁ · t₂')
β-ƛ : ∀ L {t₁ t₁'}
→ (∀ x → x ∉ L → ([ 0 ↦ fv x ] t₁) β→ ([ 0 ↦ fv x ] t₁'))
→ ƛ t₁ β→ ƛ t₁'

`where `

`β-red`

reduces a redux, `β-app₁`

and `β-app₂`

allows reduction respectively on the right and left hand sides of an application, and `β-ƛ`

goes into a `ƛ`

abstraction — again we use co-finite quantification.

Given `_β→_`

we can define its reflexive, transitive closure `_β→*_`

, and the reflexive, transitive, symmetric closure `_β≣_`

. The aim is to prove that `_β→*_`

is confluent:

```
```β*-confluent :
∀ {m s t} → (m β→* s) → (m β→* t)
→ ∃ (λ u → (s β→* u) × (t β→* u))

`which leads to the Church-Rosser property:`

```
```β-Church-Russer : ∀ {t₁ t₂} → (t₁ β≣ t₂)
→ ∃ (λ t → (t₁ β→* t) × (t₂ β→* t))

At an abstract level, the proof follows the classical route: it turns out that it is easier to prove the confluence of a “parallel reduction” relation `_⇉_`

which allows β reduction to happen in several places of a term in one step. We then prove that `_β→*_`

is equivalent to `_⇉*_`

, thereby proving the confluence of `_β→*_`

as well. All these can be carried out relatively nice and clean.

The gory details, however, hides in proving the infrastructutal properties supporting the abstract view of the proofs.

Confluence, Church-Rosser… these are the interesting stuffs we *want* to prove. However, we often end up spending most of the time proving those infrastructure properties we *have* to prove — which is why there have been so much recent research hoping to find better representations that simplify them. The locally nameless style is supposed to be such a representation. (Another orthogonal topic is to seek generic representations such that the proofs can be done once for all languages.)

In my code, most of these properties are piled in the file `Infrastructure.agda`

. They range from stuffs you might expect to have:

```
```open-term : ∀ k t {e} → LC e → e ≡ [ k ↦ t ] e
close-var-open-aux : ∀ k x e → LC e → e ≡ [ k ↦ fv x ] ([ k ↤ x ] e)

`to stuffs not that obvious:`

```
```open-var-inj : ∀ k x t u → x ∉ fvars t → x ∉ fvars u
→ [ k ↦ fv x ] t ≡ [ k ↦ fv x ] u → t ≡ u
open-term-aux : ∀ j t i u e → ¬ (i ≡ j)
→ [ j ↦ t ] e ≡ [ i ↦ u ] ([ j ↦ t ] e)
→ e ≡ [ i ↦ u ] e

`The lemma `

`open-var-inj`

is one of the early lemmas that appeared in Chargueraud’s tutorial, which might give one the impression that it is an easy first lemma to prove. On the contrary, it is among the tedious ones — I needed a 40-line proof (most of the cases were simply eliminated by contraction, though).

It takes experience and intuition to know what lemmas are needed. Without promise that it will work, I would think something must have gone wrong when I found myself having to prove weird looking lemmas like:

```
```close-var-rec-open :
∀ x y z t i j
→ ¬(i ≡ j) → ¬(y ≡ x) → y ∉ fvars t
→ [ i ↦ fv y ] ([ j ↦ fv z ] ([ j ↤ x ] t))
≡ [ j ↦ fv z ] ([ j ↤ x ] ([ i ↦ fv y ] t))

`which is not easy to prove either.`

So, is the locally nameless representation what it claims to be — a way to represent binders that simplifies the infrastructural proofs and is easier to scale up? When I was struggling with some of the proofs in `Infrastructure.agda`

I did wonder whether the claim is true only for Coq, with cleverly designed proof tactics, but not for Agda, where everything is done by hand (so far). Once the infrastructural proofs are done, however, the rest was carried out very pleasantly.

To make a fair comparison, I should re-implement everything again using de Bruin notation. That has to wait till some other time, though. (Any one want to give it a try?)

It could be the case that, while some proofs are easily dismissed in Coq using tactics, in Agda the programmer should develop some more abstractions. I did feel myself repeating some proof patterns, and found one or two lemmas that do not present in Chargueraud’s tutorial which, if used in Agda, simplifies the proofs a bit. There could be more, but at this moment I am perhaps too involved in the details to see the patterns from a higher viewpoint.

The exercise does pay off, though. Now I feel I am much more familiar with this style, and am perhaps more prepared to use it in my own project.

A zip file containing all the code.

- Brian Aydemir, Arthur Chargueraud, Benjamin Pierce, Randy Pollack, and Stephanie Weirich. Engineering Formal Metatheory. POPL ’08.
- Arthur Chargueraud. The Locally Nameless Representation.
*Journal of Automated Reasoning*, May 2011

The post Proving the Church-Rosser Theorem Using a Locally Nameless Representation appeared first on niche computing science.

]]>The post Calculating Programs from Galois Connections appeared first on niche computing science.

]]>In program construction one often encounters program specification of the form “… the smallest such number”, “the longest prefix of the input list satisfying …”, etc. A typical example is whole number division: given a natural number `x`

and a positive integer `y`

, `x / y`

is the largest natural number that, when multiplied by `y`

, is at most `x`

. For another example, the Haskell function `takeWhile p`

returns the longest prefix of the input list such that all elements satisfy predicate `p`

.

Such specifications can be seen as consisting of two parts. The *easy part* specifies a collection of solution candidates: numbers that are at most `x`

after multiplication with `y`

, or all prefixes of the input list. The *hard* part, on the other hand, picks one optimal solution, such as the largest, the longest, etc., among the collection.

Our goal is to calculate programs for such specifications. But how best should the specification be given in the first place? Take division for example, one might start from a specification that literally translates our description above into mathematics:

```
``` x / y = ⋁{ z | z * y ≤ x }

`As we know, however, suprema is in general not easy to handle. One could also explicitly name the remainder:`

```
``` z = x / y ≡ (∃ r : 0 ≤ r < y : x = z * y + r)

`at the cost of existentially quantifying over the remainder.`

A third option looks surprising simpler: given `x`

and `y`

, the value `x / y`

is such that for all `z`

,

```
```` z * y ≤ x ≡ z ≤ x / y(1)`

`Why is this sufficient as a definition of `

`x / y`

? Firstly, by substituting `x / y`

for `z`

, the right hand side of `≡`

reduces to true, and we obtain on the left hand side `(x / y) * y ≤ x`

. This tell that `x / y`

is a candidate — it satisfies the easy part of the specification. Secondly, read the definition from left to right: `z * y ≤ x ⇒ z ≤ x / y`

. It says that `x / y`

is the largest among all the numbers satisfying the easy part.

Equations of the form are called *Galois connections*. Given preorders `⊑`

and `≤`

, Functions `f`

and `g`

form a Galois connection if for all `x`

and `z`

we have

```
```` f z ⊑ x ≡ z ≤ g x(2)`

`The function `

`f`

is called the lower adjoint and `g`

the upper adjoint.

The definition of division above is a Galois connection where `f = (* y)`

and `g = (/ y)`

. For another example, `takeWhile p`

can be specified as an upper adjoint:

```
```` map p? zs ⊑ xs ≡ zs ⊑ takeWhile p xs(3)`

`where `

`⊑`

is the prefix ordering: `ys ⊑ xs`

if `ys`

is a prefix of `xs`

, and `map p?`

is a partial function: `map p? xs = xs`

if `p x`

holds for each `x`

in `xs`

.

We love Galois connections because once two functions are identified as such, a long list of useful properties follows: `f (g x) ⊑ x`

, `z ≤ g (f z)`

, `f`

and `g`

are monotonic, and are inverses of each other in the other’s range… etc.

These are all very nice. But can one calculate a program from a Galois connection? Given `⊑`

, `≤`

, and `f`

, how does one construct `g`

?

José discovered and proposed a relational operator to handle such calculations. To use the operator, we have to turn the Galois connection `(1)`

into point-free style. We look at the left hand side of `(1)`

: `f z ⊑ x`

, and try to write it as a relation between `z`

and `x`

. Let `f°`

denote the relational converse of `f`

— roughly, think of it as the inverse function of `f`

, that it, it maps `f z`

to `z`

, and let `∘`

denote relational composition — function composition extended to relations. Thus `f z ⊑ x`

translates to

```
``` f° ∘ (⊑)

`It is a relation between `

`z`

and `x`

: putting `x`

on the left hand side of `f° ∘ (⊑)`

, it relates, through `⊑`

, to `f z`

, which is then mapped to `z`

through `f°`

.

Then we wish that `f° ∘ (⊑)`

can be transformed into a (relational) fold or unfold, which is often the case because the defining components: `⊑`

, `≤`

, and `f`

, are often folds or unfolds. Consider the lower adjoint of `takeWhile p`

in `(3)`

. Since `⊑`

, the relation that takes a list and returns a prefix of the list, can be defined as a fold on lists, `(map p?)° ∘ (⊑)`

, by fold fusion, is also a fold. Consider `(1)`

, since `≤`

and `(* y)`

are both folds on natural numbers, `(* y)° ∘ (≤)`

can be both a fold and an unfold.

In our paper we showed that a Galois connection `(2)`

can be transformed into

```
``` g = (f° ∘ (⊑)) ↾ (≥)

`where `

`↾`

is the new operator José introduced. The relation `S ↾ R`

, pronounced “`S`

shrunk by `R`

“, is a sub-relation of `S`

that yields, for each input, an optimal result under relation `R`

. Note that the equation made the easy/hard division explicit: `f° ∘ (⊑)`

is the easy part: we want a solution `z`

that satisfies `f z ⊑ x`

, while `≥`

is the criteria we use, in the hard part, to choose an optimal solution.

The `↾`

operator is similar to the `min`

operator of Bird and de Moor, without having to use sets (which needs a power allegory). It satisfies a number of useful properties. In particular, we have theorems stating when `(↾ R)`

promotes into folds and unfolds. For example,

```
``` (fold S) ↾ R ⊇ fold (S ↾ R)

`if `

`R`

is transitive and `S`

is monotonic on `R`

.

With the theorems we can calculate `g`

. Given `g`

, specified as an upper adjoint in a Galois connection with lower adjoint `f`

, we first try to turn `f° ∘ (⊑)`

into a fold or an unfold, and then apply the theorems to promote `(↾ (≥))`

. For more details, take a look at our paper!

The post Calculating Programs from Galois Connections appeared first on niche computing science.

]]>The post Evaluating Simple Polynomials appeared first on niche computing science.

]]>This is what I eventually came up with: given a list of numbers `a₀, a₁, a₂ ... a`

and a constant _{n}`X`

, compute `a₀ + a₁X, + a₂X² + ... + a`

. In Haskell it can be specified as a one-liner:_{n}X^{n}

```
``` poly as = sum (zipWith (×) as (iterate (×X) 1))

`One problem of this example is that the specification is already good enough: it is a nice linear time algorithm. To save some multiplications, perhaps, we may try to further simplify it.`

It is immediate that `poly [] = 0`

. For the non-empty case, we reason:

```
``` poly (a : as)
= { definition of poly }
sum (zipWith (×) (a:as) (iterate (×X) 1))
= { definition of iterate }
sum (zipWith (×) (a:as) (1 : iterate (×X) X))
= { definition of zipWith }
sum (a : zipWith (×) as (iterate (×X) X))
= { definition of sum }
a + sum (zipWith (×) as (iterate (×X) X))

`The expression to the right of `

`a +`

is unfortunately not `poly as`

— the last argument to `iterate`

is `X`

rather than `1`

. One possibility is to generalise `poly`

to take another argument. For this problem, however, we can do slightly better:

```
``` a + sum (zipWith (×) as (iterate (×X) X))
= { since iterate f (f b) = map f (iterate f b) }
a + sum (zipWith (×) as (map (×X) (iterate (×X) 1)))
= { zipWith (⊗) as . map (⊗X) = map (⊗X) . zipWith (⊗) as
if ⊗ associative }
a + sum (map (×X) (zipWith (×) as (iterate (×X) 1)))
= { sum . map (×X) = (×X) . sum }
a + (sum (zipWith (×) as (iterate (×X) 1))) × X
= { definition of poly }
a + (poly as) × X

We have thus come up with the program

```
``` poly [] = 0
poly (a : as) = a + (poly as) × X

Besides the definitions of `sum`

, `zipWith`

, `iterate`

, etc, the rules used include:

`map f (iterate f x) = iterate f (f x)`

`zipWith (⊗) as . map (⊗X) = map (⊗X) . zipWith (⊗) as`

if`⊗`

associative`sum . map (×X) = (×X) . sum`

, a special case of`foldr ⊕ e . map (⊗X) = (⊗X) . foldr ⊕ e`

if`(a ⊕ b) ⊗ X = (a ⊗ X) ⊕ (b ⊗ X)`

and`e ⊗ X = e`

.

Well, this is not a very convincing example. Ideally I’d like to have a derivation, like the steep list, where we gain some improvement in complexity by calculation.

What is your favourite example for functional program calculation?

The post Evaluating Simple Polynomials appeared first on niche computing science.

]]>`O(N²)`

program using nested loops, which, I have to confess, is what I would do before reading Kaldewaij's book and realised that it is possible to do the task in linear time using one loop.
The post Sum of Squares of Differences appeared first on niche computing science.

]]>```
```|[ con N {N ≥ 2}; a : array [0..N) of int;
var r : int;
S
{ r = (Σ i,j : 0 ≤ i < j < N : (a.i - a.j)²) }
]|

`In words, given an array of integers having at least two elements, compute the sum of squares of the difference between all pairs of elements. (Following the convention of the guarded command language, function application is written `

`f.x`

, and an array is seen as a function from indices to values.)

It is not hard to quickly write up a `O(N²)`

program using nested loops, which, I have to confess, is what I would do before reading Kaldewaij’s book and realised that it is possible to do the task in linear time using one loop. Unfortunately, not many students managed to come up with this solution, therefore I think it is worth some discussion.

Before we solve the problem, let us review the “Dutch style” quantifier syntax and rules. Given a commutative, associative binary operator `⊕`

with unit element `e`

, if we informally denote the (integral) values in the interval `[A .. B)`

by `i₀, i₁, i₂ ... i`

, the quantified expression:_{n}

```
``` (⊕ i : A ≤ i < B : F.i)

`informally denotes `

`F.i₀ ⊕ F.i₁ ⊕ F.i₂ ⊕ ... ⊕ F.i`

. More generally, if all values satisfying predicate _{n}`R`

can be enlisted `i₀, i₁, i₂ ... i`

, the expression_{n}

```
``` (⊕ i : R.i : F.i)

`denotes `

`F.i₀ ⊕ F.i₁ ⊕ F.i₂ ⊕ ... ⊕ F.i`

. We omit the _{n}`i`

in `R.i`

and `F.i`

when there can be no confusion.

A more formal characterisation of the quantified expression is given by the following rules:

`(⊕ i : false : F.i) = e`

`(⊕ i : i = x : F.i) = F.x`

`(⊕ i : R : F) ⊕ (⊕ i : S : F) = (⊕ i : R ∨ S : F) ⊕ (⊕ i : R ∧ S : F)`

`(⊕ i : R : F) ⊕ (⊕ i : R : G) = (⊕ : R : F ⊕ G)`

`(⊕ i : R.i : (⊕ j : S.j : F.i.j)) = (⊕ j : S.j : (⊕ i : R.i : F.i.j))`

Rules 1 and 3 give rise to a useful rule “split off `n`

“: consider `i`

such that `0 ≤ i < n + 1`

. If `n > 0`

, the set of possible values of `i`

can be split into two subsets: `0 ≤ i < n`

and `i = n`

. By rule 3 (reversed) and 1 we get:

```
``` (⊕ i : 0 ≤ i < n + 1 : F.i) = (⊕ i : 0 ≤ i < n : F.i) ⊕ F.n

Expressions quantifying more than one variables can be expressed in terms of quantifiers over single variables:

```
``` (⊕ i,j : R.i ∧ S.i,j : F.i.j) = (⊕ i : R.i : (⊕ j : S.i.j : F.i.j))

`If `

`⊗`

distributes into `⊕`

, we have an additional property:

```
``` x ⊗ (⊕ i : R : F) = (⊗ i : R : x ⊗ F)

As a convention, `(+ i : R : F)`

is often written `(Σ i : R : F)`

.

The first step is to turn the constant `N`

to a variable `n`

. The main worker of the program is going to be a loop, in whose invariant we try to maintain:

```
``` P ≣ r = (Σ i,j : 0 ≤ i < j < n : (a.i - a.j)²)

`In the end of the loop we increment `

`n`

, and the loop terminates when `n`

coincides with `N`

:

```
``````
{ Inv: P ∧ 2 ≤ n ≤ N , Bound: N - n}
do n ≠ N → ... ; n := n + 1 od
```

`We shall then find out how to update `

`r`

before `n := n + 1`

in a way that preserves `P`

.

Assume that `P`

and `2 ≤ n ≤ N`

holds. To find out how to update `s`

, we substitute `n`

for `n + 1`

in the desired value of `r`

:

```
``` (Σ i,j : 0 ≤ i < j < n : (a.i - a.j)²)[n+1 / n]
= (Σ i,j : 0 ≤ i < j < n + 1 : (a.i - a.j)²)
= { split off j = n }
(Σ i,j : 0 ≤ i < j < n : (a.i - a.j)²) +
(Σ i : 0 ≤ i < n : (a.i - a.n)²)
= { P }
r + (Σ i : 0 ≤ i < n : (a.i - a.n)²)

`This is where most people stop the calculation and start constructing a loop computing `

`(Σ i : 0 ≤ i < n : (a.i - a.n)²)`

. One might later realise, however, that most computations are repeated. Indeed, the expression above can be expanded further:

```
``` r + (Σ i : 0 ≤ i < n : (a.i - a.n)²)
= { (x - y)² = x² - 2xy + y² }
r + (Σ i : 0 ≤ i < n : a.i² - 2 × a.i × a.n + a.n²)
= { Rule 4 }
r + (Σ i : 0 ≤ i < n : a.i²)
- (Σ i : 0 ≤ i < n : 2 × a.i × a.n)
+ (Σ i : 0 ≤ i < n : a.n²)
= { a.n is a constant, multiplication distributes into addition }
r + (Σ i : 0 ≤ i < n : a.i²)
- 2 × (Σ i : 0 ≤ i < n : a.i) × a.n
+ (Σ i : 0 ≤ i < n : a.n²)
= { simplifying the last term }
r + (Σ i : 0 ≤ i < n : a.i²)
- 2 × (Σ i : 0 ≤ i < n : a.i) × a.n + n × a.n²

`which hints at that we can store the values of `

`(Σ i : 0 ≤ i < n : a.i²)`

and `(Σ i : 0 ≤ i < n : a.i)`

in two additional variables:

```
``` Q₀ ≣ s = (Σ i : 0 ≤ i < n : a.i²)
Q₁ ≣ t = (Σ i : 0 ≤ i < n : a.i)

`It merely takes some routine calculation to find out how to update `

`s`

and `t`

. The resulting code is:

```
```|[ con N {N ≥ 2}; a : array [0..N) of int;
var r, s, t, n : int;
r, s, t, n := (a.0 - a.1)², a.0² + a.1², a.0 + a.1, 2
{ Inv: P ∧ Q₀ ∧ Q₁ ∧ 2 ≤ n ≤ N , Bound: N - n }
; do n ≠ N →
r := r + s - 2 × t × a.n + n × a.n²;
s := s + a.n²;
t := t + a.n;
n := n + 1
od
{ r = (Σ i,j : 0 ≤ i < j < N : (a.i - a.j)²) }
]|

Among those students who did come up with a program, most of them resorted to a typical two-loop, `O(N²)`

solution. Given that this 9-hour course is, for almost all of them, their first exposure to program derivation, I shall perhaps be happy enough that around 3 to 4 out of 38 students came up with something like the program above.

One student, however, delivered a program I did not expect to see:

```
```|[ con N {N ≥ 2}; a : array [0..N) of int;
var r, i, j : int;
r, i, j := 0, 0, 0
{ Inv: ... ∧ 0 ≤ i ≤ j ∧ 0 ≤ j ≤ N, Bound: ? }
; do j ≠ N →
if i < j → r := r + (a.i - a.j)²; i := i + 1
| i = j → i, j := 0, j + 1
fi
od
]|

`The program uses only one loop, but is still `

`O(N²)`

— on a closer inspection one realises that it is actually simulating the inner loop manually. Still, I’d be happy if the student could show me a correctness proof, with a correct loop invariant and a bound, since both of them are more complex than what I expected them to learn. Unfortunately, in the answer handed in, the program, the invariant, and the bound all contain some bugs. Anyone wants to give it a try?

The post Sum of Squares of Differences appeared first on niche computing science.

]]>The post An Exercise Utilising Galois Connections appeared first on niche computing science.

]]>`(A, ⊑)`

, `(B, ≼)`

, two functions `f : A → B`

, `g : B → A`

form a Galois connection between them if for all `a : A`

, `b : B`

we have
` f a ≼ b ≣ a ⊑ g b`

We will refer to this defining property as “GC” later. The function `f`

is called the lower adjoint and `g`

the upper adjoint of the Galois connection. Galois connections are interesting because once two functions are identified as such, they immediately satisfy a rich collection of useful properties:

- letting
`a := g b`

in GC, we get`f (g b) ≼ b`

; - letting
`b := f a`

, we get`a ⊑ g (f a)`

; `f`

is monotonic, since:`f a₁ ≼ f a₂ ≣ { GC } a₁ ⊑ g (f a₂) ⇐ { since a ⊑ g (f a) } a₁ ⊑ a₂`

- similarly,
`g`

is monotonic:`b₁ ≼ b₂ ⇒ f b₁ ⊑ f b₂`

,

and many more.

In the recent work of Sharon and me on maximally dense segments we needed quite a number of functions to be monotonic, idempotent, etc. It only occurred to me after submitting the paper: could they be defined as Galois connections? The number of properties we needed in the paper is huge and it would be nice to establish them on fewer basic properties. And it looks prettier.

One such function is `trim`

in the paper, but it is sufficient to consider a simplification: let `sam : [Int] → [Int]`

(for “sum atmost”) return the longest prefix of the input list whose sum is no larger than a constant `U`

. Denote “`x`

is a prefix of `y`

” by `x ⊑ y`

. We want to show that `sam`

satisfies

- monotonicity:
`x ⊑ y ⇒ sam x ⊑ sam y`

, and - idempotence:
`sam (sam x) = sam x`

.

Can they be derived by defining `sam`

as a Galois connection?

I learned from José N. Oliveira‘s talk A Look at Program “G”alculation in IFIP WG 2.1 #65 Meeting how (the uncurried version of) `take`

can be defined as a Galois connection. It turns out that `sam`

is just the same. We consider a slight generalisation `sam' : (Int, [Int]) → [Int]`

that takes an upper bound as a parameter. It can be characterised by:

`sum y ≤ b ∧ y ⊑ x ≣ y ⊑ sam' (b, x)`

There is in fact a Galois connection hidden already! To see that, define `⟨f, g⟩ a = (f a, g a)`

(in the Haskell Hierarchy Library it is defined in Control.Arrow as `&&&`

), and denote the product of binary relations by `×`

, that is, if `a ≤ b`

and `x ⊑ y`

then `(a,x)`

is related to `(b,y)`

by `≤×⊑`

. We write a composed relation as an infix operator by surrounding it in square brackets `(a,x) [≤×⊑] (b,y)`

.

Using these notations, the defining equation of `sam'`

can be rewritten as:

`⟨sum, id⟩ y [≤×⊑] (b,x) ≣ y ⊑ sam' (b,x)`

Thus `sam'`

is the upper adjoint in a Galois connection between `((Int, [Int]), ≤×⊑)`

and `([Int], ⊑)`

!

Now that `⟨sum, id⟩`

and `sam'`

form a Galois connection, we have:

`f (g b) ≼ b`

instantiates to`⟨sum, id⟩ (sam' (b,x)) [≤×⊑] (b,x)`

, that is,`sum (sam' (b,x)) ≤ b`

and`sam' (b,x) ⊑ x`

;`a ⊑ g (f a)`

instantiates to`x ⊑ sam' (sum x, x)`

. Together with the previous property we have`x = sam' (sum x, x)`

;- monotonicity of the lower adjoint instantiates to
`y₁ ⊑ y₂ ⇒ sum y₁ ≤ sum y₂ ∧ y₁ ⊑ y₂`

; - monotonicity of the upper adjoint instantiates to
`(b₁,x₁) [≤×⊑] (b₂,x₂) ⇒ sam' (b₁,x₁) ⊑ sam' (b₂,x₂)`

that is

`b₁ ≤ b₂ ∧ x₁ ⊑ x₂ ⇒ sam' (b₁,x₁) ⊑ sam' (b₂,x₂)`

a generalisation of the monotonicity we want.

Finally, to show idempotence, we reason

```
sam' (b₁, x) ⊑ sam' (b₁, sam' (b₂, x))
≣ { GC }
⟨sum, id⟩ (sam' (b₁, x)) [≤×⊑] (b₁, sam' (b₂, x))
≣ { definitions }
sum (sam' (b₁, x)) ≤ b₁ ∧ sam' (b₁, x) ⊑ sam' (b₂, x)
⇐ { properties above }
b₁ ≤ b₂
```

These are all nice and pretty. There is another function, however, that is much harder to deal with, which I will write about next time.

The post An Exercise Utilising Galois Connections appeared first on niche computing science.

]]>The post Finding Maximally Dense Segments appeared first on niche computing science.

]]>The basic form of the problem looks like a natural variation of the classical *maximum segment sum* problem: given a list of numbers, find a consecutive segment whose *average*, that is, sum divided by length, is maximum. The problem would be trivial without more constraints, since one could simply return the largest element, thus we usually impose a lower bound `L`

on the length of feasible segments.

It was noticed by Huang [3], that a segment having maximum average need not be longer than `2L - 1`

: given a segment of `2L`

elements or more, we cut it in the middle. If the two halves have different averages, we keep the larger one. Otherwise the two halves have the same average. Either way, we get a shorter, feasible segment whose average is not lower. The fact hints at a trivial `O(nL)`

algorithm: for each suffix of the list, find its best prefix upto `2L - 1`

elements long.

A difficult challenge, however, is to come up with an algorithm that is `O(n)`

, independently of `L`

. The problem can be generalised to the case where the elements do not have length 1, but each has a *width*, and the goal is to maximise the *density* — sum of the elements divided by sum of their width. It makes the problem sightly more complicated, but does not change its nature. If we go on to impose an upper bound `U`

on the length as well, however, the problem becomes much more difficult. There was an published algorithm that claimed to be linear only to be found not so. We discovered that two later algorithms, which appeared to have concluded the problem, also fail for a boundary case. The bug is easy to fix for one of the algorithm, but might not be so for the other.

Our algorithm closely relates to that of Chung and Lu [1] and that of Goldwasser et al [2]. The algorithm is perhaps too complex to present in detail in a blog post (that’s why we need a paper!), but I will try to give an outline using pictures from the paper, my slides and poster.

One of the ways to visualise the problem is to see each element as a block, the number being the area of the block, and the density would be its height. The input is a list of (area, width) pairs, and the goal is to find a consecutive segment maximising the height. Shown below is the input list `[(9,6),(6,2),(14,7),(20,4),(-10,5),(20,8),(-2,2),(27,6)]`

, and the dashed line is their average height:

Notice that an area can be negative. In the paper, since the alphabet `w`

is used for “window” (to be explained below), we instead refer to the width as “breadth”.

Many optimal segment problems (finding some optimal segment of a given list) are solved by finding, for each suffix, its optimal prefix, as shown below. Each bar is a suffix of the input, and the blue part is its optimal prefix:

It is preferable that an optimal prefix of `a : x`

can be computed from the optimal prefix of `x`

, that is, the function computing the optimal prefix is a `foldr`

. If it is true, the algorithm may keep a pair of (optimal segment, optimal prefix). Each time a new element is read, it computes the new optimal prefix using the previous optimal prefix, and update the optimal segment if the new prefix is better. If you like structured recursion (or the so-called “origami programming”), this form of computation is an instance of a *zygomorphism*.

For each optimal prefix to be computable from the previous optimal prefix, it may not extend further than the latter. We do not want the following to happen:

However, it appears to be possible for the maximally dense prefix! Imagining adding a very small, or even negative area. We might get a denser prefix by extending further to the right since the denominator is larger.

The first theorem we had to prove aimed to show that it does not matter — if a maximally dense prefix extends further than the previous one, it is going to be suboptimal anyway. Thus it is safe if we always start from the right end of the previous prefix. That is, we do not compute the maximally dense prefix of the entire input, but merely the maximally dense prefix *of the previous prefix*.

This is an instance of the *sliding window* scheme proposed by Zantema [4]. The blue part is like a “window” of the list, containing enough information to guarantee the correctness of the algorithm. As the algorithm progresses, the two ends of the window keeps sliding to the left, hence the name.

To formally show that the window contains enough information to compute the maximally dense segment, we have to clearly state what window is, and what invariant it satisfies. It turned out to be quite tricky to formally state the intuition that “the window does not always give you the optimal prefix, but it does when it matters,” and was the first challenge we met.

Since we aim at computing a segment at least `L`

units in breadth, it might be handy to split the window into a “compulsory part” (the shortest prefix that is at least `L`

units wide) and the rest, the “optional part”. The algorithm thus looks like this:

where the yellow bars are the compulsory parts and blue bars the optional parts. Each time we read an element into the compulsory part, zero or more elements (since the elements have non-uniform breadths) may be shifted from the compulsory part to the optional part. Then we compute a maximally dense prefix (the yellow and the blue parts together) that does not extend further than the previous one. The best among all these prefixes is the maximally dense segment.

We want a linear time algorithm, which means that all the computation from a pair of yellow-blue bars to the next pair has to be done in (amortised) constant time — how is that possible at all? To do so we will need to exploit some structure in the optional part, based on properties of density and segments.

A non-empty list of elements `x`

is called *right-skew* if, for every non-empty `x₁`

and `x₂`

such that `x₁ ⧺ x₂ = x`

, we have `density x₁ ≤ density x₂`

. Informally, a right-skew list is drawn as the blue wavy block below:

The rising wavy slope informally hints that the right half has a higher density than the left half wherever you make it cut. Howver, the drawing is at risk from the misunderstanding that a right-skew segment is a list of elements with ascending areas or densities. Note that neither the areas nor the densities of individual elements have to be ascending. For example, the list `[(9,6),(6,2),(14,7)]`

, with densities `[1.5, 3, 2]`

, is right-skew.

Right-skew lists are useful because of the following property. Imagining placing a list `z`

next to `x`

, as depicted above. To find a maximally dense prefix of `z ⧺ x`

starting with `z`

, it is sufficient to consider only `z`

and `z ⧺ x`

— nothing in the middle, such as `z ⧺ x₁`

, can be denser than the two ends!

Given a window with compulsory part `c`

and optional part `x`

, if we can partition `x`

into `x₁ ⧺ x₂ ⧺ ... ⧺ x`

, such that _{n}`x₁`

, `x₂`

, … `x`

are all right-skew, then to compute the maximally dense prefix of _{n}`c ⧺ x`

, we only need to consider `c`

, `c ⧺ x₁`

, `c ⧺ x₁ ⧺ x₂`

,… and `c ⧺ x₁ ⧺ x₂ ⧺ ... ⧺ x`

._{n}

Such a partition is always possible for any list `x`

— after all, each element itself constitute a singleton right-skew list. However, there is one unique right-skew partition such that the densities of `x₁`

, `x₂`

, … `x`

are strictly decreasing. This is called the _{n}*decreasing right-skew partition* (DRSP) of `x`

. We will partition the optional part of the window into its DRSP. A window now looks like the picture below:

Sharon summarised many nice properties of DRSP in the paper, for which we unfortunately do not have space here. We will only look at some properties that matters for this blog post. Firstly, consider the diagram below:

In the bottom row, the leftmost block is the density of `c`

, and the second block is the density of `c ⧺ x₁`

, etc. If segments `x₁`

, `x₂`

, … `x`

have decreasing densities, the densities of _{n}`c`

, `c ⧺ x₁`

, `c ⧺ x₁ ⧺ x₂`

,… and `c ⧺ x₁ ⧺ x₂ ⧺ ... ⧺ x`

must be _{n}*bitonic* — first ascending, then descending. It helps to efficiently locate the maximally dense prefix.

Secondly, the DRSP can be built and maintained in a `foldr`

. The following diagram depicts how the DRSP for the list of areas `[1,4,2,5,3]`

(all with breadth `1`

) can be built by adding elements from the left one by one (which eventually results in one big partition):

The rule is that blocks newly added from the left keeps merging with blocks to its right until it encounters a block shorter than itself. The top-left of the diagram indicates that the DRSP of `(3`

is itself. Since `5 > 3`

, adding `1`

results in a partition containing two segments. When `2`

is added, it is merged with `5`

to form a new segment with density `3.5`

. No merging is triggered with the addition of `4`

since `4 > 3.5`

and thus `[4,3.5,3]`

form a decreasing sequence. Newly added `1`

first merges `4`

, forming a block having density `2.5`

. Since `2.5 < 3.5`

, it again merges with the block `[2,5]`

. Eventually all elements are grouped into one segment with density `3`

. One important thing here is that adding a new element only involves merging some initial parts of the DRSP.

Recall that our algorithm computes, for each suffix, a prefix (a window) that is possibly optimal and contains enough information to compute all future optimal solutions. Since a feasible prefix is wider than `L`

, we split it into a (yellow) compulsory part and a (blue) optional part. To obtain a linear time algorithm, we have to compute one row from the previous row in amortised constant time (the corresponding diagram is duplicated here):

The diagram below depicts how to go from one row to the next. The blue part is partitioned into DRSP. Each time an element is added to the yellow part, some elements may be shifted to the blue part, and that may trigger some right-skew segments in the blue part to be merged (second row). Then we look for a maximally dense prefix by going from right to left, chopping away segments, until we find the peak (third row):

Note that the operation shown on the third row (chopping to find the maximum) always chop away a right-skew segment in its entirety. It is important that the merging happens at the left end of the optional part, while the chopping happens at the right end. By using a tree-like data structure, each merging can be a `O(1)`

operation. With the data structure, we may argue that, since each element can be merged at most once, throughout the algorithm only `O(n)`

merging could happen. Similarly, each element can be chopped away at most once, so the chopping could happen at most `O(n)`

time as well. Therefore the operations in the second and third rows above are both amortised `O(1)`

.

The discussion so far already allows us to develop an algorithm for the maximally dense segment problem without an upper bound on the breadth of feasible segments. Having the upper bound makes the problem much harder because, different from the chopping depicted above, an upper bound may cut through a right-skew segment in the middle:

And a right-skew segment, with some elements removed, might not be right-skew anymore!

Our solution is to develop another data structure that allows efficient removal from the right end of a DRSP, while maintaining the DRSP structure. The final configuration of a window looks like the diagram below, where the new data structure is represented by the green blocks:

Unfortunately, it is inefficient to add new elements from the left into the green blocks. Therefore we have to maintain the window in a way similar to how a queue is implemented using two lists. New elements are added from the left into the blue blocks; when we need to remove element from the right of a block, it is converted to a green block in large chunks.

For more details, see the paper!

- Chung, Kai-Min and Lu, Hsueh-I. An Optimal Algorithm for the Maximum-Density Segment Problem. SIAM Journal on Computing 34(2):373-387, 2004.
- Goldwasser, Michael H. and Kao, Ming-Yang and Lu, Hsueh-I. Linear-Time Algorithms for Computing Maximum-Density Sequence Segments with Bioinformatics Applications. Journal of Computer and System Sciences, 70(2):128-144, 2005.
- Huang, Xiaoqui. An algorithm for identifying regions of a {DNA} sequence that satisfy a content requirement. Computer Applications in the Biosciences 3(10): 219-225, 1994.
- Zantema, Hans. Longest segment problems. Science of Computer Programming, 18(1):39-66, 1992.

The post Finding Maximally Dense Segments appeared first on niche computing science.

]]>I learned the function derivation of the maximum segment sum problem from one of Jeremy's papers and was very amazed. It was perhaps one of the early incident that inspired my interest in program calculation.

The post The Maximum Segment Sum Problem: Its Origin, and a Derivation appeared first on niche computing science.

]]>Given a list of numbers, the task is to compute the largest possible sum of a consecutive segment. In a functional language the problem can be specified by:

```
mss = max . map sum . segments
```

where `segments = concat . map inits . tails`

enlists all segments of the input list, `map sum`

computes the sum of each of the segments, before `max :: Ord a ⇒ [a] → a`

picks the maximum. The specification, if executed, is a cubic time algorithm. Yet there is a linear time algorithm scanning through the list only once:

```
mss = snd . foldr step (0,0)
where step x (p,s) = (0 ↑ (x+p), (0 ↑ (x+p)) ↑ s)
```

where `a ↑ b`

yields the maximum of `a`

and `b`

.

Both the specification and the linear time program are short. The program is merely a `foldr`

that can be implemented as a simple for-loop in an imperative language. Without some reasoning, however, it is not that trivial to see why the program is correct (hint: the `foldr`

computes a pair of numbers, the first one being the maximum sum of all *prefixes* of the given list, while the second is the maximum sum of all segments). Derivation of the program (given below) is mostly mechanical, once you learn the basic principles of program calculation. Thus the problem has become a popular choice as the first non-trivial example of program derivation.

Jon Bentley recorded in Programming Pearls that the problem was proposed by Ulf Grenander of Brown University. In a pattern-matching procedure he designed, a subarray having maximum sum is the most likely to yield a certain pattern in a digitised image. The two dimensional problem took too much time to solve, so he simplified to one dimension in order to to understand its structure.

In 1977 [Grenander] described the problem to Michael Shamos of UNILOGIC, Ltd. (then of Carnegie-Mellon University) who overnight designed Algorithm 3. When Shamos showed me the problem shortly thereafter, we thought that it was probably the best possible; … A few days later Shamos described the problem and its history at a Carnegie-Mellon seminar attended by statistician Jay Kadane, who designed Algorithm 4 within a minute.

Jon Bentley, Programming Pearls (1st edition), page 76.

Jay Kadane’s Algorithm 4 is the now well-known linear time algorithm, the imperative version of the functional program above:

```
maxpre, maxseg = 0, 0
for i in range (0, N):
maxpre = 0 ↑ (maxpre + a[i])
maxseg = maxpre ↑ maxseg
```

Algorithm 3, on the other hand, is a divide and conquer algorithm. An array `a`

is split into two halves `a₁ ⧺ a₂`

, and the algorithm recursively computes the maximum segment sums of `a₁`

and `a₂`

. However, there could be some segment across `a₁`

and `a₂`

that yields a good sum, therefore the algorithm performs two additional loops respectively computing the maximum suffix sum of `a₁`

and the maximum prefix sum of `a₂`

, whose sum is the maximum sum of segment crossing the edge. The algorithm runs in `O(N log N)`

time. (My pseudo Python translation of the algorithm is given below.)

In retrospect, Shamos did not have to compute the maximum prefix and suffix sums in two loops each time. The recursive function could have computed a ~~triple~~ quadruple of (maximum prefix sum, maximum segment sum, maximum suffix sum, and sum of the whole array) for each array. The prefix and suffix sums could thus be computed bottom-up. I believe that would result in a `O(N)`

algorithm. This linear time complexity might suggest that the “divide” is superficial — we do not have to divide the array in the middle. It is actually easier to divide the array into a head and a tail — which was perhaps how Kadane quickly came up with Algorithm 4!

I learned the function derivation of the maximum segment sum problem from one of Jeremy’s papers [3] and was very amazed. It was perhaps one of the early incident that inspired my interest in program calculation. The derivation does not appear to be very well known outside the program derivation circle — not even for functional programmers, so I would like to redo it here.

The first few steps of the derivation goes:

```
max . map sum . segs
= { definition of segs }
max . map sum . concat . map inits . tails
= { since map f . concat = concat . map (map f) }
max . concat . map (map sum) . map inits . tails
= { since max . concat = max . map max }
max . map max . map (map sum) . map inits . tails
= { since map f . map g = map (f.g) }
max . map (max . map sum . inits) . tails
```

The purpose of the book-keeping transformation above is to push `max . map sum`

closer to `inits`

. The fragment `max . map sum . inits`

is a function which, given a list of numbers, computes the maximum sum among all its prefixes. We denote it by `mps`

, for maximum prefix sum. The specification has been transformed to:

` mss = max . map mps . tails `

This is a common strategy for segment problems: to solve a problem looking for an optimal segment, proceed by looking for an optimal prefix of each suffix. (Symmetrically we could process the list the other way round, look for an optimal suffix for each prefix.)

We wish that `mps`

for each of the suffixes can be efficiently computed in an incremental manner. For example, to compute `mps [-1,3,3,-4]`

, rather than actually enumerating all suffixes, we wish that it can be computed from `-1`

and `mps [3,3,-4] = 6`

, which can in turn be computed from `3`

and `mps [3,-4] = 3`

, all in constant time. In other words, we wish that `mps`

is a `foldr`

using a constant time step function. If this is true, one can imagine that we could efficiently implement `map mps . tails`

in linear time. Indeed, `scanr f e = map (foldr f e) . tails`

!

The aim now is to turn `mps = max . map sum . inits`

into a `foldr`

. Luckily, `inits`

is actually a `foldr`

. In the following we will perform `foldr`

-fusion twice, respectively fusing `map sum`

and `max`

into `inits`

, thus turning the entire expression into a `foldr`

.

The first fusion goes:

```
max . map sum .inits
= { definition of inits }
max . map sum . foldr (\x xss -> [] : map (x:) xss) [[]]
= { fold fusion, see below }
max . foldr zplus [0]
```

The fusion condition can be established below, through which we also construct the definition of `zplus`

:

```
map sum ([] : map (x:) xss)
= 0 : map (sum . (x:)) xss
= { by definition, sum (x : xs) = x + sum xs }
0 : map (x+) (map sum xss)
= { define zplus x xss = 0 : map (x+) xss }
zplus x (map sum xss)
```

We continue with the derivation and perform another fusion:

```
max . foldr zplus [0]
= { fold fusion, let zmax x y = 0 ↑ (x+y) }
foldr zmax 0 {-"."-}
```

For the second fold fusion to work, we have to prove the following fusion condition:

```
max (0 : map (x+) xs)
= 0 ↑ max (map (x+) xs)
= { since max (map (x +) xs) = x + max xs }
0 ↑ (x + max xs) {-"."-}
```

The property `max (map (x +) xs) = x + max xs`

in the last step follows from that `(↑)`

distributes into `(+)`

, that is, `(x + y) ↑ (x + z) = x + (y ↑ z)`

. This is the key property that allows the whole derivation to work.

By performing `foldr`

-fusion twice we have established that

`mps = foldr zmax 0`

In words, `mps (x : xs)`

, the best prefix sum of `x : xs`

, can be computed by `zmax x (mps xs)`

. The definition of `zmax`

says that if `x + mps xs`

is positive, it is the maximum prefix sum; otherwise we return `0`

, sum of the empty prefix.

Therefore, `mss`

can be computed by a `scanr`

:

```
mss
= { reasoning so far }
max . map (foldr zmax 0) . tails
= { introducing scanr }
max . scanr zmax 0 {-"."-}
```

We have derived `mss = max . scanr zmax 0`

, where `zmax x y = 0 ↑ (x+y)`

.

Many functional derivations usually stop here. This gives us an algorithm that runs in linear time, but takes linear space. A tupling transformation eliminates the need for linear space:

` mss = snd . (head &&& max) . scanr zmax 0 `

where `(f &&& g) a = (f a, g a)`

. The part `(head &&& max) . scanr zmax 0`

returns a pair, the first component being the result of `mps`

, the second `mss`

. By some mechanical simplification we get the final algorithm:

```
mss = snd . foldr step (0,0)
where step x (p,s) = (0 ↑ (x+p), (0 ↑ (x+p)) ↑ s)
```

The maximum segment sum problem later turned out to be a example of Richard and Oege’s Greedy Theorem [2]. It is an exercise in the Algebra of Programming book, but I have not seen the solution given anywhere. For completeness, I recorded a relational derivation in a paper of mine about some other variations of the maximum segment sum problem[4].

- Bentley, Jon. Programming Pearls. Addison-Wesley, Inc, 1987.
- Bird, Richard and de Moor, Oege. Algebra of Programming. Prentice-Hall, 1997
- Gibbons, Jeremy. Calculating Functional Programs. Proceedings of ISRG/SERG Research Colloquium, Oxford Brookes University, November 1997.
- Mu, Shin-Cheng. Maximum segment sum is back: deriving algorithms for two segment problems with bounded lengths. Partial Evaluation and Program Manipulation (PEPM ’08), pp 31-39. January 2008.

```
def mss(l,u):
if l > u:
return 0 # empty array
else if l == u:
return (0 ↑ a[l]) # singleton array
else:
m = (l + u) / 2
# compute maximum suffix sum of a[0..m]
sum, maxToLeft = 0, 0
for i in range (m, l-1, -1):
sum = sum + a[i]
maxToLeft = maxToLeft ↑ sum
# compute maximum prefix sum of a[m+1..u]
sum, maxToRight = 0, 0
for i in range (m+1, u+1):
sum = sum + a[i]
maxToLeft = maxToRight ↑ sum
maxCrossing = maxToLeft + maxToRight
# recursively compute mss of a[0..m] and a[m+1..u]
maxInL = mss(l,m)
maxInR = mss(m+1,u)
return (maxInL ↑ maxCrossing ↑ maxInR)
```

The post The Maximum Segment Sum Problem: Its Origin, and a Derivation appeared first on niche computing science.

]]>