In the previous post we reviewed Hans Zantema’s algorithm for solving longest segment problems with suffix and overlap-closed predicates. For predicates that are not overlap-closed, Zantema derived a so-called “windowing” technique, which will be the topic of this post.
A brief review: the longest segment problem takes the form:
max# ∘ p ◁ ∘ segs
where segs :: [a] → [[a]]
, defined by segs = concat ∘ map inits ∘ tails
returns all consecutive segments of the input list; p ◁
is an abbreviation for filter p
, and max# :: [[a]] → [a]
returns the longest list from the input list of lists. In words, the task is to compute the longest consecutive segment of the input that satisfies predicate p
.
A predicate p
is suffix-closed if p (xs ⧺ ys) ⇒ p ys
. For suffix-closed p
, Zantema proposed a technique that, from a high-level point of view, looks just like the right solution to such problems. We scan through the input list using a foldr
from the right to the left, during which we try to maintain the longest segment satisfying p
so far. Also, we keep a prefix of the list that is as long as the currently longest segment, which we call the window.
If, when we move one element to the right, the window (now one element longer than the currently longest segment) happens to satisfy p
, it becomes the new optimal solution. Otherwise we drop the right-most element of the window so that it slides leftwards, retaining the length. Notice that it implies that we’d better represent the window using a queue, so we can efficiently add elements from the left and drop from the right.
Derivation of the algorithm is a typical case of tupling.
Tupling
Given a function h
, we attempt to compute it efficiently by turning it into a foldr
. It would be possible if the value of the inductive case h (x : xs)
were determined solely by x
and h xs
, that is:
h (x : xs) = f x (h xs)
for some f
. With some investigation, however, it would turn out that h (x : xs)
also depends on some g
:
h (x : xs) = f x (g (x : xs)) (h xs)
Therefore, we instead try to construct their split ⟨ h , g ⟩
as a fold, where the split is defined by:
⟨ h , g ⟩ xs = (h xs, g xs)
and h = fst . ⟨ h , g ⟩
.
If ⟨ h , g ⟩
is indeed a fold, it should scan through the list and construct a pair of a h
-value and a g
-value. To make it feasible, it is then hoped that g (x : xs)
can be determined by g xs
and h xs
. Otherwise, we may have to repeat the process again, making the fold return a triple.
Segment/Prefix Decomposition
Let us look into the longest segment problem. For suffix-closed p
it is reasonable to assume that p []
is true — otherwise p
would be false everywhere. Therefore, for the base case we have max# ∘ p ◁ ∘ segs ▪ [] = []
. We denote function application by ▪
to avoid too many parentheses.
Now the inductive case. It is not hard to derive an alternative definition of segs
:
segs [] = [[]]
segs (x : xs) = inits (x:xs) ⧺ segs xs
therefore, we derive:
max# ∘ p ◁ ∘ segs ▪ (x : xs)
= max# ∘ p ◁ ▪ (inits (x : xs) ⧺ segs xs)
= (max# ∘ p ◁ ∘ inits ▪ (x : xs)) ↑#
(max# ∘ p ◁ ∘ segs ▪ xs)
where xs ↑# ys
returns the longer one between xs
and ys
.
It suggests that we maintain, by a foldr
, a pair containing the longest segment and the longest prefix satisfying p
(that is, max# ∘ p ◁ ∘ inits
). It is then hoped that max# ∘ p ◁ ∘ inits ▪ (x : xs)
can be computed using max# ∘ p ◁ ∘ inits ▪ xs
. And luckily, it is indeed the case, implied by the following proposition proved in an earlier post:
Proposition 1: If
p
is suffix-closed, we have:p ◁ ∘ inits ▪ (x : xs) = finits (max# ∘ p ◁ ∘ inits ▪ xs)
where
finits ys = p ◁ ∘ inits ▪ (x : ys)
.
Proposition 1 says that the list (or set) of all the prefixes of x : xs
that satisfies p
can be computed by the longest prefix of xs
(call it ys
) satisfying p
, provided that p
is suffix-closed. A naive way to do so is simply by computing all the prefixes of x : ys
and do the filtering again, as is done in finits
.
This was the route taken in the previous post. It would turn out, however, to come up with an efficient implementation of f
we need some more properties from p
, such as that it is also overlap-closed.
The “Window”
Proposition 1 can be strengthened: to compute all the prefixes of x : xs
that satisfies p
using finits
we do not strictly have to start with ys
. Any prefix of xs
longer than ys
will do.
Proposition 2: If
p
is suffix-closed, we have:p ◁ ∘ inits ▪ (x : xs) = finits (take i xs)
where
finits ys = p ◁ ∘ inits ▪ (x : ys)
, andi ≥ length ∘ max# ∘ p ◁ ∘ inits ▪ xs
.
In particular, we may choose i
to be the length of the longest segment:
Lemma 1:
length ∘ max# ∘ p ◁ ∘ segs ▪ xs ≥ length ∘ max# ∘ p ◁ ∘ inits ▪ xs
Appealing to intuition, Lemma 1 is true because segs xs
is a superset of inits xs
.
Remark: Zantema proved Proposition 1 by contradiction. The purpose of an earlier post was to give a constructive proof of Proposition 1, which was considerably harder than I expected. I’d be interested to see a constructive proof of Proposition 2.
Now we resume the reasoning:
max# ∘ p ◁ ∘ segs ▪ (x : xs)
= max# ∘ p ◁ ▪ (inits (x : xs) ⧺ segs xs)
= (max# ∘ p ◁ ∘ inits ▪ (x : xs)) ↑#
(max# ∘ p ◁ ∘ segs ▪ xs)
= { Proposition 2 and Lemma 1 }
let s = max# ∘ p ◁ ∘ segs ▪ xs
in (max# ∘ finits ▪ (x : take (length s) xs)) ↑# s
Define window xs = take (length ∘ max# ∘ p ◁ ∘ segs ▪ xs) xs
, the reasoning above suggest that we may try the following tupling:
max# ∘ p ◁ ∘ segs
= fst ∘ ⟨ max# ∘ p ◁ ∘ segs , window ⟩
Maintaining the Longest Segment and the Window
The task now is to express ⟨ max# ∘ p ◁ ∘ segs , window ⟩
as a foldr
. We can do so only if both max# ∘ p ◁ ∘ segs ▪ (x : xs)
and window (x : xs)
can be determined by max# ∘ p ◁ ∘ segs ▪ xs
and window xs
. Let us see whether it is the case.
Maintaining the Longest Segment
Regarding max# ∘ p ◁ ∘ segs ▪ (x : xs)
, we have derived:
max# ∘ p ◁ ∘ segs ▪ (x : xs)
= { as shown above, let s = max# ∘ p ◁ ∘ segs ▪ xs }
(max# ∘ p ◁ ∘ inits ▪ (x : take (length s) xs)) ↑# s
Let s = max# ∘ p ◁ ∘ segs ▪ xs
. We consider two cases:
- Case
p (x : take (length s) xs)
. We reason:(max# ∘ p ◁ ∘ inits ▪ (x : take (length s) xs)) ↑# s = { see below } (x : take (length s) xs) ↑# s = { since the LHS is one element longer than the RHS } x : take (length s) xs = { definition of window } x : window xs
The first step is correct because, for all
zs
,p zs
implies thatmax# ∘ p ◁ ∘ inits ▪ zs = zs
. - Case
¬ p (x : take (length s) xs)
. In this case(max# ∘ p ◁ ∘ inits ▪ (x : take (length s) xs)) ↑# s
must bes
, since¬ p zs
implies thatlength∘ max# ∘ p ◁ ∘ inits ▪ zs < length zs
.
Maintaining the Window
Now consider the window. Also, we do a case analysis:
- Case
p (x : take (length s) xs)
. We reason:window (x : xs) = take (length ∘ max# ∘ p ◁ ∘ segs ▪ (x : xs)) (x : xs) = { by the reasoning above } take (length (x : take (length s) xs)) (x : xs) = { take and length } x : take (length (take (length s)) xs) xs = { take and length } x : take (length s) xs = x : window xs
- Case
¬ p (x : take (length s) xs)
. We reason:window (x : xs) = take (length ∘ max# ∘ p ◁ ∘ segs ▪ (x : xs)) (x : xs) = { by the reasoning above } take (length s) (x : xs) = { take and length } x : take (length (s-1)) xs = x: init (window xs)
The Algorithm
In summary, the reasoning above shows that
⟨ max# ∘ p ◁ ∘ segs , window ⟩ = foldr step ([], [])
where step
is given by
step x (s, w) | p (x : w) = (x : w, x : w)
| otherwise = (s, x : init w)
As is typical of many program derivations, after much work we deliver an algorithm that appears to be rather simple. The key invariants that made the algorithm correct, such as that s
is the optimal segment and that w
is as long as s
, are all left implicit. It would be hard to prove the correctness of the algorithm without these clues.
We are not quite done yet. The window w
had better be implemented using a queue, so that init w
can be performed efficiently. The algorithm then runs in time linear to the length of the input list, provided that p (x : w)
can be performed in constant time -- which is usually not the case for interesting predicates. We may then again tuple the fold with some information that helps to compute p
efficiently. But I shall stop here.
Reference
- H. Zantema. Longest Segment Problems Sci. Comput. Program. 18(1): 39-66 (1992).