Maximum Segment Density with Lower-Bounded Length
Shin-Cheng Mu
May 2007
This literate Haskell script presents a program solving
the maximum segment density problem: given a list of numbers,
compute the maximum density (average) of a segment that is
longer than a lower-bound.
At the first glance, the derivation would make use of
the Thinning Theorem. It turns out that we need a kind
of "global" thinning difficult to be modelled by a preorder.
See Mu (2007) for details. The derived algorithm is similar
to that of Lin et al. (2002).
Functions defined:
msdl_s :: Len -> [V] -> R
The specification.
msdl_1 :: Len -> [V] -> R
An intermediate stage of derivation kept for expository
purpose. We pass a set of solutions around, represented
naively using a list.
msdl_2 :: Int -> [V] -> R
The next stage of derivation. After exploiting some
properties we realise that the set can be represented
by a sequence. The sequences are still represented by
lists.
msdl_l :: Len -> [V] -> R
Changing the representation by fusing "diffs" into
the fold. The program has become rather cryptical.
msdl :: Len -> [V] -> R
Derived program, instead of lists, we use
SimpleQueue and BraunSeq in Edison.
QuickCheck properties:
test_lb :: Len
The lower-bound on the length of segments.
prop_msdl_1_correct :: [V] -> Bool
That msdl_1 implements mssu_s.
prop_msdl_2_correct :: [V] -> Bool
That msdl_2 implements mssu_s.
prop_msdl_l_correct :: [V] -> Bool
That msdl_l implements mssu_s.
prop_msdl_correct :: [V] -> Bool
That msdl implements mssu_s.
> import List
This program uses Edison. If you do not have Edison installed,
comment out the line below as well as the section for mssu.
> import qualified Data.Edison.Seq.SimpleQueue as Q
> import qualified Data.Edison.Seq.SizedSeq as Sz
> import qualified Data.Edison.Seq.BraunSeq as B
> import Test.QuickCheck
> type V = Integer
> type Len = Int
> type R = Rational
Specification. In order to avoid round-off errors during
QuickChecking, we assume that V, the type of values we deal,
is Integer and use Rational to represent density. In general,
V need not be restricted to Integer.
> msdl_s :: Len -> [V] -> R
> msdl_s lb = maxlist . map avg . filter ((lb <=) . snd) .
> map (split (sum, length)) . concat . map inits . tails
> maxlist :: Ord v => [v] -> v
> maxlist = foldr1 bmax
> x `bmax` y | x < y = y
> | otherwise = x
> split (f,g) a = (f a, g a)
> avg :: (V,Len) -> R
> avg (s,l) = (fromIntegral s) / (fromIntegral l)
After (global) thinning, the algorithm is expressed by a fold
passing around a set of solutions.
> msdl_1 :: Len -> [V] -> R
> msdl_1 lb = snd . foldr (step1 lb) (([],[(0,0)]),-32767)
> type SL = (V, Len)
> type Queue a = [a]
> type Set a = [a]
> type ST1 = ((Queue V, Set SL),R)
> step1 :: Len -> V -> ST1 -> ST1
> step1 lb a ((x,ys),m) | length x < lb =
> ((a:x,ys),avg (split (sum, length) (a:x)))
> step1 lb a ((x,ys),m) | otherwise =
> ((a:x',ys'), m `bmax` findmax1 (a:x') ys')
> where (x',b) = split (init,last) x
> ys' = (0,0) : rm1 b ys
> add a (s,l) = (a+s,1+l)
> acat (s1,l1) (s2,l2) = (s1+s2,l1+l2)
> rm1 b ys = [add b y | y <- ys, and [safe b y y' | y' <- ys]]
> safe b (s1,l1) (s2,l2) = l2 <= l1 ||
> avg (b+s1,1+l1) > avg (s2-s1,l2-l1)
> findmax1 x ys = maxlist [avg ((s,l) `acat` y) | y <- ys ]
> where (s,l) = split (sum, length) x
Testing.
> test_lb :: Len
> test_lb = 4
> prop_msdl_1_correct =
> forAll (longer test_lb) $
> \x -> msdl_s test_lb x == msdl_1 test_lb x
> longer 0 = arbitrary
> longer (n+1) = do a <- arbitrary
> x <- longer n
> return (a:x)
By exploiting properties of the set xs, we realise that
there are faster ways to implement rm and maxfind.
The program is kept for illustrative purpose because
after introducing diffs, the program becomes rather
cryptical.
> msdl_2 :: Int -> [V] -> R
> msdl_2 lb = snd . foldr (step2 lb) (([],[]),-32767)
> type ST2 = ((Queue V, [SL]),R)
> step2 :: Len -> V -> ST2 -> ST2
> step2 lb a ((x,ys),m) | length x < lb =
> ((a:x,ys),avg (split (sum, length) (a:x)))
> step2 lb a ((x,ys),m) | otherwise =
> ((a:x',ys'), m `bmax` findmax2 (split (sum, length) (a:x')) ys')
> where (x',b) = split (init,last) x
> ys' = rm2 lb b ys
> rm2 lb b ys = (bump . dropR_l long) ((b,1) : map (add b) ys)
> where long (s,l) = l > lb -1
> bump [] = []
> bump [y] = [y]
> bump (y:z:ys)
> | avg y > avg z = y:z:ys
> | otherwise = bump (z:ys)
> findmax2 x ys = findmax2' x ((0,0):ys)
> findmax2' x [] = avg x
> findmax2' x [y] = avg x `bmax` avg (x `acat` y)
> findmax2' x (y:z:ys)
> | avg (x `acat` y) > avg (x `acat` z) = avg (x `acat` y)
> | otherwise = findmax2' x (z:ys)
> dropR_l p [] = []
> dropR_l p xs | p (last xs) = dropR_l p (init xs)
> | otherwise = xs
> prop_msdl_2_correct =
> forAll (longer test_lb) $
> \x -> msdl_s test_lb x == msdl_2 test_lb x
The final algorithm. Refined from msdl2 by planting diffs.
> msdl_l :: Len -> [V] -> R
> msdl_l lb = snd . foldr (stepl_l lb) (([],((0,0),[])),-32767)
> type ST = ((Queue V, (SL, [SL])), R)
> stepl_l :: Len -> V -> ST -> ST
> stepl_l lb a ((x,ys),m) | length x < lb =
> ((a:x,ys),avg (split (sum, length) (a:x)))
> stepl_l lb a ((x,ys),m) | otherwise =
> ((a:x',ys'),
> m `bmax` findmax_l (split (sum, length) (a:x')) ys')
> where (x',b) = split (init,last) x
> ys' = rm_l lb b ys
> rm_l lb b ((c,n),ys) =
> ((c',n'), (bump . dropR_l long) ((c,n):ys))
> where long (s,l) = 1+n-l > lb -1
> (c',n') = (b+c,1+n)
> bump [] = []
> bump [y] = [y]
> bump (y:z:ys)
> | avgd (c',n') y > avgd (c',n') z = y:z:ys
> | otherwise = bump (z:ys)
> avgd (c,n) (s,l) = avg (c-s,n-l)
> avgdx (t,m) (c,n) (s,l) = avg (t+c-s, m+n-l)
> findmax_l x ((c,n),ys) = findmax_l' x ((c,n),(c,n):ys)
> findmax_l' x ((c,n),[]) = avg x
> findmax_l' x ((c,n),[y]) = avg x `bmax` avgdx x (c,n) y
> findmax_l' x ((c,n),y:z:ys)
> | avgdx x (c,n) y > avgdx x (c,n) z
> = avgdx x (c,n) y
> | otherwise = findmax_l' x ((c,n),z:ys)
> prop_msdl_l_correct =
> forAll (longer test_lb) $
> \x -> msdl_s test_lb x == msdl_l test_lb x
Derived algorithm, using advanced data structures
for sequences. If you do not have Edison installed,
comment out the rest of the file.
> msdl :: Len -> [V] -> R
> msdl lb = snd . foldr (stepl lb) (((Sz.empty,0),((0,0),Sz.empty)),-32767)
We use a SimpleQueue to implement the prefix, and a Braun sequence
to implement the set of solutions. We need to maintain the sizes
of both (by using the Sized module of Edison). Also, we need to
maintain the sum of the prefix.
> type PrefixSeq a = (Sz.Sized (Q.Seq) a, V) -- maintaining the sum.
> type SolSeq a = Sz.Sized (B.Seq) a
> type STQ = ((PrefixSeq V, (SL, SolSeq SL)) ,R)
> stepl :: Len -> V -> STQ -> STQ
> stepl lb a ((x,ys),m) | pssize x < lb =
> ((a `pslcons` x,ys),avg (split (pssum, pssize) (a `pslcons` x)))
> stepl lb a ((x,ys),m) | otherwise =
> ((a `pslcons` x',ys'),
> m `bmax` findmax (split (pssum, pssize) (a `pslcons` x')) ys')
> where (x',b) = split (psrtail,psrhead) x
> ys' = rm lb b ys
Some functions maintaining the prefix.
> pslcons a (x, s) = (a `Sz.lcons` x, s+a)
> psrtail (x,s) = (Sz.rtail x, s-(Sz.rhead x))
> psrhead (x,s) = Sz.rhead x
> pssize (x,s) = Sz.size x
> pssum (x,s) = s
> rm lb b ((c,n),ys) =
> ((c',n'), (bump . dropR long) ((c,n) `Sz.lcons` ys))
> where long (s,l) = 1+n-l > lb -1
> (c',n') = (b+c,1+n)
> bump ys
> | Sz.null ys = ys
> | Sz.null (Sz.ltail ys) = ys
> | otherwise = let (y,ys') = (Sz.lhead ys, Sz.ltail ys)
> z = Sz.lhead ys'
> in if avgd (c',n') y > avgd (c',n') z
> then ys else bump ys'
> dropR p xs
> | Sz.null xs = xs
> | p (Sz.rhead xs) = dropR p (Sz.rtail xs)
> | otherwise = xs
> findmax x ((c,n),ys) =
> if i == k-1 then avg x `bmax` avgdx x (c,n) (Sz.lookup i ys')
> else avgdx x (c,n) (Sz.lookup (i+1) ys')
> where ys' = (c,n) `Sz.lcons` ys
> k = Sz.size ys'
> i = bsearch p 0 (Sz.size ys')
> p j = (j == k-1) ||
> (avgdx x (c,n) (Sz.lookup j ys') <=
> avgdx x (c,n) (Sz.lookup (j+1) ys'))
The function call bsearch p l r returns the largest i within
the range l <= i < r such that p i holds.
> bsearch p l r
> | l == r = -1
> | l+1 == r = if p l then l else -1
> | otherwise = let m = (l + r) `div` 2
> in if p m then bsearch p m r
> else bsearch p l m
> prop_msdl_correct =
> forAll (longer test_lb) $
> \x -> msdl_s test_lb x == msdl test_lb x
References
[LJ02] Y.-L. Lin, T. Jiang, and K.-M. Chao. Efficient algorithms
for locating the length-constrained heaviest segments, with
applications to biomolecular sequence analysis. In Proceedings
of the 27th International Symposium on Mathematical Foundations
of Computer Science, Lecture Notes in Computer Science 2420,
pages 459–470. Springer-Verlag, 2002.
[Mu07] S-C. Mu. Maximum segment sum is back: deriving algorithms
for two maximum segment problems with bounded lengths.
Under submission. 2007.