Maximum Segment Density with Lower-Bounded Length
Shin-Cheng Mu
Sep 2007
This literate Haskell script presents a program solving
the maximum segment density problem: given a list of numbers,
compute the maximum density (average) of a segment that is
longer than a lower-bound.
At the first glance, the derivation would make use of
the Thinning Theorem. It turns out that we need a kind
of "global" thinning difficult to be modelled by a preorder.
See Mu (2007) for details. The derived algorithm is similar
to that of Lin et al. (2002).
Functions defined:
msdl_s :: Len -> [V] -> R
The specification.
msdl_1 :: Len -> [V] -> R
An intermediate stage of derivation kept for expository
purpose. We pass a set of solutions around, represented
naively using a list.
msdl_2 :: Int -> [V] -> R
The next stage of derivation. After exploiting some
properties we realise that the set can be represented
by a sequence. With the discovery of Goldwasser et al. (2005)
we are able to refine the algorithm to amortised linear
time. The sequences are still represented by lists.
msdl_l :: Len -> [V] -> R
Changing the representation by fusing "diffs" into
the fold. The program has become rather cryptical.
msdl :: Len -> [V] -> R
Derived program, instead of lists, we use
SimpleQueue and BraunSeq in Edison.
QuickCheck properties:
test_lb :: Len
The lower-bound on the length of segments.
prop_msdl_1_correct :: [V] -> Bool
That msdl_1 implements msdl_s.
prop_msdl_2_correct :: [V] -> Bool
That msdl_2 implements msdl_s.
prop_msdl_l_correct :: [V] -> Bool
That msdl_l implements msdl_s.
prop_msdl_correct :: [V] -> Bool
That msdl implements msdl_s.
> import List
> import Control.Arrow
This program uses Edison. If you do not have Edison installed,
comment out the line below as well as the section for mssu.
> import qualified Data.Edison.Seq.SimpleQueue as Q
> import qualified Data.Edison.Seq.SizedSeq as Sz
> import qualified Data.Edison.Seq.BraunSeq as B
> import Test.QuickCheck
> type V = Integer
> type Len = Int
> type R = Rational
Specification. In order to avoid round-off errors during
QuickChecking, we assume that V, the type of values we deal,
is Integer and use Rational to represent density. In general,
V need not be restricted to Integer.
> msdl_s :: Len -> [V] -> R
> msdl_s lb = maxlist . map avg . filter ((lb <=) . snd) .
> map sumlen . concat . map inits . tails
> sumlen = sum &&& length
> maxlist :: Ord v => [v] -> v
> maxlist = foldr1 bmax
> x `bmax` y | x < y = y
> | otherwise = x
> avg :: (V,Len) -> R
> avg (s,l) = (fromIntegral s) / (fromIntegral l)
After (global) thinning, the algorithm is expressed by a fold
passing around a set of solutions.
> msdl_1 :: Len -> [V] -> R
> msdl_1 lb = snd . foldr (step1 lb) (([],[(0,0)]),-32767)
> type SL = (V, Len)
> type Queue a = [a]
> type Set a = [a]
> type ST1 = ((Queue V, Set SL),R)
> step1 :: Len -> V -> ST1 -> ST1
> step1 lb a ((x,ys),m) | length x < lb =
> ((a:x,ys),avg (sumlen (a:x)))
> step1 lb a ((x,ys),m) | otherwise =
> ((a:x',ys'), m `bmax` findmax1 (a:x') ys')
> where (x',b) = (init &&& last) x
> ys' = (0,0) : rm1 (map (add b) ys)
> add a (s,l) = (a+s,1+l)
> acat (s1,l1) (s2,l2) = (s1+s2,l1+l2)
> rm1 ys = [y | y <- ys, and [safe y y' | y' <- ys]]
> where safe (s1,l1) (s2,l2) = l2 <= l1 ||
> avg (s1,l1) > avg (s2-s1,l2-l1)
> findmax1 x ys = maxlist [avg ((s,l) `acat` y) | y <- ys ]
> where (s,l) = sumlen x
Testing.
> test_lb :: Len
> test_lb = 4
> prop_msdl_1_correct =
> forAll (longer test_lb) $
> \x -> msdl_s test_lb x == msdl_1 test_lb x
> longer 0 = arbitrary
> longer (n+1) = do a <- arbitrary
> x <- longer n
> return (a:x)
By exploiting properties of the set xs, we realise that
there are faster ways to implement rm and maxfind.
The program is kept for illustrative purpose because
after introducing diffs, the program becomes rather
cryptical.
> msdl_2 :: Int -> [V] -> R
> msdl_2 lb = snd . foldr (step2 lb) (([],[]),-32767)
> type ST2 = ((Queue V, [SL]),R)
> step2 :: Len -> V -> ST2 -> ST2
> step2 lb a ((x,ys),m) | length x < lb =
> ((a:x,ys),avg (sumlen (a:x)))
> step2 lb a ((x,ys),m) | otherwise =
> ((a:x',ys'), m `bmax` m')
> where (x',b) = (init &&& last) x
> sl = sumlen (a:x')
> ys' = cut2 sl (rm2 ((b,1) : map (add b) ys))
> m' = if null ys' then avg sl
> else avg (sl `acat` last ys')
> rm2 [] = error "[]"
> rm2 [y] = [y]
> rm2 (y:z:ys) | avg y > avg z = y:z:ys
> | otherwise = rm2 (z:ys)
> cut2 x [] = error "[]"
> cut2 x [y] =
> if avg x > avg (x `acat` y)
> then []
> else [y]
> cut2 x ys = -- ys = ys'' ++ [z,y]
> let (ys',y) = (init &&& last) ys
> (ys'',z) = (init &&& last) ys'
> in if avg (x `acat` z) > avg (x `acat` y)
> then cut2 x ys'
> else ys
> dropR_l p [] = []
> dropR_l p xs | p (last xs) = dropR_l p (init xs)
> | otherwise = xs
> prop_msdl_2_correct =
> forAll (longer test_lb) $
> \x -> msdl_s test_lb x == msdl_2 test_lb x
The final algorithm. Refined from msdl2 by planting diffs.
> msdl_l :: Len -> [V] -> R
> msdl_l lb = snd . foldr (stepl_l lb) (([],((0,0),[])),-32767)
> type ST = ((Queue V, (SL, [SL])), R)
> stepl_l :: Len -> V -> ST -> ST
> stepl_l lb a ((x,ys),m) | length x < lb =
> ((a:x,ys),avg (sumlen (a:x)))
> stepl_l lb a ((x,((c,n),ys)),m) | otherwise =
> ((a:x',(cn', ys')), m' `bmax` m)
> where (x',b) = (init &&& last) x
> cn' = (b+c,1+n)
> sl = sumlen (a:x')
> ys' = cut_l sl cn' (rm_l cn' ((c,n):ys))
> m' = if null ys' then avg sl else avgdx sl cn' (last ys')
> rm_l (c',n') [] = error "[]"
> rm_l (c',n') [y] = [y]
> rm_l (c',n') (y:z:ys)
> | avgd (c',n') y > avgd (c',n') z = y:z:ys
> | otherwise = rm_l (c',n') (z:ys)
> avgd (c,n) (s,l) = avg (c-s,n-l)
> avgdx (t,m) (c,n) (s,l) = avg (t+c-s, m+n-l)
> cut_l x _ [] = error "[]"
> cut_l x (c,n) [y] =
> if avg x > avgdx x (c,n) y
> then []
> else [y]
> cut_l x (c,n) ys = -- ys = ys'' ++ [z,y]
> let (ys',y) = (init &&& last) ys
> (ys'',z) = (init &&& last) ys'
> in if avgdx x (c,n) z > avgdx x (c,n) y
> then cut_l x (c,n) ys'
> else ys
> prop_msdl_l_correct =
> forAll (longer test_lb) $
> \x -> msdl_s test_lb x == msdl_l test_lb x
Derived algorithm, using advanced data structures
for sequences. If you do not have Edison installed,
comment out the rest of the file.
> msdl :: Len -> [V] -> R
> msdl lb = snd . foldr (stepl lb) (((Sz.empty,0),((0,0),Sz.empty)),-32767)
We use two SimpleQueue to implement the prefix and the set of solutions.
We need to maintain the sizes of both (by using the Sized module of Edison).
Also, we need to maintain the sum of the prefix.
> type PrefixSeq a = (Sz.Sized (Q.Seq) a, V) -- maintaining the sum.
> type SolSeq a = Sz.Sized (Q.Seq) a
> type STQ = ((PrefixSeq V, (SL, SolSeq SL)) ,R)
> stepl :: Len -> V -> STQ -> STQ
> stepl lb a ((x,ys),m) | pssize x < lb =
> ((a `pslcons` x,ys),avg ((pssum &&& pssize) (a `pslcons` x)))
> stepl lb a ((x,((c,n),ys)),m) | otherwise =
> ((a `pslcons` x',(cn',ys')), m' `bmax` m)
> where (x',b) = (psrtail &&& psrhead) x
> cn' = (b+c,1+n)
> sl = (pssum &&& pssize) (a `pslcons` x')
> ys' = cut sl cn' (rm cn' ((c,n) `Sz.lcons` ys))
> m' = if Sz.null ys' then avg sl else avgdx sl cn' (Sz.rhead ys')
Some functions maintaining the prefix.
> pslcons a (x, s) = (a `Sz.lcons` x, s+a)
> psrtail (x,s) = (Sz.rtail x, s-(Sz.rhead x))
> psrhead (x,s) = Sz.rhead x
> pssize (x,s) = Sz.size x
> pssum (x,s) = s
> rm (c',n') ys | Sz.null ys = ys
> | Sz.null (Sz.ltail ys) = ys
> | otherwise = let (y,ys') = (Sz.lhead ys, Sz.ltail ys)
> z = Sz.lhead ys'
> in if avgd (c',n') y > avgd (c',n') z
> then ys else rm (c',n') ys'
> dropR p xs
> | Sz.null xs = xs
> | p (Sz.rhead xs) = dropR p (Sz.rtail xs)
> | otherwise = xs
> cut x (c,n) ys
> | Sz.null ys = Sz.empty
> | Sz.size ys == 1 =
> if avg x > avgdx x (c,n) (Sz.rhead ys)
> then Sz.empty else ys
> | otherwise =
> let (ys',y) = (Sz.rtail &&& Sz.rhead) ys
> (ys'',z) = (Sz.rtail &&& Sz.rhead) ys'
> in if avgdx x (c,n) z > avgdx x (c,n) y
> then cut x (c,n) ys'
> else ys
> prop_msdl_correct =
> forAll (longer test_lb) $
> \x -> msdl_s test_lb x == msdl test_lb x
References
[GK05] M. H. Goldwasser, M.-Y. Kao, and H.-I. Lu. Linear-time algorithms
for computing maximum-density sequence segments with bioinformatics
applications. Journal of Computer and System Sciences,
70(2):128–144, 2005.
[LJ02] Y.-L. Lin, T. Jiang, and K.-M. Chao. Efficient algorithms
for locating the length-constrained heaviest segments, with
applications to biomolecular sequence analysis. In Proceedings
of the 27th International Symposium on Mathematical Foundations
of Computer Science, Lecture Notes in Computer Science 2420,
pages 459–470. Springer-Verlag, 2002.
[Mu07] S-C. Mu. Maximum segment sum is back: deriving algorithms
for two maximum segment problems with bounded lengths.
Under submission. 2007.