An executable monadic specification for Spark aggregation

Yu-Fang Chen, Chih-Duo Hong, Ondřej Lengál, Shin-Cheng Mu, Nishant Sinha, and Bow-Yaw Wang. Submitted.
[Code|Supplementary Proofs]

Spark is a new promising platform for scalable data-parallel computation that provides several high-level APIs to perform efficient data aggregation. Distributed computation is inherently non-deterministic, while programmers generally wish that their aggregation yield the same result regardless of how the data is distributed. We present an executable, monadic, formal specification for Spark aggregate combinators in Haskell. By algebraic, equational reasoning of monadic programs, requirements for deterministic outcomes from distributed Spark aggregation are precisely characterized. We report several case studies to analyze deterministic outcomes and correctness of Spark programs, and also illustrate how our executable specification helps developing a distributed Spark vertex coloring program.

Leave a Reply

Your email address will not be published. Required fields are marked *