KMeans Clustering II
The story so far: Previously we saw
- how to encode
n
-dimensional points using plain old Haskell lists, - how to encode a matrix with
r
rows andc
columns as a list of lists, - some basic operations on points and matrices via list-manipulating functions
More importantly, we saw how easy it was to encode dimensionality with refinements over
the len
measure, thereby allowing LiquidHaskell to precisely track the dimensions across
the various operations.
Next, lets use the basic types and operations to develop the actual KMeans clustering algorithm, and, along the way, see how LiquidHaskell lets us write precise module contracts which will ward against various unpleasant lumpenexceptions.
30: {-# LANGUAGE ScopedTypeVariables, TypeSynonymInstances, FlexibleInstances #-} 31: 32: module KMeans (kmeans, kmeansGen) where 33: 34: import KMeansHelper 35: import Prelude hiding (zipWith) 36: import Data.List (sort, span, minimumBy) 37: import Data.Function (on) 38: import Data.Ord (comparing) 39: import Language.Haskell.Liquid.Prelude (liquidAssert, liquidError) 40: 41: instance (GHC.Classes.Eq (KMeans.WrapType [GHC.Types.Double] a))Eq (WrapType [Double] a) where 42: (==) = x:{VV : [{VV : (GHC.Types.Double) | false}] | false} -> y:{VV : [{VV : (GHC.Types.Double) | false}] | false} -> {VV : (GHC.Types.Bool) | ((? Prop([VV])) <=> (x = y))}(==) ({VV : [{VV : (GHC.Types.Double) | false}] | false} -> {VV : [{VV : (GHC.Types.Double) | false}] | false} -> {VV : (GHC.Types.Bool) | false}) -> ({VV : (KMeans.WrapType {VV : [{VV : (GHC.Types.Double) | false}] | false} {VV : a | false}) | false} -> {VV : [{VV : (GHC.Types.Double) | false}] | false}) -> {VV : (KMeans.WrapType {VV : [{VV : (GHC.Types.Double) | false}] | false} {VV : a | false}) | false} -> {VV : (KMeans.WrapType {VV : [{VV : (GHC.Types.Double) | false}] | false} {VV : a | false}) | false} -> {VV : (GHC.Types.Bool) | false}`on` (KMeans.WrapType {VV : [{VV : (GHC.Types.Double) | false}] | false} {VV : a | false}) -> {VV : [{VV : (GHC.Types.Double) | false}] | false}getVect 43: 44: instance (GHC.Classes.Ord (KMeans.WrapType [GHC.Types.Double] a))Ord (WrapType [Double] a) where 45: compare = (GHC.Classes.Ord [GHC.Types.Double])comparing (KMeans.WrapType {VV : [{VV : (GHC.Types.Double) | false}] | false} {VV : a | false}) -> {VV : [{VV : (GHC.Types.Double) | false}] | false}getVect
Recall that we are using a modified version of an existing KMeans implementation. While not the swiftest implementation, it serves as a nice introduction to the algorithm, and the general invariants carry over to more sophisticated implementations.
A Quick Recap¶
Before embarking on the journey, lets remind ourselves of our destination: the goal of K-Means clustering is
-
Take as Input : A set of points represented by n-dimensional points in Euclidian space
-
Return as Ouptut : A partitioning of the points, into upto K clusters, in a manner that minimizes the sum of distances between each point and its cluster center.
Last time, we introduced a variety of refinement type aliases for Haskell lists
Fixed Length Lists
66: type List a N = {v : [a] | (len v) = N}
Non-empty Lists
70: type NonEmptyList a = {v : [a] | (len v) > 0}
N-Dimensional Points
74: type Point N = List Double N
Matrices
78: type Matrix a Rows Cols = List (List a Cols) Rows
We also saw several basic list operations
82: groupBy :: (a -> a -> Bool) -> [a] -> (Clustering a) 83: partition :: PosInt -> [a] -> (Clustering a) 84: zipWith :: (a -> b -> c) -> xs:[a] -> (List b (len xs)) -> (List c (len xs)) 85: transpose :: c:Int -> r:PosInt -> Matrix a r c -> Matrix a c r
whose types will prove essential in order to verify the invariants of the clustering algorithm. You might open the previous episode in a separate tab to keep those functions handy, but fear not, we will refresh our memory about them when we get around to using them below.
Generalized Points¶
To be more flexible, we will support arbitrary points as long as they can be projected to Euclidian space. In addition to supporting, say, an image or a cat video as a point, this will allow us to weight different dimensions to different degrees.
We represent generalized points with a record
104: data WrapType b a = WrapType {(KMeans.WrapType a b) -> agetVect :: b, (KMeans.WrapType a b) -> bgetVal :: a}
and we can define an alias that captures the dimensionality of the point
110: {-@ type GenPoint a N = WrapType (Point N) a @-}
That is, GenPoint a N
denotes a general a
value that has an
N
-dimensional projection into Euclidean space.
Algorithm: Iterative Clustering¶
Terrific, now that all the pieces are in place lets look at the KMeans
algorithm. We have implemented a function kmeans'
, which takes as input a
dimension n
, the maximum number of clusters k
(which must be positive),
a list of generalized points of dimension n
, and returns a Clustering
(i.e. a list of non-empty lists) of the generalized points.
So much verbiage -- a type is worth a thousand comments!
128: {-@ kmeans' :: n:Int 129: -> k:PosInt 130: -> points:[(GenPoint a n)] 131: -> (Clustering (GenPoint a n)) @-}
There! Crisp and to the point. Sorry. Anyhoo, the function implements the above type.
138: n:(GHC.Types.Int) -> {VV : (GHC.Types.Int) | (VV > 0)} -> [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] -> [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}]kmeans' (GHC.Types.Int)n {VV : (GHC.Types.Int) | (VV > 0)}k [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)]points = (GHC.Classes.Eq [[KMeans.WrapType [GHC.Types.Double] a]])fixpoint (n:(GHC.Types.Int) -> [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}] -> [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}]refineCluster {VV : (GHC.Types.Int) | (VV = n)}n) {VV : [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a)] | (len([VV]) > 0)}] | (VV = initialClustering), (len([VV]) >= 0)}initialClustering 139: where 140: [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a)] | (len([VV]) > 0)}]initialClustering = {VV : (GHC.Types.Int) | (VV > 0)} -> [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a)] -> [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a)] | (len([VV]) > 0)}]partition {VV : (GHC.Types.Int) | (VV = clusterSize)}clusterSize {VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (VV = points), (len([VV]) >= 0)}points 141: (GHC.Types.Int)clusterSize = x:(GHC.Types.Int) -> y:(GHC.Types.Int) -> {VV : (GHC.Types.Int) | (VV = ((x > y) ? x : y))}max {VV : (GHC.Types.Int) | (VV = (1 : int))}1 ((GHC.Types.Int)(xs:[(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a)] -> {VV : (GHC.Types.Int) | (VV = len([xs]))}length {VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (VV = points), (len([VV]) >= 0)}points x:(GHC.Types.Int) -> y:(GHC.Types.Int) -> {VV : (GHC.Types.Int) | (VV = (x + y))}+ {VV : (GHC.Types.Int) | (VV = k),(VV > 0)}k x:(GHC.Types.Int) -> y:(GHC.Types.Int) -> {VV : (GHC.Types.Int) | (VV = (x - y))}- {VV : (GHC.Types.Int) | (VV = (1 : int))}1) x:(GHC.Types.Int) -> y:(GHC.Types.Int) -> {VV : (GHC.Types.Int) | (VV = (x / y))}`div` {VV : (GHC.Types.Int) | (VV = k),(VV > 0)}k) 142: 143: fixpoint :: (Eq a) => (a -> a) -> a -> a 144: (GHC.Classes.Eq a) -> (a -> a) -> a -> afixpoint a -> af ax = if (GHC.Types.Bool)(a -> af {VV : a | (VV = x)}x) x:a -> y:a -> {VV : (GHC.Types.Bool) | ((? Prop([VV])) <=> (x = y))}== {VV : a | (VV = x)}x then {VV : a | (VV = x)}x else (GHC.Classes.Eq a) -> (a -> a) -> a -> afixpoint a -> af (a -> af {VV : a | (VV = x)}x)
That is, kmeans'
creates an initialClustering
by
partition
-ing the points
into chunks with clusterSize
elements.
Then, it invokes fixpoint
to iteratively refine the initial
clustering with refineCluster
until it converges to a stable
clustering that cannot be improved upon. This stable clustering
is returned as the output.
LiquidHaskell verifies that kmeans'
adheres to the given signature in two steps.
1. Initial Clustering
First, LiquidHaskell determines from
159: max :: (Ord a) => x:a -> y:a -> {v: a | (v >= x) && ( v >= y) }
that clusterSize
is strictly positive, and hence, from
163: partition :: size:PosInt -> [a] -> (Clustering a)
which we saw last time, that initialClustering
is indeed
a valid Clustering
of (GenPoint a n)
.
2. Fixpoint
Next, LiquidHaskell infers that at the call fixpoint (refineCluster n)
...
, that the type parameter a
of fixpoint
can be instantiated with
Clustering (GenPoint a n)
. This is because initialClustering
is a
valid clustering, as we saw above, and because refineCluster
takes -- and
returns -- valid n
-dimensional clusterings, as we shall see below.
Consequently, the value returned by kmeans'
is also Clustering
of
GenPoint a n
as required.
Refining A Clustering¶
Thus, the real work in KMeans happens inside refineCluster
, which takes a
clustering and improves it, like so:
186: n:(GHC.Types.Int) -> [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}] -> [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}]refineCluster (GHC.Types.Int)n [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}]clusters = {VV : [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a)] | (len([VV]) > 0)}] | (VV = clusters'), (len([VV]) = len([centeredGroups])), (len([VV]) >= 0)}clusters' 187: where 188: -- 1. Compute cluster centers 189: {VV : [{VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)}] | (len([VV]) = len([clusters]))}centers = ({VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a)] | (len([VV]) > 0)} -> {VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}) -> xs:[{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a)] | (len([VV]) > 0)}] -> {VV : [{VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)}] | (len([VV]) = len([xs]))}map (n:(GHC.Types.Int) -> {v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)} -> {VV : [(GHC.Types.Double)] | (len([VV]) = n)}clusterCenter {VV : (GHC.Types.Int) | (VV = n)}n) {VV : [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}] | (VV = clusters), (len([VV]) >= 0)}clusters 190: 191: -- 2. Map points to their nearest center 192: [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a)]points = [[(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a)]] -> [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a)]concat {VV : [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}] | (VV = clusters), (len([VV]) >= 0)}clusters 193: [({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a))]centeredPoints = (GHC.Classes.Ord ([GHC.Types.Double], KMeans.WrapType [GHC.Types.Double] a))sort {VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a))] | (len([VV]) = len([points])), (len([VV]) >= 0)}[({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a))(n:(GHC.Types.Int) -> (KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a) -> [{VV : [(GHC.Types.Double)] | (len([VV]) = n)}] -> {VV : [(GHC.Types.Double)] | (len([VV]) = n)}nearestCenter {VV : (GHC.Types.Int) | (VV = n)}n (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a)p {VV : [{VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)}] | (VV = centers), (len([VV]) = len([clusters])), (len([VV]) >= 0)}centers, (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a)p) | p <- {VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a)] | (VV = points), (len([VV]) >= 0)}points] 194: 195: -- 3. Group points by nearest center to get new clusters 196: [{VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a))] | (len([VV]) > 0)}]centeredGroups = (({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a)) -> ({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a)) -> (GHC.Types.Bool)) -> [({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a))] -> [{VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a))] | (len([VV]) > 0)}]groupBy ((GHC.Classes.Eq [GHC.Types.Double])(==) ({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} -> {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} -> (GHC.Types.Bool)) -> (({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a)) -> {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)}) -> ({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a)) -> ({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a)) -> (GHC.Types.Bool)`on` ({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a)) -> {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)}fst) {VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a))] | (VV = centeredPoints), (len([VV]) >= 0)}centeredPoints 197: {VV : [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a)] | (len([VV]) > 0)}] | (len([VV]) = len([centeredGroups]))}clusters' = ({VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a))] | (len([VV]) > 0)} -> {VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a)] | (len([VV]) > 0)}) -> xs:[{VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a))] | (len([VV]) > 0)}] -> {VV : [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a)] | (len([VV]) > 0)}] | (len([VV]) = len([xs]))}map ((({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a)) -> (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a)) -> xs:[({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a))] -> {VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a)] | (len([VV]) = len([xs]))}map ({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a)) -> (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a)snd) {VV : [{VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a))] | (len([VV]) > 0)}] | (VV = centeredGroups), (len([VV]) >= 0)}centeredGroups
The behavior of refineCluster
is pithily captured by its type
203: {-@ refineCluster :: n:Int 204: -> Clustering (GenPoint a n) 205: -> Clustering (GenPoint a n) @-}
The refined clustering is computed in three steps.
-
First, we compute the
centers :: [(Point n)]
of the currentclusters
. This is achieved by usingclusterCenter
, which maps a list of generalizedn
-dimensional points to a singlen
dimensional point (i.e.Point n
). -
Next, we pair each point
p
in the list of allpoints
with itsnearestCenter
. -
Finally, the pairs in the list of
centeredPoints
are grouped by the center, i.e. the first element of the tuple. The resulting groups are projected back to the original generalized points yielding the new clustering.
The type of the output follows directly from
222: groupBy :: (a -> a -> Bool) -> [a] -> (Clustering a)
from last time. At the call site above, LiquidHaskell infers that
a
can be instantiated with ((Point n), (GenPoint a n))
thereby establishing
that, after projecting away the first element, the output is a list of
non-empty lists of generalized n
-dimensional points.
That leaves us with the two crucial bits of the algorithm: clusterCenter
and nearestCenter
.
Computing the Center of a Cluster¶
The center of an n
-dimensional cluster is simply an n
-dimensional point
whose value in each dimension is equal to the average value of that
dimension across all the points in the cluster.
For example, consider a cluster of 2-dimensional points,
241: exampleCluster = [ [0, 0] 242: , [1, 10] 243: , [2, 20] 244: , [4, 40] 245: , [5, 50] ]
The center of the cluster is
249: exampleCenter = [ (0 + 1 + 2 + 4 + 5 ) / 5 250: , (0 + 10 + 20 + 40 + 50) / 5 ]
which is just
254: exampleCenter = [ 3, 30 ]
Thus, we can compute a clusterCenter
via the following procedure
260: n:(GHC.Types.Int) -> {v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)} -> {VV : [(GHC.Types.Double)] | (len([VV]) = n)}clusterCenter (GHC.Types.Int)n {v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}xs = ({VV : [(GHC.Types.Double)] | (numPoints = len([VV])), (len([VV]) = numPoints), (len([VV]) = len([xs])), (len([VV]) > 0)} -> (GHC.Types.Double)) -> xs:[{VV : [(GHC.Types.Double)] | (numPoints = len([VV])), (len([VV]) = numPoints), (len([VV]) = len([xs])), (len([VV]) > 0)}] -> {VV : [(GHC.Types.Double)] | (len([VV]) = len([xs]))}map {VV : [(GHC.Types.Double)] | (numPoints = len([VV])), (len([VV]) = numPoints), (len([VV]) = len([xs])), (len([VV]) > 0)} -> (GHC.Types.Double)average {v : [{VV : [(GHC.Types.Double)] | (len([VV]) = numPoints)}] | (v = xs'), (len([v]) = n), (len([v]) >= 0)}xs' 261: where 262: {VV : (GHC.Types.Int) | (VV = len([xs]))}numPoints = xs:[(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a)] -> {VV : (GHC.Types.Int) | (VV = len([xs]))}length {v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (v = xs), (len([v]) > 0), (len([v]) >= 0)}xs 263: {v : [{VV : [(GHC.Types.Double)] | (len([VV]) = numPoints)}] | (len([v]) = n)}xs' = c:(GHC.Types.Int) -> r:{VV : (GHC.Types.Int) | (VV > 0)} -> {v : [{VV : [(GHC.Types.Double)] | (len([VV]) = c)}] | (len([v]) = r)} -> {v : [{VV : [(GHC.Types.Double)] | (len([VV]) = r)}] | (len([v]) = c)}transpose {VV : (GHC.Types.Int) | (VV = n)}n {VV : (GHC.Types.Int) | (VV = numPoints),(VV = len([xs]))}numPoints (((KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a) -> {VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}) -> xs:[(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a)] -> {VV : [{VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)}] | (len([VV]) = len([xs]))}map (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a) -> {VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}getVect {v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (v = xs), (len([v]) > 0), (len([v]) >= 0)}xs) 264: 265: average :: [Double] -> Double 266: {VV : [(GHC.Types.Double)] | (numPoints = len([VV])), (len([VV]) = numPoints), (len([VV]) = len([xs])), (len([VV]) > 0)} -> (GHC.Types.Double)average = ((GHC.Types.Double) -> {VV : (GHC.Types.Int) | (VV > 0)} -> (GHC.Types.Double)`safeDiv` {VV : (GHC.Types.Int) | (VV = numPoints),(VV = len([xs]))}numPoints) ((GHC.Types.Double) -> (GHC.Types.Double)) -> ({VV : [(GHC.Types.Double)] | (numPoints = len([VV])), (len([VV]) = numPoints), (len([VV]) = len([xs])), (len([VV]) > 0)} -> (GHC.Types.Double)) -> {VV : [(GHC.Types.Double)] | (numPoints = len([VV])), (len([VV]) = numPoints), (len([VV]) = len([xs])), (len([VV]) > 0)} -> (GHC.Types.Double). [(GHC.Types.Double)] -> (GHC.Types.Double)sum
First, we transpose
the matrix of points in the cluster.
Suppose that xs
is the exampleCluster
from above
(and so n
is 2
and numPoints
is 5
.)
In this scenario, xs'
is
274: xs' = [ [0, 1, 2, 4, 5] 275: , [0, 10, 20, 40, 50] ]
and so map average xs'
evaluates to exampleCenter
from above.
We have ensured that the division in the average does not lead to any nasty surprises via a safe division function whose precondition checks that the denominator is non-zero, as illustrated here.
285: {- safeDiv :: (Fractional a) => a -> {v:Int | v != 0} -> a -} 286: safeDiv :: (Fractional a) => a -> Int -> a 287: (GHC.Real.Fractional a) -> a -> {VV : (GHC.Types.Int) | (VV > 0)} -> asafeDiv an 0 = {VV : [(GHC.Types.Char)] | false} -> {VV : a | false}liquidError {VV : [(GHC.Types.Char)] | (len([VV]) >= 0)}"divide by zero" 288: safeDiv n d = {VV : a | (VV = n)}n x:a -> y:{VV : a | (VV != 0)} -> {VV : a | (VV = (x / y))}/ ((GHC.Num.Num a)fromIntegral {VV : (GHC.Types.Int) | (VV > 0)}d)
LiquidHaskell verifies that the divide-by-zero never occurs, and furthermore,
that clusterCenter
indeed computes an n
-dimensional center by inferring that
295: {-@ clusterCenter :: n:Int -> NonEmptyList (GenPoint a n) -> Point n @-}
LiquidHaskell deduces that the input list of points xs
is non-empty
from the fact that clusterCenter
is only invoked on the elements of a
Clustering
which comprise only non-empty lists. Since xs
is non-empty,
i.e. (len xs) > 0
, LiquidHaskell infers that numPoints
is positive
(hover over length
to understand why), and hence, LiquidHaskell is
satisfied that the call to safeDiv
will always proceed without any
incident.
To establish the output type Point n
LiquidHaskell leans on the fact that
307: transpose :: n:Int -> m:PosInt -> Matrix a m n -> Matrix a n m
to deduce that xs'
is of type Matrix Double n numPoints
, that is to
say, a list of length n
containing lists of length numPoints
. Since
map
preserves the length, the value map average xs'
is also a list
of length n
, i.e. Point n
.
Finding the Nearest Center¶
The last piece of the puzzle is nearestCenter
which maps each
(generalized) point to the center that it is nearest. The code is
pretty self-explanatory:
324: nearestCenter :: Int -> WrapType [Double] a -> [[Double]] -> [Double] 325: n:(GHC.Types.Int) -> (KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a) -> [{VV : [(GHC.Types.Double)] | (len([VV]) = n)}] -> {VV : [(GHC.Types.Double)] | (len([VV]) = n)}nearestCenter (GHC.Types.Int)n (KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)x = {VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (GHC.Types.Double))] | (len([VV]) >= 0)} -> {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)}minKey ({VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (GHC.Types.Double))] | (len([VV]) >= 0)} -> {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)}) -> ([{VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}] -> {VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (GHC.Types.Double))] | (len([VV]) >= 0)}) -> [{VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}] -> {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)}. ({VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)} -> ({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (GHC.Types.Double))) -> xs:[{VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)}] -> {VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} , (GHC.Types.Double))] | (len([VV]) = len([xs]))}map (c:{VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)} -> ({VV : [(GHC.Types.Double)] | (VV = c), (n = len([VV])), (len([VV]) = n), (len([VV]) = len([c])), (len([VV]) >= 0)} , (GHC.Types.Double))\{VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}c -> x1:a -> b -> (a , b)({VV : [(GHC.Types.Double)] | (VV = c), (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)}c, a:{VV : [(GHC.Types.Double)] | (len([VV]) >= 0)} -> {VV : [(GHC.Types.Double)] | (len([VV]) = len([a])), (len([VV]) >= 0)} -> (GHC.Types.Double)distance {VV : [(GHC.Types.Double)] | (VV = c), (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)}c ((KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) = len([c])), (len([VV]) >= 0)} a) -> {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) = len([c])), (len([VV]) >= 0)}getVect {VV : (KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a) | (VV = x)}x)))
We map
the centers to a tuple of center c
and the distance
between
x
and c
, and then we select the tuple with the smallest distance
332: minKey :: (Ord v) => [(k, v)] -> k 333: (GHC.Classes.Ord a) -> {VV : [(b , a)] | (len([VV]) >= 0)} -> bminKey = (a , b) -> afst ((a , b) -> a) -> ({VV : [(a , b)] | (len([VV]) >= 0)} -> (a , b)) -> {VV : [(a , b)] | (len([VV]) >= 0)} -> a. ((a , b) -> (a , b) -> (GHC.Types.Ordering)) -> [(a , b)] -> (a , b)minimumBy ((a , b) -> (a , b) -> (GHC.Types.Ordering)\(a , b)x (a , b)y -> x:a -> y:a -> {VV : (GHC.Types.Ordering) | ((VV = EQ) <=> (x = y)), ((VV = GT) <=> (x > y)), ((VV = LT) <=> (x < y))}compare ((a , b) -> bsnd {VV : (a , b) | (VV = x)}x) ((a , b) -> bsnd {VV : (a , b) | (VV = y)}y))
The interesting bit is that the distance
function uses zipWith
to
ensure that the dimensionality of the center and the point match up.
340: distance :: [Double] -> [Double] -> Double 341: a:{VV : [(GHC.Types.Double)] | (len([VV]) >= 0)} -> {VV : [(GHC.Types.Double)] | (len([VV]) = len([a])), (len([VV]) >= 0)} -> (GHC.Types.Double)distance {VV : [(GHC.Types.Double)] | (len([VV]) >= 0)}a {VV : [(GHC.Types.Double)] | (len([VV]) = len([a])), (len([VV]) >= 0)}b = (GHC.Types.Double) -> (GHC.Types.Double)sqrt ((GHC.Types.Double) -> (GHC.Types.Double)) -> ({VV : [(GHC.Types.Double)] | (len([VV]) = len([a])), (len([VV]) = len([b])), (len([VV]) >= 0)} -> (GHC.Types.Double)) -> {VV : [(GHC.Types.Double)] | (len([VV]) = len([a])), (len([VV]) = len([b])), (len([VV]) >= 0)} -> (GHC.Types.Double). [(GHC.Types.Double)] -> (GHC.Types.Double)sum ({VV : [(GHC.Types.Double)] | (len([VV]) = len([a])), (len([VV]) = len([b])), (len([VV]) >= 0)} -> (GHC.Types.Double)) -> {VV : [(GHC.Types.Double)] | (len([VV]) = len([a])), (len([VV]) = len([b])), (len([VV]) >= 0)} -> (GHC.Types.Double)$ ((GHC.Types.Double) -> (GHC.Types.Double) -> (GHC.Types.Double)) -> xs:[(GHC.Types.Double)] -> {VV : [(GHC.Types.Double)] | (len([VV]) = len([xs]))} -> {VV : [(GHC.Types.Double)] | (len([VV]) = len([xs]))}zipWith ((GHC.Types.Double) -> (GHC.Types.Double) -> (GHC.Types.Double)\(GHC.Types.Double)v1 (GHC.Types.Double)v2 -> {VV : (GHC.Integer.Type.Integer) | (VV = 2)}({VV : (GHC.Types.Double) | (VV = v1)}v1 x:(GHC.Types.Double) -> y:(GHC.Types.Double) -> {VV : (GHC.Types.Double) | (VV = (x - y))}- {VV : (GHC.Types.Double) | (VV = v2)}v2) (GHC.Types.Double) -> (GHC.Integer.Type.Integer) -> (GHC.Types.Double)^ 2) {VV : [(GHC.Types.Double)] | (VV = a),(len([VV]) >= 0)}a {VV : [(GHC.Types.Double)] | (VV = b), (len([VV]) = len([a])), (len([VV]) >= 0)}b
LiquidHaskell verifies distance
by inferring that
347: {-@ nearestCenter :: n:Int -> (GenPoint a n) -> [(Point n)] -> (Point n) @-}
First, LiquidHaskell deduces that each center in cs
is indeed n
-dimensional, which
follows from the output type of clusterCenter
. Since x
is a (GenPoint a n)
LiquidHaskell infers that both c
and getVect x
are of an equal length n
.
Consequently, the call to
355: zipWith :: (a -> b -> c) -> xs:[a] -> (List b (len xs)) -> (List c (len xs))
discussed last time is determined to be safe.
Finally, the value returned is just one of the input centers and so is a (Point n)
.
Putting It All Together: Top-Level API¶
We can bundle the algorithm into two top-level API functions.
First, a version that clusters generalized points. In this case, we
require a function that can project
an a
value to an n
-dimensional
point. This function just wraps each a
, clusters via kmeans'
and then
unwraps the points.
374: {-@ kmeansGen :: n:Int 375: -> (a -> (Point n)) 376: -> k:PosInt 377: -> xs:[a] 378: -> (Clustering a) 379: @-} 380: 381: n:(GHC.Types.Int) -> (a -> {VV : [(GHC.Types.Double)] | (len([VV]) = n)}) -> {VV : (GHC.Types.Int) | (VV > 0)} -> [a] -> [{VV : [a] | (len([VV]) > 0)}]kmeansGen (GHC.Types.Int)n a -> {VV : [(GHC.Types.Double)] | (len([VV]) = n)}project {VV : (GHC.Types.Int) | (VV > 0)}k = ({VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a)] | (len([VV]) > 0)} -> {VV : [a] | (len([VV]) > 0)}) -> xs:[{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a)] | (len([VV]) > 0)}] -> {VV : [{VV : [a] | (len([VV]) > 0)}] | (len([VV]) = len([xs]))}map (((KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a) -> a) -> xs:[(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a)] -> {VV : [a] | (len([VV]) = len([xs]))}map (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a) -> agetVal) 382: ([{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a)] | (len([VV]) > 0)}] -> {VV : [{VV : [a] | (len([VV]) > 0)}] | (len([VV]) >= 0)}) -> ([a] -> [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a)] | (len([VV]) > 0)}]) -> [a] -> {VV : [{VV : [a] | (len([VV]) > 0)}] | (len([VV]) >= 0)}. n:(GHC.Types.Int) -> {VV : (GHC.Types.Int) | (VV > 0)} -> [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] -> [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}]kmeans' {VV : (GHC.Types.Int) | (VV = n)}n {VV : (GHC.Types.Int) | (VV = k),(VV > 0)}k 383: ({VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a)] | (len([VV]) >= 0)} -> [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a)] | (len([VV]) > 0)}]) -> ([a] -> {VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a)] | (len([VV]) >= 0)}) -> [a] -> [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)} a)] | (len([VV]) > 0)}]. (a -> (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a)) -> xs:[a] -> {VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} a)] | (len([VV]) = len([xs]))}map (x:a -> (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} {VV : a | (VV = x)})\ax -> {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} -> {VV : a | (VV = x)} -> (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n), (len([VV]) >= 0)} {VV : a | (VV = x)})WrapType (a -> {VV : [(GHC.Types.Double)] | (len([VV]) = n)}project {VV : a | (VV = x)}x) {VV : a | (VV = x)}x)
Second, a specialized version that operates directly on n
-dimensional
points. The specialized version just calls the general version with a
trivial id
projection.
391: {-@ kmeans :: n:Int 392: -> k:PosInt 393: -> points:[(Point n)] 394: -> (Clustering (Point n)) 395: @-} 396: 397: n:(GHC.Types.Int) -> {VV : (GHC.Types.Int) | (VV > 0)} -> [{VV : [(GHC.Types.Double)] | (len([VV]) = n)}] -> [{v : [{VV : [(GHC.Types.Double)] | (len([VV]) = n)}] | (len([v]) > 0)}]kmeans (GHC.Types.Int)n = n:(GHC.Types.Int) -> ({VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)} -> {VV : [(GHC.Types.Double)] | (len([VV]) = n)}) -> {VV : (GHC.Types.Int) | (VV > 0)} -> [{VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}] -> [{VV : [{VV : [(GHC.Types.Double)] | (n = len([VV])), (len([VV]) = n)}] | (len([VV]) > 0)}]kmeansGen {VV : (GHC.Types.Int) | (VV = n)}n x:{VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)} -> {VV : [(GHC.Types.Double)] | (VV = x), (n = len([VV])), (len([VV]) = n)}id
Conclusions¶
I hope that over the last two posts you have gotten a sense of
-
What KMeans clustering is all about,
-
How measures and refinements can be used to describe the behavior of common list operations like
map
,transpose
,groupBy
,zipWith
, and so on, -
How LiquidHaskell's automated inference makes it easy to write and verify invariants of non-trivial programs.
The sharp reader will have noticed that the one major, non syntactic, change to the
original code is the addition of the dimension parameter n
throughout
the code. This is critically required so that we can specify the relevant
invariants (which are in terms of n
.) The value is actually a ghost, and
never ever used. Fortunately, Haskell's laziness means that we don't have
to worry about it (or other ghost variables) imposing any run-time overhead
at all.
Exercise: Incidentally, if you have followed thus far I would
encourage you to ponder about how you might modify the types (and
implementation) to verify that KMeans indeed produces at most k
clusters...