KMeans Clustering II

Posted by Ranjit Jhala Feb 17, 2013

Tags: basic, measures

The story so far: Previously we saw

  • how to encode n-dimensional points using plain old Haskell lists,
  • how to encode a matrix with r rows and c columns as a list of lists,
  • some basic operations on points and matrices via list-manipulating functions

More importantly, we saw how easy it was to encode dimensionality with refinements over the len measure, thereby allowing LiquidHaskell to precisely track the dimensions across the various operations.

Next, lets use the basic types and operations to develop the actual KMeans clustering algorithm, and, along the way, see how LiquidHaskell lets us write precise module contracts which will ward against various unpleasant lumpenexceptions.

30: {-# LANGUAGE ScopedTypeVariables, TypeSynonymInstances, FlexibleInstances #-}
31: 
32: module KMeans (kmeans, kmeansGen) where
33: 
34: import KMeansHelper
35: import Prelude              hiding      (zipWith)
36: import Data.List                        (sort, span, minimumBy)
37: import Data.Function                    (on)
38: import Data.Ord                         (comparing)
39: import Language.Haskell.Liquid.Prelude  (liquidAssert, liquidError)
40: 
41: instance (GHC.Classes.Eq (KMeans.WrapType [GHC.Types.Double] a))Eq (WrapType [Double] a) where
42:    (==) = x:{VV : [{VV : (GHC.Types.Double) | false}] | false}
-> y:{VV : [{VV : (GHC.Types.Double) | false}] | false}
-> {VV : (GHC.Types.Bool) | ((? Prop([VV])) <=> (x = y))}(==) ({VV : [{VV : (GHC.Types.Double) | false}] | false}
 -> {VV : [{VV : (GHC.Types.Double) | false}] | false}
 -> {VV : (GHC.Types.Bool) | false})
-> ({VV : (KMeans.WrapType {VV : [{VV : (GHC.Types.Double) | false}] | false} {VV : a | false}) | false}
    -> {VV : [{VV : (GHC.Types.Double) | false}] | false})
-> {VV : (KMeans.WrapType {VV : [{VV : (GHC.Types.Double) | false}] | false} {VV : a | false}) | false}
-> {VV : (KMeans.WrapType {VV : [{VV : (GHC.Types.Double) | false}] | false} {VV : a | false}) | false}
-> {VV : (GHC.Types.Bool) | false}`on` (KMeans.WrapType {VV : [{VV : (GHC.Types.Double) | false}] | false} {VV : a | false})
-> {VV : [{VV : (GHC.Types.Double) | false}] | false}getVect
43: 
44: instance (GHC.Classes.Ord (KMeans.WrapType [GHC.Types.Double] a))Ord (WrapType [Double] a) where
45:     compare = (GHC.Classes.Ord [GHC.Types.Double])comparing (KMeans.WrapType {VV : [{VV : (GHC.Types.Double) | false}] | false} {VV : a | false})
-> {VV : [{VV : (GHC.Types.Double) | false}] | false}getVect

Recall that we are using a modified version of an existing KMeans implementation. While not the swiftest implementation, it serves as a nice introduction to the algorithm, and the general invariants carry over to more sophisticated implementations.

A Quick Recap

Before embarking on the journey, lets remind ourselves of our destination: the goal of K-Means clustering is

  • Take as Input : A set of points represented by n-dimensional points in Euclidian space

  • Return as Ouptut : A partitioning of the points, into upto K clusters, in a manner that minimizes the sum of distances between each point and its cluster center.

Last time, we introduced a variety of refinement type aliases for Haskell lists

Fixed Length Lists
66: type List a N = {v : [a] | (len v) = N}
Non-empty Lists
70: type NonEmptyList a = {v : [a] | (len v) > 0}
N-Dimensional Points
74: type Point N = List Double N
Matrices
78: type Matrix a Rows Cols = List (List a Cols) Rows
We also saw several basic list operations
82: groupBy   :: (a -> a -> Bool) -> [a] -> (Clustering a)
83: partition :: PosInt -> [a] -> (Clustering a)
84: zipWith   :: (a -> b -> c) -> xs:[a] -> (List b (len xs)) -> (List c (len xs))
85: transpose :: c:Int -> r:PosInt -> Matrix a r c -> Matrix a c r

whose types will prove essential in order to verify the invariants of the clustering algorithm. You might open the previous episode in a separate tab to keep those functions handy, but fear not, we will refresh our memory about them when we get around to using them below.

Generalized Points

To be more flexible, we will support arbitrary points as long as they can be projected to Euclidian space. In addition to supporting, say, an image or a cat video as a point, this will allow us to weight different dimensions to different degrees.

We represent generalized points with a record

104: data WrapType b a = WrapType {(KMeans.WrapType a b) -> agetVect :: b, (KMeans.WrapType a b) -> bgetVal :: a}

and we can define an alias that captures the dimensionality of the point

110: {-@ type GenPoint a N  = WrapType (Point N) a @-}

That is, GenPoint a N denotes a general a value that has an N-dimensional projection into Euclidean space.

Algorithm: Iterative Clustering

Terrific, now that all the pieces are in place lets look at the KMeans algorithm. We have implemented a function kmeans', which takes as input a dimension n, the maximum number of clusters k (which must be positive), a list of generalized points of dimension n, and returns a Clustering (i.e. a list of non-empty lists) of the generalized points.

So much verbiage – a type is worth a thousand comments!

128: {-@ kmeans' :: n:Int
129:             -> k:PosInt
130:             -> points:[(GenPoint a n)]
131:             -> (Clustering (GenPoint a n)) @-}

There! Crisp and to the point. Sorry. Anyhoo, the function implements the above type.

138: n:(GHC.Types.Int)
-> {VV : (GHC.Types.Int) | (VV > 0)}
-> [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)]
-> [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}]kmeans' (GHC.Types.Int)n {VV : (GHC.Types.Int) | (VV > 0)}k [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)]points    = (GHC.Classes.Eq [[KMeans.WrapType [GHC.Types.Double] a]])fixpoint (n:(GHC.Types.Int)
-> [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}]
-> [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}]refineCluster {VV : (GHC.Types.Int) | (VV = n)}n) {VV : [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                            (len([VV]) = n)} a)] | (len([VV]) > 0)}] | (VV = initialClustering),
                                                                                                       (len([VV]) >= 0)}initialClustering
139:   where
140:     [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                      (len([VV]) = n)} a)] | (len([VV]) > 0)}]initialClustering = {VV : (GHC.Types.Int) | (VV > 0)}
-> [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                  (len([VV]) = n)} a)]
-> [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                         (len([VV]) = n)} a)] | (len([VV]) > 0)}]partition {VV : (GHC.Types.Int) | (VV = clusterSize)}clusterSize {VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (VV = points),
                                                                            (len([VV]) >= 0)}points
141:     (GHC.Types.Int)clusterSize       = x:(GHC.Types.Int)
-> y:(GHC.Types.Int)
-> {VV : (GHC.Types.Int) | (VV = ((x > y) ? x : y))}max {VV : (GHC.Types.Int) | (VV = (1  :  int))}1 ((GHC.Types.Int)(xs:[(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                  (len([VV]) = n)} a)]
-> {VV : (GHC.Types.Int) | (VV = len([xs]))}length {VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (VV = points),
                                                                            (len([VV]) >= 0)}points x:(GHC.Types.Int)
-> y:(GHC.Types.Int) -> {VV : (GHC.Types.Int) | (VV = (x + y))}+ {VV : (GHC.Types.Int) | (VV = k),(VV > 0)}k x:(GHC.Types.Int)
-> y:(GHC.Types.Int) -> {VV : (GHC.Types.Int) | (VV = (x - y))}- {VV : (GHC.Types.Int) | (VV = (1  :  int))}1) x:(GHC.Types.Int)
-> y:(GHC.Types.Int) -> {VV : (GHC.Types.Int) | (VV = (x / y))}`div` {VV : (GHC.Types.Int) | (VV = k),(VV > 0)}k)
142: 
143:     fixpoint          :: (Eq a) => (a -> a) -> a -> a
144:     (GHC.Classes.Eq a) -> (a -> a) -> a -> afixpoint a -> af ax      = if (GHC.Types.Bool)(a -> af {VV : a | (VV = x)}x) x:a
-> y:a -> {VV : (GHC.Types.Bool) | ((? Prop([VV])) <=> (x = y))}== {VV : a | (VV = x)}x then {VV : a | (VV = x)}x else (GHC.Classes.Eq a) -> (a -> a) -> a -> afixpoint a -> af (a -> af {VV : a | (VV = x)}x)

That is, kmeans' creates an initialClustering by partition-ing the points into chunks with clusterSize elements. Then, it invokes fixpoint to iteratively refine the initial clustering with refineCluster until it converges to a stable clustering that cannot be improved upon. This stable clustering is returned as the output.

LiquidHaskell verifies that kmeans' adheres to the given signature in two steps.

1. Initial Clustering

First, LiquidHaskell determines from
159: max       :: (Ord a) => x:a -> y:a -> {v: a | (v >= x) && ( v >= y) }
that clusterSize is strictly positive, and hence, from
163: partition :: size:PosInt -> [a] -> (Clustering a)

which we saw last time, that initialClustering is indeed a valid Clustering of (GenPoint a n).

2. Fixpoint

Next, LiquidHaskell infers that at the call fixpoint (refineCluster n) ..., that the type parameter a of fixpoint can be instantiated with Clustering (GenPoint a n). This is because initialClustering is a valid clustering, as we saw above, and because refineCluster takes – and returns – valid n-dimensional clusterings, as we shall see below. Consequently, the value returned by kmeans' is also Clustering of GenPoint a n as required.

Refining A Clustering

Thus, the real work in KMeans happens inside refineCluster, which takes a clustering and improves it, like so:

186: n:(GHC.Types.Int)
-> [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}]
-> [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}]refineCluster (GHC.Types.Int)n [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}]clusters = {VV : [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                            (len([VV]) = n),
                                                            (len([VV]) >= 0)} a)] | (len([VV]) > 0)}] | (VV = clusters'),
                                                                                                        (len([VV]) = len([centeredGroups])),
                                                                                                        (len([VV]) >= 0)}clusters'
187:   where
188:     -- 1. Compute cluster centers
189:     {VV : [{VV : [(GHC.Types.Double)] | (n = len([VV])),
                                    (len([VV]) = n)}] | (len([VV]) = len([clusters]))}centers        = ({VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                      (len([VV]) = n)} a)] | (len([VV]) > 0)}
 -> {VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)})
-> xs:[{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                            (len([VV]) = n)} a)] | (len([VV]) > 0)}]
-> {VV : [{VV : [(GHC.Types.Double)] | (n = len([VV])),
                                       (len([VV]) = n)}] | (len([VV]) = len([xs]))}map (n:(GHC.Types.Int)
-> {v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}
-> {VV : [(GHC.Types.Double)] | (len([VV]) = n)}clusterCenter {VV : (GHC.Types.Int) | (VV = n)}n) {VV : [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}] | (VV = clusters),
                                                                                                     (len([VV]) >= 0)}clusters
190: 
191:     -- 2. Map points to their nearest center
192:     [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                               (len([VV]) = n)} a)]points         = [[(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                (len([VV]) = n)} a)]]
-> [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                  (len([VV]) = n)} a)]concat {VV : [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}] | (VV = clusters),
                                                                                                     (len([VV]) >= 0)}clusters
193:     [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                               (len([VV]) = n),
                               (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                 (len([VV]) = n),
                                                                                                 (len([VV]) >= 0)} a))]centeredPoints = (GHC.Classes.Ord
  ([GHC.Types.Double], KMeans.WrapType [GHC.Types.Double] a))sort {VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                     (len([VV]) = n),
                                     (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                       (len([VV]) = n),
                                                                                                       (len([VV]) >= 0)} a))] | (len([VV]) = len([points])),
                                                                                                                                (len([VV]) >= 0)}[({VV : [(GHC.Types.Double)] | (n = len([VV])),
                              (len([VV]) = n),
                              (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                (len([VV]) = n),
                                                                                                (len([VV]) >= 0)} a))(n:(GHC.Types.Int)
-> (KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)
-> [{VV : [(GHC.Types.Double)] | (len([VV]) = n)}]
-> {VV : [(GHC.Types.Double)] | (len([VV]) = n)}nearestCenter {VV : (GHC.Types.Int) | (VV = n)}n (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                              (len([VV]) = n)} a)p {VV : [{VV : [(GHC.Types.Double)] | (n = len([VV])),
                                    (len([VV]) = n)}] | (VV = centers),
                                                        (len([VV]) = len([clusters])),
                                                        (len([VV]) >= 0)}centers, (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                              (len([VV]) = n)} a)p) | p <- {VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                     (len([VV]) = n)} a)] | (VV = points),
                                                                            (len([VV]) >= 0)}points]
194: 
195:     -- 3. Group points by nearest center to get new clusters
196:     [{VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                      (len([VV]) = n),
                                      (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                        (len([VV]) = n),
                                                                                                        (len([VV]) >= 0)} a))] | (len([VV]) > 0)}]centeredGroups = (({VV : [(GHC.Types.Double)] | (n = len([VV])),
                               (len([VV]) = n),
                               (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                 (len([VV]) = n),
                                                                                                 (len([VV]) >= 0)} a))
 -> ({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                  (len([VV]) = n),
                                  (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                    (len([VV]) = n),
                                                                                                    (len([VV]) >= 0)} a))
 -> (GHC.Types.Bool))
-> [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                  (len([VV]) = n),
                                  (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                    (len([VV]) = n),
                                                                                                    (len([VV]) >= 0)} a))]
-> [{VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                         (len([VV]) = n),
                                         (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                           (len([VV]) = n),
                                                                                                           (len([VV]) >= 0)} a))] | (len([VV]) > 0)}]groupBy ((GHC.Classes.Eq [GHC.Types.Double])(==) ({VV : [(GHC.Types.Double)] | (n = len([VV])),
                              (len([VV]) = n),
                              (len([VV]) >= 0)}
 -> {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                 (len([VV]) = n),
                                 (len([VV]) >= 0)}
 -> (GHC.Types.Bool))
-> (({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                  (len([VV]) = n),
                                  (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                    (len([VV]) = n),
                                                                                                    (len([VV]) >= 0)} a))
    -> {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                    (len([VV]) = n),
                                    (len([VV]) >= 0)})
-> ({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                 (len([VV]) = n),
                                 (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                   (len([VV]) = n),
                                                                                                   (len([VV]) >= 0)} a))
-> ({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                 (len([VV]) = n),
                                 (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                   (len([VV]) = n),
                                                                                                   (len([VV]) >= 0)} a))
-> (GHC.Types.Bool)`on` ({VV : [(GHC.Types.Double)] | (n = len([VV])),
                              (len([VV]) = n),
                              (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                (len([VV]) = n),
                                                                                                (len([VV]) >= 0)} a))
-> {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                (len([VV]) = n),
                                (len([VV]) >= 0)}fst) {VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                     (len([VV]) = n),
                                     (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                       (len([VV]) = n),
                                                                                                       (len([VV]) >= 0)} a))] | (VV = centeredPoints),
                                                                                                                                (len([VV]) >= 0)}centeredPoints
197:     {VV : [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                            (len([VV]) = n),
                                                            (len([VV]) >= 0)} a)] | (len([VV]) > 0)}] | (len([VV]) = len([centeredGroups]))}clusters'      = ({VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                      (len([VV]) = n),
                                      (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                        (len([VV]) = n),
                                                                                                        (len([VV]) >= 0)} a))] | (len([VV]) > 0)}
 -> {VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                         (len([VV]) = n),
                                                         (len([VV]) >= 0)} a)] | (len([VV]) > 0)})
-> xs:[{VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                            (len([VV]) = n),
                                            (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                              (len([VV]) = n),
                                                                                                              (len([VV]) >= 0)} a))] | (len([VV]) > 0)}]
-> {VV : [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                               (len([VV]) = n),
                                                               (len([VV]) >= 0)} a)] | (len([VV]) > 0)}] | (len([VV]) = len([xs]))}map ((({VV : [(GHC.Types.Double)] | (n = len([VV])),
                               (len([VV]) = n),
                               (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                 (len([VV]) = n),
                                                                                                 (len([VV]) >= 0)} a))
 -> (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                  (len([VV]) = n),
                                                  (len([VV]) >= 0)} a))
-> xs:[({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                     (len([VV]) = n),
                                     (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                       (len([VV]) = n),
                                                                                                       (len([VV]) >= 0)} a))]
-> {VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                        (len([VV]) = n),
                                                        (len([VV]) >= 0)} a)] | (len([VV]) = len([xs]))}map ({VV : [(GHC.Types.Double)] | (n = len([VV])),
                              (len([VV]) = n),
                              (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                (len([VV]) = n),
                                                                                                (len([VV]) >= 0)} a))
-> (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                 (len([VV]) = n),
                                                 (len([VV]) >= 0)} a)snd) {VV : [{VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                            (len([VV]) = n),
                                            (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                              (len([VV]) = n),
                                                                                                              (len([VV]) >= 0)} a))] | (len([VV]) > 0)}] | (VV = centeredGroups),
                                                                                                                                                           (len([VV]) >= 0)}centeredGroups

The behavior of refineCluster is pithily captured by its type

203: {-@ refineCluster :: n:Int
204:                   -> Clustering (GenPoint a n)
205:                   -> Clustering (GenPoint a n) @-}

The refined clustering is computed in three steps.

  1. First, we compute the centers :: [(Point n)] of the current clusters. This is achieved by using clusterCenter, which maps a list of generalized n-dimensional points to a single n dimensional point (i.e. Point n).

  2. Next, we pair each point p in the list of all points with its nearestCenter.

  3. Finally, the pairs in the list of centeredPoints are grouped by the center, i.e. the first element of the tuple. The resulting groups are projected back to the original generalized points yielding the new clustering.

The type of the output follows directly from
222: groupBy :: (a -> a -> Bool) -> [a] -> (Clustering a)

from last time. At the call site above, LiquidHaskell infers that a can be instantiated with ((Point n), (GenPoint a n)) thereby establishing that, after projecting away the first element, the output is a list of non-empty lists of generalized n-dimensional points.

That leaves us with the two crucial bits of the algorithm: clusterCenter and nearestCenter.

Computing the Center of a Cluster

The center of an n-dimensional cluster is simply an n-dimensional point whose value in each dimension is equal to the average value of that dimension across all the points in the cluster.

For example, consider a cluster of 2-dimensional points,
241: exampleCluster = [ [0,  0]
242:                  , [1, 10]
243:                  , [2, 20]
244:                  , [4, 40]
245:                  , [5, 50] ]
The center of the cluster is
249: exampleCenter = [ (0 + 1  + 2  + 4  + 5 ) / 5
250:                 , (0 + 10 + 20 + 40 + 50) / 5 ]
which is just
254: exampleCenter = [ 3, 30 ]

Thus, we can compute a clusterCenter via the following procedure

260: n:(GHC.Types.Int)
-> {v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}
-> {VV : [(GHC.Types.Double)] | (len([VV]) = n)}clusterCenter (GHC.Types.Int)n {v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}xs = ({VV : [(GHC.Types.Double)] | (numPoints = len([VV])),
                              (len([VV]) = numPoints),
                              (len([VV]) = len([xs])),
                              (len([VV]) > 0)}
 -> (GHC.Types.Double))
-> xs:[{VV : [(GHC.Types.Double)] | (numPoints = len([VV])),
                                    (len([VV]) = numPoints),
                                    (len([VV]) = len([xs])),
                                    (len([VV]) > 0)}]
-> {VV : [(GHC.Types.Double)] | (len([VV]) = len([xs]))}map {VV : [(GHC.Types.Double)] | (numPoints = len([VV])),
                             (len([VV]) = numPoints),
                             (len([VV]) = len([xs])),
                             (len([VV]) > 0)}
-> (GHC.Types.Double)average {v : [{VV : [(GHC.Types.Double)] | (len([VV]) = numPoints)}] | (v = xs'),
                                                               (len([v]) = n),
                                                               (len([v]) >= 0)}xs'
261:   where
262:     {VV : (GHC.Types.Int) | (VV = len([xs]))}numPoints      = xs:[(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                  (len([VV]) = n)} a)]
-> {VV : (GHC.Types.Int) | (VV = len([xs]))}length {v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (v = xs),
                                                                           (len([v]) > 0),
                                                                           (len([v]) >= 0)}xs
263:     {v : [{VV : [(GHC.Types.Double)] | (len([VV]) = numPoints)}] | (len([v]) = n)}xs'            = c:(GHC.Types.Int)
-> r:{VV : (GHC.Types.Int) | (VV > 0)}
-> {v : [{VV : [(GHC.Types.Double)] | (len([VV]) = c)}] | (len([v]) = r)}
-> {v : [{VV : [(GHC.Types.Double)] | (len([VV]) = r)}] | (len([v]) = c)}transpose {VV : (GHC.Types.Int) | (VV = n)}n {VV : (GHC.Types.Int) | (VV = numPoints),(VV = len([xs]))}numPoints (((KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                               (len([VV]) = n)} a)
 -> {VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)})
-> xs:[(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                     (len([VV]) = n)} a)]
-> {VV : [{VV : [(GHC.Types.Double)] | (n = len([VV])),
                                       (len([VV]) = n)}] | (len([VV]) = len([xs]))}map (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                              (len([VV]) = n)} a)
-> {VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}getVect {v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (v = xs),
                                                                           (len([v]) > 0),
                                                                           (len([v]) >= 0)}xs)
264: 
265:     average        :: [Double] -> Double
266:     {VV : [(GHC.Types.Double)] | (numPoints = len([VV])),
                             (len([VV]) = numPoints),
                             (len([VV]) = len([xs])),
                             (len([VV]) > 0)}
-> (GHC.Types.Double)average        = ((GHC.Types.Double)
-> {VV : (GHC.Types.Int) | (VV > 0)} -> (GHC.Types.Double)`safeDiv` {VV : (GHC.Types.Int) | (VV = numPoints),(VV = len([xs]))}numPoints) ((GHC.Types.Double) -> (GHC.Types.Double))
-> ({VV : [(GHC.Types.Double)] | (numPoints = len([VV])),
                                 (len([VV]) = numPoints),
                                 (len([VV]) = len([xs])),
                                 (len([VV]) > 0)}
    -> (GHC.Types.Double))
-> {VV : [(GHC.Types.Double)] | (numPoints = len([VV])),
                                (len([VV]) = numPoints),
                                (len([VV]) = len([xs])),
                                (len([VV]) > 0)}
-> (GHC.Types.Double). [(GHC.Types.Double)] -> (GHC.Types.Double)sum

First, we transpose the matrix of points in the cluster. Suppose that xs is the exampleCluster from above (and so n is 2 and numPoints is 5.)

In this scenario, xs' is
274: xs' = [ [0,  1,  2,  4,  5]
275:       , [0, 10, 20, 40, 50] ]

and so map average xs' evaluates to exampleCenter from above.

We have ensured that the division in the average does not lead to any nasty surprises via a safe division function whose precondition checks that the denominator is non-zero, as illustrated here.

285: {- safeDiv   :: (Fractional a) => a -> {v:Int | v != 0} -> a -}
286: safeDiv     :: (Fractional a) => a -> Int -> a
287: (GHC.Real.Fractional a)
-> a -> {VV : (GHC.Types.Int) | (VV > 0)} -> asafeDiv an 0 = {VV : [(GHC.Types.Char)] | false} -> {VV : a | false}liquidError {VV : [(GHC.Types.Char)] | (len([VV]) >= 0)}"divide by zero"
288: safeDiv n d = {VV : a | (VV = n)}n x:a -> y:{VV : a | (VV != 0)} -> {VV : a | (VV = (x / y))}/ ((GHC.Num.Num a)fromIntegral {VV : (GHC.Types.Int) | (VV > 0)}d)

LiquidHaskell verifies that the divide-by-zero never occurs, and furthermore, that clusterCenter indeed computes an n-dimensional center by inferring that

295: {-@ clusterCenter :: n:Int -> NonEmptyList (GenPoint a n) -> Point n @-}

LiquidHaskell deduces that the input list of points xs is non-empty from the fact that clusterCenter is only invoked on the elements of a Clustering which comprise only non-empty lists. Since xs is non-empty, i.e. (len xs) > 0, LiquidHaskell infers that numPoints is positive (hover over length to understand why), and hence, LiquidHaskell is satisfied that the call to safeDiv will always proceed without any incident.

To establish the output type Point n LiquidHaskell leans on the fact that
307: transpose :: n:Int -> m:PosInt -> Matrix a m n -> Matrix a n m

to deduce that xs' is of type Matrix Double n numPoints, that is to say, a list of length n containing lists of length numPoints. Since map preserves the length, the value map average xs' is also a list of length n, i.e. Point n.

Finding the Nearest Center

The last piece of the puzzle is nearestCenter which maps each (generalized) point to the center that it is nearest. The code is pretty self-explanatory:

324: nearestCenter     :: Int -> WrapType [Double] a -> [[Double]] -> [Double]
325: n:(GHC.Types.Int)
-> (KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)
-> [{VV : [(GHC.Types.Double)] | (len([VV]) = n)}]
-> {VV : [(GHC.Types.Double)] | (len([VV]) = n)}nearestCenter (GHC.Types.Int)n (KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)x = {VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                     (len([VV]) = n),
                                     (len([VV]) >= 0)} , (GHC.Types.Double))] | (len([VV]) >= 0)}
-> {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                (len([VV]) = n),
                                (len([VV]) >= 0)}minKey ({VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                      (len([VV]) = n),
                                      (len([VV]) >= 0)} , (GHC.Types.Double))] | (len([VV]) >= 0)}
 -> {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                 (len([VV]) = n),
                                 (len([VV]) >= 0)})
-> ([{VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}]
    -> {VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                            (len([VV]) = n),
                                            (len([VV]) >= 0)} , (GHC.Types.Double))] | (len([VV]) >= 0)})
-> [{VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}]
-> {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                (len([VV]) = n),
                                (len([VV]) >= 0)}. ({VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}
 -> ({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                  (len([VV]) = n),
                                  (len([VV]) >= 0)} , (GHC.Types.Double)))
-> xs:[{VV : [(GHC.Types.Double)] | (n = len([VV])),
                                    (len([VV]) = n)}]
-> {VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                        (len([VV]) = n),
                                        (len([VV]) >= 0)} , (GHC.Types.Double))] | (len([VV]) = len([xs]))}map (c:{VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}
-> ({VV : [(GHC.Types.Double)] | (VV = c),
                                 (n = len([VV])),
                                 (len([VV]) = n),
                                 (len([VV]) = len([c])),
                                 (len([VV]) >= 0)} , (GHC.Types.Double))\{VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}c -> x1:a -> b -> (a , b)({VV : [(GHC.Types.Double)] | (VV = c),
                             (n = len([VV])),
                             (len([VV]) = n),
                             (len([VV]) >= 0)}c, a:{VV : [(GHC.Types.Double)] | (len([VV]) >= 0)}
-> {VV : [(GHC.Types.Double)] | (len([VV]) = len([a])),
                                (len([VV]) >= 0)}
-> (GHC.Types.Double)distance {VV : [(GHC.Types.Double)] | (VV = c),
                             (n = len([VV])),
                             (len([VV]) = n),
                             (len([VV]) >= 0)}c ((KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                              (len([VV]) = n),
                                              (len([VV]) = len([c])),
                                              (len([VV]) >= 0)} a)
-> {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                (len([VV]) = n),
                                (len([VV]) = len([c])),
                                (len([VV]) >= 0)}getVect {VV : (KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a) | (VV = x)}x)))

We map the centers to a tuple of center c and the distance between x and c, and then we select the tuple with the smallest distance

332: minKey  :: (Ord v) => [(k, v)] -> k
333: (GHC.Classes.Ord a) -> {VV : [(b , a)] | (len([VV]) >= 0)} -> bminKey  = (a , b) -> afst ((a , b) -> a)
-> ({VV : [(a , b)] | (len([VV]) >= 0)} -> (a , b))
-> {VV : [(a , b)] | (len([VV]) >= 0)}
-> a. ((a , b) -> (a , b) -> (GHC.Types.Ordering))
-> [(a , b)] -> (a , b)minimumBy ((a , b) -> (a , b) -> (GHC.Types.Ordering)\(a , b)x (a , b)y -> x:a
-> y:a
-> {VV : (GHC.Types.Ordering) | ((VV = EQ) <=> (x = y)),
                                ((VV = GT) <=> (x > y)),
                                ((VV = LT) <=> (x < y))}compare ((a , b) -> bsnd {VV : (a , b) | (VV = x)}x) ((a , b) -> bsnd {VV : (a , b) | (VV = y)}y))

The interesting bit is that the distance function uses zipWith to ensure that the dimensionality of the center and the point match up.

340: distance     :: [Double] -> [Double] -> Double
341: a:{VV : [(GHC.Types.Double)] | (len([VV]) >= 0)}
-> {VV : [(GHC.Types.Double)] | (len([VV]) = len([a])),
                                (len([VV]) >= 0)}
-> (GHC.Types.Double)distance {VV : [(GHC.Types.Double)] | (len([VV]) >= 0)}a {VV : [(GHC.Types.Double)] | (len([VV]) = len([a])),
                             (len([VV]) >= 0)}b = (GHC.Types.Double) -> (GHC.Types.Double)sqrt ((GHC.Types.Double) -> (GHC.Types.Double))
-> ({VV : [(GHC.Types.Double)] | (len([VV]) = len([a])),
                                 (len([VV]) = len([b])),
                                 (len([VV]) >= 0)}
    -> (GHC.Types.Double))
-> {VV : [(GHC.Types.Double)] | (len([VV]) = len([a])),
                                (len([VV]) = len([b])),
                                (len([VV]) >= 0)}
-> (GHC.Types.Double). [(GHC.Types.Double)] -> (GHC.Types.Double)sum ({VV : [(GHC.Types.Double)] | (len([VV]) = len([a])),
                              (len([VV]) = len([b])),
                              (len([VV]) >= 0)}
 -> (GHC.Types.Double))
-> {VV : [(GHC.Types.Double)] | (len([VV]) = len([a])),
                                (len([VV]) = len([b])),
                                (len([VV]) >= 0)}
-> (GHC.Types.Double)$ ((GHC.Types.Double) -> (GHC.Types.Double) -> (GHC.Types.Double))
-> xs:[(GHC.Types.Double)]
-> {VV : [(GHC.Types.Double)] | (len([VV]) = len([xs]))}
-> {VV : [(GHC.Types.Double)] | (len([VV]) = len([xs]))}zipWith ((GHC.Types.Double) -> (GHC.Types.Double) -> (GHC.Types.Double)\(GHC.Types.Double)v1 (GHC.Types.Double)v2 -> {VV : (GHC.Integer.Type.Integer) | (VV = 2)}({VV : (GHC.Types.Double) | (VV = v1)}v1 x:(GHC.Types.Double)
-> y:(GHC.Types.Double)
-> {VV : (GHC.Types.Double) | (VV = (x - y))}- {VV : (GHC.Types.Double) | (VV = v2)}v2) (GHC.Types.Double)
-> (GHC.Integer.Type.Integer) -> (GHC.Types.Double)^ 2) {VV : [(GHC.Types.Double)] | (VV = a),(len([VV]) >= 0)}a {VV : [(GHC.Types.Double)] | (VV = b),
                             (len([VV]) = len([a])),
                             (len([VV]) >= 0)}b

LiquidHaskell verifies distance by inferring that

347: {-@ nearestCenter :: n:Int -> (GenPoint a n) -> [(Point n)] -> (Point n) @-}

First, LiquidHaskell deduces that each center in cs is indeed n-dimensional, which follows from the output type of clusterCenter. Since x is a (GenPoint a n) LiquidHaskell infers that both c and getVect x are of an equal length n.

Consequently, the call to
355: zipWith :: (a -> b -> c) -> xs:[a] -> (List b (len xs)) -> (List c (len xs))

discussed last time is determined to be safe.

Finally, the value returned is just one of the input centers and so is a (Point n).

Putting It All Together: Top-Level API

We can bundle the algorithm into two top-level API functions.

First, a version that clusters generalized points. In this case, we require a function that can project an a value to an n-dimensional point. This function just wraps each a, clusters via kmeans' and then unwraps the points.

374: {-@ kmeansGen :: n:Int
375:               -> (a -> (Point n))
376:               -> k:PosInt
377:               -> xs:[a]
378:               -> (Clustering a)
379:   @-}
380: 
381: n:(GHC.Types.Int)
-> (a -> {VV : [(GHC.Types.Double)] | (len([VV]) = n)})
-> {VV : (GHC.Types.Int) | (VV > 0)}
-> [a]
-> [{VV : [a] | (len([VV]) > 0)}]kmeansGen (GHC.Types.Int)n a -> {VV : [(GHC.Types.Double)] | (len([VV]) = n)}project {VV : (GHC.Types.Int) | (VV > 0)}k = ({VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                      (len([VV]) = n)} a)] | (len([VV]) > 0)}
 -> {VV : [a] | (len([VV]) > 0)})
-> xs:[{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                            (len([VV]) = n)} a)] | (len([VV]) > 0)}]
-> {VV : [{VV : [a] | (len([VV]) > 0)}] | (len([VV]) = len([xs]))}map (((KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                               (len([VV]) = n)} a)
 -> a)
-> xs:[(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                     (len([VV]) = n)} a)]
-> {VV : [a] | (len([VV]) = len([xs]))}map (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                              (len([VV]) = n)} a)
-> agetVal)
382:                       ([{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                       (len([VV]) = n)} a)] | (len([VV]) > 0)}]
 -> {VV : [{VV : [a] | (len([VV]) > 0)}] | (len([VV]) >= 0)})
-> ([a]
    -> [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                             (len([VV]) = n)} a)] | (len([VV]) > 0)}])
-> [a]
-> {VV : [{VV : [a] | (len([VV]) > 0)}] | (len([VV]) >= 0)}. n:(GHC.Types.Int)
-> {VV : (GHC.Types.Int) | (VV > 0)}
-> [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)]
-> [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}]kmeans' {VV : (GHC.Types.Int) | (VV = n)}n {VV : (GHC.Types.Int) | (VV = k),(VV > 0)}k
383:                       ({VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                      (len([VV]) = n),
                                                      (len([VV]) >= 0)} a)] | (len([VV]) >= 0)}
 -> [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                          (len([VV]) = n)} a)] | (len([VV]) > 0)}])
-> ([a]
    -> {VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                            (len([VV]) = n),
                                                            (len([VV]) >= 0)} a)] | (len([VV]) >= 0)})
-> [a]
-> [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                         (len([VV]) = n)} a)] | (len([VV]) > 0)}]. (a
 -> (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                  (len([VV]) = n),
                                                  (len([VV]) >= 0)} a))
-> xs:[a]
-> {VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                        (len([VV]) = n),
                                                        (len([VV]) >= 0)} a)] | (len([VV]) = len([xs]))}map (x:a
-> (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                 (len([VV]) = n),
                                                 (len([VV]) >= 0)} {VV : a | (VV = x)})\ax -> {VV : [(GHC.Types.Double)] | (n = len([VV])),
                             (len([VV]) = n),
                             (len([VV]) >= 0)}
-> {VV : a | (VV = x)}
-> (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                 (len([VV]) = n),
                                                 (len([VV]) >= 0)} {VV : a | (VV = x)})WrapType (a -> {VV : [(GHC.Types.Double)] | (len([VV]) = n)}project {VV : a | (VV = x)}x) {VV : a | (VV = x)}x)

Second, a specialized version that operates directly on n-dimensional points. The specialized version just calls the general version with a trivial id projection.

391: {-@ kmeans :: n:Int
392:            -> k:PosInt
393:            -> points:[(Point n)]
394:            -> (Clustering (Point n))
395:   @-}
396: 
397: n:(GHC.Types.Int)
-> {VV : (GHC.Types.Int) | (VV > 0)}
-> [{VV : [(GHC.Types.Double)] | (len([VV]) = n)}]
-> [{v : [{VV : [(GHC.Types.Double)] | (len([VV]) = n)}] | (len([v]) > 0)}]kmeans (GHC.Types.Int)n   = n:(GHC.Types.Int)
-> ({VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}
    -> {VV : [(GHC.Types.Double)] | (len([VV]) = n)})
-> {VV : (GHC.Types.Int) | (VV > 0)}
-> [{VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}]
-> [{VV : [{VV : [(GHC.Types.Double)] | (n = len([VV])),
                                        (len([VV]) = n)}] | (len([VV]) > 0)}]kmeansGen {VV : (GHC.Types.Int) | (VV = n)}n x:{VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}
-> {VV : [(GHC.Types.Double)] | (VV = x),
                                (n = len([VV])),
                                (len([VV]) = n)}id

Conclusions

I hope that over the last two posts you have gotten a sense of

  1. What KMeans clustering is all about,

  2. How measures and refinements can be used to describe the behavior of common list operations like map, transpose, groupBy, zipWith, and so on,

  3. How LiquidHaskell’s automated inference makes it easy to write and verify invariants of non-trivial programs.

The sharp reader will have noticed that the one major, non syntactic, change to the original code is the addition of the dimension parameter n throughout the code. This is critically required so that we can specify the relevant invariants (which are in terms of n.) The value is actually a ghost, and never ever used. Fortunately, Haskell’s laziness means that we don’t have to worry about it (or other ghost variables) imposing any run-time overhead at all.

Exercise: Incidentally, if you have followed thus far I would encourage you to ponder about how you might modify the types (and implementation) to verify that KMeans indeed produces at most k clusters…