KMeans Clustering I

Posted by Ranjit Jhala Feb 16, 2013

Tags: basic, measures

Last time we introduced a new specification mechanism called a measure and demonstrated how to use it to encode the length of a list. We saw how measures could be used to verify that functions like head and tail were only called with non-empty lists (whose length was strictly positive). As several folks pointed out, once LiquidHaskell can reason about lengths, it can do a lot more than just analyze non-emptiness.

Indeed!

Over the next two posts, lets see how one might implement a Kmeans algorithm that clusters n-dimensional points groups, and how LiquidHaskell can help us write and enforce various dimensionality invariants along the way.

33: module KMeansHelper where
34: 
35: import Prelude                          hiding  (zipWith)
36: import Data.List                                (span)
37: import Language.Haskell.Liquid.Prelude          (liquidError)

Rather than reinvent the wheel, we will modify an existing implementation of K-Means, available on hackage. This may not be the most efficient implementation, but its a nice introduction to the algorithm, and the general invariants will hold for more sophisticated implementations.

We have broken this entry into two convenient, bite-sized chunks:

  • Part I Introduces the basic types and list operations needed by KMeans,

  • Part II Describes how the operations are used in the KMeans implementation.

The Game: Clustering Points

The goal of K-Means clustering is the following. Given

  • Input : A set of points represented by n-dimensional points in Euclidian space, return

  • Output : A partitioning of the points, into K clusters, in a manner that minimizes sum of distances between each point and its cluster center.

The Players: Types

Lets make matters concrete by creating types for the different elements of the algorithm.

1. Fixed-Length Lists We will represent n-dimensional points using good old Haskell lists, refined with a predicate that describes the dimensionality (i.e. length.) To simplify matters, lets package this into a type alias that denotes lists of a given length N.

75: {-@ type List a N = {v : [a] | (len v) = N} @-}

2. Points Next, we can represent an N-dimensional point as list of Double of length N,

81: {-@ type Point N = List Double N @-}

3. Clusters A cluster is a non-empty list of points,

87: {-@ type NonEmptyList a = {v : [a] | (len v) > 0} @-}

4. Clustering And finally, a clustering is a list of (non-empty) clusters.

93: {-@ type Clustering a  = [(NonEmptyList a)] @-}

Notation: When defining refinement type aliases, we use uppercase variables like N to distinguish value- parameters from the lowercase type parameters like a.

Aside: By the way, if you are familiar with the index-style length encoding e.g. as found in DML or Agda, then its worth noting that despite appearances, our List and Point definitions are not indexed. We’re just using the indices to define abbreviations for the refinement predicates, and we have deliberately chosen the predicates to facilitate SMT based checking and inference.

Basic Operations on Points and Clusters

Ok, with the types firmly in hand, let us go forth and develop the KMeans clustering implementation. We will use a variety of small helper functions (of the kind found in Data.List.) Lets get started by looking at them through our newly refined eyes.

Grouping

The first such function is groupBy. We can refine its type so that instead of just producing a [[a]] we know that it produces a Clustering a which is a list of non-empty lists.

124: {-@ groupBy       ::(a -> a -> Bool) -> [a] -> (Clustering a) @-}
125: (a -> a -> (GHC.Types.Bool))
-> [a] -> {VV : [{VV : [a] | (len([VV]) > 0)}] | (len([VV]) >= 0)}groupBy _  []     = {VV : [{VV : [{VV : a | false}] | false}] | (len([VV]) = 0)}[]
126: groupBy eq (x:xs) = ({VV : a | (VV = x)}xy:a -> ys:[a] -> {VV : [a] | (len([VV]) = (1 + len([ys])))}:{VV : [a] | (VV = ys),
            (VV = ys),
            (len([VV]) = len([ys])),
            (len([VV]) >= 0)}ys) y:{VV : [a] | (len([VV]) > 0)}
-> ys:[{VV : [a] | (len([VV]) > 0)}]
-> {VV : [{VV : [a] | (len([VV]) > 0)}] | (len([VV]) = (1 + len([ys])))}: (a -> a -> (GHC.Types.Bool))
-> [a] -> {VV : [{VV : [a] | (len([VV]) > 0)}] | (len([VV]) >= 0)}groupBy a -> a -> (GHC.Types.Bool)eq {VV : [a] | (VV = zs),
            (VV = zs),
            (len([VV]) = len([zs])),
            (len([VV]) >= 0)}zs
127:   where ({VV : [a] | (VV = ys),(len([VV]) = len([ys])),(len([VV]) >= 0)}ys,{VV : [a] | (VV = zs),(len([VV]) = len([zs])),(len([VV]) >= 0)}zs)   = (a -> (GHC.Types.Bool)) -> [a] -> ([a] , [a])span (a -> a -> (GHC.Types.Bool)eq {VV : a | (VV = x)}x) {VV : [a] | (VV = xs),(len([VV]) >= 0)}xs

Intuitively, its pretty easy to see how LiquidHaskell verifies the refined specification:

  • Each element of the output list is of the form x:ys
  • For any list ys the length is non-negative, i.e. (len ys) >= 0
  • The len of x:ys is 1 + (len ys), that is, strictly positive.

Partitioning

Next, lets look the function

143: {-@ partition :: size:PosInt -> [a] -> (Clustering a) @-}
144: {-@ type PosInt = {v: Int | v > 0 } @-}

which is given a strictly positive integer argument, a list of a values, and returns a Clustering a, that is, a list of non-empty lists. (Each inner list has a length that is less than size, but we shall elide this for simplicity.)

The function is implemented in a straightforward manner, using the library functions take and drop

156: {VV : (GHC.Types.Int) | (VV > 0)}
-> [a] -> {VV : [{VV : [a] | (len([VV]) > 0)}] | (len([VV]) >= 0)}partition {VV : (GHC.Types.Int) | (VV > 0)}size []       = {VV : [{VV : [{VV : a | false}] | false}] | (len([VV]) = 0)}[]
157: partition size ys@(_:_) = {VV : [a] | (VV = zs),(len([VV]) >= 0)}zs y:{VV : [a] | (len([VV]) > 0)}
-> ys:[{VV : [a] | (len([VV]) > 0)}]
-> {VV : [{VV : [a] | (len([VV]) > 0)}] | (len([VV]) = (1 + len([ys])))}: {VV : (GHC.Types.Int) | (VV > 0)}
-> [a] -> {VV : [{VV : [a] | (len([VV]) > 0)}] | (len([VV]) >= 0)}partition {VV : (GHC.Types.Int) | (VV = size),(VV > 0)}size {VV : [a] | (VV = zs'),(len([VV]) >= 0)}zs'
158:   where
159:     [a]zs                  = n:{VV : (GHC.Types.Int) | (VV >= 0)}
-> xs:[a]
-> {VV : [a] | (len([VV]) = ((len([xs]) < n) ? len([xs]) : n))}take {VV : (GHC.Types.Int) | (VV = size),(VV > 0)}size {VV : [a] | (len([VV]) >= 0)}ys
160:     [a]zs'                 = n:{VV : (GHC.Types.Int) | (VV >= 0)}
-> xs:[a]
-> {VV : [a] | (len([VV]) = ((len([xs]) < n) ? 0 : (len([xs]) - n)))}drop {VV : (GHC.Types.Int) | (VV = size),(VV > 0)}size {VV : [a] | (len([VV]) >= 0)}ys

To verify that a valid Clustering is produced, LiquidHaskell needs only verify that the list zs above is non-empty, by suitably connecting the properties of the inputs size and ys with the output.

We have verified elsewhere that
168: take :: n:{v: Int | v >= 0 }
169:      -> xs:[a]
170:      -> {v:[a] | (len v) = (if ((len xs) < n) then (len xs) else n) }

In other words, the output list’s length is the smaller of the input list’s length and n. Thus, since both size and the (len ys) are greater than 1, LiquidHaskell deduces that the list returned by take size ys has a length greater than 1, i.e., is non-empty.

Zipping

To compute the Euclidean distance between two points, we will use the zipWith function. We must make sure that it is invoked on points with the same number of dimensions, so we write

186: {-@ zipWith :: (a -> b -> c)
187:             -> xs:[a] -> (List b (len xs)) -> (List c (len xs)) @-}
188: (a -> b -> c)
-> x4:[a]
-> x2:{VV : [b] | (len([VV]) = len([x4])),(len([VV]) >= 0)}
-> {VV : [c] | (len([VV]) = len([x4])),
               (len([VV]) = len([x2])),
               (len([VV]) >= 0)}zipWith a -> b -> cf (a:as) (b:bs) = a -> b -> cf {VV : a | (VV = a)}a {VV : a | (VV = b)}b y:a -> ys:[a] -> {VV : [a] | (len([VV]) = (1 + len([ys])))}: (a -> b -> c)
-> x4:[a]
-> x2:{VV : [b] | (len([VV]) = len([x4])),(len([VV]) >= 0)}
-> {VV : [c] | (len([VV]) = len([x4])),
               (len([VV]) = len([x2])),
               (len([VV]) >= 0)}zipWith a -> b -> cf {VV : [a] | (VV = as),(len([VV]) >= 0)}as {VV : [a] | (VV = bs),(len([VV]) >= 0)}bs
189: zipWith _ [] []         = {VV : [{VV : a | false}] | (len([VV]) = 0)}[]

The type stipulates that the second input list and the output have the same length as the first input. Furthermore, it rules out the case where one list is empty and the other is not, as in that case the former’s length is zero while the latter’s is not.

Transposing

The last basic operation that we will require is a means to transpose a Matrix, which itself is just a list of lists:

204: {-@ type Matrix a Rows Cols = (List (List a Cols) Rows) @-}

The transpose operation flips the rows and columns. I confess that I can never really understand matrices without concrete examples, and even then, barely.

So, lets say we have a matrix
212: m1  :: Matrix Int 4 2
213: m1  =  [ [1, 2]
214:        , [3, 4]
215:        , [5, 6]
216:        , [7, 8] ]
then the matrix m2 = transpose 2 3 m1 should be
220: m2 :: Matrix Int 2 4
221: m2  =  [ [1, 3, 5, 7]
222:        , [2, 4, 6, 8] ]

We will use a Matrix a m n to represent a single cluster of m points each of which has n dimensions. We will transpose the matrix to make it easy to sum and average the points along each dimension, in order to compute the center of the cluster.

As you can work out from the above, the code for transpose is quite straightforward: each output row is simply the list of heads of the input rows:

235: transpose       :: Int -> Int -> [[a]] -> [[a]]
236: 
237: c:(GHC.Types.Int)
-> r:{VV : (GHC.Types.Int) | (VV > 0)}
-> {v : [{VV : [a] | (len([VV]) = c)}] | (len([v]) = r)}
-> {v : [{VV : [a] | (len([VV]) = r)}] | (len([v]) = c)}transpose 0 _ _ = {VV : [{VV : [{VV : a | false}] | false}] | (len([VV]) = 0)}[]
238: 
239: transpose c r ((col00 : col01s) : row1s)
240:   = {VV : [a] | (VV = row0'),(len([VV]) >= 0)}row0' y:{VV : [a] | (len([VV]) = len([row0'])),
              (len([VV]) = len([rest])),
              (len([VV]) > 0)}
-> ys:[{VV : [a] | (len([VV]) = len([row0'])),
                   (len([VV]) = len([rest])),
                   (len([VV]) > 0)}]
-> {VV : [{VV : [a] | (len([VV]) = len([row0'])),
                      (len([VV]) = len([rest])),
                      (len([VV]) > 0)}] | (len([VV]) = (1 + len([ys])))}: {v : [[a]] | (v = row1s'),(len([v]) >= 0)}row1s'
241:     where
242:       [a]row0'  = {VV : a | (VV = col00)}col00  y:a -> ys:[a] -> {VV : [a] | (len([VV]) = (1 + len([ys])))}: {VV : [a] | (len([VV]) = len([row1s])),(len([VV]) >= 0)}[ {VV : a | (VV = col0)}col0  | (col0 : _)  <- {VV : [[a]] | (VV = row1s),(len([VV]) >= 0)}row1s ]
243:       [{VV : [a] | (len([VV]) = len([col01s])),(len([VV]) >= 0)}]rest   = {VV : [a] | (VV = col01s),(len([VV]) >= 0)}col01s y:{VV : [a] | (len([VV]) = len([col01s])),(len([VV]) >= 0)}
-> ys:[{VV : [a] | (len([VV]) = len([col01s])),(len([VV]) >= 0)}]
-> {VV : [{VV : [a] | (len([VV]) = len([col01s])),
                      (len([VV]) >= 0)}] | (len([VV]) = (1 + len([ys])))}: {VV : [{VV : [a] | (len([VV]) = len([col01s])),
                   (len([VV]) >= 0)}] | (len([VV]) = len([row1s])),(len([VV]) >= 0)}[ {VV : [a] | (VV = col1s),(len([VV]) >= 0)}col1s | (_ : col1s) <- {VV : [[a]] | (VV = row1s),(len([VV]) >= 0)}row1s ]
244:       [[a]]row1s' = c:(GHC.Types.Int)
-> r:{VV : (GHC.Types.Int) | (VV > 0)}
-> {v : [{VV : [a] | (len([VV]) = c)}] | (len([v]) = r)}
-> {v : [{VV : [a] | (len([VV]) = r)}] | (len([v]) = c)}transpose ((GHC.Types.Int)cx:(GHC.Types.Int)
-> y:(GHC.Types.Int) -> {VV : (GHC.Types.Int) | (VV = (x - y))}-{VV : (GHC.Types.Int) | (VV = (1  :  int))}1) {VV : (GHC.Types.Int) | (VV > 0)}r {VV : [{VV : [a] | (len([VV]) = len([col01s])),
                   (len([VV]) >= 0)}] | (VV = rest),(len([VV]) >= 0)}rest

LiquidHaskell verifies that

250: {-@ transpose :: c:Int -> r:PosInt -> Matrix a r c -> Matrix a c r @-}

Try to work it out for yourself on pencil and paper.

If you like you can get a hint by seeing how LiquidHaskell figures it out. Lets work backwards.

LiquidHaskell verifies the output type by inferring that
259: row0'        :: (List a r)
260: row1s'       :: List (List a r) (c - 1) -- i.e. Matrix a (c - 1) r
and so, by simply using the measure-refined type for :
264: (:)          :: x:a -> xs:[a] -> { v : [a] | (len v) = 1 + (len xs) }
LiquidHaskell deduces that
268: row0 : rows' :: List (List a r) c
That is,
272: row0 : rows' :: Matrix a c r

Excellent! Now, lets work backwards. How does it infer the types of row0' and row1s'?

The first case is easy: row0' is just the list of heads of each row, hence a List a r.

Now, lets look at row1s'. Notice that row1s is the matrix of all except the first row of the input Matrix, and so
280: row1s        :: Matrix a (r-1) c
and so, as
284: col01s       :: List a (c-1)
285: col1s        :: List a (c-1)
LiquidHaskell deduces that since rest is the concatenation of r-1 tails from row1s
289: rest         = col01s : [ col1s | (_ : col1s) <- row1s ]
the type of rest is
293: rest         :: List (List a (c - 1)) r
which is just
297: rest         :: Matrix a r (c-1)

Now, LiquidHaskell deduces row1s' :: Matrix a (c-1) r by inductively plugging in the output type of the recursive call, thereby checking the function’s signature.

Whew! That was a fair bit of work, wasn’t it!

Happily, we didn’t have to do any of it. Instead, using the SMT solver, LiquidHaskell ploughs through calculations like that and guarantees to us that transpose indeed flips the dimensions of the inner and outer lists.

Aside: Comprehensions vs. Map Incidentally, the code above is essentially that of transpose from the Prelude with some extra local variables for exposition. You could instead use a map head and map tail and I encourage you to go ahead and see for yourself.

Intermission

Time for a break – go see a cat video! – or skip it, stretch your legs, and return post-haste for the next installment, in which we will use the types and functions described above, to develop the clustering algorithm.