Annoy can efficiently cluster any arbitrary float-valued vector of around 100 dimensions or less. We create a simple example to demonstrate how this library can be used to find similar words based on GloVe vectors.
Usually with k-means, you cannot work with vectors of larger than 5 or so dimensions. And unlike with MinHashLSH, this method works with float values instead of being limited by bit vectors.
Prepare the GloVe vectors
from annoy import AnnoyIndex
import numpy as np
Download the GloVe vectors. I’m running this in an IPython shell.
!wget http://nlp.stanford.edu/data/glove.6B.zip && unzip glove.6B.zip
Load the GloVe vectors.
VECTOR_DIM = 50
word_vectors = {}
with open(f'glove.6B.{VECTOR_DIM}d.txt') as fp:
for line in fp:
word = line.split()[0]
values = np.array(line.split()[1:]).astype(float)
word_vectors[word] = values
Set aside some words for testing. These words will not be seen by the Annoy index.
test_vectors = {}
for word in ('cat', 'toaster', 'nintendo'):
test_vectors[word] = word_vectors[word]
del word_vectors[word]
Create the index
words, vectors = zip(*word_vectors.items())
index = AnnoyIndex(VECTOR_DIM)
for idx, vector in enumerate(vectors):
index.add_item(idx, vector)
index.build(30)
index.save('test.ann')
index = AnnoyIndex(VECTOR_DIM)
index.load('test.ann')
Results
I create the function print_nearest
to find the most similar words based on learned vectors.
def print_nearest(word):
for idx in index.get_nns_by_vector(test_vectors[word], 10):
print(words[idx])
We can find the most similar words to “cat”.
print_nearest('cat')
dog
rabbit
monkey
rat
cats
snake
pet
mouse
bite
shark
Next, we can find the most similar words to “toaster”.
print_nearest('toaster')
bakes
self-replicating
fondue
souffle
ssaa
13-inch
pomerelle
espresso
blenders
burners
How about “nintendo”?
print_nearest('nintendo')
playstation
xbox
gamecube
sega
wii
consoles
ds
ps3
dreamcast
ps2
We can also see that the Annoy index is fast! I’m again running this in an IPython shell using the magic function %timeit
.
%timeit index.get_nns_by_vector(test_vectors['cat'], 10);
12.1 µs ± 289 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)