TF-IDF, which stands for term frequency–inverse document frequency, is a statistical method that determines the relative importance of a word in a snippet in the context of a list of snippets. We can use it to find keywords.
TF-IDF will give low scores to words in a snippet that are too frequent (eg. stopwords like "a", "and", "the") and also words that do not occur too often in the snippet. TF-IDF will give high scores to words that are relatively rare in the context of other snippets but also occur often in the given snippet.
TF-IDF keyword extraction is useful for:
- Automatic tag generation of a social media post
- Finding websites similar to a given website (document similarity)
- Bag-of-word text classification models
- ...and many more!
We will use the IMDB movie dataset that you can download from Kaggle.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.read_csv('IMDB-Movie-Data.csv')[['Title', 'Description']]
df.sample(5)
Title | Description | |
642 | The Ridiculous 6 | An outlaw who was raised by Native Americans d... |
733 | The Da Vinci Code | A murder inside the Louvre and clues in Da Vin... |
951 | The Descendants | A land baron tries to reconnect with his two d... |
938 | The Siege of Jadotville | Irish Commandant Pat Quinlan leads a stand off... |
639 | American Reunion | Jim, Michelle, Stifler, and their friends reun... |
Snippet vectorization (preprocessing + tokenization)
Raw descriptions aren't too useful. We need to process the descriptions into a usable format, ie. clean and tokenize. Luckily for us, scikit-learn
takes care of this for us. When we call vectorizer.fit_transform()
on our raw descriptions, it will clean and tokenize the text under-the-hood.
vectorizer = TfidfVectorizer()
snippet_scores = vectorizer.fit_transform(df.Description)
# Note we have 1000 snippets with 5897 unique words (or tokens)
snippet_scores.shape
(1000, 5897)
Keyword extraction
Features in the below cell means the words (or tokens) found in the text. We call vectorizer.get_feature_names()
so we can save them into an array which we will use later to retrieve the keyword from its index.
features = np.array(vectorizer.get_feature_names())
np.random.choice(features, 5)
array(['cleanses', 'sky', 'inflatable', 'except', 'pandemic'], dtype='<U17')
Next we do a bit of array mangling in order to extract out the keywords.
def top_n_keywords(scores, features, n=5):
tfidf_sorting = np.argsort(scores.toarray()).flatten()[::-1]
return features[tfidf_sorting][:n]
df['Top 5 Keywords'] = [top_n_keywords(scores, features) for scores in snippet_scores]
df.sample(5)
Title | Description | Top 5 Keywords | |
16 | Hacksaw Ridge | WWII American Army Medic Desmond T. Doss, who ... | [american, desmond, medic, medal, doss] |
765 | Lavender | After losing her memory, a woman begins to see... | [her, suggests, psychiatrist, unexplained, visit] |
744 | We Are Your Friends | Caught between a forbidden romance and the exp... | [cole, dj, fame, carter, fortune] |
607 | Horrible Bosses | Three friends conspire to murder their awful b... | [awful, they, bosses, conspire, standing] |
982 | Across the Universe | The music of the Beatles and the Vietnam War f... | [beatles, upper, liverpudlian, backdrop, poor] |