Extract Keywords from text snippets

Generated by DALL·E 2.
Generated by DALL·E 2.

TF-IDF, which stands for term frequency–inverse document frequency, is a statistical method that determines the relative importance of a word in a snippet in the context of a list of snippets. We can use it to find keywords.

TF-IDF will give low scores to words in a snippet that are too frequent (eg. stopwords like "a", "and", "the") and also words that do not occur too often in the snippet. TF-IDF will give high scores to words that are relatively rare in the context of other snippets but also occur often in the given snippet.

TF-IDF keyword extraction is useful for:

  • Automatic tag generation of a social media post
  • Finding websites similar to a given website (document similarity)
  • Bag-of-word text classification models
  • ...and many more!

We will use the IMDB movie dataset that you can download from Kaggle.

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.read_csv('IMDB-Movie-Data.csv')[['Title', 'Description']]
df.sample(5)
Title
Description
642
The Ridiculous 6
An outlaw who was raised by Native Americans d...
733
The Da Vinci Code
A murder inside the Louvre and clues in Da Vin...
951
The Descendants
A land baron tries to reconnect with his two d...
938
The Siege of Jadotville
Irish Commandant Pat Quinlan leads a stand off...
639
American Reunion
Jim, Michelle, Stifler, and their friends reun...

Snippet vectorization (preprocessing + tokenization)

Raw descriptions aren't too useful. We need to process the descriptions into a usable format, ie. clean and tokenize. Luckily for us, scikit-learn takes care of this for us. When we call vectorizer.fit_transform() on our raw descriptions, it will clean and tokenize the text under-the-hood.

vectorizer = TfidfVectorizer()
snippet_scores = vectorizer.fit_transform(df.Description)

# Note we have 1000 snippets with 5897 unique words (or tokens)
snippet_scores.shape
(1000, 5897)

Keyword extraction

Features in the below cell means the words (or tokens) found in the text. We call vectorizer.get_feature_names() so we can save them into an array which we will use later to retrieve the keyword from its index.

features = np.array(vectorizer.get_feature_names())
np.random.choice(features, 5)
array(['cleanses', 'sky', 'inflatable', 'except', 'pandemic'], dtype='<U17')

Next we do a bit of array mangling in order to extract out the keywords.

def top_n_keywords(scores, features, n=5):
    tfidf_sorting = np.argsort(scores.toarray()).flatten()[::-1]
    return features[tfidf_sorting][:n]
df['Top 5 Keywords'] = [top_n_keywords(scores, features) for scores in snippet_scores]
df.sample(5)
Title
Description
Top 5 Keywords
16
Hacksaw Ridge
WWII American Army Medic Desmond T. Doss, who ...
[american, desmond, medic, medal, doss]
765
Lavender
After losing her memory, a woman begins to see...
[her, suggests, psychiatrist, unexplained, visit]
744
We Are Your Friends
Caught between a forbidden romance and the exp...
[cole, dj, fame, carter, fortune]
607
Horrible Bosses
Three friends conspire to murder their awful b...
[awful, they, bosses, conspire, standing]
982
Across the Universe
The music of the Beatles and the Vietnam War f...
[beatles, upper, liverpudlian, backdrop, poor]