Extract Keywords from text snippets

Generated by DALL·E 2.

TF-IDF, which stands for term frequency–inverse document frequency, is a statistical method that determines the relative importance of a word in a snippet in the context of a list of snippets. We can use it to find keywords.

TF-IDF will give low scores to words in a snippet that are too frequent (eg. stopwords like "a", "and", "the") and also words that do not occur too often in the snippet. TF-IDF will give high scores to words that are relatively rare in the context of other snippets but also occur often in the given snippet.

TF-IDF keyword extraction is useful for:

Automatic tag generation of a social media post
Finding websites similar to a given website (document similarity)
Bag-of-word text classification models
...and many more!

We will use the IMDB movie dataset that you can download from Kaggle.

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_csv('IMDB-Movie-Data.csv')[['Title', 'Description']]
df.sample(5)

	Title	Description
642	The Ridiculous 6	An outlaw who was raised by Native Americans d...
733	The Da Vinci Code	A murder inside the Louvre and clues in Da Vin...
951	The Descendants	A land baron tries to reconnect with his two d...
938	The Siege of Jadotville	Irish Commandant Pat Quinlan leads a stand off...
639	American Reunion	Jim, Michelle, Stifler, and their friends reun...

Snippet vectorization (preprocessing + tokenization)

Raw descriptions aren't too useful. We need to process the descriptions into a usable format, ie. clean and tokenize. Luckily for us, scikit-learn takes care of this for us. When we call vectorizer.fit_transform() on our raw descriptions, it will clean and tokenize the text under-the-hood.

vectorizer = TfidfVectorizer()
snippet_scores = vectorizer.fit_transform(df.Description)

# Note we have 1000 snippets with 5897 unique words (or tokens)
snippet_scores.shape

(1000, 5897)

Keyword extraction

Features in the below cell means the words (or tokens) found in the text. We call vectorizer.get_feature_names() so we can save them into an array which we will use later to retrieve the keyword from its index.

features = np.array(vectorizer.get_feature_names())
np.random.choice(features, 5)

array(['cleanses', 'sky', 'inflatable', 'except', 'pandemic'], dtype='<U17')

Next we do a bit of array mangling in order to extract out the keywords.

def top_n_keywords(scores, features, n=5):
    tfidf_sorting = np.argsort(scores.toarray()).flatten()[::-1]
    return features[tfidf_sorting][:n]

df['Top 5 Keywords'] = [top_n_keywords(scores, features) for scores in snippet_scores]
df.sample(5)

	Title	Description	Top 5 Keywords
16	Hacksaw Ridge	WWII American Army Medic Desmond T. Doss, who ...	[american, desmond, medic, medal, doss]
765	Lavender	After losing her memory, a woman begins to see...	[her, suggests, psychiatrist, unexplained, visit]
744	We Are Your Friends	Caught between a forbidden romance and the exp...	[cole, dj, fame, carter, fortune]
607	Horrible Bosses	Three friends conspire to murder their awful b...	[awful, they, bosses, conspire, standing]
982	Across the Universe	The music of the Beatles and the Vietnam War f...	[beatles, upper, liverpudlian, backdrop, poor]