Multi-Label Genre Classification-Part I
Unleash the power of data preprocessing for accurate multi-label genre classification in NLP. From data wrangling to feature engineering, discover the essential steps to optimize your models.
Picture abhi baaki hai mere dost
In this post we'll be dealing with some basic data preprocessing we'll be using towards multi-label genre classification.
Table of Contents
- Import Libraries
- Load Metadata
- Clean Metadata
- Convert to Appropriate Form
- Load Plot Summaries
- Merge the Data
- Write to a CSV File
- Conclusion
1. Import Libraries
Babylonians invented the wheel. I'd much rather just use it
In this section, we will load all the libraries that we will require towards data preprocessing.
import json
import tarfile
import pandas as pd
import re
import string
import os
from tqdm import tqdm
from pattern.text.en import singularize
import numpy as np
import tempfile
import sys
import subprocess
from datetime import datetime
from packaging import version
%matplotlib inline
2. Load Metadata
In this section, we will load the metadata that contains information on movie genres.
PATH = <YOUR_FILE_PATH>
def extract_file():
"""
Extracts all contents of the tar file
Args: None
Returns: None
"""
file = tarfile.open(os.path.join(PATH, "MovieSummaries.tar.gz"), mode="r|gz")
file.extractall(path = PATH)
metadata_df = pd.read_csv(os.path.join(PATH, 'movie.metadata.tsv'), sep = '\t', header = None)
metadata_df.head()
3. Clean Metadata
In this section, we will clean the metadata-more specifically the genres that it contains to make it more suitable for our application. We'll merge and remove several genres so that the number of distinct genres comes to around 35 distinct genres.
def create_list_genres(x):
"""
Filters and cleans the genres present in the genre column
Args: None
Returns: List of genres
"""
regex = re.compile('((film)s?)|((movie)s?)|((cinema)s?)|((piece)s?)|((period)s?)', re.IGNORECASE)
x = regex.sub('', x)
temp_res = list(json.loads(x).values())
trans_table = {str(elem):' ' for elem in string.punctuation}
res = []
for elem in temp_res:
elem = elem.lower()
#black-and-white movies turned to a single category grayscale
regex_phrase = re.compile('black-and-white', re.IGNORECASE)
elem = regex_phrase.sub('grayscale', elem)
##########################################################################################################
trans_table = elem.maketrans(trans_table)
elem = elem.translate(trans_table)
##########################################################################################################
# combine chinese, indian, japanese movies into one category
regex_region = re.compile('(chinese)|(japanese)|(bollywood)|(filipino)|(tollywood)|(bengali)', re.IGNORECASE)
elem = regex_region.sub('asian', elem)
# noir movies are basically crime, drama movies that are generally grayscale
regex_phrase = re.compile('noir', re.IGNORECASE)
elem = regex_phrase.sub('crime drama', elem)
# anime and animated cartoons are basically animations
regex_translate = re.compile('(anime)|(animat\S+)', re.IGNORECASE)
elem = regex_translate.sub('animation', elem)
# docudrama is a portmanteau of documentary and drama
regex_translate = re.compile('docudrama', re.IGNORECASE)
elem = regex_translate.sub('documentary drama', elem)
# capture different versions of representing musical movies into a single genre
regex_translate = re.compile('(musical)|(rock(\S+)?)|(concert)|(operetta)|(hip hop)', re.IGNORECASE)
elem = regex_translate.sub('music', elem)
# experimental movies are basically art films (ahead of their time)
regex_translate = re.compile('experimental', re.IGNORECASE)
elem = regex_translate.sub('art', elem)
# spy movies involve elements of thrill, crime and action
regex_translate = re.compile('spy', re.IGNORECASE)
elem = regex_translate.sub('crime thriller action', elem)
# mockumentary is a portmanteau of a mockery and a documentary
regex_translate = re.compile('mockumentary', re.IGNORECASE)
elem = regex_translate.sub('comedy documentary', elem)
# zombie and slasher movies are a subset of horror movies
regex_translate = re.compile('(zombie)|(slasher)', re.IGNORECASE)
elem = regex_translate.sub('horror', elem)
# combine the different ways to reference adult movies to a single category
regex_adult = re.compile('(adult)|(sex(\S+)?)|(erotic(\S+)?)|(porn(\S+)?)', re.IGNORECASE)
elem = regex_adult.sub('porn', elem)
# swashbuckler are movies that comprise adventure and action scene
regex_translate = re.compile('swashbuckler', re.IGNORECASE)
elem = regex_translate.sub('adventure action', elem)
regex_translate = re.compile('science[\s+]fiction', re.IGNORECASE)
elem = regex_translate.sub('scifi', elem)
# parody and satire are a subset of comedy movies
regex_translate = re.compile('(parody)|(satire)', re.IGNORECASE)
elem = regex_translate.sub('comedy', elem)
# suspense and whodunit are synonymous with mystery movies
regex_translate = re.compile('(suspense)|(whodunit)', re.IGNORECASE)
elem = regex_translate.sub('mystery', elem)
# the various ways of representing romance movies are condensed into a single category
regex_similar_genres = re.compile('roman\S+')
elem = regex_similar_genres.sub('romance', elem)
# polical movies are collapsed into a single category
regex_similar_genres = re.compile('politic\S+')
elem = regex_similar_genres.sub('political', elem)
# biographies are casted into a single category
regex_similar_genres = re.compile('biograph\S+')
elem = regex_similar_genres.sub('biographical', elem)
# psychological movies turned into a single category
regex_similar_genres = re.compile('psycholo\S+')
elem = regex_similar_genres.sub('psychological', elem)
###########################################################################################################
distinct_genres = elem.split()
res.extend(distinct_genres)
final_res = [singularize(genre) for genre in res]
final_res = set(final_res)
return final_res
def get_frequent_genres(df, num_of_genres):
"""
Get most frequently occuring genres
Args:
1) Metadata Dataframe
2) Number of genres expected
"""
df[9] = df[8].apply(lambda x: create_list_genres(x))
temp = df[9].explode()
temp = pd.crosstab(temp.index, temp)
genre_list = list(temp.sum(axis = 0).sort_values(ascending = False).head(num_of_genres).index)
return genre_list
def retain_frequent_genres(movie_genres, genre_list):
"""
Retains only those genres that are present in the genre list
Args:
1) Movie genres filtered and cleaned
2) List of permissible genres
Returns:
1) Filtered list of genres that are present in the genre list
"""
res = []
for genre in movie_genres:
if genre in genre_list:
res.append(genre)
if len(res) == 0:
return None
return res
genre_list = get_frequent_genres(metadata_df, 35)
try:
genre_list.remove('of')
genre_list.remove('s')
except:
pass
4. Convert to Appropriate Form
Metadata dataframe is converted to an appropriate one hot form for our model.
metadata_df[10] = metadata_df[9].apply(lambda x: retain_frequent_genres(x, genre_list))
metadata_df = metadata_df[metadata_df[10].notnull()]
create_label_columns = metadata_df[10].explode()
create_label_columns = pd.crosstab(create_label_columns.index, create_label_columns)
create_label_columns
movie_genre_df = pd.DataFrame()
movie_genre_df['Movie_ID'] = metadata_df[0]
movie_genre_df['Movie_Name'] = metadata_df[2]
movie_genre_df
movie_genres_df = pd.concat([movie_genre_df, create_label_columns], axis = 1)
5. Load Plot Summaries
We will load the plot summaries dataframe which serves as input to our model.
summary_df = pd.read_csv(os.path.join(PATH, 'plot_summaries.txt'), sep = '\t', header = None)
summary_df.columns = ['Movie_ID','Movie_Summary']
6. Merge the Data
Finally the dataframes are merged in order to get the data that will be input to our model as well as labels for training it
preprocessed_df = pd.merge(movie_genres_df, summary_df, how = 'inner', left_on = 'Movie_ID', right_on = 'Movie_ID')
column_order = list(preprocessed_df.columns[:2])
column_order.append(preprocessed_df.columns[-1])
column_order.extend(list(preprocessed_df.columns[2:-1]))
preprocessed_df = preprocessed_df[column_order]
preprocessed_df.iloc[:, 3:].describe()
7. Write to a CSV File
The final dataframe is stored in a csv file for convenience of use.
preprocessed_df.to_csv(os.path.join(PATH, 'summary_and_labels.csv'), index = False)
8. Conclusion
With a solid foundation in data preprocessing, we can now proceed to unleash the true potential of NLP in identifying genres from movie summaries. We just extracted some of the most common genres in the metadata and converted it to a one-hot encoded form. Next, we merged the cleaned genre data with the movie summaries to create the data that we'll be using towards model training