Multi-Label Genre Classification-Part I

Unleash the power of data preprocessing for accurate multi-label genre classification in NLP. From data wrangling to feature engineering, discover the essential steps to optimize your models.

Multi-Label Genre Classification-Part I
Photo by No Revisions / Unsplash

Picture abhi baaki hai mere dost

In this post we'll be dealing with some basic data preprocessing we'll be using towards multi-label genre classification.

Table of Contents

  1. Import Libraries
  2. Load Metadata
  3. Clean Metadata
  4. Convert to Appropriate Form
  5. Load Plot Summaries
  6. Merge the Data
  7. Write to a CSV File
  8. Conclusion

1. Import Libraries

Babylonians invented the wheel. I'd much rather just use it

In this section, we will load all the libraries that we will require towards data preprocessing.

import json
import tarfile
import pandas as pd
import re
import string
import os
from tqdm import tqdm
from pattern.text.en import singularize
import numpy as np
import tempfile
import sys
import subprocess
from datetime import datetime
from packaging import version

%matplotlib inline

2. Load Metadata

In this section, we will load the metadata that contains information on movie genres.

PATH = <YOUR_FILE_PATH>
def extract_file():
    """
    Extracts all contents of the tar file
    Args: None
    Returns: None
    """
    file = tarfile.open(os.path.join(PATH, "MovieSummaries.tar.gz"), mode="r|gz")
    file.extractall(path = PATH)
metadata_df = pd.read_csv(os.path.join(PATH, 'movie.metadata.tsv'), sep = '\t', header = None)
metadata_df.head()

3. Clean Metadata

In this section, we will clean the metadata-more specifically the genres that it contains to make it more suitable for our application. We'll merge and remove several genres so that the number of distinct genres comes to around 35 distinct genres.

def create_list_genres(x):
  """
  Filters and cleans the genres present in the genre column
  Args: None
  Returns: List of genres
  """
  regex = re.compile('((film)s?)|((movie)s?)|((cinema)s?)|((piece)s?)|((period)s?)', re.IGNORECASE)
  x = regex.sub('', x)
  temp_res = list(json.loads(x).values())
  trans_table = {str(elem):' ' for elem in string.punctuation} 
  res = []
  for elem in temp_res:
      elem = elem.lower()
      
      #black-and-white movies turned to a single category grayscale
      regex_phrase = re.compile('black-and-white', re.IGNORECASE)
      elem = regex_phrase.sub('grayscale', elem)
      
      ##########################################################################################################
      
      trans_table = elem.maketrans(trans_table)
      elem = elem.translate(trans_table)
      
      ##########################################################################################################
      
      # combine chinese, indian, japanese movies into one category
      regex_region = re.compile('(chinese)|(japanese)|(bollywood)|(filipino)|(tollywood)|(bengali)', re.IGNORECASE)
      elem = regex_region.sub('asian', elem)
      
      # noir movies are basically crime, drama movies that are generally grayscale
      regex_phrase = re.compile('noir', re.IGNORECASE)
      elem = regex_phrase.sub('crime drama', elem)
      
      # anime and animated cartoons are basically animations
      regex_translate = re.compile('(anime)|(animat\S+)', re.IGNORECASE)
      elem = regex_translate.sub('animation', elem)
      
      # docudrama is a portmanteau of documentary and drama
      regex_translate = re.compile('docudrama', re.IGNORECASE)
      elem = regex_translate.sub('documentary drama', elem)
      
      # capture different versions of representing musical movies into a single genre
      regex_translate = re.compile('(musical)|(rock(\S+)?)|(concert)|(operetta)|(hip hop)', re.IGNORECASE)
      elem = regex_translate.sub('music', elem)
      
      # experimental movies are basically art films (ahead of their time)
      regex_translate = re.compile('experimental', re.IGNORECASE)
      elem = regex_translate.sub('art', elem)
      
      # spy movies involve elements of thrill, crime and action
      regex_translate = re.compile('spy', re.IGNORECASE)
      elem = regex_translate.sub('crime thriller action', elem)
      
      # mockumentary is a portmanteau of a mockery and a documentary
      regex_translate = re.compile('mockumentary', re.IGNORECASE)
      elem = regex_translate.sub('comedy documentary', elem)
      
      # zombie and slasher movies are a subset of horror movies
      regex_translate = re.compile('(zombie)|(slasher)', re.IGNORECASE)
      elem = regex_translate.sub('horror', elem)
      
      # combine the different ways to reference adult movies to a single category
      regex_adult = re.compile('(adult)|(sex(\S+)?)|(erotic(\S+)?)|(porn(\S+)?)', re.IGNORECASE)
      elem = regex_adult.sub('porn', elem)
      
      # swashbuckler are movies that comprise adventure and action scene
      regex_translate = re.compile('swashbuckler', re.IGNORECASE)
      elem = regex_translate.sub('adventure action', elem)
      
      regex_translate = re.compile('science[\s+]fiction', re.IGNORECASE)
      elem = regex_translate.sub('scifi', elem)
      
      # parody and satire are a subset of comedy movies
      regex_translate = re.compile('(parody)|(satire)', re.IGNORECASE)
      elem = regex_translate.sub('comedy', elem)
      
      # suspense and whodunit are synonymous with mystery movies
      regex_translate = re.compile('(suspense)|(whodunit)', re.IGNORECASE)
      elem = regex_translate.sub('mystery', elem)
      
      # the various ways of representing romance movies are condensed into a single category
      regex_similar_genres = re.compile('roman\S+')
      elem = regex_similar_genres.sub('romance', elem)
      
      # polical movies are collapsed into a single category
      regex_similar_genres = re.compile('politic\S+')
      elem = regex_similar_genres.sub('political', elem)
      
      # biographies are casted into a single category
      regex_similar_genres = re.compile('biograph\S+')
      elem = regex_similar_genres.sub('biographical', elem)
      
      # psychological movies turned into a single category
      regex_similar_genres = re.compile('psycholo\S+')
      elem = regex_similar_genres.sub('psychological', elem)
      
      ###########################################################################################################
      
      distinct_genres = elem.split()
      res.extend(distinct_genres)
      
  final_res = [singularize(genre) for genre in res]
  final_res = set(final_res)
  return final_res
def get_frequent_genres(df, num_of_genres):
  """
  Get most frequently occuring genres
  Args:
  1) Metadata Dataframe
  2) Number of genres expected
  """
  df[9] = df[8].apply(lambda x: create_list_genres(x))
  temp = df[9].explode()
  temp = pd.crosstab(temp.index, temp)
  genre_list = list(temp.sum(axis = 0).sort_values(ascending = False).head(num_of_genres).index)
  return genre_list
def retain_frequent_genres(movie_genres, genre_list):
  """
  Retains only those genres that are present in the genre list
  Args: 
  1) Movie genres filtered and cleaned
  2) List of permissible genres
  Returns:
  1) Filtered list of genres that are present in the genre list
  """
  res = []
  for genre in movie_genres:
      if genre in genre_list:
          res.append(genre)
          
  if len(res) == 0:
      return None
  return res

genre_list = get_frequent_genres(metadata_df, 35)

try:
    genre_list.remove('of')
    genre_list.remove('s')
except:
    pass

4. Convert to Appropriate Form

Metadata dataframe is converted to an appropriate one hot form for our model.

metadata_df[10] = metadata_df[9].apply(lambda x: retain_frequent_genres(x, genre_list))
metadata_df = metadata_df[metadata_df[10].notnull()]
create_label_columns = metadata_df[10].explode()
create_label_columns = pd.crosstab(create_label_columns.index, create_label_columns)
create_label_columns
movie_genre_df = pd.DataFrame()
movie_genre_df['Movie_ID'] = metadata_df[0]
movie_genre_df['Movie_Name'] = metadata_df[2]
movie_genre_df
movie_genres_df = pd.concat([movie_genre_df, create_label_columns], axis = 1)

5. Load Plot Summaries

We will load the plot summaries dataframe which serves as input to our model.

summary_df = pd.read_csv(os.path.join(PATH, 'plot_summaries.txt'), sep = '\t', header = None)
summary_df.columns = ['Movie_ID','Movie_Summary']

6. Merge the Data

Finally the dataframes are merged in order to get the data that will be input to our model as well as labels for training it

preprocessed_df = pd.merge(movie_genres_df, summary_df, how = 'inner', left_on = 'Movie_ID', right_on = 'Movie_ID')

column_order = list(preprocessed_df.columns[:2])
column_order.append(preprocessed_df.columns[-1])
column_order.extend(list(preprocessed_df.columns[2:-1]))

preprocessed_df = preprocessed_df[column_order]
preprocessed_df.iloc[:, 3:].describe()

7. Write to a CSV File

The final dataframe is stored in a csv file for convenience of use.

preprocessed_df.to_csv(os.path.join(PATH, 'summary_and_labels.csv'), index = False)

8. Conclusion

With a solid foundation in data preprocessing, we can now proceed to unleash the true potential of NLP in identifying genres from movie summaries. We just extracted some of the most common genres in the metadata and converted it to a one-hot encoded form. Next, we merged the cleaned genre data with the movie summaries to create the data that we'll be using towards model training