Your Movie Partner-ML based Movie recommendation system

Published in

Analytics Vidhya

7 min readMay 15, 2021

Recently I have built an Application called “Your movie partner” which uses content based filtering to recommend movies and classify viewer reviews on TMDB site as positive or negative.

Here is the detailed explanation of the project.

Aim of the Project

We Indians are fond of movies. We spend lot of time, money for the entertainment movies provide us. Since the advent of OTT platform love for the movies has only increased.

In 2017 OTT platforms have generated a revenue of Rs.2019 crores by 2022 it is estimated that the revenue might go up to Rs.6000 crore. Now it is upto the OTT owners to make this experience better. One way to do this is to suggest movies that the user might find interesting. Big giants like amazon, netflix are already doing this.

The aim of this project is to build a movie recommendation system using Artificial Intelligence and build a functional website which can be used by movie lovers.

Introduction

This application majorly has two features

1.Recomending movies to the users based on the movie they have searched.

2.Classify Consumer reviews as positive and negative in the TMBD website for the movie they have searched

1.Recommending movies to the users based on the movie they have searched

Recommendation systems is one of the powerful applications of Artificial Intelligence. Here are some applications of recommendation engines.

1.E commerce giants like amazon,flipkart use recommendation engines to show products user might end up buying.

2.Social media platforms like Instagram,facebook,Youtube use your activity and try to show up posts that make you stay on the platform

3.OTT platforms like amazon prime,netflix use their own formula to recommend movies that user might like.

Therefore it is certain that recommendation systems would help in user experience and they certainly have a great future.There are many ways in which we would be able to recommend to the user.Infact every company has their own way of doing this.

In this project we use content based similarity to suggest movies to the user. However there are various other methods which are explained in the reference link given above.

Content-based Filtering is a Machine Learning technique that uses similarities in features to make decisions. This technique is often used in recommender systems, which are algorithms designed to advertise or recommend things to users based on knowledge accumulated about the user.

How does it decide which item is most similar to the item user likes?

There are various ways to find similarity between the movies. Let us examine both and find the one which can be used in our application

Similarity scores.

It is a numerical value ranges between zero to one which helps to determine how much two items are similar to each other on a scale of zero to one. This similarity score is obtained by measuring the similarity between the text details of both of the items. So, similarity score is the measure of similarity between given text details of two items. This can be done by cosine-similarity.

Cosine similarity

Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.

Levenshtein Distance

The Levenshtein distance is a string metric for measuring difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

We use a hybrid of both the methods in our application. The primary reason for this to improve the accuracy.

2.Classifying user sentiments using Natural Language processing

In layman terms Natural language processing is the way which helps us talk to the computer. There is an immense amount of research happening in this field currently. Big companies like google are investing big in it.

We use concepts of natural language processing to classify the reviews of the user as positive and negative.

You might be wondering how do we use the words entered by the user. Well this is the magic of Natural language processing. We convert the words to vectors and use the vector to classify and get the appropriate feature.

There are many ways in which we convert words to vectors. Some of them are bag of words and tfidf vectorizer. Since bag of words doesnot preserve the semantic meaning of the meaning. We use tfidf vectorizer.

Td-Idf vectorizer

TF-IDF for a word in a document is calculated by multiplying two different metrics:

The term frequency of a word in a document. There are several ways of calculating this frequency, with the simplest being a raw count of instances a word appears in a document. Then, there are ways to adjust the frequency, by length of a document, or by the raw frequency of the most frequent word in a document.
The inverse document frequency of the word across a set of documents. This means, how common or rare a word is in the entire document set. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm.
So, if the word is very common and appears in many documents, this number will approach 0. Otherwise, it will approach 1.

Multiplying these two numbers results in the TF-IDF score of a word in a document. The higher the score, the more relevant that word is in that particular document.

Naive bayes classifier

A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification task. The crux of the classifier is based on the Bayes theorem.

Architecture

The above picture briefly explains the architecture of the application. Here is a detailed explanation of the same.

When the user enters the name of the movie

1.The details regarding the movie are fetched using TMDB API.TMDB is a famous site which has all the detailed information regarding movies. We use their API service and get the required details.

2.Customer reviews are produced by performing web scraping on the reviews present in the TMDB site. Now these reviews are converted into vectors using TF IDF vectorizer(mentioned in the previous section) and then are classified as positive and negative based on the previously given classified reviews.

3.Movies are recommended to the user by comparing the levenshtein distance between the current movie and the movies present in the TMDB dataset. The ones with close levenshtein distance are suggested.

Technologies and concepts used

1.Similarity score and levenshtein distance

2.Flask for backend

3.Html,css,js for front end

4.Beautifulsoup library for web scraping

5.Naive bayes classifier for classifying user reviews.

6.TMDB API to get the data regarding movies.

These are some important ones however there are many to be used. Please check requirements.txt file in the github repository.

Methodology

Create an account in https://www.themoviedb.org/, click on the API link from the left hand sidebar in your account settings and fill all the details to apply for API key. We use this API to get our data.

2.Prepare a sample reviews.txt which can be used to classify viewer reactions fetched from TMBD API.

3.Create a model sentiment.pynb for classifying user sentiments based on the above reviews.txt and the classifier used for classification is naive bayes classifer.

4.For the front end design the UI using home.html and reccomend.html.

5.Write the AJAX calls and frame the rules to retrieve data using java script. Find the file reccomend.js in the github repo below.

6.Now write the main.py file to complete the backend.

This is just an outline. There are many other things to be done to complete the application. Find the code in the github repository at the end of the article.