Restaurant Recommendation System based on the content in reviews

Jasneek Chugh
12 min readMay 8, 2021

Creating a recommendation system using the content-based filtering in Python

It is very common that we hang out with families, friends, and coworkers when comes to lunch or dinner time. As the users of recommendation applications, people care more about how we will like a restaurant. In the past, people obtained suggestions for restaurants from friends. Although this method is straightforward and user-friendly, it has some severe limitations. First, the recommendations from friends or other common people are limited to those places they have visited before. Thus, the user is not able to gain information about places less visited by their friends. Besides that, there is a chance of users not liking the place recommended by their friends.

Bengaluru in India being one such city has more than 12,000 restaurants with restaurants serving dishes from all over the world. With each day new restaurants opening the industry hasn't been saturated yet and the demand is increasing day by day. In spite of increasing demand, it however has become difficult for new restaurants to compete with established restaurants. Most of them serving the same food. Bengaluru is the IT capital of India. Most of the people here are dependent mainly on the restaurant food as they don’t have time to cook for themselves. With such an overwhelming demand for restaurants, it has therefore become important to study the demography of a location. We will use data set from Zomatao containing data of Banglore city.

The aim is to create a content-based recommender system in which when we will write a restaurant name, the Recommender system will look at the reviews of other restaurants, and the System will recommend us other restaurants with similar reviews and sort them from the highest-rated.

The main people who are going to benefit from this recommendation system are the tourists, who are new to a city. Most of the tourists always love to visit famous restaurants in a particular city during their visit. Otherwise, it can be heavily used by people belonging to the same city, to see if any new restaurant is recommended based on their activity.

Let’s Get Started

For our analysis and to get the similarity between the user reviews in order to recommend similar restaurants to the user we need a data source from where we can get the most significant features like a restaurant name, rating, cost and its reviews.

Therefore, we will use the Zomato Bangalore dataset for our analysis to draw conclusions using the content filtering method. This data set consists of restaurants in Bangalore, India collected from Zomato. Zomato is an online food delivery app and Bengaluru is one of the most digitally enabled cities in India, a huge number of city population uses the services of Zomato to find its next meal of the day.

Here I will be using Content-Based Filtering. Therefore, let us understand what is a Content-Based Filtering

Content-Based Filtering: This method uses only information about the description and attributes of the items users has previously consumed to model user’s preferences. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended.[2]

A recommender system has to decide between two methods for information delivery when providing the user with recommendations:

  • Exploitation. The system chooses documents similar to those for which the user has already expressed a preference.
  • Exploration. The system chooses documents where the user profile does not provide evidence to predict the user’s reaction.

Data Set

The CSV data file was collected from the Kaggle website. This dataset contains reviews of restaurants from Zomato. The data contains ~51,000 records and 17 columns. Reviews include user information, ratings, and a plain text of the review. There are approx 6500 unique restaurants in our data.

The data set has attributes like- URL, address, rate, votes, location, cuisines, approx cost and reviews_list. But, we will be mainly making use of features like URL, Cuisine, Rate, Reviews_list etc

I already notice some issues with the data that I’ll need to address before analyzing it — removing duplicate values, removing NaN values, changing column names, transforming data, cleaning the text columns. This is an extensive dataset that most likely requires quite a lot of cleaning.

Importing and Loading Libraries

The first thing you need to use is importing all the below libraries for our analysis.

import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import plotly.offline as py
import plotly.graph_objs as go
import seaborn as sns
import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

Pandas and NumPy are used for data preprocessing and cleaning. Seaborn, Plotly and Matplotlib helped in creating visual graphics and bar plots for the dataset. Also, since there would be cleaning of text data (reviews) as well, therefore for that we will use nltk and sklearn library.

Loading the data

zomato_data=pd.read_csv("/Users/jasneekchugh/Desktop/DS_NJIT/IS688-WebMining/restaurant_reccomendation/zomato.csv")zomato_df=zomato_data.copy()
zomato_df.head(2)
First two records of the data

Columns description

  1. URL contains the url of the restaurant on the Zomato website
  2. address contains the address of the restaurant in Bengaluru

3. name contains the name of the restaurant

4. online_order whether online ordering is available in the restaurant or not

5. book_table table book option available or not

6. rate contains the overall rating of the restaurant out of 5

7. votes contain the total number of rating for the restaurant as of the above-mentioned date

8. phone contains the phone number of the restaurant

9. location contains the neighbourhood in which the restaurant is located

10. rest_type restaurant type

11. dish_liked dishes people liked in the restaurant

12. cuisines food styles, separated by comma

13. approx_cost(for two people) contains the approximate cost for a meal for two people

14. reviews_list list of tuples containing reviews for the restaurant, each tuple

15. menu_item contains a list of menus available in the restaurant

16. listed_in(type) type of meal

17. listed_in(city) contains the neighbourhood in which the restaurant is listed

Getting the view of the data and its values for each column

There are many columns that have Null values and there are some columns that are not required for our analysis and also, we need to clean the name of the columns and the values of the text column.

Data Preparation/Data Cleaning

First, we need to clean the data to make it into a more usable format.

Operations performed: Dropping the unnecessary columns, NaN values, Duplicates etc. Changing the column names and datatypes wherever required.

#Dropping the column "dish_liked", "phone", "url"
zomato_df=zomato_df.drop(['phone','dish_liked','menu_item'],axis=1)
#Remove the NaN values from the dataset
zomato_df.dropna(how='any',inplace=True)
#Removing the Duplicates
zomato_df.duplicated().sum()
zomato_df.drop_duplicates(inplace=True)
#Changing the column names
zomato_df = zomato_df.rename(columns={'approx_cost(for two people)':'cost','listed_in(type)':'type', 'listed_in(city)':'city'})
#Removing '/5' from Rates
zomato_df = zomato_df.loc[zomato_df.rate !='NEW']
zomato_df = zomato_df.loc[zomato_df.rate !='-'].reset_index(drop=True)
remove_slash = lambda x: x.replace('/5', '') if type(x) == np.str else x
zomato_df.rate = zomato_df.rate.apply(remove_slash).str.strip().astype('float')
#Changing the cost to string
zomato_df['cost'] = zomato_df['cost'].astype(str)
zomato_df['cost'] = zomato_df['cost'].apply(lambda x: x.replace(',','.'))
zomato_df['cost'] = zomato_df['cost'].astype(float)

Cleaned Data

The shape of the data has changed after the cleaning. There are no null records and we have clean Rate and Cost columns as well

Data after cleaning

Data Transformation

Let’s compute Mean Rating for each restaurant

## Computing Mean Rating
restaurants = list(zomato_df['name'].unique())
zomato_df['Mean Rating'] = 0
for i in range(len(restaurants)):
zomato_df['Mean Rating'][zomato_df['name'] == restaurants[i]] = zomato_df['rate'][zomato_df['name'] == restaurants[i]].mean()
#Scaling the mean rating values
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range = (1,5))
zomato_df[['Mean Rating']] = scaler.fit_transform(zomato_df[['Mean Rating']]).round(2)

Printing the Mean ratings for the restaurants. We will use this for our analysis further.

Text Preprocessing and Cleaning

We will be using the ‘Review’ and ‘Cuisines’ feature in order to create a recommender system. So we need to prepare and clean the text in those columns.

Operations performed: Lower Casing, Removal of Punctuations, Removal of Stopwords, Removal of URLs, Spelling correction

## Lower Casing
zomato_df["reviews_list"] = zomato_df["reviews_list"].str.lower()
## Removal of Puctuationsimport string
PUNCT_TO_REMOVE = string.punctuation
def remove_punctuation(text):
"""custom function to remove the punctuation"""
return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))
zomato_df["reviews_list"] = zomato_df["reviews_list"].apply(lambda text: remove_punctuation(text))# Removal of Stopwordsfrom nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))
def remove_stopwords(text):
"""custom function to remove the stopwords"""
return " ".join([word for word in str(text).split() if word not in STOPWORDS])
zomato_df["reviews_list"] = zomato_df["reviews_list"].apply(lambda text: remove_stopwords(text))#Cleaning URLdef remove_urls(text):
url_pattern = re.compile(r'https?://\S+|www\.\S+')
return url_pattern.sub(r'', text)
zomato_df["reviews_list"] = zomato_df["reviews_list"].apply(lambda text: remove_urls(text))

Printing the cleaned reviews_list and cuisines column.

zomato_df[['reviews_list', 'cuisines']][:5]
Cleaned Review list and Cuisines

Exploratory Data Analysis

  • Types of restaurant
  • Distribution of Restaurant Rating
  • Top 10 Rated Restaurants
  • Top Reviewed 10 Restaurants
#Most Famous restaurant chains in Bangloreplt.figure(figsize=(8,5))
chains=zomato_df['name'].value_counts()[:10]
sns.barplot(x=chains,y=chains.index,palette='deep')
plt.title("Most famous restaurants chains in Bangalore")
plt.xlabel("Number of outlets")
#Types of Restaurantcounts = zomato_df["rest_type"].value_counts()[:10]
p = counts.sort_values().plot.barh(figsize=(8,5), fontsize=18)
p.set_xlabel("Number of Restaurant",fontsize=18)
p.set_ylabel("Restaurant Type",fontsize=18)
p.set_title("Types of Restaurant", fontsize=20)
#Distribution of Restaurant Rating fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(15, 5))sns.distplot(zomato_df.rate,kde=False,color = 'g',ax =ax,bins=20);
ax.axvline(zomato_df.rate.mean(), 0, 1, color='r', label='Mean')
ax.legend();
ax.set_ylabel('Count',size=20)
ax.set_xlabel('Rate',size=20)
ax.set_title('Distribution(count) of Restaurant rating',size=20);
# Top 10 Rated Restaurantsdf_rating = zomato_df.drop_duplicates(subset='name')
df_rating = df_rating.sort_values(by='Mean Rating', ascending=False).head(10)
plt.figure(figsize=(7,5))
sns.barplot(data=df_rating, x='Mean Rating', y='Name', palette='RdBu')
plt.title('Top Rated 10 Restaurants');
EDA on some of the columns of the Restaurant dataset

The above Exploratory data analysis charts were based on the most important columns of our data set. We can easily see the top restaurants based on Reviews, Votes and the most famous ones.

From the distribution of restaurants ratings, we can see that on average the restaurants in our dataset has a rating between 3.5 and 4.0. There are very few restaurants with a rating of less than 3.0

EDA- Word Frequency Distribution

def get_top_words(column, top_nu_of_words, nu_of_word):

vec = CountVectorizer(ngram_range= nu_of_word, stop_words='english')

bag_of_words = vec.fit_transform(column)

sum_words = bag_of_words.sum(axis=0)

words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]

words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)

return words_freq[:top_nu_of_words]

Top 15-word frequency for cuisines

# Top 15 two word frequencies for Cuisines
lst = get_top_words(zomato_df['cuisines'], 15, (2,2))
df_words = pd.DataFrame(lst, columns=['Word', 'Count'])plt.figure(figsize=(7,6))
sns.barplot(data=df_words, x='Count', y='Word')
plt.title('Word Couple Frequency for Cuisines');

Here we can see the Top favourite cuisine among people of Banglore is ‘North Indian’, ‘Indian Chinese’ and ‘Fast food’

CONTENT-BASE RECOMMENDER SYSTEM

TF-IDF Matrix (Term Frequency — Inverse Document Frequency Matrix)

TF-IDF method is used to quantify words and compute weights for them.In other words, representing each word (or couples of words etc.) with a number in order to use mathematics in our recommender system. Put simply, the higher the TF*IDF score (weight), the rarer and more important the term, and vice versa

Cosine similarity is a metric used to determine how similar the documents are irrespective of their size.

df_percent.set_index('name', inplace=True)
indices = pd.Series(df_percent.index)
# Creating tf-idf matrix
tfidf = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), min_df=0, stop_words='english')
tfidf_matrix = tfidf.fit_transform(df_percent['reviews_list'])

Here, the tfidf_matrix is the matrix containing each word and its TF-IDF score with regard to each document, or item in this case. Also, stop words are simply words that add no significant value to our system, like ‘an’, ‘is’, ‘the’, and hence are ignored by the system.[3]

Now, we have a representation of every item in terms of its description. Next, we need to calculate the relevance or similarity of one document to another.

Calculating Cosine Similarity

The formula for Cosine Similarity

We want to calculate the cosine similarity of each item with every other item in the dataset. So we just pass the matrix as an argument.[4]

cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

Here we’ve calculated the cosine similarity of each item with every other item in the dataset

Making a Recommendation

So here comes the part where we finally get to see our recommender system in action.[3]

def recommend(name, cosine_similarities = cosine_similarities):

recommend_restaurant = []

# Find the index of the hotel entered
idx = indices[indices == name].index[0]

# Find the restaurants with a similar cosine-sim value and order them from bigges number
score_series = pd.Series(cosine_similarities[idx]).sort_values(ascending=False)

# Extract top 30 restaurant indexes with a similar cosine-sim value
top30_indexes = list(score_series.iloc[0:31].index)

# Names of the top 30 restaurants
for each in top30_indexes:
recommend_restaurant.append(list(df_percent.index)[each])

# Creating the new data set to show similar restaurants
df_new = pd.DataFrame(columns=['cuisines', 'Mean Rating', 'cost'])

# Create the top 30 similar restaurants with some of their columns
for each in recommend_restaurant:
df_new = df_new.append(pd.DataFrame(df_percent[['cuisines','Mean Rating', 'cost']][df_percent.index == each].sample()))

# Drop the same named restaurants and sort only the top 10 by the highest rating
df_new = df_new.drop_duplicates(subset=['cuisines','Mean Rating', 'cost'], keep=False)
df_new = df_new.sort_values(by='Mean Rating', ascending=False).head(10)

print('TOP %s RESTAURANTS LIKE %s WITH SIMILAR REVIEWS: ' % (str(len(df_new)), name))

return df_new

RESULTS

Querying recommendation for 3 Restaurants:

For Restaurant “Marwa Restaurant”

Printing the info about ‘Marwa Restaurant’
Recommendation for ‘Marwa Restaurant’

For Restaurant “Canopy”

Recommendation for ‘Canopy Restaurant’

For Restaurant “Red Chilliez”

Recommendation for ‘Red Chilliez

EVALUATION

To make sure whether our recommendation system is recommending restaurants properly we can check based on the results it recommended. For example, in the case of ‘Red Chilliez’ restaurant, we can see that restaurant has a mean rating of 3.26, the cost is 550 rupees and the cuisines types are ‘north Indian, south Indian, Chinese, seafood’.

Now, if we see the results it recommended that those restaurants have cost around 550 rupees only and the rating is also near to 3.26 and recommends restaurants with similar cuisines types. Therefore, most of our recommendation is correct.

Another way we can look for content in the ground truth that has not come from the users’ historical data. This content is then compared with the non-historical content in the recommendations using any of the standard metrics like RMSE, Mean Avg Precision. This gives us an idea of how good our model is in recommending items[2]

LIMITATIONS and BUGS Encountered

  • The dataset mostly contained the restaurants with “North Indian” cuisine, so for most of the recommended restaurants, you will see at least 1 “North Indian” restaurant.
  • The dataset has the same restaurants names with different values in cuisines types. This makes the data somewhat inconsistent and also affects the recommendation system.
  • Limited content — The content in the review doesn’t contain enough information to discriminate the restaurant precisely, the recommendation itself risks being imprecise.

CONCLUSION

Content-based is a great model to start with for recommending restaurants for a user but it lacks variations in recommended restaurants. Most of the restaurants that are recommended will be of the same nature. It is of better use when we want to find similar restaurants as users can have a very diverse taste and can like a different type of restaurants having completely different cuisines. But the good part is you don’t need other users data and so no cold start problem i.e. not having sufficient data to recommend items. This is normally faced by a new application that is still gathering data

REFERENCES

[1] https://medium.com/analytics-vidhya/recommendation-system-content-based-part-1-8f5ac093127a

[2]https://towardsdatascience.com/recommendation-systems-models-and-evaluation-84944a84fb8e

[3] https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-recommendation-engine-python/

[4]https://www.youtube.com/watch?v=i4a0Of22QRg

--

--