How Dishes are Clustered together based on the Ingredients?

Jasneek Chugh
Web Mining [IS688, Spring 2021]
9 min readMar 14, 2021

--

A simple implementation of forming clusters and finding similarity between dishes using the common ingredients used.

Being an international student in the US, one of the biggest challenges I faced is making food. During my initial stages, it took me a lot of time to make food. For every meal, I answer myself “what should I make?”, and “what ingredients I need? I was very much fascinated by junk food, however, ever since I started cooking food by myself I have started eating healthy and I am concerned about the ingredients I put in my meal.

Also, during my short cooking experience, I have realized that most of our Indian dishes have the same ingredients. Hence, I was wondering if this is the case with other dishes as well? How two recipes are close to each other? What are the most common ingredients? How can we cluster the dishes?

So, today we will be answering the above questions using widely used techniques like Cosine Similarity, Dimensionality Reduction etc.

Let’s Get Started

For our analysis and to get the similarity between the dishes to need a data source from where I can get the most significant features like a list of ingredients, Dish name, and cooking recipe.

After a lot of exploration, I chose to use the Recipes5k dataset by the University of Barcelona.

Recipes5k is a dataset for ingredients recognition with over 4,800 unique recipes composed of an image and the corresponding list of ingredients. It contains a total of around 3,200 unique ingredients (10 per recipe on average). Each recipe represents an alternative way to prepare one of the 101 food types contained in Food101. Hence, it captures at the same time the intra- and inter-class variability of cooking recipes.

I already notice some issues with the data that I’ll need to address before analyzing it — mapping of the unique recipe and dish name is required, duplicate ingredients and there were missing dishes that were not mapped to any recipe in the data. This is an extensive dataset that most likely requires quite a lot of cleaning.

When you download the dataset you will see a lot of files available, but for our analysis purpose we will only use 3 files:

train_images.txt

classes_Recipes5k.txt

ingredients_Recipes5k.txt

Importing and Loading Libraries

The first thing you need to use is importing all the below libraries for our analysis.

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.feature_extraction import DictVectorizer
from sklearn.decomposition import KernelPCA
from sklearn.manifold import TSNEfrom IPython.display
import display, HTML
import itertools

Data Preparation

First, we need to load, map and clean the data to make it into a more usable format. Since all the required data is in different files we need to do map files as well.

Loading the data

Reading the Raw data files

img_name contains the list of images URL. So, to get the name of dishes we will extract the name from them.

recipe_name_df contains the list of unique name identifiers for each of the images.

Ingredients contain the list of ingredients

We need to extract the name of the dish from the image URL name stored in ‘train_images.txt’.

There are 3409 records but, as you can see there are duplicate dish names.

We need to get the unique names and then we will be mapping the dishes to the recipes.

Unique Dish Name

There are only 101 unique dishes in our dataset. We will be mapping Recipes to Dishes.

Mappings

Mapping Unique recipe names with Ingredient

Mapping Unique recipes and dishes

After all the required mappings, the dataset has 4826 rows and 3 columns.

Handling Null values

As you can see there are 499 missing dishes in our data.

The best way to deal with this is to drop the records which don’t have Dish's name.

Dropping the null values

Final Cleaned Data

Cleaned Data

Exploratory Data Analysis

Top 5 Dishes having maximum Recipes in the dataset

There is a maximum number of recipes for Steak. Pizza, lasagna, and guacamole have an equal number of recipes, around 55.

Top 10 Most Common Ingredients among dishes

Distribution of Ingredients of Dishes

Salt, oil, egg, onion, butter, pepper, sugar, cheese, garlic, and flour are the 10 most common ingredients. Salt is the only ingredient that is used in over 50% of recipes, while flour is used in just over 20% of all recipes. Salt has to be there in every dish, otherwise, the dish will taste bland.

PCA and t-SNE representation of dishes

One-hot encoded vector of Ingredients

To find similarities between dishes and cluster recipes using their ingredients, we will represent a recipe by a one-hot encoded vector of its ingredients. We will be establishing a vocabulary of ingredients using a method ‘DictVectorizer’ provided in the sklearn library[6].

DictVectorizer transforms lists of feature-value mappings to vectors. This transformer turns lists of mappings (dict-like objects) of feature names to feature values into Numpy arrays for use with scikit-learn estimators.

When feature values are strings, this transformer will do a binary one-hot (aka one-of-K) coding: one boolean-valued feature is constructed for each of the possible string values that the feature can take on.

#function to convert list of ingredients into a dictionarydef convert_to_dict(lst):
d = {} #empty dict
for ingre in lst:
d[ingre] = 1
return d
#one hot encodingvector_dict = DictVectorizer(sparse = False)
X = vector_dict.fit_transform(dataset["bagofwords"].tolist())
y = dataset.Dish_Name.astype("category").cat.codes
A new array of shape 680 by 680

PCA- Principal Component Analysis

PCA is a technique for reducing the number of dimensions in a dataset whilst retaining most information. It is using the correlation between some dimensions and tries to provide a minimum number of variables that keeps the maximum amount of variation or information about how the original data is distributed[4]

There are 100 dishes in the dataset. We will be using only a subset (10 dishes) for the analysis and get a clear visual.

Performing PCA using the cosine kernel as a distance metric.

# Plotting the clusters
plot_pca = pd.DataFrame(data = x_pca[:,:2], columns = ["PC1","PC2"])
plot_pca["Dish"] = dish_name_lst[y_subset].tolist()
sns.lmplot("PC1", "PC2", data = plot_pca, palette = "Paired",legend=True, hue = "Dish", fit_reg = False)
The cluster of Dishes formed by PCA

From the PCA plot, we can observe the presence of some clusters such as lasagna, nachos and pizza overlap.

There is no proper cluster as such.

t-SNE: t-Distributed Stochastic Neighbor Embedding

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a dimensionality reduction technique used to represent a high-dimensional dataset in a low-dimensional space of two or three dimensions so that we can visualize it [2]

t-SNE constructs a probability distribution for the high-dimensional samples in such a way that similar samples have a high likelihood of being picked while dissimilar points have an extremely small likelihood of being picked. If you want to see a detailed explanation I would highly recommend you watching the Statquests video on youtube.

The “perplexity,” parameter says (loosely) how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbours each point has. The perplexity value has a complex effect on the resulting pictures.

#plotting t-SNE clusters 
plot_tsne = pd.DataFrame(data = X_tsne[:,:2], columns = ["x","y"])
plot_tsne["Dish"] = dish_name_lst[y_subset].tolist()
sns.lmplot("x", "y", data = plot_tsne, palette = "Paired",legend=True, hue = "Dish", fit_reg = False)
The cluster of Dishes using t-SNE

t-SNE projection is much clearer than PCA. Clusters are more visible and we have far better insights in comparison to PCA. PCA with cosine kernel does fairly well for the subset of dishes but as the number of dishes increase, it will be impossible to distinguish the clusters.

t-SNE differs from PCA by preserving only small pairwise distances or local similarities whereas PCA is concerned with preserving large pairwise distances to maximize the variance

COSINE SIMILARITY

I use cosine similarity from ‘SKLearn’ to calculate the similarity between the recipes.[5] The concept is to measure the cosine of the angle between two vectors projected in a multi-dimensional space, and the cosine would be between 1 and 0. [7]

calculation of cosine of the angle between A and B

The recipe can be compared by summing the one-hot encoding vectors of their ingredients and computing the cosine similarity between them. Hence, with this method, the problem of answering the question “How two recipes are close to each other?” reduces to evaluating the cosine similarity of their vector representation, A value close to 1 indicate high similarity between the two recipes, whereas a value close to 0 indicates no similarity between the two recipes. More information can be found on Wikipedia.

Top 10 Recipes similar to Sushi

Recipes similar to Sushi

Top 10 Recipes similar to Paella

Recipes similar to Paella

Top 10 Recipes similar to Onion Rings

Recipes similar to Onion Rings

Limitations and Problems faced

  1. One of the major limitations of the above analysis is the dataset is small. The number of features in the dataset is less, going forward we can collect more data which may include more dishes and recipes.
  2. A lot of cleaning of the dataset was required before analysis. Although we have already done most of the cleaning part, we did map ingredients to recipes as well. We faced challenges such as correcting the names of the ingredients and dealing with duplicate values.
  3. The “perplexity” parameter in t-SNE can be misleading and can affect the resulting graph. The performance of SNE is fairly robust to changes in the perplexity. This blog helped me in finding an appropriate value of perplexity. [3]

Conclusion and Summary

We established similarities between dishes, their mutual influences, and finally characterized each with a set of distinctive components through our analysis. Salt, oil, egg and onion are the top ingredients that are used in most of the dishes. We performed One-hot encoded vector of Ingredients and saw how clusters of ingredients are formed using the two-dimensional reduction techniques- PCA and t-SNE. t-SNE formed a better and cleared cluster as compared to PCA.

Finally, we saw how recipes are closed to each other and for that, we used Cosine Similarity and returned the Top 10 recipes similar to Onion Rings, Sushi and Paella, separately.

REFERENCES

[1] https://en.wikipedia.org/wiki/Cosine_similarity

[2] https://towardsdatascience.com/t-sne-python-example-1ded9953f26

[3] https://distill.pub/2016/misread-tsne/

[4] https://towardsdatascience.com/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b

[5] https://towardsdatascience.com/understanding-cosine-similarity-and-its-application-fd42f585296a

[6] https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html

--

--