Comparison of Major Supermarkets in Australia — Part 3 (Fuzzy matching Woolworths and Coles products)
Fuzzy matching notebook file (here), outputs folder (here)
What is Fuzzy matching?
Fuzzy matching, also known as approximate string matching, is a technique to match a string with a pattern, approximately, rather than exactly. The developers of FuzzyWuzzy — a Python package for fuzzy matching, explains why they created the package in their blog:
This is a very similar problem to matching up identical products from different supermarkets. There are products which have the same individual words in their product names but they are ordered differently.
E.g.
- Coles — Helga’s Light Rye Bread 680g
- Woolworths — Helga’s Bread Light Rye 680g
Using the token sort approach, the string in question is tokenized (i.e. each word becomes a token) and the tokens are sorted alphabetically, and then joining them back into a string. So, both of the above would become, ‘680g Bread Helga’s Light Rye’, thereby giving us a token sort ratio score of 100.
The more the differences in the two strings, the lower the ratio.
E.g.
- “Hoyts Cinnamon Sticks 15g”
vs
“Hoyt’s Cinnamon Sticks 15g”
token_sort_ratio = 98 - “Cadbury Curly Wurly Bar 21.5g”
vs
“Cadbury Curly Wurly 21.5g”
token_sort_ratio = 92 - “Pureharvest Organic Apple & Pear Juice Bottle 1L”
vs
“Pureharvest Organic Oat Milk 1L”
token_sort_ratio = 70
My Process of Matching Products
Using RapidFuzz — the faster FuzzyWuzzy
Instead of using FuzzyWuzzy, which is written almost entirely in Python, there is another far superior package that serves the exact same purpose, but is many orders of magnitude faster. The creator even uses the same syntax as FuzzyWuzzy so that anyone familiar with FuzzyWuzzy can easily switch over. This package that I opted to use is called RapidFuzz. It’s written mostly in C++ and has a lot of algorithmic improvements, making it much faster than FuzzyWuzzy. Since, my objective is to crossmatch 21,500+ Coles products with a list of 23,000+ Woolworths items and find the one item with the highest token sort ratio, I will need all the optimisation and speed I can possibly get!
I tested the performance difference by matching the IGA product list (1779 items) against the entire product list of Woolworths (23,461 items). Using FuzzyWuzzy, the operation took 455secs, compared with 47secs when using RapidFuzz. Almost a 10x improvement in speed!
Matching Products
- Datasets from Coles and Woolworths were loaded into Pandas DataFrames. Custom functions were created for extracting the string from the Brand_Product_Size column with the best token sort ratio score. When we created this column in the previous stage for this very matching process. This ensure that the matched product also matches the correct packaging. It would be ridiculous to compare a 3L bottle of milk with a 1L one with the same brand and product name.
- Two new columns were created, WOW Product Match and Match Score. The output from the custom function is a tuple — (product name, ratio score). The product name was assigned to the WOW Product Match column, and the token sort ratio went into the Match Score column. A cut-off score of 70 was imposed for the token sort ratio because I had no intention on manually matching and confirming the accuracy of every match for every single item in Coles’ product list. The goal was to get a sufficiently sized sample to draw some inferences.
- This new dataset was then outputted to a CSV file.
- Using MS Excel, a new column (Manual Match) was inserted. The spreadsheet was sorted by Match Score from highest to lowest. The rows with a value of 100 for Manual Match were left alone, the rest were manually confirmed whether it was indeed the correct match by inserting ‘Y’ into that column.
If this was a real project for determining pricing competitiveness intended for corporate strategy decisions, it would be ideal to make use of ̶t̶h̶e̶ ̶a̶r̶m̶y̶ ̶o̶f̶ ̶s̶l̶a̶v̶e̶s̶ interns and graduates to manually comb over the competitor’s product list to determine if a match exists and, if so, input that product into the ‘product match’ column. I only completed around 5000 matches, which is approximately just over 1/5th of the entire Coles products list — which should be sufficient, statistically speaking, to draw some meaningful insights from the price comparison analysis in the next part.