- Priyanka P. Pattnaik

# tfidf-matcher: the SUPER-FAST string matching package

If you are working in the field of data then you can see this man as a person who is looking for correct data from the large dataset. It is very crucial and very time-consuming work. In artificial intelligence, if you got your data correct then you can pass the first hurdle. This blog is about one of my projects where I need to find the matching string from a large database while taking care of the time complexity.

From a laymen's point of view, the time complexity is referred to as the total time taken to get the output from certain work. While the definition states that the **Time complexity** of an algorithm signifies the total time required by the program to run until its completion. The time complexity of algorithms is most commonly expressed using the big O notation. It’s an asymptotic notation to represent the time complexity.

In my last blog, I have done my work using the fuzzy-wuzzy and I have used the package. The result was good but the server takes time especially when I go for a large dataset. So, I search for others and found this amazing package.

**Installation:** pip install tfidf-matcher

Before finding the match in a dataset, we need to sort the dataset. Cause matching will be easier if your dataset is sorted. So our first task is to deal with the dataset.

**Import your dataset using pandas**use the

for cleaning and for making a contiguous sequence of n items*n-grams*Make the items into the tfidf matrix by using the

**-***from sklearn.feature_extraction.text import TfidfVectorizer*Fitting a K-NearestNeighbours model to the sparse matrix.

Vectorizing the list of strings to be matched and passing it into the KNN model to calculate the cosine distance by using

**-**and call the matcher function with*import tfidf_matcher as tm*,*tm.matcher().*match it with your lookup data.

In my work, I got 9859 matched rows from a dataset in a few seconds. So, indeed it is very quick. As we saw that the matches created with this method are really appreciating and the ratio really gives us a way to look through the matching ratio percentage with the rows. The biggest advantage is speed.

Brought to You by-

COE-AI(CET-BBSR)- A Initiative by CET-BBSR, Tech Mahindra, and BPUT to provide solutions to Real-world problems through ML and IoT