[Memo] Text matching in pandas data frame using flashtext

Overview

--Introduced a library called flashtext from the following points when performing text matching in csv with a large capacity using pandas. --The original algorithm enables high-speed regular expression processing even for large volumes of data. --Depending on the capacity, python's re module is faster. --Simple notation. --Abundant text processing.

Github --This time, the basic notation is described as a memorandum. -** * Therefore, check Official Documents for various APIs. ** **

Installation

--Introduced with the following command.

pip install flashtext

Sample code

--Hereafter, the process of reading the sample CSV into the dataframe and issuing a simple count number.

import pandas as pd
from flashtext import KeywordProcessor

#keyword specification
keyword_dict = {
'front': ['html', 'javascript','css'],
'back': ['php','python','ruby'],
'db': ['mysql','postgress','mongo']
}

# init
keyword_processor = KeywordProcessor()

#keyword added
keyword_processor.add_keywords_from_dict(keyword_dict)

#Load sample csv
df = pd.read_csv("sample.csv")

#Count processing. Added a column to display each counted number.
#Example: sample.Each matching for the data in the "contents" column in csv.
df['all_count'] = df['contents'].apply(lambda x: len(keyword_processor.extract_keywords(x)))

#First 3 lines output
df.head(3)

reference

--Documentation

https://medium.com/better-programming/using-pythons-flashtext-library-to-find-keywords-in-text-data-f6cdf9c018ee