--Introduced a library called flashtext from the following points when performing text matching in csv with a large capacity using pandas. --The original algorithm enables high-speed regular expression processing even for large volumes of data. --Depending on the capacity, python's re module is faster. --Simple notation. --Abundant text processing.
--Introduced with the following command.
pip install flashtext
--Hereafter, the process of reading the sample CSV into the dataframe and issuing a simple count number.
import pandas as pd
from flashtext import KeywordProcessor
#keyword specification
keyword_dict = {
'front': ['html', 'javascript','css'],
'back': ['php','python','ruby'],
'db': ['mysql','postgress','mongo']
}
# init
keyword_processor = KeywordProcessor()
#keyword added
keyword_processor.add_keywords_from_dict(keyword_dict)
#Load sample csv
df = pd.read_csv("sample.csv")
#Count processing. Added a column to display each counted number.
#Example: sample.Each matching for the data in the "contents" column in csv.
df['all_count'] = df['contents'].apply(lambda x: len(keyword_processor.extract_keywords(x)))
#First 3 lines output
df.head(3)
Recommended Posts