[Memo] Text matching in pandas data frame using flashtext

Overview

--Introduced a library called flashtext from the following points when performing text matching in csv with a large capacity using pandas. --The original algorithm enables high-speed regular expression processing even for large volumes of data. --Depending on the capacity, python's re module is faster. --Simple notation. --Abundant text processing.

Installation

--Introduced with the following command.

pip install flashtext

Sample code

--Hereafter, the process of reading the sample CSV into the dataframe and issuing a simple count number.

import pandas as pd
from flashtext import KeywordProcessor

#keyword specification
keyword_dict = {
'front': ['html', 'javascript','css'],
'back': ['php','python','ruby'],
'db': ['mysql','postgress','mongo']
}

# init
keyword_processor = KeywordProcessor()

#keyword added
keyword_processor.add_keywords_from_dict(keyword_dict)

#Load sample csv
df = pd.read_csv("sample.csv")

#Count processing. Added a column to display each counted number.
#Example: sample.Each matching for the data in the "contents" column in csv.
df['all_count'] = df['contents'].apply(lambda x: len(keyword_processor.extract_keywords(x)))

#First 3 lines output
df.head(3)

image.png

reference

--Documentation

Recommended Posts

[Memo] Text matching in pandas data frame using flashtext
Select features using text data
Data analysis using python pandas
Inflating text data by retranslation using google translate in Python
Graph time series data in Python using pandas and matplotlib
Data visualization method using matplotlib (+ pandas) (5)
Data visualization method using matplotlib (+ pandas) (3)
Data acquisition memo using Backlog API
Data visualization method using matplotlib (+ pandas) (4)
Analyze data using RegEx's 100x Flash Text
Information recording memo using session in Django
Precautions when using for statements in pandas
RDS data via stepping stones in Pandas
SELECT data using client library in BigQuery
Working with 3D data structures in pandas
Pandas memo
Find the index of items that match the conditions in the pandas data frame / series
pandas memo
Japanese text preprocessing without for statement in pandas
Data supply tricks using deques in machine learning
[Pandas] Basics of processing date data using dt
100 language processing knock-20 (using pandas): reading JSON data
Make holiday data into a data frame with pandas
Find out the maximum number of characters in multi-line text stored in a data frame