The other day, I was asked to anonymize at work and had a lot of trouble. In the first place, most of the data was anonymized at the time of receiving the data, and almost no data was received that was not anonymized.
When I actually looked it up, there was a method of hashing, but there was little information that assumed the data frame format. Since it is confidential data, you may have to do it in an offline environment, so it is assumed that the library is pre-installed.
This time, I would like to summarize the code for the next time, taking into account the points of reflection.
The hashing library uses fastdigest. Only specify the csv to import and the column name you want to hash. You can check and work one by one interactively.
hashing.R
#Library installation
install.packages("fastdigest")
#Library import
library("fastdigest")
#Create your own hashing function using fastdigest
hash_algo <- function(data){
x <- paste("abc" ,data)
x <- fastdigest(x)
return(x)
}
#Read data
path <- ""
df <- read.csv(path, header=T)
#Confirmation of reading
str(df)
#Store the column name you want to hash
hash_list <- c("", "",...)
#Hashing
for (i in 1:length(hash_list)){
df[,hash_list[i]] <- sapply(df[,hash_list[i]], fastdigest)
}
#export
write.csv(df,"hashed.csv")
Hash using pandas and hashlib. Python wants to finish quickly with a script. Just decide the column name you want to hash in advance, specify the path and execute the code.
hashing.py
import os
import sys
import pandas as pd
import hashlib
#Store the column name you want to hash
hash_list = []
#Hashing rule function
def hash_algo(data):
#If you write here complicatedly, it will be difficult to reverse conversion
x = "abc" + str(data)
x = hashlib.sha256(x.encode("utf-8")).hexdigest()
return x
#Read to output
def hashing(path, hash_list=hash_list):
#Data reading
df = pd.read_csv(path, encoding='utf-8')
#Hashing process
for i in range(0,len(hash_list)):
df[hash_list[i]] = list(map(hash_algo, df[hash_list[i]]))
#Export file
outpath = os.path.dirname(path)
outfilename = os.path.splitext(os.path.basename(path))[0] + "_hashed.csv"
df.to_csv( outpath + "/" + outfilename, index=False)
if __name__ == "__main__" :
hashing(sys.argv[1])
Please let me know if there is a smarter way.
Recommended Posts