The other day, I was asked to anonymize at work and had a lot of trouble. In the first place, most of the data was anonymized at the time of receiving the data, and almost no data was received that was not anonymized.

When I actually looked it up, there was a method of hashing, but there was little information that assumed the data frame format. Since it is confidential data, you may have to do it in an offline environment, so it is assumed that the library is pre-installed.

This time, I would like to summarize the code for the next time, taking into account the points of reflection.

How to do it in R

The hashing library uses fastdigest. Only specify the csv to import and the column name you want to hash. You can check and work one by one interactively.

`hashing.R`


#Library installation
install.packages("fastdigest")
#Library import
library("fastdigest")

#Create your own hashing function using fastdigest
hash_algo <- function(data){
    x <- paste("abc" ,data)
    x <- fastdigest(x)
    return(x)
}

#Read data
path <- ""
df <- read.csv(path, header=T)
#Confirmation of reading
str(df)

#Store the column name you want to hash
hash_list <- c("", "",...)

#Hashing
for (i in 1:length(hash_list)){
    df[,hash_list[i]] <- sapply(df[,hash_list[i]], fastdigest)
}

#export
write.csv(df,"hashed.csv")

How to do it in python

Hash using pandas and hashlib. Python wants to finish quickly with a script. Just decide the column name you want to hash in advance, specify the path and execute the code.

`hashing.py`


import os
import sys
import pandas as pd
import hashlib

#Store the column name you want to hash
hash_list = []

#Hashing rule function
def hash_algo(data):
    #If you write here complicatedly, it will be difficult to reverse conversion
    x = "abc" + str(data)
    x = hashlib.sha256(x.encode("utf-8")).hexdigest()
    return x

#Read to output
def hashing(path, hash_list=hash_list):
    #Data reading
    df = pd.read_csv(path, encoding='utf-8')

    #Hashing process
    for i in range(0,len(hash_list)):
        df[hash_list[i]] = list(map(hash_algo, df[hash_list[i]]))

    #Export file
    outpath = os.path.dirname(path)
    outfilename = os.path.splitext(os.path.basename(path))[0] + "_hashed.csv"
    df.to_csv( outpath + "/" + outfilename, index=False)

if  __name__ == "__main__" :
    hashing(sys.argv[1])

Please let me know if there is a smarter way.

Recommended Posts

Hashing data in R and Python

Easily graph data in shell and Python

Python variables and data types learned in chemoinformatics

Receive and display HTML form data in Python

[Python] Swapping rows and columns in Numpy data

Handle Ambient data in Python

Display UTM-30LX data in Python

Stack and Queue in Python

Unittest and CI in Python

Works with Python and R