Hashing data in R and Python

The other day, I was asked to anonymize at work and had a lot of trouble. In the first place, most of the data was anonymized at the time of receiving the data, and almost no data was received that was not anonymized.

When I actually looked it up, there was a method of hashing, but there was little information that assumed the data frame format. Since it is confidential data, you may have to do it in an offline environment, so it is assumed that the library is pre-installed.

This time, I would like to summarize the code for the next time, taking into account the points of reflection.

How to do it in R

The hashing library uses fastdigest. Only specify the csv to import and the column name you want to hash. You can check and work one by one interactively.

hashing.R


#Library installation
install.packages("fastdigest")
#Library import
library("fastdigest")

#Create your own hashing function using fastdigest
hash_algo <- function(data){
    x <- paste("abc" ,data)
    x <- fastdigest(x)
    return(x)
}

#Read data
path <- ""
df <- read.csv(path, header=T)
#Confirmation of reading
str(df)

#Store the column name you want to hash
hash_list <- c("", "",...)

#Hashing
for (i in 1:length(hash_list)){
    df[,hash_list[i]] <- sapply(df[,hash_list[i]], fastdigest)
}

#export
write.csv(df,"hashed.csv")

How to do it in python

Hash using pandas and hashlib. Python wants to finish quickly with a script. Just decide the column name you want to hash in advance, specify the path and execute the code.

hashing.py


import os
import sys
import pandas as pd
import hashlib

#Store the column name you want to hash
hash_list = []

#Hashing rule function
def hash_algo(data):
    #If you write here complicatedly, it will be difficult to reverse conversion
    x = "abc" + str(data)
    x = hashlib.sha256(x.encode("utf-8")).hexdigest()
    return x

#Read to output
def hashing(path, hash_list=hash_list):
    #Data reading
    df = pd.read_csv(path, encoding='utf-8')

    #Hashing process
    for i in range(0,len(hash_list)):
        df[hash_list[i]] = list(map(hash_algo, df[hash_list[i]]))

    #Export file
    outpath = os.path.dirname(path)
    outfilename = os.path.splitext(os.path.basename(path))[0] + "_hashed.csv"
    df.to_csv( outpath + "/" + outfilename, index=False)

if  __name__ == "__main__" :
    hashing(sys.argv[1])

Please let me know if there is a smarter way.

Recommended Posts

Hashing data in R and Python
Easily graph data in shell and Python
Python variables and data types learned in chemoinformatics
Receive and display HTML form data in Python
[Python] Swapping rows and columns in Numpy data
Handle Ambient data in Python
Display UTM-30LX data in Python
Stack and Queue in Python
Unittest and CI in Python
Works with Python and R
Full-width and half-width processing of CSV data in Python
Get Leap Motion data in Python.
Difference between list () and [] in Python
Difference between == and is in python
Read Protocol Buffers data in Python3
Get data from Quandl in Python
Run shell command / python in R
Manipulate files and folders in Python
Handle NetCDF format data in Python
Assignments and changes in Python objects
Check and move directories in Python
Ciphertext in Python: IND-CCA2 and RSA-OAEP
Function synthesis and application in Python
Graph time series data in Python using pandas and matplotlib
Comparison of data frame handling in Python (pandas), R, Pig
Export and output files in Python
Reverse Hiragana and Katakana in Python2.7
Reading and writing text in Python
[GUI in Python] PyQt5-Menu and Toolbar-
Create and read messagepacks in Python
Solving AOJ's Algorithm and Introduction to Data Structures in Python -Part1-
processing to use notMNIST data in Python (and tried to classify it)
Automatic acquisition of gene expression level data by python and R
Solving AOJ's Algorithm and Introduction to Data Structures in Python -Part2-
Solving AOJ's Algorithm and Introduction to Data Structures in Python -Part4-
Conditional element extraction from data frame: R is% in%, Python is .isin ()
Solving AOJ's Algorithm and Introduction to Data Structures in Python -Part3-
Overlapping regular expressions in Python and Java
Get additional data in LDAP with python
Data pipeline construction with Python and Luigi
Differences in authenticity between Python and JavaScript
Notes using cChardet and python3-chardet in Python 3.3.1.
Modules and packages in Python are "namespaces"
Avoid nested loops in PHP and Python
Data input / output in Python (CSV, JSON)
Differences between Ruby and Python in scope
AM modulation and demodulation in Python Part 2
difference between statements (statements) and expressions (expressions) in Python
Ant book in python: Sec. 2-4, data structures
Eigenvalues and eigenvectors: Linear algebra in Python <7>
How to do R chartr () in Python
Try working with binary data in Python
Implementation module "deque" in queue and Python
Line graphs and scale lines in python
Implement FIR filters in Python and C
Differences in syntax between Python and Java
Check and receive Serial port in Python (Port check)
Search and play YouTube videos in Python
Get Google Fit API data in Python
Difference between append and + = in Python list
Difference between nonlocal and global in Python