Format the data for analysis

In reality, people's attention and interest in data preparation, calculation, and visualization is about 1: 4: 5, while the ratio in practice is about 6: 2: 2. This is just an experience, but is it safe to say that most of the analysis is a preparation for setting up the data in a computable state? That's why today I'll write a series of techniques that I often use in Python, which emphasizes usability as a glue language.

Loading and unloading JSON objects

The content isn't particularly new, but JSON format conversion is a common pre-process. I will leave it as a memo so that I will not forget it.

In Python, JSON-formatted data structures can be handled by import json. When loaded, it becomes a dictionary (hash in other languages) data format, which can also be in JSON format.

Loading JSON data from a CSV file

Assume a CSV file separated by key and value. Suppose that the value part contains text data in JSON format serialized by another system. To load this, you can split the string by specifying the delimiter as follows, and load the JSON as a dictionary type object with the json.load method.

import json

file = open(self.filename, 'r')
for line in file:
    key, value = line.rstrip().split(",")
    dic = json.loads(value)

Write a JSON object to a file

On the other hand, when writing a dictionary type object to a file or standard output in JSON format, use json.dumps as follows.

import json

json_obj = json.dumps(dic)
print(json_obj)

It's easy.

Use of arguments and instance variables

The arguments passed to the Python script are stored in sys.argv.

import sys

if __name__ == '__main__':
    argsmin = 1
    if len(sys.argv) > argsmin:
        some_instance = SomeClass(sys.argv)
        some_instance.some_method()

If you pass it when initializing the instance, you can use it by storing the argument in the instance variable.

class SomeClass:
    def __init__(self, args):
        self.filename = args[1]

Private method

Python customarily assumes that a method is a private method by prefixing it with a _.

    def self._some_method(self):
        ...

Sure, you can't call it some_instance.some_method. But you can actually call it explicitly with some_instance._some_method. It's habitual and not functionally private.

Use two __ to prevent it from being called functionally.

    def self.__some_method(self):
        ...

However, even with this, there is a tricky way to call it, and the testability of the private method will be reduced, so I do not recommend it very much. The fact that you can call a private method, in other words, makes it easier when testing.

Testing framework

There are many different Python testing frameworks, but nose is relatively easy to use. First, consider a method that behaves as follows.

import factorial
factorial.factorial(10)
#=> 3628800

Code to be tested

Let's say this Python code is named factorial.py and the method implementation looks like this.

def factorial(n):
    if n==1:
        return 1
    else:
        return n * factorial(n-1)

Test code

To test this, create a file named test_factorial.py and write the test code as follows:

from nose.tools import * #Loading the testing framework
from factorial import *  #Loading code under test

def test_factorial(): #Factorial method test
    i=10
    e=3628800
    eq_(e,factorial(i)) #Verification

The above is equivalent to doing eq_ (3628800, factorial (10)) The eq_ method verifies that the values are equal.

Run the test

After implementing the test code, issue the nosetests command from the shell.

$ nosetests
.
----------------------------------------------------------------------
Ran 1 test in 0.008s

OK

Summary

It's easy to spend a lot of time preparing steady data. This time, I have summarized the techniques often used in such work as a memorandum.

Detailed Python techniques required for data shaping (1)