What is Pandas?

Pandas is a library that can process various data, centering on a tabular data structure called DataFrame. Since it is a table in the database, you can get started immediately if you know SQL. Familiar to anyone who analyzes data with Python.

How did you introduce it?

There is a lot of hearsay information, but I think it was introduced to the development team in this way.

――Hybridization of on-premises and cloud is progressing, and database storage is becoming more and more distributed. --Data flow management will become an issue, and Luigi will be introduced, which allows you to create data flows in Python. --Initially, Luigi was supposed to be primarily responsible for input and output to database storage. --Since the common language of the team is Scala, the logic was planned to be cut out and implemented firmly. --Create an environment where Luigi can connect to each database storage. --Easy transfer and report data flow will be organized in Luigi. ――Because this is convenient, migration and refurbishment are progressing, and processing such as filtering, joining, and aggregation gradually comes in, and Pandas is used naturally. ――If you notice, some batch processing will be dependent on Pandas, and you will be addicted to various things.

Mainly addicted to: scream:

Missing value problem

Since the missing value NaN is treated as a float, the moment the missing value is mixed in the int column, the entire column is cast to the float. If the type information is corrupted, it tends to be a problem, especially when it is submitted to the database.

>>> s = pd.Series([0, 1, 2])
>>> s[2]
2
>>> s[1] = np.nan
>>> s[2]
2.0

http://pandas.pydata.org/pandas-docs/stable/missing_data.html

Reference problem

With just a little index operation, you can be forced into an uncertain situation whether it is view or copy (!?)

def do_something(df):
   foo = df[['bar', 'baz']]  # Is foo a view? A copy? Nobody knows!
   # ... many lines here ...
   foo['quux'] = value       # We don't know whether this will modify df or not!
   return foo

http://pandas.pydata.org/pandas-docs/stable/indexing.html#why-does-assignment-fail-when-using-chained-indexing

In this case, no matter how much testing is done, the quality is not guaranteed. Warning may be spit out at runtime, but the only suspicious part is to explicitly call the copy method ...

Sudden death

Looking at the log of a certain batch, there is a 1% chance of dying. There are many memory related items, and core dumps multiply. It also freezes.

*** glibc detected *** /usr/local/anaconda/bin/python: free(): invalid pointer:

Fatal Python error: GC object already tracked

People People People ＞ Sudden death < ￣Y^Y^Y^Y￣

Since it is a Python 2.7 & Pandas 0.17 environment, it may be solved by updating ....

What to do in the future: thinking:

In future new development, it is a policy not to use Pandas together with Luigi as much as possible. After all, Pandas was for analysis, and it wasn't good to use it in batch ...

However, even for analytical purposes, I personally feel that the reference problem is fatal, so I will use Spark if I want a DataFrame in the future. Although it can be written in statically typed Scala, note that the compile check does not work for the essential schema operations. Library using cats framelessもありますが、あくまでproof-of-conceptです。

By the way, Luigi is idempotent for each task and assumes one output data, so it may not be suitable depending on the data flow to be assembled. And it seems that Spotify, the developer of Luigi, has moved to Google Cloud Dataflow and is developing Scala's wrapper library scio ....

Scio - A Scala API for Google Cloud Dataflow & Apache Beam

Python Pandas is not suitable for batch processing