Python Advent Calendar 2020 Day 25: Christmas_tree :.
The era is server racing, but I think there are still many systems that operate batches on on-premises servers and instances such as the cloud (IaaS).
This article assumes a batch that runs on a server, and the first half summarizes the points of batch design based on anti-patterns. The second half will be tips for batch development in Python.
** * The content of this article is just an example of the way of thinking, and not all ideas fit the system or are satisfied with the contents written here. ** **
Batch processing is a processing method that continuously executes a series of processing on a set of data. The etymology goes back to the era of general-purpose computers.
For the purpose of processing data in a batch, Unix-like OS often operates at a specified date and time using cron. Also, the batch itself is sometimes called a job. Since a large number of jobs are managed in a large-scale system, a dedicated job management server is built and job management software is installed to manage the jobs.
Ultimately, depending on the system, it can be operated manually without batch processing. However, in reality, batch processing is indispensable to meet the cost, processing time, and certainty of manpower. In addition, there are cases where the necessity was not understood in the requirement definition process during system development, but it becomes necessary after the system is put into operation.
Therefore, it is very important for batch design to take a bird's-eye view of the entire system and consider the basic points described below and operations such as anti-patterns.
-[x] Give variable names that are easy to understand (do not add x, i, etc.) for variables used in batches to improve maintainability. -[x] Keep the method simple and don't combine multiple processes -[x] DB connection settings that change for each environment should be separated from the executable file by using config etc. -[x] Create general-purpose processes such as log modules as util as needed. -[x] When managing jobs such as cron, it is important to design with consideration for penetration. -[x] Create created_at and updated_at columns when registering or updating DB records. -[x] Number of commits during transaction execution (throughput and rollback are considered according to the amount of data) -[x] Built-in considering rerun (simplify recovery method)
The anti-patterns related to batch processing that I experienced every time I suddenly took over the operation of the system are described below.
--Log files that are not log rotated It is a batch created in Python and is logged using the logging module. However, since log rotation is not performed on the program side, log output continues to be performed in the same log file.
** If you know Linux rsyslog, you can solve it by just setting the OS without implementing it programmatically. ** **
--Multiple created congig It has a directory structure in which a, b, c and each batch are stored as shown below. Each batch has a different purpose, but the DB settings for data linkage are the same. In addition, the URL of the Webhook that is the notification destination when an exception such as a program error occurs is also described in each config, but they are all the same.
.
|-- a
| `-- config
|-- b
| `-- config
`-- c
`-- config
** For example, when a server migration occurs and you change the URL of the webhook, you have to rewrite everything. ** **
--Batch processing breakthrough There is a batch that starts and stops an instance in batch processing. One day, it took longer than usual to start and stop the instance, probably because the load was applied to the entire AWS and it was affected by it. Therefore, the time of the preceding batch processing becomes long, and it overlaps with the later batch processing, and the batch processing fails.
** Insufficient consideration of batch design without considering breakthrough of batch processing due to unexpected system abnormality. ** **
--DB that does not know when the update was done It was decided to investigate the DB that was updated by batch processing, but the investigation was difficult because the update time was not recorded. When registering or updating a DB in batch processing, creating columns such as created_at and updated_at as a DB design makes it easier to investigate when a failure occurs.
** If you do not know when the data was registered / updated, the operation / maintainability will be greatly reduced. ** **
--Batch to notify anything Notification of info messages for events that do not need to be confirmed, and batches that notify all errors due to temporary communication failure and connection failure are meaningless for those who have taken over the operation. In addition, unnecessary error notifications are harmful for system operation because they become just wolf boys when they become mere corpses.
** Do not notify other than errors that affect the service or messages that require confirmation from the operator. ** **
--Batch without considering extensibility The batch processing time affects the service, but when the amount of batch processing increases after operation, there is a limit to sequential processing unless parallel processing and extensibility are taken into consideration.
** There is no problem if it can be handled by tuning, but a design that does not consider extensibility will have a large effect later. ** **
--Install unnecessary libraries with pip
A server migration occurs and run pip freeze
on the source server to create requirements.txt
. When installing based on requirements.txt
on a new server, an error occurs regarding an unused library.
** Do not install unnecessary libraries. ** **
When the operator in charge of the system changes due to handing over, etc., ** document maintenance ** is more important for a more profane system in order to continue operation.
Suppose that the system is set to notify an alert to notify the operator when something goes wrong. If the alert is not set properly when you suddenly take over the operation, you do not even know the failure, so you can isolate it from info or error.
If it is an error, look for the batch log file. However, since we do not know where it is output, we will investigate using the information as a clue. Also, the scary thing is that sometimes the batch is not even logged.
When the person in charge of system operation changes, it is desirable to have a document that at least understands the entire system, such as the batch schedule of the entire system and the batch list. In particular, the more data linkage is performed, the more caution is required.
I think there are various ways to develop Python, but the following is an example to improve development efficiency.
Creating a Python image with Docker in your local environment and mounting the directory containing the source will improve development efficiency.
First, create a Dokcerfile and build it.
Next, with the directory containing the source mounted, start it with the docker run
command.
--Creating a Dokcer file
FROM python:3
WORKDIR /usr/src/app
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD [ "python", "./your-daemon-or-script.py" ]
--Build
$ docker build -t python3/test .
--Starting the container
$ docker run -v <source directory>:/batch -it python3/test/bin/bash
** After that, you can develop the program file stored in the source directory with an editor. ** **
Tips Here are some tips for batch development in Python.
config If there are multiple environments such as development environment and production environment, you can prevent it from becoming complicated by creating a file such as config.py and importing it.
sys.argv
is a list of command line arguments passed to the Python script. argv [0] is the name of the script, and argv [1] contains the first argument.
For example, using sys.argv
, you can set each environment as follows.
I think there are many ways to check the arguments by looking at the value of sys.argv
.
import sys
args = sys.argv
env = args[1]
if env == 'local':
pass
I created a batch file that outputs logs, but an error may occur because the log file does not exist.
If the specified log file does not exist, you can prevent it by adding a process to create a log file.
log_file = config.base_dir + 'log/bacth.log'
if not os.path.exists(log_file):
with open(log_file, 'w') as f:
f.write('')
If it is not in the exclude list (exclude_list), add it to the list to be processed.
if item_id not in exclude_list:
stock_list.append({"item_id": item_id})
Extract the DB result (result_set) from the dictionary and add it to the list.
for row in result_set:
row_dict = {"id": row[0], "name": row[1], "age": row[2]}
target_list.append(row_dict)
If it is a list, it may shift, so basically it is better to use a dictionary.
for (z, x, y) in zip(list1, list2, list3):
temp_list.append([z, x, y])
uuid Generating uuids is easy with the standard library.
import uuid
def make_sys_id():
return str(uuid.uuid4())
#Execution example
>>> make_sys_id()
'ac441afe-fc2d-4ebb-a9cf-18a49c77ec71'
hash Hash with MD5 to find the hash value.
import hashlib
serialized = 'hoge'
md5 = hashlib.md5(serialized.encode('utf-8')).hexdigest()
#Execution example
>>> print(md5)
ea703e7aa1efda0064eaa507d9e8ab7e
An example when you want to perform date processing.
import datetime
from dateutil.relativedelta import relativedelta
#Today's date
today_tmp = datetime.date.today()
today = today_tmp.strftime('%Y%m%d')
>>> print(today)
20201225
#Tomorrow date
tomorrow_tmp = today_tmp + datetime.timedelta(days=1)
#Yesterday date
yesterday_tmp = today_tmp - datetime.timedelta(days=1)
#Tomorrow's date one month ago
one_month_before = tomorrow_tmp - relativedelta(months=1)
one_month_before = one_month_before.strftime('%Y%m%d')
>>> print(one_month_before)
20201125
#Yesterday's date one month later
one_month_later = yesterday_tmp + relativedelta(months=1)
one_month_later = one_month_later.strftime('%Y%m%d')
>>> print(one_month_later)
20210123
When you want to output the return value of the batch and end it according to the processing result in the try catch. There are other ways to output the return value.
import os
#Example
try:
Contents to be processed (Example: DB registration)
#Successful completion
os._exit(0)
except:
#Abnormal termination
os._exit(99)
Create your own exception class and throw an exception with raise. Indispensable for try catching in batch processing.
class BatchError(Exception):
def __init__(self, m):
self.message = m
def __str__(self):
return self.message
#Example
try:
Contents to be processed (Example: DB connection)
except:
e = traceback.format_exc()
logging.error(e)
logging.error('Processing will end because it cannot connect to the DB')
raise BatchError("DB connection failure")
An example of how to debug. My personal recommendation is the pysnooper library.
pysnooper
import pysnooper
It's easy to use. Decorate @ pysnooper.snoop ()
to the function you want to debug.
If you execute the batch with this, the details such as the contents of the variables will be output.
pprint Pprint is useful when you want to see clearly in json format.
from pprint import pprint
Depending on the OS environment etc., Japanese output may fail in Python 3 when the encoding method is ANSI.
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
As you learn Pyhon, most introductory books probably don't mention how to write Python these days. In order to improve Python, it is essential to catch up with new information by yourself.
The f string was added in Python 3.6. A formatted string literal. The substitutions realized by the conventional format () method can be described in string literals.
>>> word = "WORLD"
>>> f'HELLO {word}'
'HELLO WORLD'
>>> today = datetime(year=2020, month=5, day=6)
>>> f"{today:%B %d, %Y}"
'May 06, 2020'
Python is a dynamically typed language, but since Python 3.5, type annotation is possible. The following code will result in an error because it will result in an error if the same types are not aligned due to the nature of Python.
>>> def test(word):
... return 'Hello' + word
...
>>> test(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in test
TypeError: can only concatenate str (not "int") to str
You can improve the maintainability of your code by using type annotations. In the following cases, it is expected that the type of the actual argument name is str and the type of the return value is str. The point can only be annotated, and error checking is not performed.
>>> def greeting(name: str) -> str:
... return 'Hello' + name
...
>>> greeting("apple")
'Hello apple'
dataclasses dataclasses was added in Python 3.7. Provides decorators and functions that are automatically added to user-defined classes.
There is no need to write by initializing with the conventional init
.
>>> class Animal:
... def __init__(self, type, age, name):
... self.type = type
... self.age = age
... self.name = name
...
>>> cat = Animal("cat", 0,"Tulle" )
>>> print(cat.type, cat.age, cat.name)
cat 0 tulle
You can easily do the same with dataclasses.
>>> @dataclass
... class Animal:
... type: str
... age: int
... name: str
...
>>> cat = Animal("cat", 0, "Tulle")
>>> print(cat)
Animal(type='cat', age=0, name='Tulle')
In the coming era, the system will be based on container technology, so the concept of batch design will change.
However, no matter how much advanced technology such as serverless is used, if operation is not considered, only issues and technical debts remain.
Technology is just a means. The important thing is to design an appropriate batch and make it so that the service will not be hindered in order to continue the service as a business.
Recommended Posts