I'm currently analyzing Apache logs in my research. I was using a module called apache-log-parser, but suddenly I wondered if I needed to use this module and looked it up. Which is faster, the regular expression method or the module method after parsing the IP address? ?? about it.
It is a log for one day (about 45MB, 184087 lines). This time, only the IP address is displayed.
The first is the regular expression method.
sample_regex.py
# coding:utf-8
#A program that checks which is faster, IP address search using regular expressions or modules
import time
import sys
import re
if __name__ == "__main__":
start = time.time()
argvs = sys.argv
f = open("~/apache_log_analysis/log_data/" + argvs[1])
re_ip_addr = re.compile("((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))")
for line in f.readlines():
try:
ip_addr = re_ip_addr.search(line)
if ip_addr.group() is not None:
print ip_addr.group()
except:
print "logfiled turned over"
f.close()
elapsed_time = time.time() - start
print ("elapsed_time:{0}".format(elapsed_time)) + "[sec]"
print "exit"
The result was 2.10073304176 [sec]! After doing it several times, about 0.9 [sec] seems to be the fastest, and 0.9 [sec] was measured in most of the execution results.
Next is the method using modules.
sample_module.py
# coding:utf-8
#A program that checks which is faster, IP address search using regular expressions or modules
import time
import sys
import apache_log_parser
if __name__ == "__main__":
start = time.time()
argvs = sys.argv
f = open("~/apache_log_analysis/log_data/" + argvs[1])
parser = apache_log_parser.make_parser('%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"')
for line in f.readlines():
try:
log_data = parser(line)
print log_data['remote_host']
except:
print "logfiled turned over"
f.close()
elapsed_time = time.time() - start
print ("elapsed_time:{0}".format(elapsed_time)) + "[sec]"
print "exit"
The result was 78.4286789894 [sec]! I checked many times to see if I made a mistake in the program because it was too late, and I wondered what it was.
If you think about it, it was a natural result because various other information would be parsed if the module was used. Even so, I was surprised because it was too late. When I looked at the source of the module, it was made so that it could be widely used, so it felt like it was.
In the future, I thought that it would be better not to rely too much on the module, but to pull it from the module source and use only the necessary parts if it is faster to implement it by yourself.
I learned that it's convenient, but you can't rely on it too much. ..