Hello. I'm Tanaka from NETS1.

I want to identify the alert email. ――XXXXX is a wild card! - is a continuation.

It seems that no one is registered on the last day, so I decided to write it.

Appearance of emails that do not judge well

One day I received an alert email like this.

Dec 14 12:59:12 app001 2020/12/14 12:59:12.449 app001 ERROR ERROR002 ecnxeci-1349 Great things happened. sugoi[10223]And an error Oh, what!

If you look at the list, you can clearly see that it is a telephone call.

Error code	Error message	Correspondence
ERROR002	app00x ERROR ERROR002 xxxxx A great thing happened. XXXXX[10223]Error in	Telephone contact
ERROR002	Great thing happened. XXXXX[YYYYY]Error in	Email contact
ERROR002	Great thing happened.	ignore

But if you look at the result with the previous code ...

ecn has been judged as delete. If true, I would like ecnxeci-1349 to be judged as replace, I misunderstood that the x contained in the ecnxeci-1349 part matched the xxxxx part on the list side.

Do your best to respond

Correspondence policy

If x matches, it shouldn't be x!

It's easy to say. You just have to replace it with something other than x that matches xxxxx. But how do you get x to be an unexpected match for a wildcard ...

Wildcard conditions

If you don't define what a wildcard is, you can't judge it as a wildcard.

For x, X ... Example: x for app00x
If the alphabet is consecutive ... Example: xxxxxx yyyyyy

I think this is the wildcard when I wrote it so that people can easily see it. It is evaluated as replace when evaluated by difflib, and on the manual (list) side If the string meets the above conditions, it is a wildcard.

Determining if a wildcard is an unexpected match

As for how to judge, I decided to utilize the evaluation value of difflib.

For the part judged as a wild card, replace the manual side with the character string on the mail side and output the evaluation value.
Does the string evaluated as equal meet the wildcard condition?
If applicable, replace the mail side with an appropriate character string and re-evaluate
When re-evaluated, if it is higher than the evaluation value of 1., it is judged as an unexpected match.

In this example,

 : Dec 14 12:59:12 app001 2020/12/14 12:59:12.449  |
equal : app00 | app00
replace : 1 | x
equal :  ERROR ERROR002  |  ERROR ERROR002
delete : ecn |
equal : x | x
replace : eci-1349 | xxxx
equal :Great thing happened.|Great thing happened.
replace : sugoi | XXXXX
equal : [10223]Error in| [10223]Error in
 :Oh, what!|

From this result, 1. Now, create a new evaluation character string as shown below and output the evaluation value.

app001 ERROR ERROR002 xeci-1349 Great things happened. sugoi[10223]Error in

At this time, of course, the xeci-1349 part does not match the character string on the mail side, so the evaluation value does not reach the maximum of 1.0. Next, in 2. and 3., set x in the part of equal: x | x to an appropriate character string, and create the following mail side character string. (Unnecessary beginning and end of the email are deleted at the pre-processing stage)

app001 ERROR ERROR002 ecnieci-1349 Great things happened. sugoi[10223]Error in

Then use this email string to re-evaluate from 1. Then, in 1. of re-evaluation, the following evaluation character string is newly created and the evaluation value is output.

app001 ERROR ERROR002 ecnieci-1349 Great things happened. sugoi[10223]Error in

At this time, the re-evaluation value becomes 1.0, which is the maximum, and the evaluation value rises, so it can be seen that the match was unexpected.

Try to implement

The main part is the same as before, so it is omitted

def search_space(message):
    '''Returns the character position next to the space (for convenience, start with a space)'''
    space_pos = [0]
    index = 0
    for c in message:
        if c == ' ':
            space_pos.append(index + 1)
        index+=1
    return space_pos

def is_replacement(string):
    # x,Replace if X Other than that, if it is a single character, it is not applicable
    if len(string) <= 1:
        if string.lower() == 'x':
            return True
        return False

    #Not applicable if the characters are not the same consecutively
    pre_char = string[1]
    for char in string:
        if pre_char != char:
            return False
    return True

def diff_analyzer(skip_seek, msg, man_msg):
    fix = 0
    fix_man_msg = man_msg
    fix_msg = msg
    opcodes = [('', 0, skip_seek, 0, 0)]

    seq = difflib.SequenceMatcher(None, msg, man_msg)
    ratio = seq.ratio()

    for tag, i1, i2, j1, j2 in seq.get_opcodes():
        fj1 = j1 + fix
        fj2 = j2 + fix

        if tag == 'replace':
            #Wildcard replacement target changes the manual side and new tags(fix_equal)Put on
            if is_replacement(fix_man_msg[fj1:fj2]):
                fix_man_msg = fix_man_msg[:fj1] + msg[i1:i2] + fix_man_msg[fj2:]
                fix = fix + i2 - i1 - (fj2 - fj1)
                tag = 'fix_equal'
        elif tag == 'equal':
            #It is assumed that the random character string on the message side happens to match around x on the manual side.
            #Forced change if one character is equal except for spaces(If it is not good to change it, the evaluation should be lowered at the time of re-evaluation)
            #If there is a match, replace the message side with an appropriate character
            if (fj2 - fj1 == 1 and fix_man_msg[fj1:fj2] != ' ') or is_replacement(fix_man_msg[fj1:fj2]):
                replace_msg = ''
                for letter in msg[i1:i2]:
                    #Add 100 to unicode and replace with different characters(Super violent)
                    replace_msg += chr(ord(letter) + 100)
                fix_msg = fix_msg[:i1] + replace_msg + fix_msg[i2:]

        opcodes.append((tag, skip_seek + i1, skip_seek + i2, j1, j2))
        finish_seek = skip_seek + i2

    #Re-evaluate only ratio when replacing wildcards
    if fix_man_msg != man_msg:
        ratio = difflib.SequenceMatcher(None, msg, fix_man_msg).ratio()

    #Re-evaluate when there is an unexpected match
    #If the result of the re-evaluation is not an unexpected match, fix_Discard msg
    if fix_msg != msg:
        f_seek, f_opcodes, f_ratio = diff_analyzer(skip_seek, fix_msg, man_msg)
        print(f_ratio, ':', fix_msg)
        print(ratio, ':', msg)
        if ratio < f_ratio:
            finish_seek = f_seek
            opcodes = f_opcodes
            ratio = f_ratio
        else:
            fix_msg = msg

    return (finish_seek, opcodes, ratio)

def check_message_by_difflib(manual, message):

    space_pos = search_space(message)

    ratio = 0
    #Evaluate each space as the beginning
    for i in space_pos:
        msg = message[i:]
        delete_flag = False

        #If the end ends with delete, delete and evaluate
        tag, i1, i2, j1, j2 = difflib.SequenceMatcher(None, msg, manual).get_opcodes()[-1]
        if tag == 'delete':
            msg = msg[:i1]
            delete_flag = True

        finish_seek, tmp_opcodes, tmp_ratio = diff_analyzer(i, msg, manual)

        if ratio <= tmp_ratio:
            if delete_flag:
                tmp_opcodes.append(('', finish_seek, len(message), 0, 0))
            ratio = tmp_ratio
            opcodes = tmp_opcodes

    return opcodes, ratio

test

...abridgement...

 : Dec 14 12:59:12 app001 2020/12/14 12:59:12.449  |
equal : app00 | app00
fix_equal : 1 | x
equal :  ERROR ERROR002  |  ERROR ERROR002
fix_equal : ecnxeci-1349 | xxxxx
equal :Great thing happened.|Great thing happened.
fix_equal : sugoi | XXXXX
equal : [10223]Error in| [10223]Error in
 :Oh, what!|

ecnxeci-1349 is evaluated as a wildcard part (fix_equal), and it looks good. But I'm wondering if I can really judge only the wildcard part, so I will compare even such a character string.

Mail side… app001 ERROR fix_data.sh error Manual side… app00x ERROR boxdata.sh error

Execution result

0.8813559322033898 : app001 ERROR fiÜ_data.sh error
0.9152542372881356 : app001 ERROR fix_data.sh error
0.7307692307692307 : ERROR fiÜ_data.sh error
0.7692307692307693 : ERROR fix_data.sh error
0.5652173913043478 : fiÜ_data.sh error
0.6086956521739131 : fix_data.sh error
app001 ERROR fix_data.sh error

 :  |
equal : app00 | app00
fix_equal : 1 | x
equal :  ERROR  |  ERROR
replace : fi | bo
equal : x | x
delete : _ |
equal : data.sh error | data.sh error

It is still correctly equal: x | x. Even if you look at the first and second lines, the evaluation value after replacing as intended has dropped. It's a nice atmosphere.

I gave an evaluation value and let me judge

I think that main should be changed so that the evaluation value of 1.0 and the longest match is the judgment result. Like this.

mail = 'Dec 14 12:59:12 app001 2020/12/14 12:59:12.449 app001 ERROR ERROR002 ecnxeci-1349 Great things happened. sugoi[10223]And an error Oh, what!'
manual1 = 'app00x ERROR ERROR002 xxxxx A great thing happened. XXXXX[10223]Error in'
manual2 = 'Great thing happened. XXXXX[YYYYY]Error in'
manual3 = 'Great thing happened.'

max_match_length = 0
result = 'Not applicable'
for manual in manuals:
    opcodes, ratio = check_message_by_difflib(manual, mail)
    match_length = sum([opcode[2] - opcode[1] for opcode in opcodes if opcode[0] == 'fix_equal' or opcode[0] == 'equal'])

    if ratio == 1:
        if max_match_length < match_length:
            max_match_length = match_length
            result = manual

print('result:' + result)