I would like to web scrape Joe Sandbox's malware analysis report with BeautifulSoup + Python to get ** PowerShell command line **.
It is a site that analyzes malware and outputs a report. https://www.joesandbox.com
There are various versions of JoeSandbox, but the version called Cloud Basic allows you to analyze malware for free. In addition, reports analyzed by Cloud Basic will be published, so you can see other people's analysis result reports. By the way, Web API can be used with versions other than Cloud Basic, but it seems that it cannot be used with Cloud Basic.
If you want to know more details, please refer to the following.
How to do dynamic malware analysis with Joe Sandbox * https://qiita.com/hanzawak/items/ec665e0f96dc65f3def3
Web scraping from malware dynamic analysis site with BeautifulSoup + Python * https://qiita.com/hanzawak/items/0a8a26d1ebe62b84f847
Get the PowerShell command line from the JoeSandbox Cloud Basic analysis report.
If you just get it normally, the PowerShell command line executed by the malware and the PowerShell command line executed by the legitimate file will be mixed. Therefore, the score indicating the degree of malware judged by Joe Sandbox is also acquired.
The score is obtained from the following.
The PowerShell command line is extracted from:
In the following cases, C: \\ WINDOWS \\ System32 \\ WindowsPowerShell \\ v1.0 \\ powershell.exe -noP -sta -w 1 -enc abbreviation
is extracted.
The output will write the information to a text file as follows. The order is "report number, score, PowerShell command" separated by commas.
ReportNumber,DetectionScore,PowerShellCommandLine
236546,56,C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -noP -sta -w 1 -enc abbreviation
236547,99,C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -NoP -NonI -W Hiden -Exec Bypass Abbreviation
236548,10,C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\powershell.exe abbreviation
I am trying to make it work if I connect the contents described below.
In addition to BeautifulSoup, import requests, os, re as well.
import requests
from bs4 import BeautifulSoup
import os
import re
Prepare to enter from and to to extract information from multiple reports. Only one is specified here for testing.
report_num_from = 236543
report_num_to = 236543
The URL of the report is as follows. This "236543" part seems to be the report number, so I would like to loop this number. https://www.joesandbox.com/analysis/236543/0/html
def extract_powershell_command(report_num_from, report_num_to):
for reoprt_num in range(report_num_from, report_num_to + 1):
ps_cmdline = []
try:
target_url = 'https://www.joesandbox.com/analysis/' + str(reoprt_num) + '/0/html'
response = requests.get(target_url)
soup = BeautifulSoup(response.text, 'lxml')
Score is written at the top of the screen. In this case, the score is 56, but we will get this number.
The corresponding code was below.
It's a bit rough, but you can get a score by doing the following.
detection_score = 0
table = soup.findAll("table", {"id":"detection-details-overview-table"})[0]
rows = table.findAll("tr")
detection_score = re.sub(r'.+>(.+)</td></tr>', r'\1', str(rows[0]))
Below is an example screen that includes a PowerShell command line.
I'm going to get the back part from the cmdline after powershell.exe. The corresponding code was below.
I will check the contents.
Apparently it is stored in the table.
I'm going to get the one that contains powershell.exe
separated by li
, format it and get the information behind it from cmdline
.
There may be a smarter way, but I did the following:
startup = soup.find('div', id='startup1')
for line in startup.findAll('li'):
if 'powershell.exe' in str(line):
tmp = str(line).replace('<wbr>', '').replace('</wbr>', '')
cmdline = re.sub(r'.+ cmdline: (.*) MD5: <span.+', r'\1', tmp)
ps_cmdline.append(str(reoprt_num) + ',' + detection_score + ',' + cmdline)
Exception handling is also included for the time being.
At the end of the process, call the save_file
function (described later).
except IndexError as e:
ps_cmdline.append('{},ERROR:{}'.format(reoprt_num,e))
except Exception as e:
ps_cmdline.append('{},ERROR:{}'.format(reoprt_num,e))
finally:
save_file(ps_cmdline)
This is the process of writing to a file. It is not necessary to make it a separate function, but I will change it to a process other than writing the file in the future, so I made it a function so that it can be easily rewritten. Create ʻoutput.txt` in the same folder and write it.
def save_file(ps_cmdline):
with open('./output.txt', 'a') as f:
if os.stat('./output.txt').st_size == 0:
f.write('ReportNumber,DetectionScore,PowerShellCommandLine\n')
for x in ps_cmdline:
f.write(str(x) + "\n")
I added a comment by connecting the code so far. Obviously, the range specified by from and to should be moderate. Since it was created and executed with Jupyter Notebook, it has the following form.
import requests
from bs4 import BeautifulSoup
import os
import re
report_num_from = 236547
report_num_to = 236547
def extract_powershell_command(report_num_from, report_num_to):
"""
Extract PowerShell Command from JoeSandbox analysis result.
Parameters
----------
report_num_from : int
First report number to analyze
report_num_to : int
Last report number to analyze
"""
for reoprt_num in range(report_num_from, report_num_to + 1):
ps_cmdline = []
try:
target_url = 'https://www.joesandbox.com/analysis/' + str(reoprt_num) + '/0/html'
response = requests.get(target_url)
soup = BeautifulSoup(response.text, 'lxml')
# Check JoeSandbox Detection Score (Maybe score above 40 is malicious)
detection_score = 0
table = soup.findAll("table", {"id":"detection-details-overview-table"})[0]
rows = table.findAll("tr")
detection_score = re.sub(r'.+>(.+)</td></tr>', r'\1', str(rows[0]))
startup = soup.find('div', id='startup1') # 'startup1' is a table with ProcessName & CommandLine
for line in startup.findAll('li'):
if 'powershell.exe' in str(line):
tmp = str(line).replace('<wbr>', '').replace('</wbr>', '')
cmdline = re.sub(r'.+ cmdline: (.*) MD5: <span.+', r'\1', tmp)
ps_cmdline.append(str(reoprt_num) + ',' + detection_score + ',' + cmdline)
# Report number does not exist
except IndexError as e:
ps_cmdline.append('{},ERROR:{}'.format(reoprt_num,e))
except Exception as e:
ps_cmdline.append('{},ERROR:{}'.format(reoprt_num,e))
finally:
save_file(ps_cmdline)
def save_file(ps_cmdline):
"""
Save the extraction results to a file.
File I/O is a function because it may change.
Parameters
----------
ps_cmdline : list of str
List containing process names.
"""
with open('./output.txt', 'a') as f:
if os.stat('./output.txt').st_size == 0:
f.write('ReportNumber,DetectionScore,PowerShellCommandLine\n')
for x in ps_cmdline:
f.write(str(x) + "\n")
extract_powershell_command(report_num_from, report_num_to)
Running the above code will give you the following results:
ReportNumber,DetectionScore,PowerShellCommandLine
236547,56,C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -noP -sta -w 1 -enc SQBmACgAJABQAFMAVgBFAFIAcwBJAG8AbgBUAGEAYgBMAGUALgBQA Abbreviation
Recommended Posts