The content of this time is almost the same as the previous Using Azure ML Python SDK 3: Writing the output to Blob storage-Part 1, and the input is The difference is that it becomes a folder instead of a file. That's not interesting, so I'll try to attach a sample that actually passes some packages to the container.
Since this time we specify a folder instead of a specific file as input, it is assumed that there are multiple files under work2 / input / in the figure below. Other than that, you still have remote virtual machines and Jupyter Notebooks.
--Remote virtual machine (hereinafter "computing cluster" using Azure ML terminology)
To check the Python SDK version
import azureml.core
print("SDK version:", azureml.core.VERSION)
The folder structure of the Notebook remains the same.
script2.2.py collects the file names of the files saved in work2 / input / into a CSV file and saves it in work2 / output / output1 /. The purpose of setting the subfolder output1 is to use it as a sample folder creation with script2.2.py.
The procedure for HelloWorld2.2.ipynb is the same as last time and is as follows.
We will continue to follow the steps as before.
Load the package
Load the package.
from azureml.core import Workspace, Experiment, Dataset, Datastore, ScriptRunConfig, Environment
from azureml.data import OutputFileDatasetConfig
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException
from azureml.core.conda_dependencies import CondaDependencies
workspace = Workspace.from_config()
Specifying a computing cluster
Specify the compute cluster.
aml_compute_target = "demo-cpucluster1" # <== The name of the cluster being used
try:
aml_compute = ComputeTarget(workspace, aml_compute_target)
print("found existing compute target.")
except ComputeTargetException:
print("no compute target with the specified name found")
Specify input and output folders
demostore is the datastore name registered in Azure ML Workspace. I'm passing the file path inside the BLOB container of the datastore to the dataset class.
ds = Datastore(workspace, 'demostore')
input_data = Dataset.File.from_files(ds.path('work2/input/')).as_named_input('input_ds').as_mount()
output = OutputFileDatasetConfig(destination=(ds, 'work2/output'))
Specifying the container environment
As mentioned at the beginning, this time I actually specified some packages. It is only described as a sample of the specification method, and has no particular meaning.
myenv = Environment("myenv")
myenv.docker.enabled = True
myenv.python.conda_dependencies = CondaDependencies.create(pip_packages=[
'azureml-defaults',
'opencv-python-headless',
'numpy',
'pandas',
'tensorflow',
'matplotlib',
'Pillow'
])
Specifying the executable file name
Specify the folder name containing the set of scripts to be executed remotely in source_directory. Also, in script, specify the name of the script file that will be the entry for remote execution.
In remote execution, all files and subdirectories in source_directory are passed to the container, so be careful not to place unnecessary files.
Pass input_data with the argument name datadir and output with the argument name output.
We also specify the compute cluster name in compute_target and pass in myenv, which instantiates the Environment in environment.
src = ScriptRunConfig(source_directory='script_folder2',
script='script2.2.py',
arguments =['--datadir', input_data, '--output', output],
compute_target=aml_compute,
environment=myenv)
Run the experiment
Run the script.
exp = Experiment(workspace, 'work-test')
run = exp.submit(config=src)
This cell ends asynchronously, so if you want to wait for the end of execution, execute the following statement.
```python
%%time
run.wait_for_completion(show_output=True)
```
script2.2.py
The contents of the script that is executed remotely. As mentioned above, combine the file names of the files saved in work2 / input / into a data frame and save them in work2 / output / output1 / as outfile.csv. output1 / is created in this script.
import argparse
import os
import cv2
import numpy as np
import pandas as pd
import math
import tensorflow as tf
import PIL
import matplotlib
parser = argparse.ArgumentParser()
parser.add_argument('--datadir', type=str, help="data directory")
parser.add_argument('--output', type=str, help="output")
args = parser.parse_args()
print("Argument 1: %s" % args.datadir)
print("Argument 2: %s" % args.output)
print("cv2: %s" % cv2.__version__)
print("numpy: %s" % np.__version__)
print("pandas: %s" % pd.__version__)
print("tensorflow: %s" % tf.__version__)
print("matplotlib: %s" % matplotlib.__version__)
print("PIL: %s" % PIL.PILLOW_VERSION)
file_dict = {}
file_dict_df = pd.DataFrame([])
i = 0
for fname in next(os.walk(args.datadir))[2]:
print('processing', fname)
i += 1
infname = os.path.join(args.datadir, fname)
file_dict['num'] = i
file_dict['file name'] = fname
file_dict_df = file_dict_df.append([file_dict])
os.makedirs(args.output + '/output1', exist_ok=True)
outfname = os.path.join(args.output, 'output1/outfile.csv')
file_dict_df.to_csv(outfname, index=False, encoding='shift-JIS')
What do you think. You can understand the basic operation of Azure ML Python SDK in 1-4 using Azure ML Python SDK. Next time, I would like to introduce you to the pipeline.
Use Azure ML Python SDK 1: Use dataset as input-Part 1 [Using Azure ML Python SDK 2: Using dataset as input-Part 2] (https://qiita.com/notanaha/items/30d57590c92b03bc953c) [Using Azure ML Python SDK 3: Using dataset as input-Part 1] (https://qiita.com/notanaha/items/d22ba02b9cc903d281b6) Azure/MachineLearningNotebooks
Recommended Posts