-
Notifications
You must be signed in to change notification settings - Fork 317
Description
Describe the bug or the issue that you are facing
I'm trying to implement hyperparameter tuning in the default train pipeline by setting up a sweep job. It errors out during the run-model-training-pipeline / run-pipeline after running the workflow deploy-model-training-pipeline
Steps/Code to Reproduce
- Replace the train_model in mlops/azureml/train/pipeline.yml with the following code (I used indents in my yml, the layout is not displayed correctly in this comment):
train_model:
name: train_model
display_name: train-model
type: sweep
trial:
code: ../../../data-science/src
command: >-
python train.py
--train_data ${{inputs.train_data}}
--model_output ${{outputs.model_output}}
--regressor__n_estimators ${{search_space.regressor__n_estimators}}
environment: azureml:taxi-train-env@latest
inputs:
train_data: ${{parent.jobs.prep_data.outputs.train_data}}
outputs:
model_output: ${{parent.outputs.trained_model}}
sampling_algorithm: random
search_space:
regressor__n_estimators:
type: choice
values: [100, 200]
objective:
goal: minimize
primary_metric: train_mse
limits:
max_total_trials: 4
max_concurrent_trials: 2
timeout: 7200
- Revise the main function in data-science/src/train.py as follows:
def main(args):
'''Read train dataset, train model, save trained model'''# Read train data train_data = pd.read_parquet(Path(args.train_data)) # Split the data into input(X) and output(y) y_train = train_data[TARGET_COL] X_train = train_data[NUMERIC_COLS + CAT_NOM_COLS + CAT_ORD_COLS] # Train a Random Forest Regression Model with the training set model = RandomForestRegressor(n_estimators = args.regressor__n_estimators, bootstrap = args.regressor__bootstrap, max_depth = args.regressor__max_depth, max_features = args.regressor__max_features, min_samples_leaf = args.regressor__min_samples_leaf, min_samples_split = args.regressor__min_samples_split, random_state=0) # log model hyperparameters mlflow.log_param("model", "RandomForestRegressor") mlflow.log_param("n_estimators", args.regressor__n_estimators) mlflow.log_param("bootstrap", args.regressor__bootstrap) mlflow.log_param("max_depth", args.regressor__max_depth) mlflow.log_param("max_features", args.regressor__max_features) mlflow.log_param("min_samples_leaf", args.regressor__min_samples_leaf) mlflow.log_param("min_samples_split", args.regressor__min_samples_split) # Train model with the train set model.fit(X_train, y_train) # Predict using the Regression Model yhat_train = model.predict(X_train) # Evaluate Regression performance with the train set r2 = r2_score(y_train, yhat_train) mse = mean_squared_error(y_train, yhat_train) rmse = np.sqrt(mse) mae = mean_absolute_error(y_train, yhat_train) # log model performance metrics mlflow.log_metric("train r2", r2) mlflow.log_metric("train_mse", mse) mlflow.log_metric("train rmse", rmse) mlflow.log_metric("train mae", mae) # Visualize results plt.scatter(y_train, yhat_train, color='black') plt.plot(y_train, y_train, color='blue', linewidth=3) plt.xlabel("Real value") plt.ylabel("Predicted value") plt.savefig("regression_results.png") mlflow.log_artifact("regression_results.png") # Save the model mlflow.sklearn.save_model(sk_model=model, path="model") from distutils.dir_util import copy_tree # copy subdirectory example from_directory = "model" to_directory = args.model_output copy_tree(from_directory, to_directory)
- Run .github/workflows/tf-gha-deploy-infra.yml in Github Actions
- Run .github/workflows/deploy-model-training-pipeline-classical.yml in Github Actions
- Errors out during the run-model-training-pipeline / run-pipeline with the following msg:
Run run_id=$(az ml job create --file /home/runner/work/Azure_mlops_v2_demo/Azure_mlops_v2_demo/mlops/azureml/train/pipeline.yml --resource-group rg-mlopsv2-0040dev --workspace-name mlw-mlopsv2-0040dev --query name -o tsv)
Class WorkspaceHubOperations: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
ERROR: Failed to find referenced source for input binding $parent.jobs.train_model.outputs.model_output
Error: Process completed with exit code 1.
Expected Output
Execute .github/workflows/deploy-model-training-pipeline-classical.yml workflow with no errors
Versions
I'm using GitHub Actions and created my own repository following your guide and created a new dev branch.
Terraform
Azure ML CLI v2
Pre built examples from Tabular
Classic
Which platform are you using for deploying your infrastrucutre?
GitHub Actions (GitHub)
If you mentioned Others, please mention which platformm are you using?
No response
What are you using for deploying your infrastrucutre?
Terraform
Are you using Azure ML CLI v2 or Azure ML Python SDK v2
Azure ML CLI v2
Describe the example that you are trying to run?
Pre built examples from Tabular