Skip to content

Conversation

@HCharlie
Copy link
Owner

Issue #, if available:

Description of changes:

Testing done:

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • I have read the CONTRIBUTING doc
  • I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the Python SDK team
  • I used the commit message format described in CONTRIBUTING
  • I have passed the region in to all S3 and STS clients that I've initialized as part of this change.
  • I have updated any necessary documentation, including READMEs and API docs (if appropriate)

Tests

  • I have added tests that prove my fix is effective or that my feature works (if appropriate)
  • I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes
  • I have checked that my tests are not configured for a specific region or account (if appropriate)
  • I have used unique_name_from_base to create resource names in integ tests (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@HCharlie HCharlie force-pushed the MLX-1269 branch 3 times, most recently from ef346b1 to 3112627 Compare May 2, 2024 11:45
benieric and others added 26 commits December 4, 2024 04:38
* add in-process mode for DJL server

* fix format

* add inference_spec as a member of DJL

* add the validations for model server

* fix typo

* fix test assertion

* add unit-testing

* have a common server for inprocess mode

* fix failing tests

* add support to torchserve

* fix tests to include torchserve servers

* use custom inference_spec code instead of HF pipelines

* fix tests for app.py

* fix unit test failure

* fix format

* use schema_builder for serialization and deserialization

* remove task field

* remove unused import
* Base model trainer (aws#1521)

* Base model trainer

* flake8

* add testing notebook

* add param validation & set defaults

* Implement simple train method

* feature: support script mode with local train.sh (aws#1523)

* feature: support script mode with local train.sh

* Stop tracking train.sh and add it to .gitignore

* update message

* make dir if not exist

* fix docs

* fix: docstyle

* Address comments

* fix hyperparams

* Revert pydantic custom error

* pylint

* Image Spec refactoring and updates (aws#1525)

* Image Spec refactoring and updates

* Unit tests and update function for Image Spec

* Fix hugging face test

* Fix Tests

* Add unit tests for ModelTrainer (aws#1527)

* Add unit tests for ModelTrainer

* Flake8

* format

* Add example notebook (aws#1528)

* Add testing notebook

* format

* use smaller data

* remove large dataset

* update

* pylint

* flake8

* ignore docstyle in directories with test

* format

* format

* Add enviornment variable bootstrapping script (aws#1530)

* Add enviornment variables scripts

* format

* fix comment

* add docstrings

* fix comment

* feature: add utility function to capture local snapshot (aws#1524)

* local snapshot

* Update pip list command

* Remove function calls

* Address comments

* Address comments

* Change to make Model Trainer return a Model Object

* Fix

* Cleanup

* Support intelligent parameters (aws#1540)

* Support intelligent parameters

* fix codestyle

* Revert Image Spec (aws#1541)

* Cleanup ModelTrainer (aws#1542)

* General image builder (aws#1546)

* General image builder

* General image builder

* Fix codestyle

* Fix codestyle

* Move location

* Add warnings

* Add integ tests

* Fix integ test

* Fix integ test

* Fix region error

* Add region

* Latest Container Image (aws#1545)

* Latest Container Image

* Test Fixes

* Parameterized tests and some logic updates

* Test fixes

* Move to Image URI

* Fixes for unit test

* Fixes for unit test

* Fix codestyle error checks

* Cleanup ModelTrainer code (aws#1552)

* Updates

* feat: add pre-processing and post-processing logic to inference_spec (aws#1560)

* add pre-processing and post-processing logic to inference_spec

* fix format

* make  accept_type and content_type optional

* remove accept_type and content_type from pre/post processing

* correct typo

* Add Distributed Training Support Model Trainer (aws#1536)

* Add path to set Additional Settings in ModelTrainer (aws#1555)

* Updates

* Mask Sensitive Env Logs in Container (aws#1568)

* Cleanup PR

* Codestyle fixes

* Update logic to use model parameter instead of model_path

* Fixes

* Fixes

* Tests

* Codestyle Fixes

* Codestyle Fixes

* Codestyle Fixes

* Codestyle Fixes

---------

Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com>
Co-authored-by: pintaoz-aws <167920275+pintaoz-aws@users.noreply.github.com>
Co-authored-by: Pravali Uppugunduri <46845440+pravali96@users.noreply.github.com>
Co-authored-by: Gokul Anantha Narayanan <166456257+nargokul@users.noreply.github.com>
* Base model trainer (aws#1521)

* Base model trainer

* flake8

* add testing notebook

* add param validation & set defaults

* Implement simple train method

* feature: support script mode with local train.sh (aws#1523)

* feature: support script mode with local train.sh

* Stop tracking train.sh and add it to .gitignore

* update message

* make dir if not exist

* fix docs

* fix: docstyle

* Address comments

* fix hyperparams

* Revert pydantic custom error

* pylint

* Image Spec refactoring and updates (aws#1525)

* Image Spec refactoring and updates

* Unit tests and update function for Image Spec

* Fix hugging face test

* Fix Tests

* Add unit tests for ModelTrainer (aws#1527)

* Add unit tests for ModelTrainer

* Flake8

* format

* Add example notebook (aws#1528)

* Add testing notebook

* format

* use smaller data

* remove large dataset

* update

* pylint

* flake8

* ignore docstyle in directories with test

* format

* format

* Add enviornment variable bootstrapping script (aws#1530)

* Add enviornment variables scripts

* format

* fix comment

* add docstrings

* fix comment

* feature: add utility function to capture local snapshot (aws#1524)

* local snapshot

* Update pip list command

* Remove function calls

* Address comments

* Address comments

* Support intelligent parameters (aws#1540)

* Support intelligent parameters

* fix codestyle

* Revert Image Spec (aws#1541)

* Cleanup ModelTrainer (aws#1542)

* General image builder (aws#1546)

* General image builder

* General image builder

* Fix codestyle

* Fix codestyle

* Move location

* Add warnings

* Add integ tests

* Fix integ test

* Fix integ test

* Fix region error

* Add region

* Latest Container Image (aws#1545)

* Latest Container Image

* Test Fixes

* Parameterized tests and some logic updates

* Test fixes

* Move to Image URI

* Fixes for unit test

* Fixes for unit test

* Fix codestyle error checks

* Cleanup ModelTrainer code (aws#1552)

* feat: add pre-processing and post-processing logic to inference_spec (aws#1560)

* add pre-processing and post-processing logic to inference_spec

* fix format

* make  accept_type and content_type optional

* remove accept_type and content_type from pre/post processing

* correct typo

* Add Distributed Training Support Model Trainer (aws#1536)

* Add path to set Additional Settings in ModelTrainer (aws#1555)

* Support building image from Dockerfile

* Fix test

* Fix test

* Rename functions

---------

Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com>
Co-authored-by: Gokul Anantha Narayanan <166456257+nargokul@users.noreply.github.com>
Co-authored-by: Pravali Uppugunduri <46845440+pravali96@users.noreply.github.com>
* Base model trainer (aws#1521)

* Base model trainer

* flake8

* add testing notebook

* add param validation & set defaults

* Implement simple train method

* feature: support script mode with local train.sh (aws#1523)

* feature: support script mode with local train.sh

* Stop tracking train.sh and add it to .gitignore

* update message

* make dir if not exist

* fix docs

* fix: docstyle

* Address comments

* fix hyperparams

* Revert pydantic custom error

* pylint

* Image Spec refactoring and updates (aws#1525)

* Image Spec refactoring and updates

* Unit tests and update function for Image Spec

* Fix hugging face test

* Fix Tests

* Add unit tests for ModelTrainer (aws#1527)

* Add unit tests for ModelTrainer

* Flake8

* format

* Add example notebook (aws#1528)

* Add testing notebook

* format

* use smaller data

* remove large dataset

* update

* pylint

* flake8

* ignore docstyle in directories with test

* format

* format

* Add enviornment variable bootstrapping script (aws#1530)

* Add enviornment variables scripts

* format

* fix comment

* add docstrings

* fix comment

* feature: add utility function to capture local snapshot (aws#1524)

* local snapshot

* Update pip list command

* Remove function calls

* Address comments

* Address comments

* Support intelligent parameters (aws#1540)

* Support intelligent parameters

* fix codestyle

* Revert Image Spec (aws#1541)

* Cleanup ModelTrainer (aws#1542)

* Initial Prototype

* General image builder (aws#1546)

* General image builder

* General image builder

* Fix codestyle

* Fix codestyle

* Move location

* Add warnings

* Add integ tests

* Fix integ test

* Fix integ test

* Fix region error

* Add region

* Unified deploying in ModelBuilder

* Latest Container Image (aws#1545)

* Latest Container Image

* Test Fixes

* Parameterized tests and some logic updates

* Test fixes

* Move to Image URI

* Fixes for unit test

* Fixes for unit test

* Fix codestyle error checks

* Address PR comments

* Address Codestyle errors

* Cleanup ModelTrainer code (aws#1552)

* Black format

* Codestyle changes

* Codestyle changes

* from __future__ import absolute_import

* DocString formatting

* Black formatting

* Address PR comments

* Noteboook changes and fixes

* feat: add pre-processing and post-processing logic to inference_spec (aws#1560)

* add pre-processing and post-processing logic to inference_spec

* fix format

* make  accept_type and content_type optional

* remove accept_type and content_type from pre/post processing

* correct typo

* Add Distributed Training Support Model Trainer (aws#1536)

* Add path to set Additional Settings in ModelTrainer (aws#1555)

* Checkstyle Fixes

* Address PR comments

* Fixes

* Merge Fixes

* Codestyle Fixes

* Codestyle Fixes

* Codestyle Fixes

* Codestyle Fixes

* Codestyle Fixes

* Update Docstring

---------

Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com>
Co-authored-by: pintaoz-aws <167920275+pintaoz-aws@users.noreply.github.com>
Co-authored-by: Pravali Uppugunduri <46845440+pravali96@users.noreply.github.com>
* Base model trainer (aws#1521)

* Base model trainer

* flake8

* add testing notebook

* add param validation & set defaults

* Implement simple train method

* feature: support script mode with local train.sh (aws#1523)

* feature: support script mode with local train.sh

* Stop tracking train.sh and add it to .gitignore

* update message

* make dir if not exist

* fix docs

* fix: docstyle

* Address comments

* fix hyperparams

* Revert pydantic custom error

* pylint

* Image Spec refactoring and updates (aws#1525)

* Image Spec refactoring and updates

* Unit tests and update function for Image Spec

* Fix hugging face test

* Fix Tests

* Add unit tests for ModelTrainer (aws#1527)

* Add unit tests for ModelTrainer

* Flake8

* format

* Add example notebook (aws#1528)

* Add testing notebook

* format

* use smaller data

* remove large dataset

* update

* pylint

* flake8

* ignore docstyle in directories with test

* format

* format

* Add enviornment variable bootstrapping script (aws#1530)

* Add enviornment variables scripts

* format

* fix comment

* add docstrings

* fix comment

* feature: add utility function to capture local snapshot (aws#1524)

* local snapshot

* Update pip list command

* Remove function calls

* Address comments

* Address comments

* Support intelligent parameters (aws#1540)

* Support intelligent parameters

* fix codestyle

* Revert Image Spec (aws#1541)

* Cleanup ModelTrainer (aws#1542)

* General image builder (aws#1546)

* General image builder

* General image builder

* Fix codestyle

* Fix codestyle

* Move location

* Add warnings

* Add integ tests

* Fix integ test

* Fix integ test

* Fix region error

* Add region

* Latest Container Image (aws#1545)

* Latest Container Image

* Test Fixes

* Parameterized tests and some logic updates

* Test fixes

* Move to Image URI

* Fixes for unit test

* Fixes for unit test

* Fix codestyle error checks

* Cleanup ModelTrainer code (aws#1552)

* Single container local mode training

* Add wait argument

* Implement helper funtions

* Add helper functions

* Fix bugs

* Fix codestyle

* feat: add pre-processing and post-processing logic to inference_spec (aws#1560)

* add pre-processing and post-processing logic to inference_spec

* fix format

* make  accept_type and content_type optional

* remove accept_type and content_type from pre/post processing

* correct typo

* Fix test and codestyle

* Add Distributed Training Support Model Trainer (aws#1536)

* Add tests

* Add path to set Additional Settings in ModelTrainer (aws#1555)

* Added example notebook

* Fix codestyle

* Address comments

* resolve merge conflict

* Support multi container local training (aws#1576)

* Fix codestyle

* Mask Sensitive Env Logs in Container (aws#1568)

* Fix bug in script mode setup ModelTrainer (aws#1575)

* Support multi container local training

* Merge branch 'single_container_local_training' into multi_container_local_training

* Update unit tests

---------

Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com>

* Remove LocalTrainingJob class

* Bypass pydantic check

* Add example

---------

Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com>
Co-authored-by: Gokul Anantha Narayanan <166456257+nargokul@users.noreply.github.com>
Co-authored-by: Pravali Uppugunduri <46845440+pravali96@users.noreply.github.com>
* add inference morpheus nbs

* update the in process notebook
…#1583)

* Fix: move the functionality from latest_container_image to retrieve

* address some comments from Gokul and add unit test

* remove extra functions and rewrite the test

* fix unit test

* fix for other unit test

* unit test fix

* fix unit test: add one more condition

* more unit tests fix

* remove redundant files

---------

Co-authored-by: Chad Chiang <chadchc@amazon.com>
Co-authored-by: Gokul Anantha Narayanan <166456257+nargokul@users.noreply.github.com>
* Fix: move the functionality from latest_container_image to retrieve

* address some comments from Gokul and add unit test

* remove extra functions and rewrite the test

* fix unit test

* fix for other unit test

* unit test fix

* fix unit test: add one more condition

* more unit tests fix

* remove redundant files

* remove the special condition and fix the unit test

---------

Co-authored-by: Chad Chiang <chadchc@amazon.com>
Co-authored-by: Gokul Anantha Narayanan <166456257+nargokul@users.noreply.github.com>
* Notebooks update for Bugbash

* Testing and updates

* Testing and updates

* Addressed comments

* Fix

* Fix
ci and others added 30 commits April 11, 2025 01:19
* Fix deepdiff dependencies

* trigger tests
* change: Allow telemetry only in supported regions

* change: Allow telemetry only in supported regions

* change: Allow telemetry only in supported regions

* change: Allow telemetry only in supported regions

* change: Allow telemetry only in supported regions

* documentation: Removed a line about python version requirements of training script which can misguide users.Training script can be of latest version based on the support provided by framework_version of the container

* feature: Enabled update_endpoint through model_builder

* fix: fix unit test, black-check, pylint errors

* fix: fix black-check, pylint errors

* fix:Added handler for pipeline variable while creating process job

* fix: Added handler for pipeline variable while creating process job

* Revert the PR changes: aws#5122, due to issue https://t.corp.amazon.com/P223568185/overview

* Fix: fix the issue, https://t.corp.amazon.com/P223568185/communication

---------

Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>
* fix: tgi image uri unit tests

* fix: black-format and flake8 failures

* fix: parse

* fix: print statement

---------

Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com>
…aws#5123)

* clean up

* bump maxdepth for doc/api/training to fix readthedocs

* change maxdepth for readthedocs rendering doc/api/training page

* change maxdepth for readthedocs rendering doc/api/training page

* change maxdepth for readthedocs rendering doc/api/training page
* change: Allow telemetry only in supported regions

* change: Allow telemetry only in supported regions

* change: Allow telemetry only in supported regions

* change: Allow telemetry only in supported regions

* change: Allow telemetry only in supported regions

* documentation: Removed a line about python version requirements of training script which can misguide users.Training script can be of latest version based on the support provided by framework_version of the container

* feature: Enabled update_endpoint through model_builder

* fix: fix unit test, black-check, pylint errors

* fix: fix black-check, pylint errors

* fix:Added handler for pipeline variable while creating process job

* fix: Added handler for pipeline variable while creating process job

* Revert the PR changes: aws#5122, due to issue https://t.corp.amazon.com/P223568185/overview

* Fix: fix the issue, https://t.corp.amazon.com/P223568185/communication

* Revert PR 5122 changes, due to issues with other processor codeflows

---------

Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>
Co-authored-by: Zhaoqi <jzhaoqwa@amazon.com>
…ws#5144)

* add s3 uri check to modeltrainer data source

* update ModelTrainer to support s3 uri and tar.gz file as source_dir

* black-format

* add unit and integ tests

* update logic and unit test to raise value error if the file is not .tar.gz
…image. (aws#5143)

* feature:support custom workflow deployment in ModelBuilder using SMD image. (aws#1661)

* feature:support custom workflow deployment in ModelBuilder using SMD inference image.

* Rename test case and pass session.

* Address PR comments.

* Tweak resource cleanup logic in integ test.

* Fixing CodeBuild integ test failures.

* Renamed integ test.

* Remove unused integ test, restore once GA.

---------

Co-authored-by: Joseph Zhang <cjz@amazon.com>

* Cache client as instance attribute in property@ decorator. (aws#1668)

* Remove property@ decorator from ABC definition.

* Cache client as instance attribute in @Property.

* Fix flake8 issue.

---------

Co-authored-by: Joseph Zhang <cjz@amazon.com>

* Bugfixes from e2e testing. (aws#1670)

* Fix Alabtross Inference component tests

* trigger integ tests

---------

Co-authored-by: cj-zhang <32367995+cj-zhang@users.noreply.github.com>
Co-authored-by: Joseph Zhang <cjz@amazon.com>
Co-authored-by: Pravali Uppugunduri <upravali@amazon.com>
Co-authored-by: adishaa <adishaa@amazon.com>
…5146)

* Fix Flake8 Violations

* Add Owner ID check for bucket with path when prefix is provided

**Description**

Previously we called the head_bucket call to ensure the owner ID check, but this doesnt take into consideration cases where the s3 path is provided through the prefix.

This change makes sure that director level permissions are supported.

**Testing Done**
Tested through unit tests, integ tests and manual testing through the installation file.

Yes

* Address PR comment

* Codestyle fixes

* Minor fix

* Codestyle fixes

* Fix Unit tests
* chore: add huggingface images

* chore: add tei 1.6 image

* chore: add tei 1.6.0 to tei mapping in tests
aws#5098)

Bumps [mlflow](https://github.com/mlflow/mlflow) from 2.13.2 to 2.20.3.
- [Release notes](https://github.com/mlflow/mlflow/releases)
- [Changelog](https://github.com/mlflow/mlflow/blob/master/CHANGELOG.md)
- [Commits](mlflow/mlflow@v2.13.2...v2.20.3)

---
updated-dependencies:
- dependency-name: mlflow
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [mlflow](https://github.com/mlflow/mlflow) from 2.13.2 to 2.20.3.
- [Release notes](https://github.com/mlflow/mlflow/releases)
- [Changelog](https://github.com/mlflow/mlflow/blob/master/CHANGELOG.md)
- [Commits](mlflow/mlflow@v2.13.2...v2.20.3)

---
updated-dependencies:
- dependency-name: mlflow
  dependency-version: 2.20.3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [scikit-learn](https://github.com/scikit-learn/scikit-learn) from 1.3.2 to 1.5.1.
- [Release notes](https://github.com/scikit-learn/scikit-learn/releases)
- [Commits](scikit-learn/scikit-learn@1.3.2...1.5.1)

---
updated-dependencies:
- dependency-name: scikit-learn
  dependency-version: 1.5.1
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Improve error logging and documentation for issue 4007

* Add hyperlink to RTDs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.