Mlx 1269 #1

HCharlie · 2024-03-25T20:22:01Z

Issue #, if available:

Description of changes:

Testing done:

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

I have read the CONTRIBUTING doc
I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the Python SDK team
I used the commit message format described in CONTRIBUTING
I have passed the region in to all S3 and STS clients that I've initialized as part of this change.
I have updated any necessary documentation, including READMEs and API docs (if appropriate)

Tests

I have added tests that prove my fix is effective or that my feature works (if appropriate)
I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes
I have checked that my tests are not configured for a specific region or account (if appropriate)
I have used unique_name_from_base to create resource names in integ tests (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

… and deployment configs (aws#1572)

* add in-process mode for DJL server * fix format * add inference_spec as a member of DJL * add the validations for model server * fix typo * fix test assertion * add unit-testing * have a common server for inprocess mode * fix failing tests * add support to torchserve * fix tests to include torchserve servers * use custom inference_spec code instead of HF pipelines * fix tests for app.py * fix unit test failure * fix format * use schema_builder for serialization and deserialization * remove task field * remove unused import

* Base model trainer (aws#1521) * Base model trainer * flake8 * add testing notebook * add param validation & set defaults * Implement simple train method * feature: support script mode with local train.sh (aws#1523) * feature: support script mode with local train.sh * Stop tracking train.sh and add it to .gitignore * update message * make dir if not exist * fix docs * fix: docstyle * Address comments * fix hyperparams * Revert pydantic custom error * pylint * Image Spec refactoring and updates (aws#1525) * Image Spec refactoring and updates * Unit tests and update function for Image Spec * Fix hugging face test * Fix Tests * Add unit tests for ModelTrainer (aws#1527) * Add unit tests for ModelTrainer * Flake8 * format * Add example notebook (aws#1528) * Add testing notebook * format * use smaller data * remove large dataset * update * pylint * flake8 * ignore docstyle in directories with test * format * format * Add enviornment variable bootstrapping script (aws#1530) * Add enviornment variables scripts * format * fix comment * add docstrings * fix comment * feature: add utility function to capture local snapshot (aws#1524) * local snapshot * Update pip list command * Remove function calls * Address comments * Address comments * Change to make Model Trainer return a Model Object * Fix * Cleanup * Support intelligent parameters (aws#1540) * Support intelligent parameters * fix codestyle * Revert Image Spec (aws#1541) * Cleanup ModelTrainer (aws#1542) * General image builder (aws#1546) * General image builder * General image builder * Fix codestyle * Fix codestyle * Move location * Add warnings * Add integ tests * Fix integ test * Fix integ test * Fix region error * Add region * Latest Container Image (aws#1545) * Latest Container Image * Test Fixes * Parameterized tests and some logic updates * Test fixes * Move to Image URI * Fixes for unit test * Fixes for unit test * Fix codestyle error checks * Cleanup ModelTrainer code (aws#1552) * Updates * feat: add pre-processing and post-processing logic to inference_spec (aws#1560) * add pre-processing and post-processing logic to inference_spec * fix format * make accept_type and content_type optional * remove accept_type and content_type from pre/post processing * correct typo * Add Distributed Training Support Model Trainer (aws#1536) * Add path to set Additional Settings in ModelTrainer (aws#1555) * Updates * Mask Sensitive Env Logs in Container (aws#1568) * Cleanup PR * Codestyle fixes * Update logic to use model parameter instead of model_path * Fixes * Fixes * Tests * Codestyle Fixes * Codestyle Fixes * Codestyle Fixes * Codestyle Fixes --------- Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com> Co-authored-by: pintaoz-aws <167920275+pintaoz-aws@users.noreply.github.com> Co-authored-by: Pravali Uppugunduri <46845440+pravali96@users.noreply.github.com>

Co-authored-by: Gokul Anantha Narayanan <166456257+nargokul@users.noreply.github.com>

* Base model trainer (aws#1521) * Base model trainer * flake8 * add testing notebook * add param validation & set defaults * Implement simple train method * feature: support script mode with local train.sh (aws#1523) * feature: support script mode with local train.sh * Stop tracking train.sh and add it to .gitignore * update message * make dir if not exist * fix docs * fix: docstyle * Address comments * fix hyperparams * Revert pydantic custom error * pylint * Image Spec refactoring and updates (aws#1525) * Image Spec refactoring and updates * Unit tests and update function for Image Spec * Fix hugging face test * Fix Tests * Add unit tests for ModelTrainer (aws#1527) * Add unit tests for ModelTrainer * Flake8 * format * Add example notebook (aws#1528) * Add testing notebook * format * use smaller data * remove large dataset * update * pylint * flake8 * ignore docstyle in directories with test * format * format * Add enviornment variable bootstrapping script (aws#1530) * Add enviornment variables scripts * format * fix comment * add docstrings * fix comment * feature: add utility function to capture local snapshot (aws#1524) * local snapshot * Update pip list command * Remove function calls * Address comments * Address comments * Support intelligent parameters (aws#1540) * Support intelligent parameters * fix codestyle * Revert Image Spec (aws#1541) * Cleanup ModelTrainer (aws#1542) * General image builder (aws#1546) * General image builder * General image builder * Fix codestyle * Fix codestyle * Move location * Add warnings * Add integ tests * Fix integ test * Fix integ test * Fix region error * Add region * Latest Container Image (aws#1545) * Latest Container Image * Test Fixes * Parameterized tests and some logic updates * Test fixes * Move to Image URI * Fixes for unit test * Fixes for unit test * Fix codestyle error checks * Cleanup ModelTrainer code (aws#1552) * feat: add pre-processing and post-processing logic to inference_spec (aws#1560) * add pre-processing and post-processing logic to inference_spec * fix format * make accept_type and content_type optional * remove accept_type and content_type from pre/post processing * correct typo * Add Distributed Training Support Model Trainer (aws#1536) * Add path to set Additional Settings in ModelTrainer (aws#1555) * Support building image from Dockerfile * Fix test * Fix test * Rename functions --------- Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com> Co-authored-by: Gokul Anantha Narayanan <166456257+nargokul@users.noreply.github.com> Co-authored-by: Pravali Uppugunduri <46845440+pravali96@users.noreply.github.com>

* Base model trainer (aws#1521) * Base model trainer * flake8 * add testing notebook * add param validation & set defaults * Implement simple train method * feature: support script mode with local train.sh (aws#1523) * feature: support script mode with local train.sh * Stop tracking train.sh and add it to .gitignore * update message * make dir if not exist * fix docs * fix: docstyle * Address comments * fix hyperparams * Revert pydantic custom error * pylint * Image Spec refactoring and updates (aws#1525) * Image Spec refactoring and updates * Unit tests and update function for Image Spec * Fix hugging face test * Fix Tests * Add unit tests for ModelTrainer (aws#1527) * Add unit tests for ModelTrainer * Flake8 * format * Add example notebook (aws#1528) * Add testing notebook * format * use smaller data * remove large dataset * update * pylint * flake8 * ignore docstyle in directories with test * format * format * Add enviornment variable bootstrapping script (aws#1530) * Add enviornment variables scripts * format * fix comment * add docstrings * fix comment * feature: add utility function to capture local snapshot (aws#1524) * local snapshot * Update pip list command * Remove function calls * Address comments * Address comments * Support intelligent parameters (aws#1540) * Support intelligent parameters * fix codestyle * Revert Image Spec (aws#1541) * Cleanup ModelTrainer (aws#1542) * Initial Prototype * General image builder (aws#1546) * General image builder * General image builder * Fix codestyle * Fix codestyle * Move location * Add warnings * Add integ tests * Fix integ test * Fix integ test * Fix region error * Add region * Unified deploying in ModelBuilder * Latest Container Image (aws#1545) * Latest Container Image * Test Fixes * Parameterized tests and some logic updates * Test fixes * Move to Image URI * Fixes for unit test * Fixes for unit test * Fix codestyle error checks * Address PR comments * Address Codestyle errors * Cleanup ModelTrainer code (aws#1552) * Black format * Codestyle changes * Codestyle changes * from __future__ import absolute_import * DocString formatting * Black formatting * Address PR comments * Noteboook changes and fixes * feat: add pre-processing and post-processing logic to inference_spec (aws#1560) * add pre-processing and post-processing logic to inference_spec * fix format * make accept_type and content_type optional * remove accept_type and content_type from pre/post processing * correct typo * Add Distributed Training Support Model Trainer (aws#1536) * Add path to set Additional Settings in ModelTrainer (aws#1555) * Checkstyle Fixes * Address PR comments * Fixes * Merge Fixes * Codestyle Fixes * Codestyle Fixes * Codestyle Fixes * Codestyle Fixes * Codestyle Fixes * Update Docstring --------- Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com> Co-authored-by: pintaoz-aws <167920275+pintaoz-aws@users.noreply.github.com> Co-authored-by: Pravali Uppugunduri <46845440+pravali96@users.noreply.github.com>

* Base model trainer (aws#1521) * Base model trainer * flake8 * add testing notebook * add param validation & set defaults * Implement simple train method * feature: support script mode with local train.sh (aws#1523) * feature: support script mode with local train.sh * Stop tracking train.sh and add it to .gitignore * update message * make dir if not exist * fix docs * fix: docstyle * Address comments * fix hyperparams * Revert pydantic custom error * pylint * Image Spec refactoring and updates (aws#1525) * Image Spec refactoring and updates * Unit tests and update function for Image Spec * Fix hugging face test * Fix Tests * Add unit tests for ModelTrainer (aws#1527) * Add unit tests for ModelTrainer * Flake8 * format * Add example notebook (aws#1528) * Add testing notebook * format * use smaller data * remove large dataset * update * pylint * flake8 * ignore docstyle in directories with test * format * format * Add enviornment variable bootstrapping script (aws#1530) * Add enviornment variables scripts * format * fix comment * add docstrings * fix comment * feature: add utility function to capture local snapshot (aws#1524) * local snapshot * Update pip list command * Remove function calls * Address comments * Address comments * Support intelligent parameters (aws#1540) * Support intelligent parameters * fix codestyle * Revert Image Spec (aws#1541) * Cleanup ModelTrainer (aws#1542) * General image builder (aws#1546) * General image builder * General image builder * Fix codestyle * Fix codestyle * Move location * Add warnings * Add integ tests * Fix integ test * Fix integ test * Fix region error * Add region * Latest Container Image (aws#1545) * Latest Container Image * Test Fixes * Parameterized tests and some logic updates * Test fixes * Move to Image URI * Fixes for unit test * Fixes for unit test * Fix codestyle error checks * Cleanup ModelTrainer code (aws#1552) * Single container local mode training * Add wait argument * Implement helper funtions * Add helper functions * Fix bugs * Fix codestyle * feat: add pre-processing and post-processing logic to inference_spec (aws#1560) * add pre-processing and post-processing logic to inference_spec * fix format * make accept_type and content_type optional * remove accept_type and content_type from pre/post processing * correct typo * Fix test and codestyle * Add Distributed Training Support Model Trainer (aws#1536) * Add tests * Add path to set Additional Settings in ModelTrainer (aws#1555) * Added example notebook * Fix codestyle * Address comments * resolve merge conflict * Support multi container local training (aws#1576) * Fix codestyle * Mask Sensitive Env Logs in Container (aws#1568) * Fix bug in script mode setup ModelTrainer (aws#1575) * Support multi container local training * Merge branch 'single_container_local_training' into multi_container_local_training * Update unit tests --------- Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com> * Remove LocalTrainingJob class * Bypass pydantic check * Add example --------- Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com> Co-authored-by: Gokul Anantha Narayanan <166456257+nargokul@users.noreply.github.com> Co-authored-by: Pravali Uppugunduri <46845440+pravali96@users.noreply.github.com>

* add inference morpheus nbs * update the in process notebook

…#1583) * Fix: move the functionality from latest_container_image to retrieve * address some comments from Gokul and add unit test * remove extra functions and rewrite the test * fix unit test * fix for other unit test * unit test fix * fix unit test: add one more condition * more unit tests fix * remove redundant files --------- Co-authored-by: Chad Chiang <chadchc@amazon.com> Co-authored-by: Gokul Anantha Narayanan <166456257+nargokul@users.noreply.github.com>

* Fix: move the functionality from latest_container_image to retrieve * address some comments from Gokul and add unit test * remove extra functions and rewrite the test * fix unit test * fix for other unit test * unit test fix * fix unit test: add one more condition * more unit tests fix * remove redundant files * remove the special condition and fix the unit test --------- Co-authored-by: Chad Chiang <chadchc@amazon.com> Co-authored-by: Gokul Anantha Narayanan <166456257+nargokul@users.noreply.github.com>

* Notebooks update for Bugbash * Testing and updates * Testing and updates * Addressed comments * Fix * Fix

* Fix deepdiff dependencies * trigger tests

* change: Allow telemetry only in supported regions * change: Allow telemetry only in supported regions * change: Allow telemetry only in supported regions * change: Allow telemetry only in supported regions * change: Allow telemetry only in supported regions * documentation: Removed a line about python version requirements of training script which can misguide users.Training script can be of latest version based on the support provided by framework_version of the container * feature: Enabled update_endpoint through model_builder * fix: fix unit test, black-check, pylint errors * fix: fix black-check, pylint errors * fix:Added handler for pipeline variable while creating process job * fix: Added handler for pipeline variable while creating process job * Revert the PR changes: aws#5122, due to issue https://t.corp.amazon.com/P223568185/overview * Fix: fix the issue, https://t.corp.amazon.com/P223568185/communication --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>

* fix: tgi image uri unit tests * fix: black-format and flake8 failures * fix: parse * fix: print statement --------- Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com>

…aws#5123) * clean up * bump maxdepth for doc/api/training to fix readthedocs * change maxdepth for readthedocs rendering doc/api/training page * change maxdepth for readthedocs rendering doc/api/training page * change maxdepth for readthedocs rendering doc/api/training page

* change: Allow telemetry only in supported regions * change: Allow telemetry only in supported regions * change: Allow telemetry only in supported regions * change: Allow telemetry only in supported regions * change: Allow telemetry only in supported regions * documentation: Removed a line about python version requirements of training script which can misguide users.Training script can be of latest version based on the support provided by framework_version of the container * feature: Enabled update_endpoint through model_builder * fix: fix unit test, black-check, pylint errors * fix: fix black-check, pylint errors * fix:Added handler for pipeline variable while creating process job * fix: Added handler for pipeline variable while creating process job * Revert the PR changes: aws#5122, due to issue https://t.corp.amazon.com/P223568185/overview * Fix: fix the issue, https://t.corp.amazon.com/P223568185/communication * Revert PR 5122 changes, due to issues with other processor codeflows --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> Co-authored-by: Zhaoqi <jzhaoqwa@amazon.com>

…ws#5144) * add s3 uri check to modeltrainer data source * update ModelTrainer to support s3 uri and tar.gz file as source_dir * black-format * add unit and integ tests * update logic and unit test to raise value error if the file is not .tar.gz

…image. (aws#5143) * feature:support custom workflow deployment in ModelBuilder using SMD image. (aws#1661) * feature:support custom workflow deployment in ModelBuilder using SMD inference image. * Rename test case and pass session. * Address PR comments. * Tweak resource cleanup logic in integ test. * Fixing CodeBuild integ test failures. * Renamed integ test. * Remove unused integ test, restore once GA. --------- Co-authored-by: Joseph Zhang <cjz@amazon.com> * Cache client as instance attribute in property@ decorator. (aws#1668) * Remove property@ decorator from ABC definition. * Cache client as instance attribute in @Property. * Fix flake8 issue. --------- Co-authored-by: Joseph Zhang <cjz@amazon.com> * Bugfixes from e2e testing. (aws#1670) * Fix Alabtross Inference component tests * trigger integ tests --------- Co-authored-by: cj-zhang <32367995+cj-zhang@users.noreply.github.com> Co-authored-by: Joseph Zhang <cjz@amazon.com> Co-authored-by: Pravali Uppugunduri <upravali@amazon.com>

…ws#5149) Co-authored-by: Namrata Madan <nmmadan@amazon.com>

Co-authored-by: adishaa <adishaa@amazon.com>

…5146) * Fix Flake8 Violations * Add Owner ID check for bucket with path when prefix is provided **Description** Previously we called the head_bucket call to ensure the owner ID check, but this doesnt take into consideration cases where the s3 path is provided through the prefix. This change makes sure that director level permissions are supported. **Testing Done** Tested through unit tests, integ tests and manual testing through the installation file. Yes * Address PR comment * Codestyle fixes * Minor fix * Codestyle fixes * Fix Unit tests

* chore: add huggingface images * chore: add tei 1.6 image * chore: add tei 1.6.0 to tei mapping in tests

aws#5098) Bumps [mlflow](https://github.com/mlflow/mlflow) from 2.13.2 to 2.20.3. - [Release notes](https://github.com/mlflow/mlflow/releases) - [Changelog](https://github.com/mlflow/mlflow/blob/master/CHANGELOG.md) - [Commits](mlflow/mlflow@v2.13.2...v2.20.3) --- updated-dependencies: - dependency-name: mlflow dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Bumps [mlflow](https://github.com/mlflow/mlflow) from 2.13.2 to 2.20.3. - [Release notes](https://github.com/mlflow/mlflow/releases) - [Changelog](https://github.com/mlflow/mlflow/blob/master/CHANGELOG.md) - [Commits](mlflow/mlflow@v2.13.2...v2.20.3) --- updated-dependencies: - dependency-name: mlflow dependency-version: 2.20.3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Bumps [scikit-learn](https://github.com/scikit-learn/scikit-learn) from 1.3.2 to 1.5.1. - [Release notes](https://github.com/scikit-learn/scikit-learn/releases) - [Commits](scikit-learn/scikit-learn@1.3.2...1.5.1) --- updated-dependencies: - dependency-name: scikit-learn dependency-version: 1.5.1 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Improve error logging and documentation for issue 4007 * Add hyperlink to RTDs

HCharlie force-pushed the MLX-1269 branch from 60a2a03 to 2f5345a Compare March 25, 2024 21:26

HCharlie force-pushed the MLX-1269 branch 3 times, most recently from ef346b1 to 3112627 Compare May 2, 2024 11:45

benieric and others added 26 commits December 4, 2024 04:38

Add path to set Additional Settings in ModelTrainer (aws#1555)

ae3e3d2

Mask Sensitive Env Logs in Container (aws#1568)

18d3cda

Fix bug in script mode setup ModelTrainer (aws#1575)

869b75f

Feature: ModelBuilder supports HuggingFace Models with benchmark data…

81ecffa

… and deployment configs (aws#1572)

Simplify Config Class Names and DistributedRunner structures (aws#1573)

63027f5

Remove ignored files

fff8cdd

Pass hyperparameters as CLI args (aws#1577)

694f8e9

Add Support for Training Recipes (aws#1565)

f28f814

Co-authored-by: Gokul Anantha Narayanan <166456257+nargokul@users.noreply.github.com>

Use exact python path in trainer template (aws#1584)

03a3ac7

Add recipes examples (aws#1582)

856b192

update notebooks (aws#1588)

5821b1a

update notebooks (aws#1592)

eced212

Bug fixes (aws#1596)

7b564ad

Update ModelTrainer Notebooks (aws#1597)

0ebc6d8

add inference morpheus nbs (aws#1594)

3bb4fbb

* add inference morpheus nbs * update the in process notebook

Add bugbash bootstrapping (aws#1598)

107f8e9

Notebooks update for Bugbash (aws#1595)

26e19f2

* Notebooks update for Bugbash * Testing and updates * Testing and updates * Addressed comments * Fix * Fix

Add Rich Logging to Model Builder (aws#1604)

c807e0d

Fix: codestyles (aws#1606)

47d2c66

ci and others added 30 commits April 11, 2025 01:19

prepare release v2.243.1

2bb8c78

update development version to v2.243.2.dev0

2f86ad9

Fix deepdiff dependencies (aws#5128)

99b1b81

* Fix deepdiff dependencies * trigger tests

fix: tgi image uri unit tests (aws#5127)

92efc09

* fix: tgi image uri unit tests * fix: black-format and flake8 failures * fix: parse * fix: print statement --------- Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com>

prepare release v2.243.2

29bdeb4

update development version to v2.243.3.dev0

27e5208

change: update image_uri_configs 04-11-2025 07:18:19 PST

ba6323f

change: update image_uri_configs 04-15-2025 07:18:10 PST

f225b85

change: update image_uri_configs 04-16-2025 07:18:18 PST

6b96afa

update pr test to deprecate py38 and add py312 (aws#5133)

79c4ddd

update readme to reflect py312 upgrade

ba559e6

prepare release v2.243.3

57f483d

update development version to v2.243.4.dev0

201500c

chore: add huggingface images (aws#5142)

15cb303

fix: pin mamba version to 24.11.3-2 to avoid inconsistent test runs (a…

0dae5c9

…ws#5149) Co-authored-by: Namrata Madan <nmmadan@amazon.com>

Add model server timeout (aws#5151)

a896bc6

Co-authored-by: adishaa <adishaa@amazon.com>

prepare release v2.244.0

87372db

update development version to v2.244.1.dev0

85056eb

chore: Add tei 1.6.0 image (aws#5145)

bb803c9

* chore: add huggingface images * chore: add tei 1.6 image * chore: add tei 1.6.0 to tei mapping in tests

Improve error logging and documentation for issue 4007 (aws#5153)

e747b03

* Improve error logging and documentation for issue 4007 * Add hyperlink to RTDs

Merge branch 'master' into MLX-1269

c9d8ced

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mlx 1269 #1

Mlx 1269 #1

Uh oh!

HCharlie commented Mar 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

45 participants

Mlx 1269 #1

Are you sure you want to change the base?

Mlx 1269 #1

Uh oh!

Conversation

HCharlie commented Mar 25, 2024

Merge Checklist

General

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

45 participants