-
Notifications
You must be signed in to change notification settings - Fork 83
Add methods to create data generation specs from files #310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| return result | ||
|
|
||
| @staticmethod | ||
| def fromDict(options): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to have explicit tests for this covering the following use cases:
1 - with simple options
2 - with composite (object valued options)
See the examples on the following page for object valued options - i.e DateRange, Distribution objects
dbldatagen/data_generator.py
Outdated
| return DataGenerator(**options) | ||
|
|
||
| @staticmethod | ||
| def fromFile(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dont add fromFile as method as open does not support reading file from a Databricks workspace or dbfs
dbldatagen/data_generator.py
Outdated
| raise ValueError("File type must be '.json' or '.yml'") | ||
|
|
||
| @staticmethod | ||
| def fromJson(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than taking a path, pass a string containing the definition to method
Calling code should be responsible for loading string
it could be from dbfs, from a database, from unity catalog
dbldatagen/data_generator.py
Outdated
| return DataGenerator.fromDict(generator).withColumns(columns) | ||
|
|
||
| @staticmethod | ||
| def fromYaml(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than taking a path, pass a string containing the definition to method
Calling code should be responsible for loading string
it could be from dbfs, from a database, from unity catalog
| SQL expression. | ||
| To enforce the dependency, you must use the `baseColumn` attribute to indicate the dependency. | ||
|
|
||
| Creating data generation specs from files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be creating data specs from string based YAML or JSON
Also we should have capability to write to JSON and YAML
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ronanstokes-db the code is done. I will update the docs.
tests/test_quick_tests.py
Outdated
| assert gen_from_dict.randomSeed == dg_spec.get("randomSeed") | ||
|
|
||
| def test_generation_from_file(self): | ||
| path = "tests/files/test_generator_spec.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use string based APIs, they'll be more general - also you can simply define the definitions as multi-line strings rather than requiring separate data files
Proposed changes
Added several methods to support creating
DataGeneratorandColumnGenerationSpecobjects from Python dictionaries and JSON/YAML files.Types of changes
What types of changes does your code introduce to dbldatagen?
Put an
xin the boxes that applyChecklist
Put an
xin the boxes that apply. You can also fill these out after creating the PR.If you're unsure about any of them, don't hesitate to ask. We're here to help!
This is simply a reminder of what we are going to look for before merging your code.
Further comments
I added several methods:
withColumnsaddsColumnGenerationSpecobjects via a list of dictionaries; It iteratively passes the dictionary values as arguments towithColumnfromDictcreates aDataGeneratorfrom a dictionary by passing the values as arguments to the constructorfromJsonallows users to create aDataGeneratorand addColumnGenerationSpecsfrom a JSON filefromYamlallows users to create aDataGenerator and addColumnGenerationSpecs` from a YAML filefromFilewraps bothfromJsonandfromYamlinto a single API