-
Notifications
You must be signed in to change notification settings - Fork 26
DOC-2975: Added new page 'Load from Apache Iceberg' [4.3] #875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 4.3.0-dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall lgtm. But there are some partials that can be referenced and reused, e.g., https://github.com/tigergraph/server-docs/blob/4.2/modules/data-loading/pages/load-from-cloud.adoc?plain=1#L32
| :toclevels: 4 | ||
| = Load from Apache Iceberg | ||
|
|
||
| In version *4.3*, TigerGraph introduces a connector to load data from an *Apache Iceberg*. This connector allows you to load data stored in *AWS S3* or *MinIO* buckets that are managed by an *Iceberg REST Catalog*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| In version *4.3*, TigerGraph introduces a connector to load data from an *Apache Iceberg*. This connector allows you to load data stored in *AWS S3* or *MinIO* buckets that are managed by an *Iceberg REST Catalog*. | |
| TigerGraph 4.3 adds *Apache Iceberg* to its collection of high-speed built-in connectors. This connector allows you to load data stored in *AWS S3* or *MinIO* buckets that are managed by an *Iceberg REST Catalog*. |
Reasons for the change:
- We want to emphasize that we are expanded an existing family. not adding something that is totally new. Writing our documentation as though everything is new is one reason why our documentation is difficult to read: it's written like a large number of individual features, rather than as a smooth and logical fabric.
- Avoid talking about TigerGraph the company. Talk about TigerGraph DB the product. Notice how the meaning of "TigerGraph" is different in the two versions. Is this were marketing material, talking about the company would be fine, but this is technical documentation, not marketing.
- We do not say "an Iceberg", just like we do not say "a Spark".
|
|
||
| This guide shows how to connect your data source, create a loading job, and manage it effectively. | ||
|
|
||
| == Build Your Graph Foundation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| == Build Your Graph Foundation | |
| == Build Your Graph Schema |
Both TigerGraph and Iceberg use the term "schema". "Foundation" is not a standard term for either product, so why introduce a new concept that isn't needed?
| aws.s3.access_key: admin, | ||
| aws.s3.secret_key: password, | ||
| aws.client.region: us-east-1, | ||
| tasks.max: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our current documentation, tasks.max is a filename parameter and can only be configured when defining a FILENAME.
Is the ability to define tasks.max in a DATA SOURCE a new or old feature? Does it apply to all connectors or just Iceberg?
Does it apply to all filename parameters or just one (or a few)? Which ones?
This is the problem with documenting by "example". You actually create more questions than you answer. The better approach is to briefly explain the feature (or option) and then show an example. Or, show an example and then explain it. At some point, provide all the important details for what is/isn't supported.
| LOAD f1 TO VERTEX person VALUES ($0, $1, $2); | ||
| } | ||
|
|
||
| - *8 Tasks from Job*: Use 8 tasks for larger data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why these numbers, 2 and 8?
1 is the default, right? Is there a global max?
|
|
||
| [source,gsql] | ||
| CREATE LOADING JOB loadSocialNet FOR GRAPH socialNet { | ||
| DEFINE FILENAME f1 = "$s1:SELECT personId, id, gender FROM iceberg_connector.person WHERE gender = 'male'"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is iceberg_connector a built-in name or this a user-defined name? If it was user defined, where/when would it be defined?
|
|
||
| Create a loading job to turn Iceberg data into your graph’s vertices and edges. It involves defining data sources and mapping the data. | ||
|
|
||
| === Quick Loading Job Example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should include the Iceberg schema. A user cannot really learn from a loading example if you are only showing the schema of one half (either the source schema or the target graph schema). We need to see both schemas.
| CREATE DATA_SOURCE s1 = """ | ||
| { | ||
| type: iceberg, | ||
| iceberg.catalog.type: rest, | ||
| iceberg.catalog.uri: http://rest:8181, | ||
| aws.s3.endpoint: http://minio:9000, | ||
| aws.s3.access_key: accesskey, | ||
| aws.s3.secret_key: password, | ||
| aws.client.region: us-east-1 | ||
| }""" FOR GRAPH socialNet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other TigerGraph connectors let you put this configuration JSON in a file. I assume that is also supported here?
If this is in a file, I suspect the triple quotes are omitted, but I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pingxieTG Could you please clarify this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it supports JSON file as well. It is no different from the way other data sources are created. @Tushar-TG-14
| LOAD f1 TO VERTEX person VALUES ($0, $1, $2); | ||
| } | ||
|
|
||
| == Define Your Data Files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section confused me because it was repeating things that we're shown in the examples above.
After reading further, I eventually realized that you were breaking down the full loading process, step-by-step, which was all lumped together in the examples before.
Please add introductory or transitional sentences, and some sort of main heading like "Connector Setup and Loading - Step by Step" to tell the user how you are leading them. Otherwise, the reader doesn't understand how the sections are related to one another.
| [source,gsql] | ||
| DEFINE FILENAME query_person = "$s1:SELECT personId, id, gender FROM iceberg_connector.person"; | ||
| DEFINE FILENAME bq_inline_json = "$s1:myfile.json"; | ||
| DEFINE FILENAME query_person = "$s1:{query: 'SELECT personId, id, gender FROM iceberg_connector.person WHERE gender = 'male'', tasks.max: 2}"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example uses a a file object name query_person which was already used above. It should be changed to be unique.
|
|
||
| == Connect Your Data Source | ||
|
|
||
| Configure a data source object to connect TigerGraph to your Apache Iceberg storage (S3 or MinIO). This involves specifying connection details using JSON. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Using JSON" is very vague. If you say that, I expect some explanation pretty quickly; otherwise you leave me hanging. So, it's actually better no to say that yet.
The structure of the existing data connection documentation is logical; it just needs to be streamlined.
The logic:
- Create a Data Source Object (instead of Connect Your Data Source)
** Specify configuration parameters in a JSON object
*** use a small JSON example to show the format and how it can be specified either inline or in a separate file
*** show the tables of all the configuration parameters
No description provided.