diff --git a/pages/data-migration.mdx b/pages/data-migration.mdx index c3478a1b4..911a2102f 100644 --- a/pages/data-migration.mdx +++ b/pages/data-migration.mdx @@ -15,7 +15,7 @@ instance. Whether your data is structured in files, relational databases, or other graph databases, Memgraph provides the flexibility to integrate and analyze your data efficiently. -Memgraph supports file system imports like CSV files, offering efficient and +Memgraph supports file system imports like Parquet and CSV files, offering efficient and structured data ingestion. **However, if you want to migrate directly from another data source, you can use the [`migrate` module](/advanced-algorithms/available-algorithms/migrate)** from Memgraph MAGE @@ -31,6 +31,11 @@ In order to learn all the pre-requisites for importing data into Memgraph, check ## File types +### Parquet files + +Parquet files can be imported efficiently from the local disk and from s3:// using the +[LOAD PARQUET clause](/querying/claused/load-parquet). + ### CSV files CSV files provide a simple and efficient way to import tabular data into Memgraph @@ -262,4 +267,4 @@ nonsense or sales pitch, just tech. /> - \ No newline at end of file + diff --git a/pages/data-migration/_meta.ts b/pages/data-migration/_meta.ts index 454863979..fc73210b5 100644 --- a/pages/data-migration/_meta.ts +++ b/pages/data-migration/_meta.ts @@ -1,6 +1,7 @@ export default { "best-practices": "Best practices", "csv": "CSV", + "parquet": "PARQUET", "json": "JSON", "cypherl": "CYPHERL", "migrate-from-neo4j": "Migrate from Neo4j", diff --git a/pages/data-migration/best-practices.mdx b/pages/data-migration/best-practices.mdx index 2d8e742a9..c5384b82a 100644 --- a/pages/data-migration/best-practices.mdx +++ b/pages/data-migration/best-practices.mdx @@ -572,4 +572,4 @@ For more information about `Delta` objects, check the information on the [IN_MEMORY_TRANSACTIONAL storage mode](/fundamentals/storage-memory-usage#in-memory-transactional-storage-mode-default). - \ No newline at end of file + diff --git a/pages/data-migration/parquet.mdx b/pages/data-migration/parquet.mdx new file mode 100644 index 000000000..39d0b39c4 --- /dev/null +++ b/pages/data-migration/parquet.mdx @@ -0,0 +1,247 @@ +--- +title: Import data from Parquet files +description: Leverage Parquet files in Memgraph operations. Our detailed guide simplifies the process for an enhanced graph computing journey. +--- + +import { Callout } from 'nextra/components' +import { Steps } from 'nextra/components' +import { Tabs } from 'nextra/components' + +# Import data from Parquet file + +The data from Parquet files can be imported using the [`LOAD PARQUET` Cypher clause](#load-parquet-cypher-clause) from the local disk +and from the s3. + +## `LOAD PARQUET` Cypher clause + +The `LOAD PARQUET` clause uses a background thread that reads column batches, assembles batch of 64K rows and puts it on the queue from +where the main thread pulls the data. The main thread then reads row by row from the queue, binds the contents of the parsed row to the +specified variable, populates the database if it is empty or appends new data to an existing dataset. + +### `LOAD PARQUET` clause syntax + + + +The syntax of the `LOAD PARQUET` clause is: + +```cypher +LOAD PARQUET FROM ( WITH CONFIG configs=configMap ) ? AS +``` + +- `` is a string of the location of the Parquet file.
Without a + s3:// prefix, it refers to a path on the local and with s3:// prefix, it pulls the file with specified URI from the S3-compatible storage. + There are no restrictions on where in + your file system the file can be located, as long as the path is valid (i.e., + the file exists). If you are using Docker to run Memgraph, you will need to + [copy the files from your local directory into + Docker](/getting-started/first-steps-with-docker#copy-files-from-and-to-a-docker-container) + container where Memgraph can access them.
+ +- `` Represents an optional configuration map through which you can specify configuration options: `aws_region`, `aws_access_key`, `aws_secret_key` and `aws_endpoint_url`. + - ``: The region in which your S3 service is being located + - ``: Access key used to connect to S3 service + - ``: Secret key used to connect S3 service + - `: Optional configuration parameter. Can be used to set the URL of the S3 compatible storage. +- `` is a symbolic name representing the variable to which the + contents of the parsed row will be bound to, enabling access to the row + contents later in the query. The variable doesn't have to be used in any + subsequent clause. + +### `LOAD PARQUET` clause specificities + +When using the `LOAD PARQUET` clause please keep in mind: + +- The parser parses the values in their appropriate type so you should get the same type as in the Parquet file. Types `BOOL`, `INT8`, `INT16`, `INT32`, `INT64`, `UINT8`, `UINT16`, `UINT32`, `UINT64`, + `HALF_FLOAT`, `FLOAT`, `DOUBLE`, `STRING`, `LARGE_STRING`, `STRING_VIEW`, `DATE32`, `DATE64`, `TIME32`, `TIME64`, `TIMESTAMP`, `DURATION`, `DECIMAL128`, `DECIMAL256`, `BINARY`, `LARGE_BINARY`, `FIXED_SIZE_BINARY`, + `LIST` and `MAP` are supported. Unsupported types will be saved as string in Memgraph. + +- Authentication parameters (`aws_region`, `aws_access_key`, `aws_secret_key` and `aws_endpoint_url`) can be provided in the `LOAD PARQUET` query using WITH CONFIG construct, through environment variables + (`AWS_REGION`, `AWS_ACCESS_KEY`, `AWS_SECRET_KEY` and `AWS_ENDPOINT_URL`) and through run-time database settings. For setting authentication parameters through run-time settings, use `SET DATABASE SETTING to ;` + query. Keys of this authentication parameters are `aws.access_key`, `aws.region`, `aws.secret_key` and `aws.endpoint_url`. + +- **The `LOAD PARQUET` clause is not a standalone clause**, meaning a valid query + must contain at least one more clause, for example: + + ```cypher + LOAD PARQUET FROM "/people.parquet" AS row + CREATE (p:People) SET p += row; + ``` + + In this regard, the following query will throw an exception: + + ```cypher + LOAD PARQUET FROM "/file.parquet" AS row; + ``` + + **Adding a `MATCH` or `MERGE` clause before LOAD PARQUET** allows you to match certain + entities in the graph before running LOAD PARQUET, optimizing the process as + matched entities do not need to be searched for every row in the PARQUET file. + + But, the `MATCH` or `MERGE` clause can be used prior the `LOAD PARQUET` clause only + if the clause returns only one row. Returning multiple rows before calling the + `LOAD PARQUET` clause will cause a Memgraph runtime error. + +- **The `LOAD PARQUET` clause can be used at most once per query**, so queries like + the one below will throw an exception: + + ```cypher + LOAD PARQUET FROM "/x.parquet" AS x + LOAD PARQUET FROM "/y.parquet" AS y + CREATE (n:A {p1 : x, p2 : y}); + ``` + +### Increase import speed + +The `LOAD PARQUET` clause will create relationships much faster and consequently +speed up data import if you [create indexes](/fundamentals/indexes) on nodes or +node properties once you import them: + +```cypher + CREATE INDEX ON :Node(id); +``` + +If the LOAD PARQUET clause is merging data instead of creating it, create indexes +before running the LOAD PARQUET clause. + + +The construct `USING PERIODIC COMMIT ` also improves the import speed because +it optimizes some of the memory allocation patterns. In our benchmarks, this construct +speeds up the execution from 25% to 35%. + +```cypher + USING PERIODIC COMMMIT 1024 LOAD PARQUET FROM "/x.parquet" AS x + CREATE (n:A {p1 : x, p2 : y}); +``` + + +You can also speed up import if you switch Memgraph to [**analytical storage +mode**](/fundamentals/storage-memory-usage#storage-modes). In the analytical +storage mode there are no ACID guarantees besides manually created snapshots. +After import you can switch the storage mode back to +transactional and enable ACID guarantees. + +You can switch between modes within the session using the following query: + +```cypher +STORAGE MODE IN_MEMORY_{TRANSACTIONAL|ANALYTICAL}; +``` + +If you use `IN_MEMORY_ANALYTICAL` mode and have nodes and relationships stored in + separate PARQUET files, you can run multiple concurrent `LOAD PARQUET` queries to import data even faster. +In order to achieve the best import performance, split your nodes and relationships +files into smaller files and run multiple `LOAD PARQUET` queries in parallel. +The key is to run all `LOAD PARQUET` queries which create nodes first. After that, run +all `LOAD PARQUET` queries that create relationships. + + +### Import multiple Parquet files with distinct graph objects + +In this example, the data is split across four files, each file contains nodes +of a single label or relationships of a single type. + + + + {

Parquet files

} + + - [`people_nodes.parquet`](s3://download.memgraph.com/asset/docs/people_nodes.parquet) is used to create nodes labeled `:Person`.
The file contains the following data: + ```parquet + id,name,age,city + 100,Daniel,30,London + 101,Alex,15,Paris + 102,Sarah,17,London + 103,Mia,25,Zagreb + 104,Lucy,21,Paris + ``` +- [`restaurants_nodes.parquet`](s3://download.memgraph.com/asset/docs/restaurants_nodes.parquet) is used to create nodes labeled `:Restaurants`.
The file contains the following data: + ```parquet + id,name,menu + 200,Mc Donalds,Fries;BigMac;McChicken;Apple Pie + 201,KFC,Fried Chicken;Fries;Chicken Bucket + 202,Subway,Ham Sandwich;Turkey Sandwich;Foot-long + 203,Dominos,Pepperoni Pizza;Double Dish Pizza;Cheese filled Crust + ``` + +- [`people_relationships.parquet`](s3://download.memgraph.com/asset/docs/people_relationships.parquet) is used to connect people with the `:IS_FRIENDS_WITH` relationship.
The file contains the following data: + ```parquet + first_person,second_person,met_in + 100,102,2014 + 103,101,2021 + 102,103,2005 + 101,104,2005 + 104,100,2018 + 101,102,2017 + 100,103,2001 + ``` +- [`restaurants_relationships.parquet`](s3://download.memgraph.com/asset/docs/restaurants_relationships.parquet) is used to connect people with restaurants using the `:ATE_AT` relationship.
The file contains the following data: + ```parquet + PERSON_ID,REST_ID,liked + 100,200,true + 103,201,false + 104,200,true + 101,202,false + 101,203,false + 101,200,true + 102,201,true + ``` + + {

Import nodes

} + + Each row will be parsed as a map, and the + fields can be accessed using the property lookup syntax (e.g. `id: row.id`). Files can be imported directly from s3 or can be downloaded and then accessed from the local disk. + + The following query will load row by row from the file, and create a new node + for each row with properties based on the parsed row values: + + ```cypher + LOAD PARQUET FROM "s3://download.memgraph.com/asset/docs/people_nodes.parquet" AS row + CREATE (n:Person {id: row.id, name: row.name, age: row.age, city: row.city}); + ``` + + In the same manner, the following query will create new nodes for each restaurant: + + ```cypher + LOAD PARQUET FROM "s3://download.memgraph.com/asset/docs/restaurants_nodes.parquet" AS row + CREATE (n:Restaurant {id: row.id, name: row.name, menu: row.menu}); + ``` + + {

Create indexes

} + + Creating an [index](/fundamentals/indexes) on a property used to connect nodes + with relationships, in this case, the `id` property of the `:Person` nodes, + will speed up the import of relationships, especially with large datasets: + + ```cypher + CREATE INDEX ON :Person(id); + ``` + + {

Import relationships

} + The following query will create relationships between the people nodes: + + ```cypher + LOAD PARQUET FROM "s3://download.memgraph.com/asset/docs/people_relationships.parquet" AS row + MATCH (p1:Person {id: row.first_person}) + MATCH (p2:Person {id: row.second_person}) + CREATE (p1)-[f:IS_FRIENDS_WITH]->(p2) + SET f.met_in = row.met_in; + ``` + + The following query will create relationships between people and restaurants where they ate: + + ```cypher + LOAD PARQUET FROM "s3://download.memgraph.com/asset/docs/restaurants_relationships.parquet" AS row + MATCH (p1:Person {id: row.PERSON_ID}) + MATCH (re:Restaurant {id: row.REST_ID}) + CREATE (p1)-[ate:ATE_AT]->(re) + SET ate.liked = ToBoolean(row.liked); + ``` + + {

Final result

} + Run the following query to see how the imported data looks as a graph: + + ``` + MATCH p=()-[]-() RETURN p; + ``` + + ![](/pages/data-migration/csv/load_csv_restaurants_relationships.png) + +
diff --git a/pages/database-management/authentication-and-authorization/role-based-access-control.mdx b/pages/database-management/authentication-and-authorization/role-based-access-control.mdx index e83499792..c4519e211 100644 --- a/pages/database-management/authentication-and-authorization/role-based-access-control.mdx +++ b/pages/database-management/authentication-and-authorization/role-based-access-control.mdx @@ -159,7 +159,7 @@ of the following commands: | Privilege to enforce [constraints](/fundamentals/constraints). | `CONSTRAINT` | | Privilege to [dump the database](/configuration/data-durability-and-backup#database-dump).| `DUMP` | | Privilege to use [replication](/clustering/replication) queries. | `REPLICATION` | -| Privilege to access files in queries, for example, when using `LOAD CSV` clause. | `READ_FILE` | +| Privilege to access files in queries, for example, when using `LOAD CSV` and `LOAD PARQUET` clauses. | `READ_FILE` | | Privilege to manage [durability files](/configuration/data-durability-and-backup#database-dump). | `DURABILITY` | | Privilege to try and [free memory](/fundamentals/storage-memory-usage#deallocating-memory). | `FREE_MEMORY` | | Privilege to use [trigger queries](/fundamentals/triggers). | `TRIGGER` | diff --git a/pages/database-management/configuration.mdx b/pages/database-management/configuration.mdx index 236ed476c..74accb95b 100644 --- a/pages/database-management/configuration.mdx +++ b/pages/database-management/configuration.mdx @@ -318,6 +318,10 @@ fallback to the value of the command-line argument. | hops_limit_partial_results | If set to `true`, partial results are returned when the hops limit is reached. If set to `false`, an exception is thrown when the hops limit is reached. The default value is `true`. | yes | | timezone | IANA timezone identifier string setting the instance's timezone. | yes | | storage.snapshot.interval | Define periodic snapshot schedule via cron expression ([crontab](https://crontab.guru/) format, an [Enterprise feature](/database-management/enabling-memgraph-enterprise)) or as a period in seconds. Set to empty string to disable. | no | +| aws.region | AWS region in which your S3 service is located. | yes | +| aws.access_key | Access key used to READ the file from S3. | yes | +| aws.secret_key | Secret key used to READ the file from S3. | yes | +| aws.endpoint_url | URL on which S3 can be accessed (if using some other S3-compatible storage). | yes | All settings can be fetched by calling the following query: @@ -481,6 +485,19 @@ connections in Memgraph. | `--stream-transaction-retry-interval=500` | The interval to wait (measured in milliseconds) before retrying to execute again a conflicting transaction. | `[uint32]` | +### AWS + +This section contains the list of flags that are used when connecting to S3-compatible storage. + + +| Flag | Description | Type | +|--------------------------------------------|-------------------------------------------------------------------------------------------------------------|------------| +| `--aws-region` | AWS region in which your S3 service is located. | `[string]` | +| `--aws-access-key` | Access key used to READ the file from S3. | `[string]` | +| `--aws-secret-key` | Secret key used to READ the file from S3. | `[string]` | +| `--aws-endpoint-url` | URL on which S3 can be accessed (if using some other S3-compatible storage). | `[string]` | + + ### Other This section contains the list of all other relevant flags used within Memgraph. diff --git a/pages/help-center/faq.mdx b/pages/help-center/faq.mdx index 943163c43..a7674c480 100644 --- a/pages/help-center/faq.mdx +++ b/pages/help-center/faq.mdx @@ -212,11 +212,11 @@ us](https://memgraph.com/enterprise-trial) for more information. ### What is the fastest way to import data into Memgraph? -Currently, the fastest way to import data is from a CSV file with a [LOAD CSV -clause](/data-migration/csv). Check out the [best practices for importing +Currently, the fastest way to import data is from a Parquet file with a [LOAD PARQUET +clause](/data-migration/parquet). Check out the [best practices for importing data](/data-migration/best-practices). -[Other import methods](/data-migration) include importing data from JSON and CYPHERL files, +[Other import methods](/data-migration) include importing data from CSV, JSON and CYPHERL files, migrating from relational databases, or connecting to a data stream. ### How to import data from MySQL or PostgreSQL? @@ -226,11 +226,11 @@ You can migrate from [MySQL](/data-migration/migrate-from-rdbms) or ### What file formats does Memgraph support for import? -You can import data from [CSV](/data-migration/csv), +You can import data from [CSV](/data-migration/csv), [PARQUET](/data-migration/parquet) [JSON](/data-migration/json) or [CYPHERL](/data-migration/cypherl) files. CSV files can be imported in on-premise instances using the [LOAD CSV -clause](/data-migration/csv), and JSON files can be imported using a +clause](/data-migration/csv), PARQUET files can be imported using the [LOAD PARQUET](/data-migration/parquet) and JSON files can be imported using a [json_util](/advanced-algorithms/available-algorithms/json_util) module from the MAGE library. On a Cloud instance, data from CSV and JSON files can be imported only from a remote address. diff --git a/pages/index.mdx b/pages/index.mdx index f01b5d965..d04ef148d 100644 --- a/pages/index.mdx +++ b/pages/index.mdx @@ -165,6 +165,10 @@ JSON files, and import data using queries within a CYPHERL file. title="JSON" href="/data-migration/json" /> + - \ No newline at end of file + diff --git a/pages/querying/query-plan.mdx b/pages/querying/query-plan.mdx index 532867e67..9ad3ae3a2 100644 --- a/pages/querying/query-plan.mdx +++ b/pages/querying/query-plan.mdx @@ -241,6 +241,7 @@ The following table lists all the operators currently supported by Memgraph: | `IndexedJoin` | Performs an indexed join of the input from its two input branches. | | `Limit` | Limits certain rows from the pull chain. | | `LoadCsv` | Loads CSV file in order to import files into the database. | +| `LoadParquet` | Loads Parqet file in order to import files into the database. | | `Merge` | Applies merge on the input it received. | | `Once` | Forms the beginning of an operator chain with "only once" semantics. The operator will return false on subsequent pulls. | | `Optional` | Performs optional matching. |