Skip to content

Conversation

@ghukill
Copy link
Collaborator

@ghukill ghukill commented Nov 5, 2025

NOTE: this PR builds on #18. Once the other PR is complete, this PR will get pointed to main and merged.

Purpose and background context

Small PR to support a JSONLines files as input for the CLI command create-embeddings.

While normally the CLI will read directly from the TIMDEX dataset via TDA, it can be optionally invoked to read records from as JSONLines file that simulates the columns/fields that would come from TDA. This is helpful for testing and will be helpful for ironing out the parallel processing work.

How can a reviewer manually see the effects of these changes?

Set env vars:

HF_HUB_DISABLE_PROGRESS_BARS=true
TE_MODEL_URI=opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte
TE_MODEL_PATH=/tmp/te-model

If not already done, ensure model is downloaded locally:

uv run --env-file .env embeddings --verbose download-model

Create embeddings with entirely local input + output files:

uv run --env-file .env embeddings --verbose create-embeddings \
--input-jsonl tests/fixtures/cli_inputs/test-3-records.jsonl \
--strategy full_record \
--output-jsonl output/use137.jsonl

Note: S3 uris are supported for both --input-jsonl and --ouput-jsonl via smart_open.

Includes new or updated dependencies?

YES: smart_open[s3]

Changes expectations for external applications?

NO

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Comment on lines 226 to 257
# read input records from TIMDEX dataset (default)
# or a JSONLines file
if input_jsonl:
with (
smart_open.open(input_jsonl, "r") as file_obj, # type: ignore[no-untyped-call]
jsonlines.Reader(file_obj) as reader,
):
timdex_records = iter(list(reader))

else:
if not dataset_location or not run_id:
raise click.UsageError(
"Both '--dataset-location' and '--run-id' are required arguments "
"when reading input records from the TIMDEX dataset."
)

# init TIMDEXDataset
timdex_dataset = TIMDEXDataset(dataset_location)

# query TIMDEX dataset for an iterator of records
timdex_records = timdex_dataset.read_dicts_iter(
columns=[
"timdex_record_id",
"run_id",
"run_record_offset",
"transformed_record",
],
run_id=run_id,
where=f"""run_record_offset >= {run_record_offset}""",
limit=record_limit,
action="index",
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At one point, I think @ehanson8 floated the idea of refactoring this out to a utility function. I considered that again here now that we handle two different possible inputs.

For the time being, I'd propose to keep here as-is, because there are no other known use cases for these inputs getting handled or used. Until then, it feels reasonable that this CLI handles the input + output handling.

Why these changes are being introduced:

Initially, the CLI command create-embeddings only supported reading input records
from the TIMDEX dataset via TDA.  While this is likely the way we'll get input
records, supporting a JSONLines file as input is helpful for testing.

How this addresses that need:
* Adds a new --input-jsonl argument that reads a JSONLines file and uses
those rows as input for creating embeddings.
* Args --dataset-location and --run-id are required when --input-jsonl
is not set.

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-137
@ghukill ghukill force-pushed the USE-137-JSONL-input branch from 7216c9e to b898882 Compare November 5, 2025 20:29
@ghukill ghukill requested a review from a team November 6, 2025 16:10
@ghukill ghukill marked this pull request as ready for review November 6, 2025 16:10
@jonavellecuerdo jonavellecuerdo self-requested a review November 7, 2025 14:08
Copy link

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as expected! ✨

@ghukill ghukill merged commit 9006ec2 into USE-136-implement-create-embeddings Nov 7, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants