Skip to content

Conversation

@jacobwgillespie
Copy link
Member

WIP 🙂

dvdksn and others added 30 commits August 17, 2023 07:36
ADD --checksum and git url graduated in v1.6,
removing the references to labs channel since
they are now in stable.

Signed-off-by: David Karlsson <35727626+dvdksn@users.noreply.github.com>
(cherry picked from commit dcc83f8)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
…: 0xe: operation not permitted`

The error is known to happen when buildkitd is executed inside a
container without `--oci-worker-no-process-sandbox`.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit 591478d)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
We can't avoid squashing even after just fixing up whiteout timestamps;
Squashing is still needed to apply the `touch`-ed timestamps across multiple `RUN` instructions.

Squashing will no longer be needed if we can merge PR 3560.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit 0758355)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Signed-off-by: CrazyMax <crazy-max@users.noreply.github.com>
(cherry picked from commit 5b8f962)
Signed-off-by: Chris Goller <goller@gmail.com>
We want to track the time saved of cached layers
over time but the public layer digest was not stable
from run to run.

This update adds a new stable digest to the cli
via the progress control.

The core addition was to notify progress during CacheMap
lookup.  There absolutely seem to be race conditions!

For some reason if I add any logic, such as filtering
for only those vertices that are completed, the progress
with a stable digest is _NOT_ sent to the client.

This makes me quesy.  More buildkit mysteries here.

Signed-off-by: Chris Goller <goller@gmail.com>
Co-authored-by: Jacob Gillespie <jacobwgillespie@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
When pushing large images to multiple registries
it could take a long time because the code sent
the data serially.

This now pushes to multiple registries in parallel.

Signed-off-by: Chris Goller <goller@gmail.com>
The pushing of cached layers to S3 is serial
and quite slow.  It appears that we can make it parallel
for a large increase in speed.

The goal is to make the S3 cache a usable backup
to local disk caching.

Signed-off-by: Chris Goller <goller@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
We have seen performance issues seemingly related to
the session not closing.  We want to use profiling to
capture some of the root causes.

Because profiling can be expensive, we only turn it on
when the env vars, PROFILER_TOKEN, PROFILER_ENDPOINT,
and PROFILER_PROJECT_ID are set.

Signed-off-by: Chris Goller <goller@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
To allow images to be exported all layers need to be unlazied.
Unlazy means to actually make sure that all layers exist.
Typically, this can mean that base images may not be downloaded.

This fixes an oversight in the logic.  It turns out that this
piece of code handles both create an image in an "image store"
as well as unlazy all layers for the content store.

The OCI worker does not have an image store.  Thus, the logic
of this function accidently excluded OCI workers from unlazy
of their layers for the content store.

Signed-off-by: Chris Goller <goller@gmail.com>
We are seeing slow container push and are wondering if
the global pool of concurrent layer pushes is causing
a slowdown.

This PR addresses both concurrent fetch and pulls as the
same global variable controls the semaphore weight.

Signed-off-by: Chris Goller <goller@gmail.com>
Compressing large layers such as from ML models
takes a lot of time.

With parallel gzip we can get much better compression
performance.

Testing with an LLM of 22GB previously it would take 117 seconds
whereas with pgzip it took 18 (w/ 128 cores).

Signed-off-by: Chris Goller <goller@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
When exporting images to pull with the depot cli
the manifest and config may have been garbage collected.

This returns the index, config, and manifest as annotations
within the solve response to workaround garbage collection
as the files are not removed until after solve returns.

Signed-off-by: Chris Goller <goller@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
The SHA256 creation of new layers is now takes the most
time when creating an exportable container.

This SIMD implementation of SHA256 is significantly faster
then the standard library implementation particularly for ARM.

Signed-off-by: Chris Goller <goller@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
The manifest, config, and layers of newly created images
could be garbage collected before they are pulled by the CLI.

This new lease will inhibit GC until the lease is removed or
one hour has passed.

The lease ID can be accessed by the returned annotation
`depot/session.id`.

Co-authored-by: Jacob Gillespie <jacobwgillespie@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
Under higher load and multiple merging sessions
walking provenance had panics dereferencing a nil op.

Ops are by default nil until getEdge is called.
Somehow getEdge was not being called but provenance was.

It appears that the previous code before the refactor
did lock walking.  It has also been added back to remove
other state issues.

Signed-off-by: Chris Goller <goller@gmail.com>
goller and others added 15 commits August 17, 2023 07:51
On the assumption that mergeTo is causing inconsistent graph
state, a new feature flag `DEPOT_DISABLE_MERGE_TO` is added
to selectively turn off the merging functionality.

Signed-off-by: Chris Goller <goller@gmail.com>
Requires `--oci-worker-binary /usr/bin/nvidia-container-runtime`
and `--oci-worker-gpu` from the nvidia-container-toolkit.

Signed-off-by: Chris Goller <goller@gmail.com>
Previously, manifests and configs were sent as annotations
on the image index. This caused failures when pushing to GCR
and oci-mediatypes was false.

Now a new export image option, `depot.export.image.version` has
been added.  If the version is set to `2` then we send back manifest
and config in the solve response.  If not set or `1` we'll continue
to send via annotations for backwards compatibility.

The response manifests and configs is a base64-encoded JSON array stored
at the key `depot.export.image.version`.

Signed-off-by: Chris Goller <goller@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
The labels for an existing snapshot may not always exist.

Signed-off-by: Chris Goller <goller@gmail.com>
This panic was creating log spam for expected
user failures.

Instead we log and exit 1 now.

Signed-off-by: Chris Goller <goller@gmail.com>
Currently, it is awkward to check if buildkitd is alive.
This adds the standard gRPC health check; it always responds
that buildkitd is healthy.

Signed-off-by: Chris Goller <goller@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
This is a backport of the change I upstreamed for stargz.

Signed-off-by: Chris Goller <goller@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
goller and others added 3 commits September 5, 2023 17:58
We have seen bolt corruption every so often.
The metadata boltdb was never closed and could
potentially lead to corruption.

Signed-off-by: Chris Goller <goller@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
Co-authored-by: Jacob Gillespie <jacobwgillespie@gmail.com>
Signed-off-by: Chris Goller <goller@gmail.com>
jacobwgillespie and others added 2 commits September 6, 2023 18:14
Signed-off-by: Chris Goller <goller@gmail.com>
We want to know the original build and stable digests
associated with each snapshot and blob.

The goal was to have a minimally invasive patch.
That's a fancy way of saying I used context to propagate
the creator vertex and the stable digests into the metadata
storage.

The creator vertex will only ever be set when the data is
first created.

The stable digests are created within each `Op`. However,
those ops are written in such a way that they may not have
access to the digest later in their execution.

Signed-off-by: Chris Goller <goller@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants