Xvc for DVC Users
DVC is an MLOps utility to track data, pipelines and machine learning experiments on top of Git. Xvc is inspired by DVC in its purpose, but there are major technical differences between these two.
Note that this document refers mostly to Xvc v0.4 and DVC 2.30. Both commands are in development, and similarities and differences may change considerably.
Similarities
The purposes of these two commands are similar, and these are alternatives to each other. Both of these aims to manage data, pipelines and experiments of an ML project.
Both of the utilities similarly work on top of Git. DVC became more bound to Git after the introduction of its experiment tracking features. Before that, Git was optional (but recommended) for DVC. Xvc has the same optional and recommended reliance on Git.
Both of these commands use hashing the content to detect changes in files.
Both of these use DAGs to represent pipelines.
Conceptual Differences
- What DVC calls "remote", Xvc calls "storage." This is to emphasize the difference between Xvc storages and Git remotes.
- What DVC calls "stage" in a data pipeline, Xvc calls "step." "Stage" has a different meaning in the Git context, and I believe using the same word in a different meaning increases the mental effort to describe and understand.
- In DVC, there is a 1-1 correspondence between
dvc.yaml
files in a repository and the pipelines. In Xvc, pipelines are more abstract. They are defined withxvc pipeline
family of commands. No single file contains a pipeline definition. You can export pipelines to YAML, JSON, and TOML, and import them after making changes. Xvc doesn't consider any file format authoritative for pipelines, and their YAML/JSON/TOML representation may change between versions. - DVC is more liberal in creating files among user files in the repository. When you add a file to DVC with
dvc add
, DVC creates a.dvc
file next to it. Xvc only creates a.xvc/
directory in the repository root and only updates.gitignore
files to hide tracked files from Git. - Cache type, (or rather recheck type) that is whether a file in the repository is linked to its cached version by copying, reflink, symlink or hardlink is determined repository-wide in DVC. You can either have all your cache links as symlinks, or hardlinks, etc. Xvc tracks these per file, you can have one file symlinked to cache, another file copied from cache, etc.
Command Differences
❗Note that, some of the Xvc commands described here are still under development.
- While naming Xvc commands, we tried our best to avoid name clashes with Git. Having both
git push
anddvc push
commands may look beneficial for exposition at first, as these two are analogous. However, giving the same name also hides some important details, that are more difficult to emphasize later. (e.g. DVC experiments are Git objects that are pushed to Git remotes, while the files changed during experiments are pushed to DVC remotes. ) dvc add
can be replaced byxvc file track
.dvc add
creates a.dvc
file (formatted in YAML) in the repository. Xvc doesn't create separate files for tracked paths.dvc check-ignore
can be replaced byxvc check-ignore
. Xvc version can be used against any other ignore filename. (.gitignore
,.ignore
,.fooignore
...)dvc checkout
is replaced byxvc file recheck
. There is a--recheck-as
option in several Xvc commands that tells whether to check out as symlink, hardlink, reflink or copy.dvc commit
is replaced byxvc file carry-in
.- There is no command similar to
dvc config
. You can either edit the configuration files, or modify configuration with-c
options in each run. You can also supply all configuration from the environment. See Configuration. dvc dag
is replaced byxvc pipeline dag
. DVC version uses ASCII art to present the pipeline. Xvc doesn't provide ASCII art, only Graphviz representation.dvc data status
anddvc status
can be replaced byxvc file list
. Xvc version doesn't provide information about pipelines, or the remotes.- There is no command similar to
dvc destroy
in Xvc. There will be anxvc deinit
command at some point. - There is no command similar to
dvc diff
in Xvc. - There is no command similar to
dvc doctor
ordvc version
. Version information should be visible in the help text. - Currently, there are no commands corresponding to
dvc exp
set of commands. This is on the roadmap for Xvc. Scope, implementation, and actual commands may differ. dvc fetch
is replaced byxvc file bring --no-recheck
.- Instead of freezing "pipeline stages" as in
dvc freeze
, and unfreezing withdvc unfreeze
,xvc pipeline step update --changed [never|always|by_dependencies]
can be used to specify if/when to run a pipeline step. - Instead of
dvc gc
to "garbage-collect" files, you can usexvc file delete
with various options. - There is no corresponding command for
dvc get-url
in Xvc. You can usewget
orcurl
instead. - Currently there is no command to replace
dvc get
anddvc import
, anddvc import-url
. URL dependencies are to be supported eventually with a different mechanism. - Instead of
dvc install
like hooks, Xvc issues Git commands itself ifgit.auto_commit
,git.auto_stage
configuration options are set. - There is no corresponding command for
dvc list-url
. dvc list
is replaced byxvc file list
for local paths. Its remote capabilities are not implemented but on the roadmap.- Currently, there is no params/metrics tracking/diff similar to
dvc params
,dvc metrics
ordvc plots
commands in Xvc. dvc move
is replaced byxvc file move
.dvc push
is replaced byxvc file send
.dvc pull
is replaced byxvc file bring
.- There are no commands similar to
dvc queue
for experiments in Xvc. Experiment tracking will probably be handled differently. dvc remote
set of commands are replaced byxvc storage
set of commands. You can usexvc storage new
for adding new storages. Currently, there is no "default remote" facility in Xvc. Instead ofdvc remote modify
, you can usexvc storage remove
andxvc storage new
.- There is no single command to replace
dvc remove
. For files, you can usexvc file delete
. For pipelines steps, you can use ]xvc pipeline step remove
- Instead of
dvc repro
, Xvc hasxvc pipeline run
. If you want to reproduce a pipeline, you can usexvc pipeline run
again. xvc root
is for the same purpose asdvc root
.dvc run
(that defines a stage in DVC pipeline and immediately runs it) can be replaced byxvc pipeline
set of commands.xvc pipeline new
for a new pipeline,xvc pipeline step new
for a new step in the pipeline,xvc pipeline step dependency
to specify dependencies of a step,xvc pipeline step output
to specify outputs of a step andxvc pipeline run
to run this pipeline.- Instead of
dvc stage add
, we havexvc pipeline step new
. Fordvc stage list
, we havexvc pipeline step list
. - There is no (need) for
dvc protect
ordvc unprotect
commands in Xvc. "Cache type" is not a repository-wide option. If you want to track a certain directory as symlink, and another as hardlink, you can do so withxvc file recheck --as
. If you want identical files copied to one directory and linked in another,xvc file copy
can help. - DVC needs
dvc update
for external dependencies in pipelines. Xvc checks their metadata like any other dependency before downloading and invalidates the step if the URL/file has changed automatically. - DVC leaves Git operations to the user, and automates them to a certain degree with Git hooks. Xvc adds Git commits to the repository after operations by default.
Technical Differences
- DVC is written in Python. Xvc is written in Rust.
- DVC uses MD5 to check file content changes. Xvc uses BLAKE3 by default, and can be configured to use BLAKE2s, SHA2-256 and SHA3-256.
- DVC tracks file/directory changes in separate
.dvc
files. Xvc tracks them in.json
files in.xvc/store
. There is no 1-1 correspondence between these files and the directory structure. - DVC uses Object-Oriented Programming in Python. Xvc tries to minimize function/data coupling and uses an Entity-Component System (
xvc-ecs
) in its core. - DVC remotes are identical to their cache in structure, and multiple DVC repositories use the same remote by mixing files. This leads to inter-repository deduplication. Xvc uses separate directory for each repository. This means identical files in separate Xvc repositories are duplicated.
- DVC considers directories as file-equivalent entities to track with
.dvc
files pointing to.json
files in the cache. Xvc doesn't track directories as identical to files. They are considered collections of files. - DVC uses Dulwich for Git operations. Xvc executes the Git process directly, with its common command line options.