Xvc for DVC Users
DVC is an MLOps utility to track data, pipelines and machine learning experiments on top of Git. Xvc is inspired by DVC in its purpose, but there are major technical differences between these two.
Note that this document refers mostly to Xvc v0.4 and DVC 2.30. Both commands are in development, and similarities and differences may change considerably.
Similarities
The purposes of these two commands are similar, and these are alternatives to each other. Both of these aims to manage data, pipelines and experiments of an ML project.
Both of the utilities similarly work on top of Git. DVC became more bound to Git after the introduction of its experiment tracking features. Before that, Git was optional (but recommended) for DVC. Xvc has the same optional and recommended reliance on Git.
Both of these commands use hashing the content to detect changes in files.
Both of these use DAGs to represent pipelines.
Conceptual Differences
- What DVC calls "remote", Xvc calls "storage." This is to emphasize the difference between Xvc storages and Git remotes.
- What DVC calls "stage" in a data pipeline, Xvc calls "step." "Stage" has a different meaning in the Git context, and I believe using the same word in a different meaning increases the mental effort to describe and understand.
- In DVC, there is a 1-1 correspondence between
dvc.yamlfiles in a repository and the pipelines. In Xvc, pipelines are more abstract. They are defined withxvc pipelinefamily of commands. No single file contains a pipeline definition. You can export pipelines to YAML, JSON, and TOML, and import them after making changes. Xvc doesn't consider any file format authoritative for pipelines, and their YAML/JSON/TOML representation may change between versions. - DVC is more liberal in creating files among user files in the repository. When you add a file to DVC with
dvc add, DVC creates a.dvcfile next to it. Xvc only creates a.xvc/directory in the repository root and only updates.gitignorefiles to hide tracked files from Git. - Cache type, (or rather recheck type) that is whether a file in the repository is linked to its cached version by copying, reflink, symlink or hardlink is determined repository-wide in DVC. You can either have all your cache links as symlinks, or hardlinks, etc. Xvc tracks these per file, you can have one file symlinked to cache, another file copied from cache, etc.
Command Differences
❗Note that, some of the Xvc commands described here are still under development.
- While naming Xvc commands, we tried our best to avoid name clashes with Git. Having both
git pushanddvc pushcommands may look beneficial for exposition at first, as these two are analogous. However, giving the same name also hides some important details, that are more difficult to emphasize later. (e.g. DVC experiments are Git objects that are pushed to Git remotes, while the files changed during experiments are pushed to DVC remotes. ) dvc addcan be replaced byxvc file track.dvc addcreates a.dvcfile (formatted in YAML) in the repository. Xvc doesn't create separate files for tracked paths.dvc check-ignorecan be replaced byxvc check-ignore. Xvc version can be used against any other ignore filename. (.gitignore,.ignore,.fooignore...)dvc checkoutis replaced byxvc file recheck. There is a--recheck-asoption in several Xvc commands that tells whether to check out as symlink, hardlink, reflink or copy.dvc commitis replaced byxvc file carry-in.- There is no command similar to
dvc config. You can either edit the configuration files, or modify configuration with-coptions in each run. You can also supply all configuration from the environment. See Configuration. dvc dagis replaced byxvc pipeline dag. DVC version uses ASCII art to present the pipeline. Xvc doesn't provide ASCII art, only Graphviz representation.dvc data statusanddvc statuscan be replaced byxvc file list. Xvc version doesn't provide information about pipelines, or the remotes.- There is no command similar to
dvc destroyin Xvc. There will be anxvc deinitcommand at some point. - There is no command similar to
dvc diffin Xvc. - There is no command similar to
dvc doctorordvc version. Version information should be visible in the help text. - Currently, there are no commands corresponding to
dvc expset of commands. This is on the roadmap for Xvc. Scope, implementation, and actual commands may differ. dvc fetchis replaced byxvc file bring --no-recheck.- Instead of freezing "pipeline stages" as in
dvc freeze, and unfreezing withdvc unfreeze,xvc pipeline step update --changed [never|always|by_dependencies]can be used to specify if/when to run a pipeline step. - Instead of
dvc gcto "garbage-collect" files, you can usexvc file deletewith various options. - There is no corresponding command for
dvc get-urlin Xvc. You can usewgetorcurlinstead. - Currently there is no command to replace
dvc getanddvc import, anddvc import-url. URL dependencies are to be supported eventually with a different mechanism. - Instead of
dvc installlike hooks, Xvc issues Git commands itself ifgit.auto_commit,git.auto_stageconfiguration options are set. - There is no corresponding command for
dvc list-url. dvc listis replaced byxvc file listfor local paths. Its remote capabilities are not implemented but on the roadmap.- Currently, there is no params/metrics tracking/diff similar to
dvc params,dvc metricsordvc plotscommands in Xvc. dvc moveis replaced byxvc file move.dvc pushis replaced byxvc file send.dvc pullis replaced byxvc file bring.- There are no commands similar to
dvc queuefor experiments in Xvc. Experiment tracking will probably be handled differently. dvc remoteset of commands are replaced byxvc storageset of commands. You can usexvc storage newfor adding new storages. Currently, there is no "default remote" facility in Xvc. Instead ofdvc remote modify, you can usexvc storage removeandxvc storage new.- There is no single command to replace
dvc remove. For files, you can usexvc file delete. For pipelines steps, you can use ]xvc pipeline step remove - Instead of
dvc repro, Xvc hasxvc pipeline run. If you want to reproduce a pipeline, you can usexvc pipeline runagain. xvc rootis for the same purpose asdvc root.dvc run(that defines a stage in DVC pipeline and immediately runs it) can be replaced byxvc pipelineset of commands.xvc pipeline newfor a new pipeline,xvc pipeline step newfor a new step in the pipeline,xvc pipeline step dependencyto specify dependencies of a step,xvc pipeline step outputto specify outputs of a step andxvc pipeline runto run this pipeline.- Instead of
dvc stage add, we havexvc pipeline step new. Fordvc stage list, we havexvc pipeline step list. - There is no (need) for
dvc protectordvc unprotectcommands in Xvc. "Cache type" is not a repository-wide option. If you want to track a certain directory as symlink, and another as hardlink, you can do so withxvc file recheck --as. If you want identical files copied to one directory and linked in another,xvc file copycan help. - DVC needs
dvc updatefor external dependencies in pipelines. Xvc checks their metadata like any other dependency before downloading and invalidates the step if the URL/file has changed automatically. - DVC leaves Git operations to the user, and automates them to a certain degree with Git hooks. Xvc adds Git commits to the repository after operations by default.
Technical Differences
- DVC is written in Python. Xvc is written in Rust.
- DVC uses MD5 to check file content changes. Xvc uses BLAKE3 by default, and can be configured to use BLAKE2s, SHA2-256 and SHA3-256.
- DVC tracks file/directory changes in separate
.dvcfiles. Xvc tracks them in.jsonfiles in.xvc/store. There is no 1-1 correspondence between these files and the directory structure. - DVC uses Object-Oriented Programming in Python. Xvc tries to minimize function/data coupling and uses an Entity-Component System (
xvc-ecs) in its core. - DVC remotes are identical to their cache in structure, and multiple DVC repositories use the same remote by mixing files. This leads to inter-repository deduplication. Xvc uses separate directory for each repository. This means identical files in separate Xvc repositories are duplicated.
- DVC considers directories as file-equivalent entities to track with
.dvcfiles pointing to.jsonfiles in the cache. Xvc doesn't track directories as identical to files. They are considered collections of files. - DVC uses Dulwich for Git operations. Xvc executes the Git process directly, with its common command line options.