Xvc for DVC Users

DVC is an MLOps utility to track data, pipelines and machine learning experiments on top of Git. Xvc is inspired by DVC in its purpose, but there are major technical differences between these two.

Note that this document refers mostly to Xvc v0.4 and DVC 2.30. Both commands are in development, and similarities and differences may change considerably.

Similarities

The purposes of these two commands are similar, and these are alternatives to each other. Both of these aims to manage data, pipelines and experiments of an ML project.

Both of the utilities similarly work on top of Git. DVC became more bound to Git after the introduction of its experiment tracking features. Before that, Git was optional (but recommended) for DVC. Xvc has the same optional and recommended reliance on Git.

Both of these commands use hashing the content to detect changes in files.

Both of these use DAGs to represent pipelines.

Conceptual Differences

  • What DVC calls "remote", Xvc calls "storage." This is to emphasize the difference between Xvc storages and Git remotes.
  • What DVC calls "stage" in a data pipeline, Xvc calls "step." "Stage" has a different meaning in the Git context, and I believe using the same word in a different meaning increases the mental effort to describe and understand.
  • In DVC, there is a 1-1 correspondence between dvc.yaml files in a repository and the pipelines. In Xvc, pipelines are more abstract. They are defined with xvc pipeline family of commands. No single file contains a pipeline definition. You can export pipelines to YAML, JSON, and TOML, and import them after making changes. Xvc doesn't consider any file format authoritative for pipelines, and their YAML/JSON/TOML representation may change between versions.
  • DVC is more liberal in creating files among user files in the repository. When you add a file to DVC with dvc add, DVC creates a .dvc file next to it. Xvc only creates a .xvc/ directory in the repository root and only updates .gitignore files to hide tracked files from Git.
  • Cache type, (or rather recheck type) that is whether a file in the repository is linked to its cached version by copying, reflink, symlink or hardlink is determined repository-wide in DVC. You can either have all your cache links as symlinks, or hardlinks, etc. Xvc tracks these per file, you can have one file symlinked to cache, another file copied from cache, etc.

Command Differences

❗Note that, some of the Xvc commands described here are still under development.

  • While naming Xvc commands, we tried our best to avoid name clashes with Git. Having both git push and dvc push commands may look beneficial for exposition at first, as these two are analogous. However, giving the same name also hides some important details, that are more difficult to emphasize later. (e.g. DVC experiments are Git objects that are pushed to Git remotes, while the files changed during experiments are pushed to DVC remotes. )
  • dvc add can be replaced by xvc file track. dvc add creates a .dvc file (formatted in YAML) in the repository. Xvc doesn't create separate files for tracked paths.
  • dvc check-ignore can be replaced by xvc check-ignore. Xvc version can be used against any other ignore filename. (.gitignore,.ignore, .fooignore...)
  • dvc checkout is replaced by xvc file recheck. There is a --recheck-as option in several Xvc commands that tells whether to check out as symlink, hardlink, reflink or copy.
  • dvc commit is replaced by xvc file carry-in.
  • There is no command similar to dvc config. You can either edit the configuration files, or modify configuration with -c options in each run. You can also supply all configuration from the environment. See Configuration.
  • dvc dag is replaced by xvc pipeline dag. DVC version uses ASCII art to present the pipeline. Xvc doesn't provide ASCII art, only Graphviz representation.
  • dvc data status and dvc status can be replaced by xvc file list. Xvc version doesn't provide information about pipelines, or the remotes.
  • There is no command similar to dvc destroy in Xvc. There will be an xvc deinit command at some point.
  • There is no command similar to dvc diff in Xvc.
  • There is no command similar to dvc doctor or dvc version. Version information should be visible in the help text.
  • Currently, there are no commands corresponding to dvc exp set of commands. This is on the roadmap for Xvc. Scope, implementation, and actual commands may differ.
  • dvc fetch is replaced by xvc file bring --no-recheck.
  • Instead of freezing "pipeline stages" as in dvc freeze, and unfreezing with dvc unfreeze, xvc pipeline step update --changed [never|always|by_dependencies] can be used to specify if/when to run a pipeline step.
  • Instead of dvc gc to "garbage-collect" files, you can use xvc file delete with various options.
  • There is no corresponding command for dvc get-url in Xvc. You can use wget or curl instead.
  • Currently there is no command to replace dvc get and dvc import, and dvc import-url. URL dependencies are to be supported eventually with a different mechanism.
  • Instead of dvc install like hooks, Xvc issues Git commands itself if git.auto_commit , git.auto_stage configuration options are set.
  • There is no corresponding command for dvc list-url.
  • dvc list is replaced by xvc file list for local paths. Its remote capabilities are not implemented but on the roadmap.
  • Currently, there is no params/metrics tracking/diff similar to dvc params, dvc metrics or dvc plots commands in Xvc.
  • dvc move is replaced by xvc file move.
  • dvc push is replaced by xvc file send.
  • dvc pull is replaced by xvc file bring.
  • There are no commands similar to dvc queue for experiments in Xvc. Experiment tracking will probably be handled differently.
  • dvc remote set of commands are replaced by xvc storage set of commands. You can use xvc storage new for adding new storages. Currently, there is no "default remote" facility in Xvc. Instead of dvc remote modify, you can use xvc storage remove and xvc storage new.
  • There is no single command to replace dvc remove. For files, you can use xvc file delete. For pipelines steps, you can use ]xvc pipeline step remove
  • Instead of dvc repro, Xvc has xvc pipeline run. If you want to reproduce a pipeline, you can use xvc pipeline run again.
  • xvc root is for the same purpose as dvc root.
  • dvc run (that defines a stage in DVC pipeline and immediately runs it) can be replaced by xvc pipeline set of commands. xvc pipeline new for a new pipeline, xvc pipeline step new for a new step in the pipeline, xvc pipeline step dependency to specify dependencies of a step, xvc pipeline step output to specify outputs of a step and xvc pipeline run to run this pipeline.
  • Instead of dvc stage add, we have xvc pipeline step new. For dvc stage list, we have xvc pipeline step list.
  • There is no (need) for dvc protect or dvc unprotect commands in Xvc. "Cache type" is not a repository-wide option. If you want to track a certain directory as symlink, and another as hardlink, you can do so with xvc file recheck --as. If you want identical files copied to one directory and linked in another, xvc file copy can help.
  • DVC needs dvc update for external dependencies in pipelines. Xvc checks their metadata like any other dependency before downloading and invalidates the step if the URL/file has changed automatically.
  • DVC leaves Git operations to the user, and automates them to a certain degree with Git hooks. Xvc adds Git commits to the repository after operations by default.

Technical Differences

  • DVC is written in Python. Xvc is written in Rust.
  • DVC uses MD5 to check file content changes. Xvc uses BLAKE3 by default, and can be configured to use BLAKE2s, SHA2-256 and SHA3-256.
  • DVC tracks file/directory changes in separate .dvc files. Xvc tracks them in .json files in .xvc/store. There is no 1-1 correspondence between these files and the directory structure.
  • DVC uses Object-Oriented Programming in Python. Xvc tries to minimize function/data coupling and uses an Entity-Component System (xvc-ecs) in its core.
  • DVC remotes are identical to their cache in structure, and multiple DVC repositories use the same remote by mixing files. This leads to inter-repository deduplication. Xvc uses separate directory for each repository. This means identical files in separate Xvc repositories are duplicated.
  • DVC considers directories as file-equivalent entities to track with .dvc files pointing to .json files in the cache. Xvc doesn't track directories as identical to files. They are considered collections of files.
  • DVC uses Dulwich for Git operations. Xvc executes the Git process directly, with its common command line options.