Introduction to Xvc

Xvc is a command line utility to track large files with Git, define dependencies between files to run commands when only these dependencies change, and run experiments by making small changes in these files for later comparison. It's used mostly in Machine Learning scenarios where data and model files are large, code files depend on these and experiments must be compared via various metrics.

Xvc can use S3 and compatible cloud storages to upload tracked files with their exact version and can retrieve these later. This allows to delete them from the project when they are not needed to save space and get them back when needed. This facility can also be used for sharing these files. You can just clone the Git repository and get only the necessary Xvc-tracked files.

Xvc tracks files, directories and other elements by calculating their digests. These digests are used as address to store and find their locations in the storages. When you make a change to a file, it gets a new digest and the changed version has a new address. This makes sure that all versions can be retrieved on demand.

Xvc can be used as a make replacement to build multi-file projects with complex dependencies. Unlike make that detect file changes with timestamps, Xvc checks the files via their content. This reduces false-positives in invalidation.

Xvc pipelines are used to define steps to reach to a set of outputs. These steps have commands to run and may (or may not) produce intermediate outputs that other steps depend. Xvc pipelines allows steps to depend on other steps, other pipelines, text and binary files, directories, globs that select a subset of files, certain lines in a file, certain regular expression results, URLs, (hyper)parameter definitions in YAML, JSON or TOML files as of now. More dependency types like environment variables, database tables and queries, S3 buckets, REST query results, generic CLI command results, Bitcoin wallets, Jupyter notebook cells are in the plans.

For example, Xvc can be used to create a pipeline that depends on certain files in a directory via a glob, and a parameter in a YAML file to update a machine learning model. The same feature can be used to build software when the code or artifacts used in the software change. This allow binary outputs (as well as code inputs) to be tracked in Xvc. Instead of building everything from scratch in a new Git clone, a software project can reuse only the portions that require a rebuild. Binary distributions become much simpler.

This book is used as the documentation of the project. It is a work in progress as Xvc, and contain outdated information. Please report any errors and bugs in https://github.com/iesahin/xvc as the rest of project.

Comparison with other tools

There are many similar tools for managing large files on Git, managing machine learning pipelines and experiments. Most of ML oriented tools are provided as SaaS and in a different vein than Xvc.

Similar tools for file management on Git are the following:

  • git-annex: One of the earliest and most successful projects to manage large files on Git. It supports a large number of remote storage types, as well as adding other utilities as backends, similar to xvc storage new generic. It features an assistant aimed to make it easier for common use cases. It uses SHA-256 as the single digest option and uses symlinks as a cache type. It doesn't have data pipeline features.
  • git-lfs: It uses Git internals to track binary files. It requires server support for remote storages and allows only Git remotes to be used for binary file storage. Uses the same digest function Git uses. (By default, SHA-1). Uses .gitattributes mechanism to track certain files by default. It doesn't have data pipeline features.
  • dvc: Uses YAML files in the working directory to track file content. It uses MD5 sums. It can use different cache type for all the files in the repository. It has experiments tracking features, data pipelines and a SaaS GUI.

I have done some preliminary benchmarks to measure time to add files. I added 70.000 files with a single command. xvc file track (0.3.1) finished in 19 seconds, git lfs track '*.png' ; git add 'data/images/**/*.png' in 56 seconds, dvc add data/images in 80 seconds and git-annex add data/images in around 11 minutes. Note that these measurements are affected by output behavior and commands may gain some speed by turning off the default terminal output. Some finer benchmarks will be provided in the future, when Xvc is optimized.

Installation

Rust

Linux

macOS

Windows

Configuration

Configuration Files

Configure with Environment Variables

Changing configuration for a command

Get Started to Xvc

Xvc is a multipurpose tool. Its features can be used by professionals with various roles. If you're working with data, you can benefit from Xvc data management features.

Xvc for Everyone

Xvc Getting Started pages are written as stories and dialogues between tortoise (🐢) and hare (🐇).

🐇 Hello tortoise. How are you? Let's take a selfie. Do you take selfies? I have lots of them. Terabytes of them.

🐢 I don't have much selfies, you know. I don't change quickly and scenery is changing less often.

🐇 I see. I have terabytes of them, but can't find a good solution to store them. How do you store your documents? I know you have documents, lots and lots of them.

🐢 I track them with Git to track my evolving thoughts on text files. Images are different. I think it's not a good idea to keep images on Git, but there is a tool for that.

🐇 What kind of tool? Not Git, but something different?

🐢 It's called Xvc. You can keep track of your selfies with it. You can backup them, and get them as needed.

🐇 Tell me more about it. I have a directory in my home, ~/Selfies and I have thousands of them. How will I start?

🐢 Xvc can be used as a standalone tool but better when used with Git. You can just type

$ git init
$ xvc init

to start working with Xvc.

🐇 It looks easy but I heard that Git is complicated. Will I need to learn it?

🐢 Ah, no. If you're not willing to learn Git, you can just let Xvc to handle that. By default, it handles all Git operations about the changes it makes. If you want to push your files with someone, you may need to learn how to manage a repository.

🐇 How do I track my files?

🐢 You use xvc file track command. Do you have directories in ~/Selfies?

🐇 Yep. I have. Lots of them.

🐢 Do you want to track all of them?

🐇 Almost all. Some of them are so private that I want to hide even from Xvc.

🐢 You can use .xvcignore file to list them. Xvc ignores the files you list in .xvcignore.

🐇 How do I add others? Could you give an example?

🐢 If you have a folder for today's selfies, type this in ~/Selfies

$ xvc file track today/

and Xvc will track everything in that directory.

🐇 Oh, that's easy. If I want to track everything not ignored, I can type xvc file track then.

🐢 You're a quick learner.

After some brief period 🐇 went to home and added files.

🐇 Now, I want to learn how to share my selfies.

🐢 Xvc can store file contents in another location. First you must setup a storage. Do you use AWS S3?

🐇 Yes. I have buckets there. I want to keep my selfies in my rabbit-hole.

🐢 You can configure Xvc to use it with xvc storage new s3 command. You'll specify the region and bucket, and Xvc will prepare it.

🐇 types

$ xvc storage new s3 --name selfies --region eu-lepus-1 --bucket rabbit-hole

🐢 Now, you can send your files there with xvc file send --to selfies.

🐇 Is that all?

🐢 You will also need to push your Git files to another place. Do you have a Github account?

🐇 Ah, yeah, I have.

🐢 Now create a repository for your selfies. We will configure Git to use it as origin.

$ git remote add origin https://github.com/🐇/selfies
$ git push --set-upstream origin main

Now, you can share your selfies with your friends.

🐇 Cool, but how Xvc knows my AWS password? Does it share my passwords?

🐢 No, never. You must allow your friends to read that bucket of yours. Xvc reads the credentials from AWS configuration, either from the file or the environment variables.

🐇 How will they get my files?

🐢 First, they must clone the repository.

$ git clone https://github.com/🐇/selfies

Then, they can get all files with:

$ cd selfies
$ xvc file get .

🐇 Oh, cool, they don't have to xvc init again? Right?

🐢 No, they don't. Xvc should be initialized only once per repository. When you have new selfies, you can share them with:

$ xvc file track
$ git push

and your friends can receive the changes with

$ git pull
$ xvc file get

🐇 The order of these commands are important, it looks.

🐢 Yep. You add to Xvc first. Xvc automatically commits the changes to Git. Then you push Git changes to remote. Your friends first pull these changes, then get the actual files.

🐇 Thank you tortoise. Let me get back to my hole.

Xvc for Data

Xvc for Machine Learning

Xvc Getting Started pages are written as stories and dialogues between tortoise (🐢) and hare (🐇).

🐇 Ah, hello tortoise. How are you? I began to work as an machine learning engineer, you know? I'll be the fastest.

🐢 You're quick as always, hare. How is your job going so far?

🐇 It's good. We have lots and lots of data. We have models. We have scripts to create those models. We have notebooks full of experiments. That's all good stuff. We'll solve the hare intelligence problem.

🐢 Sounds cool. Aren't you losing yourself in all these, though?

🐇 Time to time we have those moments. Some models work with some data, some experiments require some kind of preprocessing, some data changed since we started to work with it and now we have multiple versions.

🐢 I see. I began to use a tool called Xvc. It may be of use to you.

🐇 What does it do?

🐢 It keeps track of all these stuff you mentioned. Data, models, scripts. It also can detect when data changed and run the scripts associated with that data.

🐇 That sound something we need. My boss wanted me to build a pipeline for cat pictures. He makes a contest for cat pictures. Every time he finds a new cat picture he likes, we have to update the model.

🐢 He must have lots of cat pictures.

🐇 He has. He sometimes find higher resolution versions and replaces older pictures. He has terabytes of cat pictures.

🐢 How do you keep track of those versions?

🐇 We don't. We have a disk for cat pictures. He puts everything there and we train models with it.

🐢 You can use Xvc to version those files. You can go back and forth in time, or have different branches. It's based on Git.

🐇 I know, but Git is for code files, right? I never found a good way to store image files in Git. It stores everything.

🐢 Yep. Git keeps all history in each repository. Better to keep that terabytes of images away from Git. Otherwise, you'll have terabytes of cat pictures in each clone you use. Xvc helps there. It tracks contents of data files separately from Git. Image files are not put into Git objects, and they are not duplicated in all repositories.

🐇 You know, I'm not interested in details. Tell me how this works.

🐢 Ok. When you go back to cat picture directory, create a Git repository, and initialize Xvc immediately.

$ git init
...
$ xvc init
? 0

🐇 No messages?

🐢 Xvc is of silent type of Unix commands. It follows "no news is good news" principle. We use ? 0 to indicate the command return code. 0 means success. If you want more output, you can add -v as a flag. Increase the number of -vs to increase the details.

🐇 So -vvvvvvvvvvvvvvv will show which atoms interact in disk while running Xvc?

🐢 It may work, try that next time. Now, you can add your cat pictures to Xvc. Xvc makes copies of tracked files by default. I assume you have a large collection. Better to make everything symlinks for now. We can change how specific files are linked to cache later.

$ xvc -v file track --cache-type symlink .
...

🐇 Does it track everything that way?

🐢 Yes. If you want to track only particular files or directories, you can replace . with their names.

🐇 What's the best cache type for me?

🐢 If your file system supports, best way seems reflink to me. It's like a symlink but makes a copy when your file changes. Most of the widely used file systems don't support it though. If your files are read only and you don't have many links to the same files, you can use hardlink. If they are likely to change, you can use copy. If there are many links to same files, better to use symlink.

🐇 So, symlinks are not the best? Why did you select it?

🐢 I suspect most of the files in your cat pictures are duplicates. Xvc stores only one copy of these in cache and links all occurrences in the workspace to this copy. This is called deduplication. There are limits to number of hardlinks, so I recommended you to use symlinks. They are more visible. You can see they are links. Hardlinks are harder to detect.

🐇 Ah, when I type ls -l, they all show the cache location now.

🐢 If you have a models/ directory and want to track them as copies, you can tell Xvc:

$ xvc file track --cache-type copy models/

It replaces previous symlinks with the copies of the files only in models/.

🐇 Can I have my data read only and models writable?

🐢 You can. Xvc keeps track of each file's cache-type separately. Data can stay in read-only symlinks, and models can be copied so they can be updated and stored as different versions.

🐇 I have also scripts, what should I do with them?

🐢 Are you using Git for them?

🐇 Yep. They are in a separate repository. I think I can use the same repository now.

🐢 You can. Better to keep them in the same repository. They can be versioned with the data they use and models they produce. You can use standard Git commands to track them. If you track a file with Git, Xvc doesn't track it. It stays away from it.

🐇 You said we can create pipelines with Xvc as well. I created a multi-stage pipeline for cat picture models. It's like this:

graph LR
    cats["data/cats/"] --> pp-train["preprocess.py --train data/pp-train/"]
    pp-train --> train["train.py"]
    params["params.yaml"] --> train
    cat-ratings["cat-ratings.txt"] --> train
    train --> model["models/model.bin"]
    cats --> pp-test["preprocess.py --test data/pp-test/"]
    model --> test["test.py"]
    pp-test --> test
    test --> metrics["metrics.json"]
    test --> best-model["best-model.json"]
    best-model --> deploy["deploy.sh"]

🐢 It looks like a fairly complex pipeline. You can create a pipeline definition for it. For each separate command we'll have a step. How many different commands do you have?

🐇 A preprocess --train command, a preprocess --test command, a train command, a test command and a deploy command. Five.

🐢 Do you need more than one pipeline? Maybe you would like to put deployment to another pipeline?

🐇 No, I don't think so. I may have in the future.

🐢 Xvc has a default pipeline. We'll use it for now. If you need more pipelines you can create with xvc pipeline new.

🐇 How do I create step for commands?

🐢 Let's create the steps at once. Each step requires a name and a command.

$ xvc pipeline step new --name preprocess-train --command 'python3 src/preprocess.py --train data/cats data/pp-train/'
$ xvc pipeline step new --name preprocess-test --command 'python3 src/preprocess.py --test data/cats data/pp-test/'
$ xvc pipeline step new --name train --command 'python3 src/train.py data/pp-train/' 
$ xvc pipeline step new --name test --command 'python3 src/test.py data/pp-test/ metrics.json'
$ xvc pipeline step new --name deploy --command 'python3 deploy.py models/model.bin /var/server/files/model.bin'

🐇 How do we define dependencies?

🐢 You can have many different types of dependencies. All are defined by xvc pipeline step dependency command. You can set up direct dependencies between steps, if one is invalidated, its dependents also run. You can set up file dependencies, if the file changes the step is invalidated and requires to run. There are other, more detailed dependencies like parameter dependencies which take a file in JSON or YAML format, then checks whether a value has changed. There are regular expression dependencies, for example if you have a piece of code in your training script that you change to update the parameters, you can define a regex dependency.

🐇 It looks I can use this for CSV files as well.

🐢 Yes. If your step depends not on the whole CSV file, but only specific rows, you can use regex dependencies. You can also specify line numbers of a file to depend.

🐇 My preprocess.py script depends on data/cats directory. My train.py script depends on params.yaml for some hyperparameters, and reads 5 Star ratings from cat-contest.txt. I want to deploy when the newly produced model is better than the older one by checking best-model.json. My deployment script doesn't update the deployment if the new model is not the best.

🐢 Let's see. For each step, you can use a single command to define its dependencies. For preprocess.py you'll depend to the data directory and the script itself. We want to run the step when the script changes. It's like this:

$ xvc pipeline step dependency --step-name preprocess-train --directory data/cats --file src/preprocess.py
$ xvc pipeline step dependency --step-name preprocess-test --directory data/cats --file src/preprocess.py
$ xvc pipeline step dependency --step-name train --directory data/pp-train --file src/train.py --param 'params.yaml::learning_rate' --regex 'cat-contest.csv:/^5,.*'
$ xvc pipeline step dependency --step-name test --directory models/ --directory data/pp-test/
$ xvc pipeline step dependency --step-name deploy --file best-model.json 

You must also define the outputs these steps produce, so when the output is missing or dependency is newer than the output, the step will require to rerun.

$ xvc pipeline step output --step-name preprocess-train --directory data/pp-train
$ xvc pipeline step output --step-name preprocess-test --directory data/pp-test
$ xvc pipeline step output --step-name train --directory models/
$ xvc pipeline step output --step-name test --file metrics.json  --file best-model.json
$ xvc pipeline step output --step-name deploy --file /var/server/files/model.bin

🐇 These commands become too long to type. You know, I'm a lazy hare and don't like to type much. Is there an easier way?

🐢 You can try source $(xvc aliases) in your Bash or Zsh, and get a bunch of aliases for these commands. xvc pipeline step output becomes xvcpso, xvc pipeline step dependency becomes xvcpsd, etc. You can see the whole list:

$ xvc aliases
alias xls='xvc file list'
alias pvc='xvc pipeline'
alias fvc='xvc file'
alias xvcf='xvc file'
alias xvcft='xvc file track'
alias xvcfl='xvc file list'
alias xvcfs='xvc file send'
alias xvcfb='xvc file bring'
alias xvcfh='xvc file hash'
alias xvcfc='xvc file checkout'
alias xvcp='xvc pipeline'
alias xvcpr='xvc pipeline run'
alias xvcps='xvc pipeline step'
alias xvcpsn='xvc pipeline step new'
alias xvcpsd='xvc pipeline step dependency'
alias xvcpso='xvc pipeline step output'
alias xvcpi='xvc pipeline import'
alias xvcpe='xvc pipeline export'
alias xvcpl='xvc pipeline list'
alias xvcpn='xvc pipeline new'
alias xvcpu='xvc pipeline update'
alias xvcpd='xvc pipeline dag'
alias xvcs='xvc storage'
alias xvcsn='xvc storage new'
alias xvcsl='xvc storage list'
alias xvcsr='xvc storage remove'

🐇 Oh, there are many more commands.

🐢 Yep. More to come as well. If you want to edit the pipelines you created in YAML, you can use xvc pipeline export and after making the changes, you can use xvc pipeline import.

🐇 I don't need to delete the pipeline to rewrite everything, then?

🐢 You can export a pipeline, edit and import with a different name to test. When you want to run them, you specify their names.

🐇 Ah, yeah, that's the most important part. How do I run?

🐢 xvc pipeline run, or xvcpr. It takes the name of the pipeline and runs it. It sorts steps, checks if there are any cycles. The steps musn't have cycles, otherwise it's an infinite loop and computers don't like infinite loops like turtles do. Xvc runs steps in parallel if there are no common dependencies.

🐇 So, if I have multiple preprocessing steps that don't depend each other, they can run in parallel?

🐢 Yeah, they run in parallel. For example in your pipeline preprocess-train and preprocess-test can run in parallel, because they don't depend on each other.

🐇 Cool. I want to see the pipeline we created.

🐢 You can see it with xvc pipeline dag (xvcpd) It prints a mermaid.js diagram that you can paste to your files.

🐇 Better to have an image of this, maybe.

🐢 I'll inform the developer about it. Please tell him anything you'd like to see in the tool in Github or via email He's extremely introverted but tries to be a nice guy.

🐇 Ah, ok, I'll write to him about this.

Xvc for Software Development

Xvc for DVC Users

DVC is an MLOps utility to track data, pipelines and machine learning experiments on top of Git. Xvc is inspired by DVC in its purpose, but there are major technical differences between these two.

Note that this document refers mostly to Xvc v0.4 and DVC 2.30. Both commands are in development, and similarities and differences may change considerably.

Similarities

The purposes of these two commands are similar, and these are alternatives to each other. Both of these aims to manage data, pipelines and experiments of an ML project.

Both of the utilities similarly work on top of Git. DVC became more bound to Git after the introduction of its experiment tracking features. Before that, Git was optional (but recommended) for DVC. Xvc has the same optional and recommended reliance on Git.

Both of these commands use hashing the content to detect changes in files.

Both of these use DAGs to represent pipelines.

Conceptual Differences

  • What DVC calls "remote", Xvc calls "storage." This is to emphasize the difference between Xvc storages and Git remotes.
  • What DVC calls "stage" in a data pipeline, Xvc calls "step." "Stage" has a different meaning in the Git context, and I believe using the same word in a different meaning increases the mental effort to describe and understand.
  • In DVC, there is a 1-1 correspondence between dvc.yaml files in a repository and the pipelines. In Xvc, pipelines are more abstract. They are defined with xvc pipeline family of commands. No single file contains a pipeline definition. You can export pipelines to YAML, JSON, and TOML, and import them after making changes. Xvc doesn't consider any file format authoritative for pipelines, and their YAML/JSON/TOML representation may change between versions.
  • DVC is more liberal in creating files among user files in the repository. When you add a file to DVC with dvc add, DVC creates a .dvc file next to it. Xvc only creates a .xvc/ directory in the repository root and only updates .gitignore files to hide tracked files from Git.
  • Cache type, (or rather recheck type) that is whether a file in the repository is linked to its cached version by copying, reflink, symlink or hardlink is determined repository-wide in DVC. You can either have all your cache links as symlinks, or hardlinks, etc. Xvc tracks these per file, you can have one file symlinked to cache, another file copied from cache, etc.

Command Differences

❗Note that, some of the Xvc commands described here are still under development.

  • While naming Xvc commands, we tried our best to avoid name clashes with Git. Having both git push and dvc push commands may look beneficial for exposition at first, as these two are analogous. However, giving the same name also hides some important details, that are more difficult to emphasize later. (e.g. DVC experiments are Git objects that are pushed to Git remotes, while the files changed during experiments are pushed to DVC remotes. )
  • dvc add can be replaced by xvc file track. dvc add creates a .dvc file (formatted in YAML) in the repository. Xvc doesn't create separate files for tracked paths.
  • dvc check-ignore can be replaced by xvc check-ignore. Xvc version can be used against any other ignore filename. (.gitignore,.ignore, .fooignore...)
  • dvc checkout is replaced by xvc file recheck. There is a --recheck-as option in several Xvc commands that tells whether to check out as symlink, hardlink, reflink or copy.
  • dvc commit is replaced by xvc file carry-in.
  • There is no command similar to dvc config. You can either edit the configuration files, or modify configuration with -c options in each run. You can also supply all configuration from the environment. See Configuration.
  • dvc dag is replaced by xvc pipeline dag. DVC version uses ASCII art to present the pipeline. Xvc doesn't provide ASCII art, only Graphviz representation.
  • dvc data status and dvc status can be replaced by xvc file list. Xvc version doesn't provide information about pipelines, or the remotes.
  • There is no command similar to dvc destroy in Xvc. There will be an xvc deinit command at some point.
  • There is no command similar to dvc diff in Xvc.
  • There is no command similar to dvc doctor or dvc version. Version information should be visible in the help text.
  • Currently, there are no commands corresponding to dvc exp set of commands. This is on the roadmap for Xvc. Scope, implementation, and actual commands may differ.
  • dvc fetch is replaced by xvc file bring --no-recheck.
  • Instead of freezing "pipeline stages" as in dvc freeze, and unfreezing with dvc unfreeze, xvc pipeline step update --changed [never|always|by_dependencies] can be used to specify if/when to run a pipeline step.
  • Instead of dvc gc to "garbage-collect" files, you can use xvc file delete with various options.
  • There is no corresponding command for dvc get-url in Xvc. You can use wget or curl instead.
  • Currently there is no command to replace dvc get and dvc import, and dvc import-url. URL dependencies are to be supported eventually with a different mechanism.
  • Instead of dvc install like hooks, Xvc issues Git commands itself if git.auto_commit , git.auto_stage configuration options are set.
  • There is no corresponding command for dvc list-url.
  • dvc list is replaced by xvc file list for local paths. Its remote capabilities are not implemented but on the roadmap.
  • Currently, there is no params/metrics tracking/diff similar to dvc params, dvc metrics or dvc plots commands in Xvc.
  • dvc move is replaced by xvc file move.
  • dvc push is replaced by xvc file send.
  • dvc pull is replaced by xvc file bring.
  • There are no commands similar to dvc queue for experiments in Xvc. Experiment tracking will probably be handled differently.
  • dvc remote set of commands are replaced by xvc storage set of commands. You can use xvc storage new for adding new storages. Currently, there is no "default remote" facility in Xvc. Instead of dvc remote modify, you can use xvc storage remove and xvc storage new.
  • There is no single command to replace dvc remove. For files, you can use xvc file delete. For pipelines steps, you can use ]xvc pipeline step remove
  • Instead of dvc repro, Xvc has xvc pipeline run. If you want to reproduce a pipeline, you can use xvc pipeline run again.
  • xvc root is for the same purpose as dvc root.
  • dvc run (that defines a stage in DVC pipeline and immediately runs it) can be replaced by xvc pipeline set of commands. xvc pipeline new for a new pipeline, xvc pipeline step new for a new step in the pipeline, xvc pipeline step dependency to specify dependencies of a step, xvc pipeline step output to specify outputs of a step and xvc pipeline run to run this pipeline.
  • Instead of dvc stage add, we have xvc pipeline step new. For dvc stage list, we have xvc pipeline step list.
  • There is no (need) for dvc protect or dvc unprotect commands in Xvc. "Cache type" is not a repository-wide option. If you want to track a certain directory as symlink, and another as hardlink, you can do so with xvc file recheck --as. If you want identical files copied to one directory and linked in another, xvc file copy can help.
  • DVC needs dvc update for external dependencies in pipelines. Xvc checks their metadata like any other dependency before downloading and invalidates the step if the URL/file has changed automatically.
  • DVC leaves Git operations to the user, and automates them to a certain degree with Git hooks. Xvc adds Git commits to the repository after operations by default.

Technical Differences

  • DVC is written in Python. Xvc is written in Rust.
  • DVC uses MD5 to check file content changes. Xvc uses BLAKE3 by default, and can be configured to use BLAKE2s, SHA2-256 and SHA3-256.
  • DVC tracks file/directory changes in separate .dvc files. Xvc tracks them in .json files in .xvc/store. There is no 1-1 correspondence between these files and the directory structure.
  • DVC uses Object-Oriented Programming in Python. Xvc tries to minimize function/data coupling and uses an Entity-Component System (xvc-ecs) in its core.
  • DVC remotes are identical to their cache in structure, and multiple DVC repositories use the same remote by mixing files. This leads to inter-repository deduplication. Xvc uses separate directory for each repository. This means identical files in separate Xvc repositories are duplicated.
  • DVC considers directories as file-equivalent entities to track with .dvc files pointing to .json files in the cache. Xvc doesn't track directories as identical to files. They are considered collections of files.
  • DVC uses Dulwich for Git operations. Xvc executes the Git process directly, with its common command line options.

Xvc for Git Annex Users

Xvc for Git LFS Users

How-To Guides

How to Compile Xvc

Why would you compile?

  • You want to use Xvc on a platform that we don't distribute the binary.
  • You want a smaller binary size by removing features that you don't use.
  • You like your software compiled.
  • It's easier to use cargo than other means to install for you.
  • Fix a bug for yourself.
  • Contribute!

Install Rust

You must have Rust installed on your system.

If you have a sensible terminal on your system:

$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Otherwise refer to other installation methods page.

Clone the repository

Clone the repository from Emre's Github repository.

$ git clone https://github.com/iesahin/xvc -b latest

The latest tag refers to the latest stable release. If you're willing to fight with compilation errors, you can also use main branch directly.

Compile without default features

Xvc with Git Branches

When you're working with multiple branches in Git, you may ask Xvc to checkout a branch and commit to another branch. These operations are performed at the beginning, and at the end of Xvc operations. You can use --from-ref and --to-branch options to checkout a Git reference before an Xvc operation, and commit the results to a certain Git branch.

Checkout and commit operations sandwich Xvc operations.

graph LR
   checkout["git checkout $REF"] --> xvc
   xvc["xvc operation"] --> stash["git stash --staged"]
   stash --> branch["git checkout --branch $TO_BRANCH"]
   branch --> commit["git add .xvc && git commit"]

If --from-ref is not given, initial git checkout is not performed. Xvc operates in the current branch. This is the default behavior.

$ git init --initial-branch=main
...
$ xvc init
? 0
$ ls
data.txt

$ xvc --to-branch data-file file track data.txt
Switched to a new branch 'data-file'

$ git branch
* data-file
  main

$ git status -s
$ xvc file list data.txt
FC          19 [..] c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size:          19 Cached Size:          19


If you return to main branch, you'll see the file is tracked by neither Git nor Xvc.

$ git checkout main
...
$ xvc file list data.txt
FX          19 [..]          c85f3e81 data.txt
Total #: 1 Workspace Size:          19 Cached Size:           0


$ git status -s
?? data.txt

Now, we'll add a step to the default pipeline to get an uppercase version of the data. We want this to work only in data

$ xvc --from-ref data-file pipeline step new --step-name to-uppercase --command 'cat data.txt | tr a-z A-Z > uppercase.txt'
Switched to branch 'data-file'

$ xvc pipeline step dependency --step-name to-uppercase --file data.txt 
$ xvc pipeline step output --step-name to-uppercase --output-file uppercase.txt

Note that xvc pipeline step dependency and xvc pipeline step output commands don't need --from-ref and --to-branch options, as they run in data-file branch already.

Now, we want to have this new version of data available only in uppercase branch.

$ xvc --from-ref data-file --to-branch uppercase pipeline run
Already on 'data-file'
Switched to a new branch 'uppercase'

$ git branch
  data-file
  main
* uppercase

You can use this for experimentation. Whenever you have a pipeline that you want to run and keep the results in another Git branch, you can use --to-branch for experimentation.

$ xvcpr --from-ref data-file --to-branch another-uppercase
$ git-branch 
* another-uppercase
uppercase
data-file
main

The pipeline always runs, because in data-file branch uppercase.txt is always missing. It's stored only in the resulting branch you give by --to-branch.

Command Reference

Synopsis

$ xvc --help
Xvc CLI to manage data and ML pipelines

Usage: xvc [OPTIONS] <COMMAND>

Commands:
  file          File and directory management commands
  init          Initialize an Xvc project
  pipeline      Pipeline management commands
  storage       Storage (cloud) management commands
  root          Find the root directory of a project
  check-ignore  Check whether files are ignored with `.xvcignore`
  aliases       Print command aliases to be sourced in shell files
  help          Print this message or the help of the given subcommand(s)

Options:
  -v, --verbose...             Output verbosity. Use multiple times to increase the output detail
      --quiet                  Suppress all output
      --debug                  Turn on all logging to $TMPDIR/xvc.log
  -C <WORKDIR>                 Set working directory for the command. It doesn't create a new shell, or change the directory [default: .]
  -c, --config <CONFIG>        Configuration options set from the command line in the form section.key=value You can use multiple times
      --no-system-config       Ignore system configuration file
      --no-user-config         Ignore user configuration file
      --no-project-config      Ignore project configuration file (.xvc/config)
      --no-local-config        Ignore local (gitignored) configuration file (.xvc/config.local)
      --no-env-config          Ignore configuration options obtained from environment variables
      --skip-git               Don't run automated Git operations for this command. If you want to run git commands yourself all the time, you can set `git.auto_commit` and `git.auto_stage` options in the configuration to False
      --from-ref <FROM_REF>    Checkout the given Git reference (branch, tag, commit etc.) before performing the Xvc operation. This runs `git checkout <given-value>` before running the command
      --to-branch <TO_BRANCH>  If given, create (or checkout) the given branch before committing results of the operation. This runs `git checkout --branch <given-value>` before committing the changes
  -h, --help                   Print help
  -V, --version                Print version

Subcommands

  • file: File and directory management commands
  • init: Initialize an Xvc project
  • pipeline: Pipeline management commands
  • storage: Storage (cloud) management commands
  • root: Find the root directory of a project
  • check-ignore: Check whether files are ignored with .xvcignore
  • aliases Print command aliases to be sourced in shell files

xvc init

Synopsis

$ xvc init --help
Initialize an Xvc project

Usage: xvc init [OPTIONS]

Options:
      --path <PATH>  Path to the directory to be intialized. (default: current directory)
      --no-git       Don't require Git
      --force        Create the repository even if already initialized. Overwrites the current .xvc directory Resets all data and guid, etc
  -h, --help         Print help
  -V, --version      Print version

Examples

To initialize a blank Xvc repository, initialize Git first and run xvc init.

$ cd my-project-1
$ git init
...
$ xvc init
? 0

The command doesn't print anything upon success.

If you want to initialize

File Management

Synopsis

$ xvc file --help
File and directory management commands

Usage: xvc file [OPTIONS] <COMMAND>

Commands:
  track     Add file and directories to Xvc
  hash      Get digest hash of files with the supported algorithms
  recheck   Get files from cache by copy or *link
  carry-in  Carry (commit) changed files to cache
  copy      Copy from source to another location in the workspace
  list      List tracked and untracked elements in the workspace
  send      Send (push, upload) files to external storages
  bring     Bring (download, pull, fetch) files from external storages
  help      Print this message or the help of the given subcommand(s)

Options:
  -v, --verbose...         Verbosity level. Use multiple times to increase command output detail
      --quiet              Suppress error messages
  -C <WORKDIR>             Set the working directory to run the command as if it's in that directory [default: .]
  -c, --config <CONFIG>    Configuration options set from the command line in the form section.key=value
      --no-system-config   Ignore system config file
      --no-user-config     Ignore user config file
      --no-project-config  Ignore project config (.xvc/config)
      --no-local-config    Ignore local config (.xvc/config.local)
      --no-env-config      Ignore configuration options from the environment
  -h, --help               Print help
  -V, --version            Print version

Subcommands

  • track: Begin tracking (add) files with XVC
  • hash: Calculate hash of given file
  • recheck: Copy/link files in the cache to the workspace (checkout)
  • carry-in: Carry (commit) changed files to cache
  • copy: Copy files in the workspace to another location
  • list: List files tracked with XVC
  • send: Send (push
  • ) files to remote
  • bring: Bring (pull) files from remote

xvc file track

Purpose

xvc file track is used to register any kind of file to Xvc for tracking versions.

Synopsis

$ xvc file track --help
Add file and directories to Xvc

Usage: xvc file track [OPTIONS] [TARGETS]...

Arguments:
  [TARGETS]...
          Files/directories to track

Options:
      --cache-type <CACHE_TYPE>
          How to track the file contents in cache: One of copy, symlink, hardlink, reflink.
          
          Note: Reflink uses copy if the underlying file system doesn't support it.

      --no-commit
          Do not copy/link added files to the file cache

      --text-or-binary <TEXT_OR_BINARY>
          Calculate digests as text or binary file without checking contents, or by automatically. (Default: auto)

      --force
          Add targets even if they are already tracked

      --no-parallel
          Don't use parallelism

  -h, --help
          Print help (see a summary with '-h')

Examples

By default, the command runs similar to git add and git commit.

$ xvc file track my-large-image.jpeg

You can track directories with the same command.

$ xvc file track my-large-directory/

You can specify more than one target in a single command.

$ xvc file track my-large-image.jpeg my-large-directory

Caveats

  • This command doesn't discriminate symbolic links or hardlinks. Links are followed and any broken links may cause errors.

  • Under the hood, Xvc tracks only the files, not directories. Directories are considered as path collections. It doesn't matter if you track a directory or files in it separately.

Technical Details

  • Detecting changes in files and directories employ different kinds of associated digests. If a file has different metadata digest, its content digest is calculated. If file's content digest has changed, the file is considered changed. A directory that contains different set of files, or files with changed content is considered changed.

xvc file list

Synopsis

$ xvc file list --help
List tracked and untracked elements in the workspace

Usage: xvc file list [OPTIONS] [TARGETS]...

Arguments:
  [TARGETS]...
          Files/directories to list.
          
          If not supplied, lists all files under the current directory.

Options:
  -f, --format <FORMAT>
          A string for each row of the output table
          
          The following are the keys for each row:
          
          - {{acd8}}:  actual content digest from the workspace file. First 8 digits.
          
          - {{acd64}}:  actual content digest. All 64 digits.
          
          - {{aft}}:  actual file type. Whether the entry is a file (F), directory (D), symlink (S), hardlink (H) or reflink (R).
          
          - {{asz}}:  actual size. The size of the workspace file in bytes. It uses MB, GB and TB to represent sizes larger than 1MB.
          
          - {{ats}}:  actual timestamp. The timestamp of the workspace file.
          
          - {{name}}: The name of the file or directory.
          
          - {{cst}}:  cache status. One of "=", ">", "<", "X", or "?" to show whether the file timestamp is the same as the cached timestamp, newer, older, not cached or not tracked.
          
          - {{rcd8}}:  recorded content digest stored in the cache. First 8 digits.
          
          - {{rcd64}}:  recorded content digest stored in the cache. All 64 digits.
          
          - {{rct}}:  recorded cache type. Whether the entry is linked to the workspace as a copy (C), symlink (S), hardlink (H) or reflink (R).
          
          - {{rsz}}:  recorded size. The size of the cached content in bytes. It uses MB, GB and TB to represent sizes larged than 1MB.
          
          - {{rts}}:  recorded timestamp. The timestamp of the cached content.
          
          The default format can be set with file.list.format in the config file.

  -s, --sort <SORT>
          Sort criteria.
          
          It can be one of none (default), name-asc, name-desc, size-asc, size-desc, ts-asc, ts-desc.
          
          The default option can be set with file.list.sort in the config file.

      --no-summary
          Don't show total number and size of the listed files.
          
          The default option can be set with file.list.no_summary in the config file.

  -h, --help
          Print help (see a summary with '-h')

Examples

For these examples, we'll create a directory tree with five directories, each having a file.

$ xvc-test-helper create-directory-tree --directories 5 --files 5 --fill 23

$ tree
.
├── dir-0001
│   ├── file-0001.bin
│   ├── file-0002.bin
│   ├── file-0003.bin
│   ├── file-0004.bin
│   └── file-0005.bin
├── dir-0002
│   ├── file-0001.bin
│   ├── file-0002.bin
│   ├── file-0003.bin
│   ├── file-0004.bin
│   └── file-0005.bin
├── dir-0003
│   ├── file-0001.bin
│   ├── file-0002.bin
│   ├── file-0003.bin
│   ├── file-0004.bin
│   └── file-0005.bin
├── dir-0004
│   ├── file-0001.bin
│   ├── file-0002.bin
│   ├── file-0003.bin
│   ├── file-0004.bin
│   └── file-0005.bin
└── dir-0005
    ├── file-0001.bin
    ├── file-0002.bin
    ├── file-0003.bin
    ├── file-0004.bin
    └── file-0005.bin

[..] directories, 25 files

xvc file list command works only in Xvc repositories. As we didn't initialize a repository yet, it lists nothing.

$ xvc file list 

Let's initialize the repository.

$ git init
...

$ xvc init

Now it lists all files and directories.

$ xvc file list --sort name-asc
FX         107 [..]          ce9fcf30 .gitignore
FX         130 [..]          ac46bf74 .xvcignore
DX         224 [..]                   dir-0001
FX        1001 [..]          189fa49f dir-0001/file-0001.bin
FX        1002 [..]          8c079454 dir-0001/file-0002.bin
FX        1003 [..]          2856fe70 dir-0001/file-0003.bin
FX        1004 [..]          3640687a dir-0001/file-0004.bin
FX        1005 [..]          e23e79a0 dir-0001/file-0005.bin
DX         224 [..]                   dir-0002
FX        1001 [..]          189fa49f dir-0002/file-0001.bin
FX        1002 [..]          8c079454 dir-0002/file-0002.bin
FX        1003 [..]          2856fe70 dir-0002/file-0003.bin
FX        1004 [..]          3640687a dir-0002/file-0004.bin
FX        1005 [..]          e23e79a0 dir-0002/file-0005.bin
DX         224 [..]                   dir-0003
FX        1001 [..]          189fa49f dir-0003/file-0001.bin
FX        1002 [..]          8c079454 dir-0003/file-0002.bin
FX        1003 [..]          2856fe70 dir-0003/file-0003.bin
FX        1004 [..]          3640687a dir-0003/file-0004.bin
FX        1005 [..]          e23e79a0 dir-0003/file-0005.bin
DX         224 [..]                   dir-0004
FX        1001 [..]          189fa49f dir-0004/file-0001.bin
FX        1002 [..]          8c079454 dir-0004/file-0002.bin
FX        1003 [..]          2856fe70 dir-0004/file-0003.bin
FX        1004 [..]          3640687a dir-0004/file-0004.bin
FX        1005 [..]          e23e79a0 dir-0004/file-0005.bin
DX         224 [..]                   dir-0005
FX        1001 [..]          189fa49f dir-0005/file-0001.bin
FX        1002 [..]          8c079454 dir-0005/file-0002.bin
FX        1003 [..]          2856fe70 dir-0005/file-0003.bin
FX        1004 [..]          3640687a dir-0005/file-0004.bin
FX        1005 [..]          e23e79a0 dir-0005/file-0005.bin
Total #: 32 Workspace Size:       26432 Cached Size:           0


With the default output format, the first two letters show the path type and cache type, respectively.

For example, if you track dir-0001 as copy, the first letter is F for the files and D for the directories. The second letter is C for files, meaning the file is a copy of the cached file, and it's X for directories that means they are not in the cache. Similar to Git, Xvc doesn't track only files and directories are considered as collection of files.

$ xvc file track dir-0001/

$ xvc file list dir-0001/
FC        1005 [..] e23e79a0 e23e79a0 dir-0001/file-0005.bin
FC        1004 [..] 3640687a 3640687a dir-0001/file-0004.bin
FC        1003 [..] 2856fe70 2856fe70 dir-0001/file-0003.bin
FC        1002 [..] 8c079454 8c079454 dir-0001/file-0002.bin
FC        1001 [..] 189fa49f 189fa49f dir-0001/file-0001.bin
Total #: 5 Workspace Size:        5015 Cached Size:        5015


If you add another set of files as hardlinks to the cached copies, it will print the second letter as H.

$ xvc file track dir-0002 --cache-type hardlink

$ xvc file list dir-0002
FH        1005 [..] e23e79a0 e23e79a0 dir-0002/file-0005.bin
FH        1004 [..] 3640687a 3640687a dir-0002/file-0004.bin
FH        1003 [..] 2856fe70 2856fe70 dir-0002/file-0003.bin
FH        1002 [..] 8c079454 8c079454 dir-0002/file-0002.bin
FH        1001 [..] 189fa49f 189fa49f dir-0002/file-0001.bin
Total #: 5 Workspace Size:        5015 Cached Size:        5015


Note, as hardlinks are actually files with the same inode in the file system with alternative paths, they are detected as F.

Symbolic links are typically reported as SS in the first letters. It means they are symbolic links on the file system and their cache type is also symbolic links.

$ xvc file track dir-0003 --cache-type symlink

$ xvc file list dir-0003
SS        [..] [..] e23e79a0          dir-0003/file-0005.bin
SS        [..] [..] 3640687a          dir-0003/file-0004.bin
SS        [..] [..] 2856fe70          dir-0003/file-0003.bin
SS        [..] [..] 8c079454          dir-0003/file-0002.bin
SS        [..] [..] 189fa49f          dir-0003/file-0001.bin
Total #: 5 Workspace Size:         900 Cached Size:        5015


Although not all filesystems support, R represents reflinks.

Globs

You may use globs to list files.

$ xvc file list 'dir-*/*-0001.bin' 
FX        1001 [..]          189fa49f dir-0005/file-0001.bin
FX        1001 [..]          189fa49f dir-0004/file-0001.bin
SS        [..] [..] 189fa49f          dir-0003/file-0001.bin
FH        1001 [..] 189fa49f 189fa49f dir-0002/file-0001.bin
FC        1001 [..] 189fa49f 189fa49f dir-0001/file-0001.bin
Total #: 5 Workspace Size:        4184 Cached Size:        1001


Note that all these files are identical. They are cached once, and only one of them takes space in the cache.

You can also use multiple targets as globs.

$ xvc file list '*/*-0001.bin' '*/*-0002.bin' 
FX        1002 [..]          8c079454 dir-0005/file-0002.bin
FX        1001 [..]          189fa49f dir-0005/file-0001.bin
FX        1002 [..]          8c079454 dir-0004/file-0002.bin
FX        1001 [..]          189fa49f dir-0004/file-0001.bin
SS        [..] [..] 8c079454          dir-0003/file-0002.bin
SS        [..] [..] 189fa49f          dir-0003/file-0001.bin
FH        1002 [..] 8c079454 8c079454 dir-0002/file-0002.bin
FH        1001 [..] 189fa49f 189fa49f dir-0002/file-0001.bin
FC        1002 [..] 8c079454 8c079454 dir-0001/file-0002.bin
FC        1001 [..] 189fa49f 189fa49f dir-0001/file-0001.bin
Total #: 10 Workspace Size:        8372 Cached Size:        2003


Sorting

You may sort xvc file list output by name, by modification time and by file size.

Use --sort option to specify the sort criteria.

$ xvc file list --sort name-desc dir-0001/
FC        1005 [..] e23e79a0 e23e79a0 dir-0001/file-0005.bin
FC        1004 [..] 3640687a 3640687a dir-0001/file-0004.bin
FC        1003 [..] 2856fe70 2856fe70 dir-0001/file-0003.bin
FC        1002 [..] 8c079454 8c079454 dir-0001/file-0002.bin
FC        1001 [..] 189fa49f 189fa49f dir-0001/file-0001.bin
Total #: 5 Workspace Size:        5015 Cached Size:        5015


$ xvc file list --sort name-asc dir-0001/
FC        1001 [..] 189fa49f 189fa49f dir-0001/file-0001.bin
FC        1002 [..] 8c079454 8c079454 dir-0001/file-0002.bin
FC        1003 [..] 2856fe70 2856fe70 dir-0001/file-0003.bin
FC        1004 [..] 3640687a 3640687a dir-0001/file-0004.bin
FC        1005 [..] e23e79a0 e23e79a0 dir-0001/file-0005.bin
Total #: 5 Workspace Size:        5015 Cached Size:        5015


Column Format

You can specify the columns that the command prints.

For example, if you only want to see the file names, use {{name}} as the format string.

The following command sorts all files with their sizes in the workspace, and prints their size and name.

$ xvc file list --format '{{asz}} {{name}}' --sort size-desc dir-0001/
       1005 dir-0001/file-0005.bin
       1004 dir-0001/file-0004.bin
       1003 dir-0001/file-0003.bin
       1002 dir-0001/file-0002.bin
       1001 dir-0001/file-0001.bin
Total #: 5 Workspace Size:        5015 Cached Size:        5015


If you want to compare the recorded (cached) hashes and actual hashes in the workspace, you can use {{acd}} {{rcd}} {{name}} format string.

$ xvc file list --format '{{acd8}} {{rcd8}} {{name}}' --sort ts-asc dir-0001
189fa49f 189fa49f dir-0001/file-0001.bin
8c079454 8c079454 dir-0001/file-0002.bin
2856fe70 2856fe70 dir-0001/file-0003.bin
3640687a 3640687a dir-0001/file-0004.bin
e23e79a0 e23e79a0 dir-0001/file-0005.bin
Total #: 5 Workspace Size:        5015 Cached Size:        5015


If `{{acd8}}` or `{{acd64}}` is not present in the format string, Xvc doesn't calculate these hashes. If you have large number of files where the default format (that includes actual content hashes) runs slowly, you may customize it to not to include these columns.

If you want to get a quick glimpse of what needs to carried in, or rechecked, you can use cache status {{cst}} column.

$ xvc-test-helper generate-random-file --size 100 --filename dir-0001/a-new-file.bin

$ xvc file list --format '{{cst}} {{name}}' dir-0001/
= dir-0001/file-0005.bin
= dir-0001/file-0004.bin
= dir-0001/file-0003.bin
= dir-0001/file-0002.bin
= dir-0001/file-0001.bin
X dir-0001/a-new-file.bin
Total #: 6 Workspace Size:        5115 Cached Size:        5015


The cache status column shows = for unchanged files in the cache, X for untracked files, > for files that there is newer version in the cache, and < for files that there is a newer version in the workspace. The comparison is done between recorded timestamp and actual timestamp with an accuracy of 1 second.

xvc file hash

Synopsis

$ xvc file hash --help
Get digest hash of files with the supported algorithms

Usage: xvc file hash [OPTIONS] [TARGETS]...

Arguments:
  [TARGETS]...  Files to process

Options:
  -a, --algorithm <ALGORITHM>
          Algorithm to calculate the hash. One of blake3, blake2, sha2, sha3. All algorithm variants produce 32-bytes digest
      --text-or-binary <TEXT_OR_BINARY>
          For "text" remove line endings before calculating the digest. Keep line endings if "binary". "auto" (default) detects the type by checking 0s in the first 8Kbytes, similar to Git [default: auto]
  -h, --help
          Print help
  -V, --version
          Print version

xvc file checkout

This is an alias of xvc file recheck. Please see that page for examples.

Synopsis

$ xvc file checkout --help
Get files from cache by copy or *link

Usage: xvc file recheck [OPTIONS] [TARGETS]...

Arguments:
  [TARGETS]...
          Files/directories to recheck

Options:
      --cache-type <CACHE_TYPE>
          How to track the file contents in cache: One of copy, symlink, hardlink, reflink.
          
          Note: Reflink uses copy if the underlying file system doesn't support it.

      --no-parallel
          Don't use parallelism

      --force
          Force even if target exists

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

xvc file recheck

Synopsis

$ xvc file recheck --help
Get files from cache by copy or *link

Usage: xvc file recheck [OPTIONS] [TARGETS]...

Arguments:
  [TARGETS]...
          Files/directories to recheck

Options:
      --cache-type <CACHE_TYPE>
          How to track the file contents in cache: One of copy, symlink, hardlink, reflink.
          
          Note: Reflink uses copy if the underlying file system doesn't support it.

      --no-parallel
          Don't use parallelism

      --force
          Force even if target exists

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

This command has an alias xvc file checkout if you feel more at home with Git terminology.

Examples

Rechecking is analogous to git checkout. It copies or links a cached file to the workspace.

Start by tracking a file.

$ git init
...
$ xvc init

$ xvc file track data.txt

$ ls -l
total[..]
-rw-rw-rw- [..] data.txt

Once you added the file to the cache, you can delete the workspace copy.

$ rm data.txt
$ ls -l
total[..]

Then, recheck the file. By default, it makes a copy of the file.

$ xvc file recheck data.txt

$ ls -l
total [..]
-rw-rw-rw- [..] data.txt

Xvc updates the cache type if the file is not changed.

$ xvc file recheck data.txt --as symlink

$ ls -l data.txt
l[..] data.txt -> [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt

Symlink and hardlinks are read-only. You can delete the symlink, and replace with an updated copy. (As perl -i does below.)

$ perl -i -pe 's/a/ee/g' data.txt

$ xvc file recheck data.txt --as copy
[ERROR] data.txt has changed on disk. Either carry in, force, or delete the target to recheck. 

$ rm data.txt

$ xvc -vv file recheck data.txt --as hardlink
[INFO] [HARDLINK] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt -> [CWD]/data.txt

$ ls -l
total[..]
-r--r--r-- [..] data.txt

Note that, as files in the cache are kept read-only, hardlinks and symlinks are also read only. Files rechecked as copy are made read-write explicitly.

Reflinks are supported by Xvc, but the underlying file system should also support it. Otherwise it uses copy.

$ rm -f data.txt
$ xvc file recheck data.txt --as reflink

The above command will create a read only link in macOS APFS and a copy in ext4 or NTFS file systems.

xvc file carry-in

Copies the file changes to cache.

Synopsis

$ xvc file carry-in --help
Carry (commit) changed files to cache

Usage: xvc file carry-in [OPTIONS] [TARGETS]...

Arguments:
  [TARGETS]...
          Files/directories to add

Options:
      --text-or-binary <TEXT_OR_BINARY>
          Calculate digests as text or binary file without checking contents, or by automatically. (Default: auto)

      --force
          Carry in targets even their content digests are not changed.
          
          This removes the file in cache and re-adds it.

      --no-parallel
          Don't use parallelism

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Examples

Carry in command works with Xvc repositories.

$ git init
...
$ xvc init

We first track a file.

$ xvc file track data.txt

$ xvc file list data.txt
FC          19 [..] c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size:          19 Cached Size:          19


We update the file with a command.

$ perl -i -pe 's/a/ee/g' data.txt

$ cat data.txt
Oh, deetee, my, deetee

$ xvc file list data.txt
FC          23 [..] c85f3e81 e37c686a data.txt
Total #: 1 Workspace Size:          23 Cached Size:          19


Note that the size of the file has increased, as we replace each a with an ee.

$ xvc file carry-in data.txt

$ xvc file list data.txt
FC          23 [..] e37c686a e37c686a data.txt
Total #: 1 Workspace Size:          23 Cached Size:          19


xvc file send

Synopsis

$ xvc file send --help
Send (push, upload) files to external storages

Usage: xvc file send [OPTIONS] --remote <REMOTE> [TARGETS]...

Arguments:
  [TARGETS]...  Targets to send/push/upload to storage

Options:
  -r, --remote <REMOTE>  Storage name or guid to send the files
      --force            Force even if the files are already present in the storage
  -h, --help             Print help

xvc file bring

Synopsis

$ xvc file bring --help
Bring (download, pull, fetch) files from external storages

Usage: xvc file bring [OPTIONS] --storage <STORAGE> [TARGETS]...

Arguments:
  [TARGETS]...
          Targets to bring from the storage

Options:
  -s, --storage <STORAGE>
          Storage name or guid to send the files

      --force
          Force even if the files are already present in the workspace

      --no-recheck
          Don't recheck (checkout) after bringing the file to cache.
          
          This makes the command similar to `git fetch` in Git. It just updates the cache, and doesn't copy/link the file to workspace.

      --recheck-as <RECHECK_AS>
          Recheck (checkout) the file in one of the four alternative ways. (See `xvc file recheck`) and [CacheType]

  -h, --help
          Print help (see a summary with '-h')

This is yet to be implemented. Please see https://github.com/iesahin/xvc/issues/177 for progress.

xvc file copy

Synopsis

$ xvc file copy --help
Copy from source to another location in the workspace

Usage: xvc file copy [OPTIONS] <SOURCE> <DESTINATION>

Arguments:
  <SOURCE>
          Source file, glob or directory within the workspace.
          
          If the source ends with a slash, it's considered a directory and all files in that directory are copied.
          
          If the number of source files is more than one, the destination must be a directory.

  <DESTINATION>
          Location we copy file(s) to within the workspace.
          
          If the target ends with a slash, it's considered a directory and created if it doesn't exist.
          
          If the number of source files is more than one, the destination must be a directory.

Options:
      --cache-type <CACHE_TYPE>
          How the targets should be rechecked: One of copy, symlink, hardlink, reflink.
          
          Note: Reflink uses copy if the underlying file system doesn't support it.

      --force
          Force even if target exists

      --no-recheck
          Do not recheck the destination files This is useful when you want to copy only records, without updating the workspace

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Examples

This command is used to copy a set of files to another location in the workspace.

By default, it doesn't update the recheck method (cache type) of the targets. It rechecks them to the destination with the same method.

xvc file copy works only with the tracked files.

$ git init
...
$ xvc init

$ xvc file track data.txt

$ ls -l
total[..]
-rw-rw-rw-  [..] data.txt

Once you add the file to the cache, you can copy the file to another location.

$ xvc file copy data.txt data2.txt

$ ls
data.txt
data2.txt

Note that, multiple copies of the same content doesn't add up to the cache size.

$ xvc file list data.txt
FC          19 [..] c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size:          19 Cached Size:          19


$ xvc file list 'data*'
FC          19 [..] c85f3e81 c85f3e81 data2.txt
FC          19 [..] c85f3e81 c85f3e81 data.txt
Total #: 2 Workspace Size:          38 Cached Size:          19


Xvc can change the destination file's recheck method.

$ xvc file copy data.txt data3.txt --as symlink

$ ls -l
total[..]
-rw-rw-rw-  1 [..] data.txt
-rw-rw-rw-  1 [..] data2.txt
lrwxr-xr-x  1 [..] data3.txt -> [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt

You can create views of your data by copying it to another location.

$ xvc file copy 'd*' another-set/ --as hardlink

$ xvc file list another-set/
FH          19 [..] c85f3e81 c85f3e81 another-set/data3.txt
FH          19 [..] c85f3e81 c85f3e81 another-set/data2.txt
FH          19 [..] c85f3e81 c85f3e81 another-set/data.txt
Total #: 3 Workspace Size:          57 Cached Size:          19


If the targets you specify are changed, copy operation is cancelled. Please either recheck old versions or carry in new versions.

$ perl -i -pe 's/a/ee/g' data.txt

$ xvc file copy data.txt data5.txt

You can copy files without them being in the workspace if they are in the cache.

$ rm -f data.txt

$ xvc file copy data.txt data6.txt

$ ls -l data6.txt
-rw-rw-rw-  [..] data6.txt

You can also skip rechecking. In this case, xvc won't create any copies in the workspace, and you don't need them to be available in the cache. They will be listed with xvc file list command.

$ xvc file copy data.txt data7.txt --no-recheck

$ ls
another-set
data2.txt
data3.txt
data5.txt
data6.txt

$ xvc file list
XC             [..] c85f3e81          data7.txt
FC          19 [..] c85f3e81 c85f3e81 data6.txt
FC          19 [..] c85f3e81 c85f3e81 data5.txt
SS        [..] [..] c85f3e81          data3.txt
FC          19 [..] c85f3e81 c85f3e81 data2.txt
XC             [..] c85f3e81          data.txt
FH          19 [..] c85f3e81 c85f3e81 another-set/data3.txt
FH          19 [..] c85f3e81 c85f3e81 another-set/data2.txt
FH          19 [..] c85f3e81 c85f3e81 another-set/data.txt
DX         160 [..]                   another-set
FX         130 [..]          ac46bf74 .xvcignore
FX         619 [..]          [..] .gitignore
Total #: 12 Workspace Size:        1203 Cached Size:          19


Later, you can recheck them to the workspace.

$ xvc file recheck data7.txt

$ ls -l data7.txt
-rw-rw-rw-  [..] data7.txt

Data-Model Pipelines

Synopsis

$ xvc pipeline --help
Pipeline management commands

Usage: xvc pipeline [OPTIONS] <COMMAND>

Commands:
  new     Create a new pipeline
  update  Rename, change dir or set a pipeline as default
  delete  Delete a pipeline
  run     Run a pipeline
  list    List all pipelines
  dag     Generate a dot or mermaid diagram for the pipeline
  export  Export the pipeline to a YAML or JSON file to edit
  import  Import the pipeline from a file
  step    Step creation, dependency, output commands
  help    Print this message or the help of the given subcommand(s)

Options:
  -n, --name <NAME>  Name of the pipeline this command applies
  -h, --help         Print help

xvc pipeline new

Synopsis

$ xvc pipeline new --help
Create a new pipeline

Usage: xvc pipeline new [OPTIONS] --name <NAME>

Options:
  -n, --name <NAME>        Name of the pipeline this command applies to
  -w, --workdir <WORKDIR>  Default working directory
      --set-default        Set this pipeline as default
  -h, --help               Print help

xvc pipeline list

Synopsis

$ xvc pipeline list --help
List all pipelines

Usage: xvc pipeline list

Options:
  -h, --help  Print help

xvc pipeline step

Synopsis

$ xvc pipeline step --help
Step creation, dependency, output commands

Usage: xvc pipeline step <COMMAND>

Commands:
  new         Add a new step
  update      Update step options
  dependency  Add a dependency to a step
  output      Add an output to a step
  show        Print step configuration
  help        Print this message or the help of the given subcommand(s)

Options:
  -h, --help  Print help

xvc pipeline step new

Purpose

Create a new step in the pipeline.

Synopsis

$ xvc pipeline step new --help
Add a new step

Usage: xvc pipeline step new [OPTIONS] --step-name <STEP_NAME>

Options:
  -s, --step-name <STEP_NAME>  Name of the new step
  -c, --command <COMMAND>      Step command to run
      --when <WHEN>            When to run the command. One of always, never, by_dependencies (default). This is used to freeze or invalidate a step manually
  -h, --help                   Print help

Examples

Caveats

xvc pipeline step dependency

Purpose

Define a dependency to an existing step in the pipeline.

Synopsis

$ xvc pipeline step dependency --help
Add a dependency to a step

Usage: xvc pipeline step dependency [OPTIONS] --step-name <STEP_NAME>

Options:
  -s, --step-name <STEP_NAME>    Name of the step to add the dependency to
      --file <FILES>             Add a file dependency to the step. Can be used multiple times
      --step <STEPS>             Add a step dependency to a step. Can be used multiple times. Steps are referred with their names
      --pipeline <PIPELINES>     Add a pipeline dependency to a step. Can be used multiple times. Pipelines are referred with their names
      --directory <DIRECTORIES>  Add a directory dependency to the step. Can be used multiple times
      --glob <GLOBS>             Add a glob dependency to the step. Can be used multiple times
      --param <PARAMS>           Add a parameter dependency to the step in the form filename.yaml::model.units . Can be used multiple times
      --regex <REGEXPS>          Add a regex dependency in the form filename.txt:/^regex/ . Can be used multiple times
      --line <LINES>             Add a line dependency in the form filename.txt::123-234
  -h, --help                     Print help

Examples

Caveats

xvc pipeline step output

Purpose

Define an output (file, metrics or plots) to an already existing step in the pipeline.

Synopsis

$ xvc pipeline step output --help
Add an output to a step

Usage: xvc pipeline step output [OPTIONS] --step-name <STEP_NAME>

Options:
  -s, --step-name <STEP_NAME>    Name of the step to add the output to
      --output-file <FILES>      Add a file output to the step. Can be used multiple times
      --output-metric <METRICS>  Add a metric output to the step. Can be used multiple times
      --output-image <IMAGES>    Add an image output to the step. Can be used multiple times
  -h, --help                     Print help

Examples

Caveats

xvc pipeline step show

Purpose

Print the steps of a pipeline.

Synopsis

$ xvc pipeline step show --help
Print step configuration

Usage: xvc pipeline step show --step-name <STEP_NAME>

Options:
  -s, --step-name <STEP_NAME>  Name of the step to show
  -h, --help                   Print help

Examples

Caveats

xvc pipeline step update

Purpose

Update the name, running condition, or command of a step.

Synopsis

$ xvc pipeline step update --help
Update step options

Usage: xvc pipeline step update [OPTIONS] --step-name <STEP_NAME>

Options:
  -s, --step-name <STEP_NAME>  Name of the step to update. The step should already be defined
  -c, --command <COMMAND>      Step command to run
      --when <WHEN>            When to run the command. One of always, never, by_dependencies (default). This is used to freeze or invalidate a step manually
  -h, --help                   Print help

Examples

Caveats

xvc pipeline run

Synopsis

$ xvc pipeline run --help
Run a pipeline

Usage: xvc pipeline run [OPTIONS]

Options:
  -n, --name <NAME>  Name of the pipeline to run
  -h, --help         Print help

xvc pipeline delete

Synopsis

$ xvc pipeline delete --help
Delete a pipeline

Usage: xvc pipeline delete --name <NAME>

Options:
  -n, --name <NAME>  Name or GUID of the pipeline to be deleted
  -h, --help         Print help

xvc pipeline export

Synopsis

$ xvc pipeline export --help
Export the pipeline to a YAML or JSON file to edit

Usage: xvc pipeline export [OPTIONS]

Options:
  -n, --name <NAME>      Name of the pipeline to export
      --file <FILE>      File to write the pipeline. Writes to stdout if not set
      --format <FORMAT>  Output format. One of json or yaml. If not set, the format is guessed from the file extension. If the file extension is not set, json is used as default
  -h, --help             Print help

xvc pipeline import

Synopsis

$ xvc pipeline import --help
Import the pipeline from a file

Usage: xvc pipeline import [OPTIONS]

Options:
  -n, --name <NAME>      Name of the pipeline to import. If not set, the name from the file is used
      --file <FILE>      File to read the pipeline. Use stdin if not specified
      --format <FORMAT>  Input format. One of json or yaml. If not set, the format is guessed from the file extension. If the file extension is not set, json is used as default
      --overwrite        Overwrite the pipeline even if the name already exists
  -h, --help             Print help

xvc pipeline update

Synopsis

$ xvc pipeline update --help
Rename, change dir or set a pipeline as default

Usage: xvc pipeline update [OPTIONS]

Options:
  -n, --name <NAME>        Name of the pipeline this command applies to
      --rename <RENAME>    Rename the pipeline to
      --workdir <WORKDIR>  Set the working directory
      --set-default        set this pipeline default
  -h, --help               Print help

xvc pipeline dag

Synopsis

$ xvc pipeline dag --help
Generate a dot or mermaid diagram for the pipeline

Usage: xvc pipeline dag [OPTIONS]

Options:
  -n, --name <NAME>      Name of the pipeline to generate the diagram
      --file <FILE>      Output file. Writes to stdout if not set
      --format <FORMAT>  Format for graph. Either dot or mermaid [default: dot]
  -h, --help             Print help

Storage management commands (xvc storage)

Purpose

Xvc allows to keep tracked content in storages. These can be in either local file system or the cloud. xvc storage set of commands allow to configure, list and delete these storages.

Synopsis

$ xvc storage --help
Storage (cloud) management commands

Usage: xvc storage <COMMAND>

Commands:
  list    List all configured storages
  remove  Remove a storage configuration
  new     Configure a new storage
  help    Print this message or the help of the given subcommand(s)

Options:
  -h, --help  Print help

xvc storage list

Purpose

List all configured storages with their names and guids.

Synopsis

$ xvc storage list --help
List all configured storages

Usage: xvc storage list

Options:
  -h, --help  Print help

Examples

List all storages in the repository:

$ xvc storage list 

Caveats

This one uses the local configuration and doesn't try to connect storages. If it's listed with the command, it doesn't mean it's guaranteed to be able to pull or push.

xvc storage remove

Purpose

Remove unused or inaccessible storages from the configuration

Synopsis

$ xvc storage remove --help
Remove a storage configuration.

This doesn't delete any files in the storage.

Usage: xvc storage remove --name <NAME>

Options:
      --name <NAME>
          Name of the storage to be deleted

  -h, --help
          Print help (see a summary with '-h')

Caveats

xvc storage new

Synopsis

$ xvc storage new --help 
Configure a new storage

Usage: xvc storage new <COMMAND>

Commands:
  local          Add a new local storage
  generic        Add a new generic storage
  rsync          Add a new rsync storages
  s3             Add a new S3 storage
  minio          Add a new Minio storage
  digital-ocean  Add a new Digital Ocean storage
  r2             Add a new R2 storage
  gcs            Add a new Google Cloud Storage storage
  wasabi         Add a new Wasabi storage
  help           Print this message or the help of the given subcommand(s)

Options:
  -h, --help  Print help

xvc storage new local

Purpose

Create a new storage reachable from the local filesystem. It allows to keep tracked file contents in a different directory for backup or sharing purposes.

Synopsis

$ xvc storage new local --help
Add a new local storage

A local storage is a directory accessible from the local file system. Xvc will use common file operations for this directory without accessing the network.

Usage: xvc storage new local --path <PATH> --name <NAME>

Options:
      --path <PATH>
          Directory (outside the repository) to be set as a storage

  -n, --name <NAME>
          Name of the storage.
          
          Recommended to keep this name unique to refer easily.

  -h, --help
          Print help (see a summary with '-h')

Examples

Create a new Xvc backup storage on a path

$ xvc storage new-local --name backup --path /media/bigdisk/backups/my-project-xvc

Caveats

--name NAME is not checked to be unique but you should use unique storage names to refer them later.

--path PATH should be accessible for writing and shouldn't already exist.

Technical Details

The command creates the PATH and a new file under PATH called .xvc-guid. The file contains the unique identifier for this storage. The same identifier is also recorded to the project.

A file that's found in .xvc/{{HASH_PREFIX}}/{{CACHE_PATH}} is saved to PATH/{{REPO_ID}}/{{HASH_PREFIX}}/{{CACHE_PATH}}. {{REPO_ID}} is the unique identifier for the repository created during xvc init. Hence if you use a common storage for different Xvc projects, their files are kept under different directories. There is no inter-project deduplication.

xvc storage new generic

Purpose

Create a new storage that uses shell commands to send and retrieve cache files. It allows to keep tracked files in any kind of service that can be used command line.

Synopsis

$ xvc storage new generic --help
Add a new generic storage.

⚠️ Please note that this is an advanced method to configure storages. You may damage your repository and local and remote files with incorrect configurations.

Please see https://docs.xvc.dev/ref/xvc-storage-new-generic.html for examples and make necessary backups before continuing.

Usage: xvc storage new generic [OPTIONS] --name <NAME> --init <INIT_COMMAND> --list <LIST_COMMAND> --download <DOWNLOAD_COMMAND> --upload <UPLOAD_COMMAND> --delete <DELETE_COMMAND>

Options:
  -n, --name <NAME>
          Name of the storage.
          
          Recommended to keep this name unique to refer easily.

  -i, --init <INIT_COMMAND>
          Command to initialize the storage. This command is run once after defining the storage.
          
          You can use {URL} and {STORAGE_DIR}  as shortcuts.

  -l, --list <LIST_COMMAND>
          Command to list the files in storage
          
          You can use {URL} and {STORAGE_DIR} placeholders and define values for these with --url and --storage_dir options.

  -d, --download <DOWNLOAD_COMMAND>
          Command to download a file from storage.
          
          You can use {URL} and {STORAGE_DIR} placeholders and define values for these with --url and --storage_dir options.

  -u, --upload <UPLOAD_COMMAND>
          Command to upload a file to storage.
          
          You can use {URL} and {STORAGE_DIR} placeholders and define values for these with --url and --storage_dir options.

  -D, --delete <DELETE_COMMAND>
          The delete command to remove a file from storage You can use {URL} and {STORAGE_DIR} placeholders and define values for these with --url and --storage_dir options

  -M, --processes <MAX_PROCESSES>
          Number of maximum processes to run simultaneously
          
          [default: 1]

      --url <URL>
          You can set a string to replace {URL} placeholder in commands

      --storage-dir <STORAGE_DIR>
          You can set a string to replace {STORAGE_DIR} placeholder in commands

  -h, --help
          Print help (see a summary with '-h')

You can use the following placeholders in your commands. These are replaced with the actual paths in runtime and commands are run with concrete paths.

  • {URL} : The content of --url option. (default "")
  • {STORAGE_DIR} Content of --storage-dir option. (default "")
  • {RELATIVE_CACHE_PATH} The portion of the cache path after .xvc/.
  • {ABSOLUTE_CACHE_PATH} The absolute local path for the cache element.
  • {RELATIVE_CACHE_DIR} The portion of directory that contains the file after .xvc/.
  • {ABSOLUTE_CACHE_DIR} The portion of the local directory that contains the file after .xvc.
  • {XVC_GUID}: Repository GUID used in storages to differ repository elements
  • {FULL_STORAGE_PATH}: Concatenation of {URL}{STORAGE_DIR}{XVC_GUID}/{RELATIVE_CACHE_PATH}
  • {FULL_STORAGE_DIR}: Concatenation of {URL}{STORAGE_DIR}{XVC_GUID}/{RELATIVE_CACHE_DIR}
  • {LOCAL_GUID_FILE_PATH}: The path that contains guid of the storage locally. Used only in --init option.
  • {STORAGE_GUID_FILE_PATH}: The path that should have guid of the storage, in storage. Used only in --init option.

Examples

Create a generic storage in the same filesystem

You can create a storage that's using shell commands to send and receive files to another location in the file system.

There are two variables that you can use in the commands. For a storage in the same file system, --url could be blank and --storage-dir could be the location you want to define.

$ xvc storage new-generic
    --url ""
    --storage-dir $HOME/my-xvc-storage
    ...

You need to specify the commands for the following operations:

  • init: The command that's used to create the directory that will be used as a storage. It should also copy XVC_STORAGE_GUID_FILENAME (currently .xvc-guid) to that location. This file is used to identify the location as an Xvc storage.
$ xvc storage new-generic
      ...
      --init 'mkdir -p {STORAGE_DIR} ; cp {LOCAL_GUID_FILE_PATH} {STORAGE_GUID_FILE_PATH}'
      ...

Note that if the command doesn't contain {LOCAL_GUID_FILE_PATH} and {STORAGE_GUID_FILE_PATH} variables, it won't be run and Xvc will report an error.

  • list: This operation should list all files under {URL}{STORAGE_DIR}. The list is filtered through a regex that matches the format of the paths. Hence, even the command lists all files in the storage, Xvc will consider only the relevant paths.

All paths should be listed in separate lines.

$ xvc storage new-generic
        ...
        --list 'ls -1 {URL}{STORAGE_DIR}'
        ...
  • upload: The command that will copy a file from local cache to the storage. Normally, it uses {ABSOLUTE_CACHE_PATH} variable. For the local file system, we also need to create a directory before copying.
$ xvc storage new-generic
     ...
     --upload 'mkdir -p {FULL_STORAGE_DIR} && cp {ABSOLUTE_CACHE_PATH} {FULL_STORAGE_PATH}'
     ...
  • download: This command will be used to copy from storage to the local cache. It must create local cache directory as well.
$ xvc storage new-generic
    ...
    --download 'mkdir -p {ABSOLUTE_CACHE_DIR} && cp {FULL_STORAGE_PATH} {ABSOLUTE_CACHE_PATH}'
    ...
  • delete: This operation is used to delete the storage file. It shouldn't touch the local file in any way, otherwise you may lose data.
$ xvc storage new-generic
    ...
    --delete 'rm -f {FULL_STORAGE_PATH} ; rmdir {FULL_STORAGE_DIR}'
    ...

In total, the command you write is the following. It defines all operations of this storage.

$ xvc storage new-generic
    --url ""
    --storage-dir $HOME/my-xvc-storage
    --init 'mkdir -p {STORAGE_DIR} ; cp {LOCAL_GUID_FILE_PATH} {STORAGE_GUID_FILE_PATH}'
    --list 'ls -1 {URL}{STORAGE_DIR}'
    --upload 'mkdir -p {FULL_STORAGE_DIR} && cp {ABSOLUTE_CACHE_PATH} {FULL_STORAGE_PATH}'
    --download 'mkdir -p {ABSOLUTE_CACHE_DIR} && cp {FULL_STORAGE_PATH} {ABSOLUTE_CACHE_PATH}'
    --delete 'rm -f {FULL_STORAGE_PATH} ; rmdir {FULL_STORAGE_DIR}'

Create a storage using rclone

Create a storage using rsync

Rsync is found for all popular platforms to copy file contents. Xvc can use it to maintain a storage if you already have a working rsync setup.

We need to define operations for init, upload, download, list and delete with rsync or ssh. Some of the commands need ssh to perform operations, like creating a directory. We'll use placeholders for paths.

As rsync URL format is slightly different than SSH, we will define the commands verbosely.

Suppose you want to use your account at user@example.com to store your Xvc files. You want to store the files under /home/user/my-xvc-storage.

We assume you have configured public key authentication for your account. Xvc doesn't receive user input during storage operations, and can't receive your password during runs.

We first define these as our --url and --storage-dir options.

$ xvc --url user@example.com 
        --storage-dir '/home/user/my-xvc-storage'
        ...

Initialization command must create this directory and copy the storage GUID file to its respective location.

$ xvc 
  ...
  --init "ssh {URL} 'mkdir -p {STORAGE_DIR}' ; rsync -av '{LOCAL_GUID_FILE_PATH}' '{URL}:{STORAGE_GUID_FILE_PATH}'"

Note the use of : in rsync command. As it doesn't support ssh:// URLs currently, we are using a form that's compatible with both ssh and rsync as URL. It may be possible to use && between ssh and rsync commands, but if the first command fails (e.g. the directory already exists), we still want to copy the guid file.

Caveats

Technical Details

The paths in list commands are filtered through a regex. They are matched against {REPO_GUID}/{RELATIVE_CACHE_DIR}/0 pattern and only the {RELATIVE_CACHE_DIR} portion is reported. Any line that doesn't conform to this pattern is ignored. You can any listing command that returns a recursive file list, and only the pattern matching elements are considered.

xvc storage new s3

Purpose

Configure an S3 (or a compatible) service as an Xvc storage.

Synopsis

$ xvc storage new s3 --help
Add a new S3 storage

Usage: xvc storage new s3 [OPTIONS] --name <NAME> --bucket-name <BUCKET_NAME> --region <REGION>

Options:
  -n, --name <NAME>
          Name of the storage
          
          This must be unique among all storages of the project

      --storage-prefix <STORAGE_PREFIX>
          You can set a directory in the bucket with this prefix
          
          [default: ]

      --bucket-name <BUCKET_NAME>
          S3 bucket name

      --region <REGION>
          AWS region

  -h, --help
          Print help (see a summary with '-h')

Examples

xvc storage new gcs

Purpose

Configure an Google Cloud Storage service as an Xvc storage.

Synopsis

$ xvc storage new gcs --help
Add a new Google Cloud Storage storage

Usage: xvc storage new gcs [OPTIONS] --name <NAME> --bucket-name <BUCKET_NAME> --region <REGION>

Options:
  -n, --name <NAME>
          Name of the storage
          
          This must be unique among all storages of the project

      --bucket-name <BUCKET_NAME>
          Bucket name

      --region <REGION>
          Region of the server, e.g., europe-west3

      --storage-prefix <STORAGE_PREFIX>
          You can set a directory in the bucket with this prefix
          
          [default: ]

  -h, --help
          Print help (see a summary with '-h')

Examples

xvc storage new minio

Purpose

Create a new Xvc storage on a MinIO instance. It allows to store tracked file contents in a Minio server.

Synopsis

$ xvc storage new minio --help
Add a new Minio storage

Usage: xvc storage new minio [OPTIONS] --name <NAME> --endpoint <ENDPOINT> --bucket-name <BUCKET_NAME> --region <REGION>

Options:
  -n, --name <NAME>
          Name of the storage
          
          This must be unique among all storages of the project

      --endpoint <ENDPOINT>
          Minio server url in the form https://myserver.example.com:9090

      --bucket-name <BUCKET_NAME>
          Bucket name

      --region <REGION>
          Region of the server

      --storage-prefix <STORAGE_PREFIX>
          You can set a directory in the bucket with this prefix
          
          [default: ]

  -h, --help
          Print help (see a summary with '-h')

Credentials

Xvc doesn't store any credentials. Xvc gets server credentials from two environment variables: XVC_STORAGE_ACCESS_KEY_ID and XVC_STORAGE_SECRET_KEY. You must supply the credentials in these two environment variables before running any command that connects to the storage.

These environment variables can contain user name and password to Minio server. If you have created service accounts, you can also set the keys to them as keys.

$ export XVC_STORAGE_ACCESS_KEY_ID=myname
$ export XVC_STORAGE_SECRET_KEY=mypassword
$ xvc storage new minio --name minio-storage --endpoint 'http://example.com:9001' --bucket-name xvc-bucket --region us-east-1 --storage-prefix my-project

Examples

You can create a new Minio storage by supplying the credentials and required parameters.

$ export XVC_STORAGE_ACCESS_KEY_ID=myname
$ export XVC_STORAGE_SECRET_KEY=mypassword
$ xvc storage new minio --name minio-storage --endpoint 'http://example.com:9001' --bucket-name xvc-bucket --region us-east-1 --storage-prefix my-project

After defining the storage, you can push, fetch, and pull files with xvc file push and xvc file pull commands.

Caveats

--name NAME is not verified to be unique but you should use unique storage names to refer them later. You can also use storage GUIDs listed by xvc storage list to refer to storages.

You must have a valid connection to the server.

Xvc uses Minio API port (9001, by default) to connect to the server. Ensure that it's accessible.

For reasons caused from the underlying library, Xvc tries to connect http://xvc-bucket.example.com:9001 if you give http://example.com:9001 as the endpoint, and xvc-bucket as the bucket name. You may need to consider this when you have servers running in exact URLs. If you have a http://minio.example.com:9001 as a Minio server, you may want to supply http://example.com:9001 as the endpoint, and minio as the bucket name to form the correct URL. This behavior may change in the future.

Technical Details

This command requires Xvc to be compiled with minio feature, which is on by default. It uses Rust async features via rust-s3 crate, and may add some bulk to the binary. If you want to compile Xvc without these features, please refer to How to Compile Xvc document.

The command creates .xvc-guid file in http://{{BUCKET-NAME}}.{{ENDPOINT}}/{{STORAGE-PREFIX}}/.xvc-guid. The file contains the unique identifier for this storage. The same identifier is also recorded to the project.

A file that's found in .xvc/{{HASH_PREFIX}}/{{CACHE_PATH}} is saved to http://{{BUCKET-NAME}}.{{ENDPOINT}}/{{STORAGE-PREFIX}}/{{REPO_ID}}/{{HASH_PREFIX}}/{{CACHE_PATH}}. {{REPO_ID}} is the unique identifier for the repository created during xvc init. Hence if you use a common storage for different Xvc projects, their files are kept under different directories. There is no inter-project deduplication.

xvc storage new r2

Purpose

Configure Cloudflare R2 as an Xvc storage.

Synopsis

$ xvc storage new r2 --help
Add a new R2 storage

Usage: xvc storage new r2 [OPTIONS] --name <NAME> --account-id <ACCOUNT_ID> --bucket-name <BUCKET_NAME>

Options:
  -n, --name <NAME>
          Name of the storage
          
          This must be unique among all storages of the project

      --account-id <ACCOUNT_ID>
          R2 account ID

      --bucket-name <BUCKET_NAME>
          Bucket name

      --storage-prefix <STORAGE_PREFIX>
          You can set a directory in the bucket with this prefix
          
          [default: ]

  -h, --help
          Print help (see a summary with '-h')

Examples

xvc storage new wasabi

Purpose

Configure a Wasabi service as an Xvc storage.

Synopsis

$ xvc storage new wasabi --help
Add a new Wasabi storage

Usage: xvc storage new wasabi [OPTIONS] --name <NAME> --bucket-name <BUCKET_NAME>

Options:
  -n, --name <NAME>
          Name of the storage
          
          This must be unique among all storages of the project

      --bucket-name <BUCKET_NAME>
          Bucket name

      --endpoint <ENDPOINT>
          Endpoint for the server, complete with the region if there is
          
          e.g. for eu-central-1 region, use s3.eu-central-1.wasabisys.com as the endpoint.
          
          [default: s3.wasabisys.com]

      --storage-prefix <STORAGE_PREFIX>
          You can set a directory in the bucket with this prefix
          
          [default: ]

  -h, --help
          Print help (see a summary with '-h')

Examples

xvc storage new digital-ocean

Purpose

Configure a Digital Ocean Spaces service as an Xvc storage.

Synopsis

$ xvc storage new digital-ocean --help
Add a new Digital Ocean storage

Usage: xvc storage new digital-ocean [OPTIONS] --name <NAME> --bucket-name <BUCKET_NAME> --region <REGION>

Options:
  -n, --name <NAME>
          Name of the storage
          
          This must be unique among all storages of the project

      --bucket-name <BUCKET_NAME>
          Bucket name

      --region <REGION>
          Region of the server

      --storage-prefix <STORAGE_PREFIX>
          You can set a directory in the bucket with this prefix
          
          [default: ]

  -h, --help
          Print help (see a summary with '-h')

Examples

Utilities

xvc root

Purpose

Shows the Xvc root project directory where .xvc/ resides.

Synopsis

$ xvc root --help
Find the root directory of a project

Usage: xvc root [OPTIONS]

Options:
      --absolute  Show absolute path instead of relative
  -h, --help      Print help

Examples

xvc root can be used in scripts to make paths relative to the Xvc project root.

By default, it shows the relative path.

$ xvc root
..

When you supply --absolute, it prints the absolute path.

$ xvc root --absolute
/home/user/my-xvc-project/

xvc check-ignore

Purpose

Check whether a path is ignored or whitelisted by Xvc.

Synopsis

$ xvc check-ignore --help
Check whether files are ignored with `.xvcignore`

Usage: xvc check-ignore [OPTIONS] [TARGETS]...

Arguments:
  [TARGETS]...
          Targets to check. If no targets are provided, they are read from stdin

Options:
  -d, --details
          Show the exclude patterns along with each target path. A series of lines are printed in this format: <path/to/.xvcignore>:<line_num>:<pattern> <target_path>

      --ignore-filename <IGNORE_FILENAME>
          Filename that contains ignore rules
          
          This can be set to .gitignore to test whether Git and Xvc work the same way.
          
          [default: .xvcignore]

  -n, --non-matching
          Include the target paths which don’t match any pattern in the --details list. All fields in each line, except for <target_path>, will be empty. Has no effect without --details

  -h, --help
          Print help (see a summary with '-h')

Examples

By default it checks the files supplied from stdin.

$ xvc check-ignore
my-dir/my-file

If you supply paths from the CLI, they are checked instead.

$ xvc check-ignore my-dir/my-file another-dir/another-file

If you're looking which .xvcignore file ignores (or whitelists) a certain path, you can use --details.

$ xvc check-ignore --details my-dir/my-file another-dir/another-file

.xvcignore file format is identical to .gitignore file format. This utility can be used to check any other ignore rules in other files as well. You can specify an alternative ignore filename with --ignore-filename option. The below command is identical to git check-ignore and should give the same results.

$ xvc check-ignore --ignore-filename .gitignore 

xvc aliases

Synopsis

$ xvc aliases --help
Print command aliases to be sourced in shell files

Usage: xvc aliases

Options:
  -h, --help  Print help

Examples

You can include aliases in interactive shells.

$ . $(xvc aliases)
$ pvc --help
Pipeline management commands

Usage: xvc pipeline [OPTIONS] <COMMAND>

Commands:
  new     Add a new pipeline
  update  Rename, change dir or set a pipeline default
  delete  Delete a pipeline
  run     Run a pipeline
  list    List all pipelines
  dag     Generate mermaid diagram for the pipeline
  export  Export the pipeline to a YAML, TOML or JSON file
  import  Import the pipeline from a file
  step    Step management commands
  help    Print this message or the help of the given subcommand(s)

Options:
  -n, --name <NAME>  Name of the pipeline this command applies to
  -h, --help         Print help information

If you add the above line to your .bashrc or .zshrc, these aliases will always be available.

You can get a list of aliases.

$ xvc aliases

alias xls='xvc file list'
alias pvc='xvc pipeline'
alias fvc='xvc file'
alias xvcf='xvc file'
alias xvcft='xvc file track'
alias xvcfl='xvc file list'
alias xvcfs='xvc file send'
alias xvcfb='xvc file bring'
alias xvcfh='xvc file hash'
alias xvcfco='xvc file checkout'
alias xvcfr='xvc file recheck'
alias xvcp='xvc pipeline'
alias xvcpr='xvc pipeline run'
alias xvcps='xvc pipeline step'
alias xvcpsn='xvc pipeline step new'
alias xvcpsd='xvc pipeline step dependency'
alias xvcpso='xvc pipeline step output'
alias xvcpi='xvc pipeline import'
alias xvcpe='xvc pipeline export'
alias xvcpl='xvc pipeline list'
alias xvcpn='xvc pipeline new'
alias xvcpu='xvc pipeline update'
alias xvcpd='xvc pipeline dag'
alias xvcs='xvc storage'
alias xvcsn='xvc storage new'
alias xvcsl='xvc storage list'
alias xvcsr='xvc storage remove'

If there are aliases that you'd rather not use with Xvc, you can unalias them.

This command is not implemented yet. Please see https://github.com/iesahin/xvc/issues/176 for its progress.

Rust API

xvc

See https://docs.rs/xvc/ for latest version of the Xvc API

xvc-config

See https://docs.rs/xvc-config/ for latest version of the Xvc API

xvc-core

See https://docs.rs/xvc-core/ for latest version of the Xvc API

xvc-ecs

xvc-file

See https://docs.rs/xvc-file/ for latest version of the Xvc API

xvc-logging

See https://docs.rs/xvc-logging/ for latest version of the Xvc API

xvc-pipeline

See https://docs.rs/xvc-pipeline/ for latest version of the Xvc API

xvc-storage

See https://docs.rs/xvc-storage/ for latest version of the Xvc API

xvc-walker

See https://docs.rs/xvc-walker/ for latest version of the Xvc API

Xvc Architecture

The malleability of the material (bits and bytes) we're working with leads to difficulties in architecting software. Unlike real architecture, bits and bytes don't bring natural restrictions. It's not possible to build skyscrapers with mud bricks, and our material is much more malleable. There are too many options, too many ways to solve problems that it's easy to merge in technical mud with the decisions we make.

Software developers created a set of architectural principles to overcome this unlimitation. Most of these principles are bogus. They are not tested on the field. We seldom have software that's still perfectly maintainable after ten years. Usually, reading and understanding the code is more difficult than coming up with a new solution and rewriting it.

In this chapter, we describe the problems, assumptions, and solutions in Xvc's intended domain. It's a work in progress but should give you ideas about the intentions behind decisions.

After two decades, I (un)learned a few basic principles regarding software development.

  • Object Oriented Programming doesn't work. Mixing data and functions (methods) isn't a good way to write programs. It leads to artificial layers and structures that become burdensome the long run. It forces the developer to think about both the data and functionality at the same time. This makes reasoning and solving the problem harder than it should be.

  • Data structures are more important than algorithms. Using a few distinct, well thought data structures is more important than creating the best algorithm. Algorithms are replaceable locally without much peripheral impact. Modifying data structures usually requires updates to all related elements.

  • DRY is overrated. It may be a good principle after you write the first version. However, during the actual development phase, it's not a good idea to try not to repeat yourself. What parts of the program repeat, what parts rhyme, and what should be abstracted can be seen after we write the whole. Trying to apply abstract principles to exploratory development hinders the ability to solve problems as plainly as possible.

  • More errors are done in the name of abstraction than the reverse. Abstractions don't always help. They usually distribute a single functionality across arbitrary layers. In the age of LSP, it's easier to find repeating functionality and merge/rewrite, rather than fixing incorrect assumptions about abstractions. Problems with repeating code are obvious and easier to fix than problems with abstractions.

  • Vertical architecture is more important than horizontal architecture. Vertical architecture means the lower the number of layers between the user and their intention, the better. If the user wants to copy a file, creating a layer of abstract classes to make this more modular doesn't result in more resilient software. If you want to detect whether we're in a Git repository, checking the presence of .git directory is simpler than creating a few abstract classes that work for more than one SCM, and implementing abstract methods for them. The architecture shouldn't try to satisfy abstract patterns, it should make the path between the user's action and effect as direct as possible.

Xvc Modules (Crates)

Xvc is composed of modules that can be tested and used independently. core module is in the middle of the architecture. Lower-level crates interface with the OS and convert these to data structures. Higher levels use these data structures to implement functionality.

For example xvc-walker crate interfaces with the directories and paths, ignore rules and serves a set of paths with their metadata. xvc-file crate uses these to check whether a file is changed or not.

  • logging: Logger definitions and debugging macros.
  • walker: A file system directory walker that checks ignore files. It can also notify the changes in the directory via channels after the initial traversal.
  • config: Configuration framework that loads configuration from various levels (Default, System, User, Project, Environment) and merges these with command line options for each module.
  • ecs: The entity-component system responsible for saving and loading state of all data structures, along with their associations and queries.
  • storage: Commands and functionality to configure external (local or cloud) locations to store file content.
  • core: Xvc specific data structures and utilities.

All user level modules use this module for shared functionality.

  • file: Commands to track files and utilities around file management.
  • pipeline: Commands to define data pipelines as DAGs and run them.

The current dependency graph where lower-level modules are used directly is this:

graph TD 

xvc --> xvc-file
xvc --> xvc-pipeline

xvc-file --> xvc-config
xvc-file --> xvc-core
xvc-file --> xvc-ecs
xvc-file --> xvc-logging
xvc-file --> xvc-walker
xvc-file --> xvc-storage

xvc-pipeline --> xvc-config 
xvc-pipeline --> xvc-core
xvc-pipeline --> xvc-ecs
xvc-pipeline --> xvc-logging
xvc-pipeline --> xvc-walker

xvc-config --> xvc-walker
xvc-config --> xvc-logging

xvc-ecs --> xvc-logging

xvc-core --> xvc-config
xvc-core --> xvc-logging
xvc-core --> xvc-walker
xvc-core --> xvc-ecs

xvc-walker --> xvc-logging

After the crate interfaces are stabilized, all lower-level functions will be reused from xvc-core. It will provide the basic Xvc API. In this case, the graph will be simplified.

graph TD 

xvc --> xvc-file
xvc --> xvc-pipeline

xvc-file --> xvc-core

xvc-pipeline --> xvc-core

xvc-config --> xvc-walker
xvc-config --> xvc-logging

xvc-ecs --> xvc-logging

xvc-core --> xvc-config
xvc-core --> xvc-logging
xvc-core --> xvc-walker
xvc-core --> xvc-ecs
xvc-core --> xvc-storage

xvc-walker --> xvc-logging

Any improvement in user-level API will be done higher than xvc-core levels. Any improvement in lower-level modules will be done in dependencies of xvc-core.

Goals

Xvc is an CLI MLOps tool to track file, data, pipeline, experiment, model versions.

It has the following goals:

  • Enable to track any kind of files, including large binary, data and models in Git.
  • Enable to get subset of these files.
  • Enable to remove files from workspace temporarily, and retrieve them from cache.
  • Enable to upload and download these files to/from a central server.
  • Enable users to run pipelines composed of commands.
  • Be able to invalidate pipelines partially.
  • Enable to run a pipeline or arbitrary commands as experiments, and store and retrieve them.

Xvc users are data and machine learning professionals that need to track large amounts of data. They also want to run arbitrary commands on this data when it changes. Their goal is to produce better machine learning models and better suited data for their problems.

We have three quality goals:

  • Robustness: The system should be robust for basic operations.
  • Performance: The overall system performance must be within the ballpark of usual commands like b3sum or cp.
  • Availability: The system must run on all major operating systems.

Xvc users work with large amounts of data. They want to depend on Xvc for basic operations like tracking file versions, and uploading these to a central location.

They don't want to wait too long for these operations on common hardware.

They would like to download their data to any system running various operating systems.

Xvc Cache

The cache is where Xvc copies the files it tracks.

It's located under .xvc directory.

Instead of the file tree that's normally used to address files, it uses content digest of files to organize them.

In a standard file hierarchy, we have files in paths like /home/iesahin/Photos/my-photo.png. Xvc doesn't use such a tree in its cache. It uses paths like .xvc/b3/a12/b45/d789a...f54/0.png to refer to files.

Producing the cache path from its content leads cache paths change when the files are updated. For example, if you save another photo on top of my-photo.png, the first version will be lost. However, as these two versions produce different digests, they can be stored in different locations in cache.

There are 4 parts of this cache path.

.xvc part is the standard directory xvc init command creates. It resides in the root folder of your project.

b3/ denotes the [digest type] of the content digest. Xvc supports more than one algorithm to calculate content digests. [HashAlgorithm][https://docs.rs/xvc-core/0.4.0/xvc_core/types/hashalgorithm/enum.HashAlgorithm.html] enum shows which algorithms are supported. Each of these algorithms has a 2-letter prefix.

  • b3 :obs_right_arrow_with_tail: BLAKE3
  • b2 :obs_right_arrow_with_tail:BLAKE2s
  • s3 :obs_right_arrow_with_tail: SHA2-256
  • s2 :obs_right_arrow_with_tail: SHA3-256

Note that, all these digest algorithms produce 256bits/32 bytes of digests. This is converted to 64 hexadecimal digits. In order to keep the total path length shorter, currently Xvc requires digests to be 32 bytes in length.

The third part in cache path is this 64 hexadecimal digits in the form a12/b45/d789...f54/. 64 digits are split into directories to keep the number of directories under one directory lower. Had Xvc put all cache elements in a single directory, it could lead to degraded performance in some file systems. With this arrangement, b3/ can contain at most 4096 directories, that contain 4096 directories each. With usual distribution and good hash algorithms, there won't be more than 4000 elements per directory until 68 billion files in the cache. (4096³)

The fourth part is the 0.png part, that's the file itself with the same extension but with 0 as the basename. Xvc uses digest as a directory instead of file name. There may be times when the file in the cache should be used manually, on remote storages for example. The extension is kept for this reason, to make sure that the OS recognizes the file type correctly.

The rename to 0 means, that this is the whole file. In the future, when Xvc will support splitting large files to transfer to remotes, all parts of the file will be put into this directory.

Storages also use the same cache structure, with an added GUID part to use a single storage for multiple projects.

The Architecture of Xvc Entity Component System

Xvc uses an entity component system (ECS) in its core. ECS architecture is popular among game development, but didn't find popularity in other areas. It's an alternative to Object-Oriented Programming.

There are a few basic notions of ECS architecture. Although it may differ in other frameworks, Xvc assumes the following:

  • An entity is a neutral way of tracking components and their relationships. It doesn't contain any semantics other than being an entity. An entity in Xvc is an atomic integer tuple. (XvcEntity)

  • A component is a bundle of associated data about these entities. All semantics of entities are described through components. Xvc uses components to keep track of different aspects of file system objects, dependencies, remotes, etc.

  • A system is where the components are created and modified. Xvc considers all modules that interact with components as separate systems.

Suppose you're want to track a new file in Xvc. Xvc creates a new entity for this file. Associates the path (XvcPath) with this entity. Checks the file metadata, creates an instance of XvcMetadata, and associates it with this entity. If this object is commit to Xvc cache, an XvcDigest struct is associated with the entity.

The difference from OOP is that there is no basic or main object. If you want to work only with digests and want to find the workspace paths associated with them, you can write a function (system) that starts from XvcDigest records and collect the associated paths. If you want to get only the files larger than a certain size, you can work with XvcMetadata, filter them and get the paths later. In contrast, in an OOP setting, these kind of data is associated with paths and when you want to do such operations, you need to load paths and all their associations first. OOP way of doing things is usually against the principle of locality.

The whole idea is to be flexible for further changes. As of now, Xvc doesn't have different notion of data and models. It doesn't have different functionality for files that are models or data. In the future, however, when this will be added, an XvcModel component will be created and associated with the same entity of an XvcPath. It will allow to work with some paths as model files but it doesn't require paths to be known beforehand. There may be other metadata, like features or version associated with models that are more important. There may be some models without a file system path, maybe living only in memory or in the cloud. Those kind of models might be checked by verifying whether the model has a corresponding XvcPath component or not.

In contrast, OOP would define this either by inheritance (a model is a path) or containment (a model has a path). When you select any of these, it becomes a relationship that must be maintained indefinitely. When you only have an integer that identifies these components, it's much easier to describe models without a path later. There is no predefined relationship between paths and models.

The architecture is approximately similar to database modeling. Components are in-memory tables, albeit they are small and mostly contain a few fields. Entities are sequential primary keys. Systems are insert, query and update mechanisms.

Stores

An XvcStore in its basic definition is a map structure between XvcEntity and a component type T It has facilities for persistence, iteration, search and filtering. It can be considered a system in the usual ECS sense.

Loading and Saving Stores

As our goal is to track data files with Git, stores save and load binary files' metadata to text files. Instead of storing the binary data itself in Git, Xvc stores information about these files to track whether they are changed.
By default, these metadata are persisted to JSON. Component types must be serializable because of this. They are meant to be stored to disk in JSON format. Nevertheless, as they are almost always composed of basic types [serde] supports, this doesn't pose a difficulty in usage. The JSON files are then commit to Git.

Note that, there are usually multiple branches in Git repositories. Also multiple users may work on the same branch.

When these text files are reused by the stores, they are modified and this may lead to merge conflicts. We don't want our users to deal with merge conflicts with entities and components in text files. This also makes it possible to use binary formats like MessagePack in the future.

Suppose user A made a change in XvcStore<XvcPath> by adding a few files. Another user B made another change to the project, by adding another set of files in another copy of the project. This will lead to merge conflicts:

  • XvcEntity counter will have different values in A and B's repositories.
  • XvcStore<XvcPath> will have different records in A and B's repositories.

Instead of saving and loading to monolithical files, XvcStore saves and loads event logs. There are two kind of events in a store:

  • Add(XvcEntity, T): Adds an element T to a store.
  • Remove(XvcEntity): Removes the element with entity id.

These events are saved into files. When the store is loaded, all files after the last full snapshot are loaded and replayed.

When you add an item to a store, it saves the Add event to a log. These events are then put into a vector. A BTreeMap is also created by this vector.

When an item is deleted, a Remove event is added to the event vector. While loading, stores removes the elements with Remove events from the BTreeMap. So the final set of elements doesn't contain the removed item.

The second problem with multiple branches is duplicate entities in separate branches. Xvc uses a counter to generate unique entity ids. When a store is loaded, it checks the last entity id in the event log and uses it as the starting point for the counter. But using this counter as is causes duplicate values in different branches. Xvc solves this by adding a random value to these counter values.

Since v0.5, XvcEntity is a tuple of 64-bit integers. The first is loaded from the disk and is an atomic counter. The second is a random value that is renewed at every command invocation. Therefore we have a unique entity id for every run, that's also sortable by the first value. Easy sorting with integers is sometimes required for stable lists.

Inverted Index

Stores also have a inverted index for quick lookup. They store value of T as key and a list of entities that correspond to this key. For example, when we have a path that we stored, it's a single operation to get the corresponding XvcEntity and after this, all recorded metadata about this path is available.

All search, iteration and filtering functionality is performed using these two internal maps.

In summary, a store has four components.

  • An immutable log of previous events: Vec<Event<T>>
  • A mutable log of current events: Vec<Event<T>>
  • A mutable map of the current data: BTreeMap<XvcEntity, T>
  • A mutable map of the entities from values: BTreeMap<T, Vec<XvcEntity>>

Note that, when two branches perform the same operation, the event logs will be different, as the random part of XvcEntity is different. When two parties branches merge, the inverted index may contain conflicting values. In this case, a fsck command is used to merge the store files and merge conflicting entity ids.

Insert, update and delete operations affect mutable log and maps. Queries, iteration and such non-destructive operations are done with the maps. When loading, all log files are merged in immutable log. No standard operation touches the event logs. All log modifications are done outside of the normal worflow. When saving, only the mutable log is saved. Note that only can only be added to the log, they are not removed. (See xvc fsck --merge-stores for merging store files.)

Relationship Stores

XvcStore keeps component-per-entity. Each component is a flat structure that doesn't refer to other components.

Xvc also has relation stores that represent relationships between entities, and components. Similar to the database Entity-Relationship model, there are three kinds of the relationship store:

R11Store<T, U> keeps two sets of components associated with the same entity. It represents a 1-1 relationship between T and U. It contains two XvcStores for each component type. These two stores are indexed with the same XvcEntity values. For example, an R11Store<XvcPath, XvcMetadata> keeps track of path metadata for the identical XvcEntity keys.

R1NStore<T, U> keeps parent-child relationships. It represents a 1-N relationship between T and U. On top of two XvcStores, this one keeps track of relationships with a third XvcStore<XvcEntity>. It lists which U's are children of Ts. For example, a value of XvcPipeline can have multiple XvcSteps. These are represented with R1NStore<XvcPipeline, XvcStep>. This struct has parent-to-child and child-to-parent functions that can be used get children of a parent, or parent of child element.

The third type is RMNStore<T, U>. This one keeps arbitrary number of relationships between T and U. Any number of Ts may correspond to any number of Us. This type of store keeps the relationships in two XvcStore<XvcEntity>'s.

Comparisons

In order to avoid unnecessary work, we need to find differences across versions. What has changed between the previous version and this version of type T?

Xvc is built bottom up, with vertical, long functions that do one thing. For example, xvc file track is written separately from xvc file checkout, and the commonalities are arised after these implementations. We consider implementation is a form of planning philosophy. We didn't start from traits and try to fit everything to these. Instead we began from concrete enums and structs, saw some of these share common functionality and thought to group this common functionality as a trait after implementing several concrete functions.

We saw that the diff pattern across all functionality. In xvc pipeline, dependencies need to detect changes to decide whether to invalidate them. In xvc file, files and directories need to detect changes to decide whether they should be commit to cache.

It's easy to make comparison/subtraction when the data types are numeric. For a signed integer, you can get a single numeric value as diff with diff = a - b. For complex data structures, representing the change is usually not straightforward.

We keep track of everything in the repository in stores. These serialize a type T to a file, and get it back when needed. Diff pattern works with these types. Sometimes, there happens to be no record of something we have in the repository. Sometimes, we only have only the record, and not the actual thing on disk. The diff should also handle this.

Instead of trying to come up some wizardy, at the end, we decided to represent this with five conditions.

  • Identical: When two things of the same type T are equal. Nothing has changed between the actual version and its record.

  • RecordMissing { actual: T }: If we have something on disk, but can't find the respective record. For example, a new file is added to the disk but xvc file track detects it for the first time. The action is usually creating a record from actual: T

  • ActualMissing { record: T }: We found a record in the store, but the corresponding file is not there. If a tracked file is deleted, but the record still keeps it.

  • Difference { record: T, actual: T }: There is a record, but the actual data isn't identical. When a tracked file is changed, and its content hash now returns another digest, this can be reflected with Difference.

  • Skipped: When the comparison seems unnecessary. For example, if we know a file hasn't changed by checking its metadata. In this case, we don't calculate its content digest and set it to Skipped.

These five conditions are represented in DeltaField type.

As an entity may have more than one component, a comparison may require multiple DeltaFields. For example, we may want to compare an XvcPath, to see whether it has changed. This requires comparing its XvcMetadata, its ContentDigest if it's a file, its CollectionDigest if it's a directory, etc. There are various such Delta types.

Comparing files

Files are compared with several aspects. We assume their relative path (XvcPath) doesn't change. Other features like XvcMetadata, ContentDigest, etc. could be modified and are tracked.

The following struct is used to compare two files:

#![allow(unused)]
fn main() {
pub struct FileDelta {
    pub delta_md: DeltaField<XvcMetadata>,
    pub delta_content_digest: DeltaField<ContentDigest>,
    pub delta_metadata_digest: DeltaField<MetadataDigest>,
    pub delta_cache_type: DeltaField<CacheType>,
    pub delta_text_or_binary: DeltaField<DataTextOrBinary>,
}

}

When the user first start tracking a file, all delta fields are of the value RecordMissing. It contains the actual value on disk. These are recorded to stores.

When they edit the file, its delta_md changes. Xvc checks whether the delta_content_digest has also changed.

When the user wants to check out the file in a different cache_type, for example changing the workspace version from Copy to Hardlink, delta_cache_type field contains a Difference value.

Comparing directories

A directory is considered as a collection of paths.

Its comparison is based on the (non-ignored) paths it contains.

#![allow(unused)]
fn main() {
pub struct DirectoryDelta {
    pub delta_xvc_metadata: DeltaField<XvcMetadata>,
    pub delta_collection_digest: DeltaField<CollectionDigest>,
    pub delta_metadata_digest: DeltaField<MetadataDigest>,
    pub delta_content_digest: DeltaField<ContentDigest>,
}
}

We record the size and modification time of the directories as well. When these change, they are reflected with delta_xvc_metadata file.

The other fields are generated from the paths the directory contains.

Storages

Xvc uses storages to store content of the files. These storages are different from Git remotes. They don't contain Git history of a repository, but they can store contents of the files tracked by Xvc.

A storage uses the same content-addresses used in Xvc cache to store the files. For example, if there is a file in Xvc repository that points to /b3/1886572424...defa/0.png in local cache, this path will be used to identify the content in storage as well.

Additionally, Xvc stores storage event logs that lists which operations are performed on that storage. By using these event logs, it's possible to identify what has gone on with storages without checking the file lists. These event logs are also shared with the other users, and a user can identify which files are present in a storage even without a connection.

Basic Operations

All storages should support the following operations:

  • Init to initialize a storage
  • List to list the files available in the storage.
  • Send to upload files from local cache to a storage.
  • Receive to download files from a storage to local cache.
  • Delete to delete file from a storage.

All these operations record a distinct event to the event log.

Events record the event, guid of the storage and the event content.

Event contents are like the following:

  • Init creates the necessary directories and the guid file in a storage
  • List includes the listing got from the storage. Once a list is retrieved from the storage, it's available for local operations. Most recent lists are starting point to determine files available in a storage.
  • Send event contains the affected paths. These paths are added to storage file list.
  • Receive event contains the affected paths. These paths are added to storage file list.
  • Delete to delete multiple files at once. These paths are removed from storage file list.

Storage types

Local Storages

A local storage is a directory in the local file system. It may be a mount point shared with others, or another disk that you use for backups and sharing.

  • Init uses std::fs::copy to copy the GUID file to the appropriate directory
  • List uses std::fs::listdir.
  • Send uses std::fs::copy with rayon.
  • Receive uses std::fs::copy with rayon.
  • Delete uses std::fs::remove_file with rayon.

Generic Storages

These storages define commands for each of the operations listed above. It allows to run external programs such as rsync, rclone, s5cmd. For such storages, commands for the above operations must be defined and they will be run in separate processes.

This storage type offloads the responsibility of exact operations to the user.

The user is expected to supply the value following variables:

  • {URL}: The url for the storage. This can be anything the commands to send/receive/list will accept. It's to build the paths with minor repeats.

  • {STORAGE_DIR}: You can separate the storage directory.

  • {PATH}: This is set by Xvc for each singular commands. It's a relative path to the local cache directory.

  • {PROCESS_POOL_SIZE}: This value is used to set the number of processes to perform operations. Setting this to 1 makes all operations sequential.

  • List Command: A command to list the {URL}. For example, for rsync --list-only {URL}{STORAGE_DIR}

  • Send Command: A command to send a file to {URL}{STORAGE_DIR}. It can use {URL} and should use {PATH} in the command. An example may be rsync -a {PATH} {URL}{STORAGE_DIR}{PATH}

  • Receive Command: A command to receive a file from a storage. It can use {URL} and {STORAGE_DIR}, and should use {PATH} in the command. Example: rsync -a {URL}{STORAGE_DIR}{PATH} {PATH}

  • Delete Command: A command to delete a file from the storage. It can use {URL} and {STORAGE_DIR}, and should use {PATH} in the command. Example: ssh {URL} "rm {STORAGE_DIR}{PATH}"

Generic storages use these commands to create multiple processes to send/receive/delete files. It's not as fast as using other types because of the overhead involved, but its flexibility is useful.

Git and Xvc

Xvc aims to fill the gap Git leaves for certain workflows. These workflows involve large binary data that shouldn't be replicated in each repository.

Xvc tracks all its metadata on top of Git. In most cases, Xvc assumes the presence of a Git repository where the user tracks the history, text files, and metadata. However, the relationship between these should be clear and separate.

Xvc doesn't (and shouldn't) use Git more than a user could use manually. Our aim is not to replace Git operations with Xvc operations or tamper with the internal structure of the Git repository. When Xvc uses Git to track ECS or other metadata, the operations must be separate and sandwich Xvc operations.

  • Any Git operation that involves to checkout commits, branches, tags, or other references must come before any Xvc operation. As Xvc relies on the files tracked by Git, resuming any state for Xvc operations should be complete before these operations start.

  • Xvc helps to stage and commit certain files in .xvc/ to Git. By default, any state-changing operation in Xvc adds a commit to Git.

  • Xvc also helps to store this changed metadata in a new or existing branch. In this case, a checkout must be done before Xvc records the files.

sequenceDiagram
    User ->> Xvc: xvc --from-ref my-branch --to-branch another-branch file track large-dir/
    Xvc ->> Git: git checkout my-branch
    Git ->> Xvc: branch = my-branch
    Xvc->> xvc-file: track large-dir/
    xvc-file ->> Xvc: Ok. Saved the stores and metadata.
    Xvc ->> Git: Do we have user staged files?
    Git ->> Xvc: Yes. This and this.
    Xvc ->> Git: Stash them. 
    Git ->> Xvc: Stashed user staged files. 
    Xvc ->> Git: git checkout -b another-branch
    Git ->> Xvc: branch = another-branch
    Xvc ->> Git: git add .xvc/
    Git ->> Xvc: added .xvc/
    Xvc ->> Git: git commit -m "Commit after xvc file track"
    Xvc ->> Git: Unstash files that we have stashed

Note that if the user has some already staged files, these are stashed and unstashed to the requested branch. This is a side effect of doing xvc commit operations on behalf of the user. The other option is to report an error and quit if the user has the --to-branch option set. The behavior may change in the future. For the time being, we will keep this stash-unstash operation for the user files.

One other issue is the library that we're going to use. I checked several options when I was writing auto-commit functionality.

At that time, I decided that the number of Git operations for each Xvc operation is less than five. These can be done by creating a Git process. The libraries are not 100% identical in features. Even the most widely used libgit2 doesn't provide shallow clones, or it's not possible to use git stash --staged.

The second reason for this is explainability. Instead of trying to explain to the user what we are doing with Git, we can report the commands we are running. The library interfaces are different from Git CLI. They need to be learned before reading the code. Using Git CLI is more dependable, observable, and understandable than trying to come up with a set of library calls.

Concepts

  • Digest: A digest is a 32-byte numeric sequence to identify a file, content or any other data. Xvc uses different algorithms to generate this sequence.
  • Associated Digest: This is a specific kind of digest associated with an entity. An entity can have more than one digests, like content digest or metadata digest. Xvc uses these different kinds of digests to avoid unncessary digest calculations.

Digest

A numerical summary of an entity. In Xvc digests are 32-bytes, and produced by BLAKE3 by default.

See Associated Digest for different types of digests.

Associated Digest

There may be multiple digests associated with an entity like path, directory or dependency. An associated digest is all digests associated with an entity.

Metadata Digest

Files and directories have metadata. Metadata shows information about creation, modification, access time of the file, or the size of it. Metadata is OS dependent in most cases. Xvc abstracts file and directory metadata with XvcMetadata struct. Metadata digest represents this abstraction in 32-bytes to compare changes in files and directories.

Content Digest

The content digest of a file is calculated by the data it contains. It calculates 32-bytes from the content. When content changes, this calculation result also change.

Collection Digest

Some entities in Xvc are composed of multiple elements. Examples are directories (composed of files), file lines, regex filter results, SQL query results etc. Instead trying to compare all elements, Xvc creates a 32-byte digest of the collection with the same conditions. For example, when a new file is added to a directory, its collection digest also changes. This is used keep track of changed directories easier than moving members around.

Development

Code and Documentation Conventions

  • Xvc is spelled capitalized in documentation. It's Xvc, not XVC, not xvc.