Introduction to Xvc
Xvc is a command line utility to track large files with Git, define dependencies between files to run commands when only these dependencies change, and run experiments by making small changes in these files for later comparison. It's used mostly in Machine Learning scenarios where data and model files are large, code files depend on these and experiments must be compared via various metrics.
Xvc can use S3 and compatible cloud storages to upload tracked files with their exact version and can retrieve these later. This allows to delete them from the project when they are not needed to save space and get them back when needed. This facility can also be used for sharing these files. You can just clone the Git repository and get only the necessary Xvc-tracked files.
Xvc tracks files, directories and other elements by calculating their digests. These digests are used as address to store and find their locations in the storages. When you make a change to a file, it gets a new digest and the changed version has a new address. This makes sure that all versions can be retrieved on demand.
Xvc can be used as a make
replacement to build multi-file projects with complex dependencies. Unlike make
that detect file changes with timestamps, Xvc checks the files via their content. This reduces false-positives in invalidation.
Xvc pipelines are used to define steps to reach to a set of outputs. These steps have commands to run and may (or may not) produce intermediate outputs that other steps depend. Xvc pipelines allows steps to depend on other steps, other pipelines, text and binary files, directories, globs that select a subset of files, certain lines in a file, certain regular expression results, URLs, (hyper)parameter definitions in YAML, JSON or TOML files as of now. More dependency types like environment variables, database tables and queries, S3 buckets, REST query results, generic CLI command results, Bitcoin wallets, Jupyter notebook cells are in the plans.
For example, Xvc can be used to create a pipeline that depends on certain files in a directory via a glob, and a parameter in a YAML file to update a machine learning model. The same feature can be used to build software when the code or artifacts used in the software change. This allow binary outputs (as well as code inputs) to be tracked in Xvc. Instead of building everything from scratch in a new Git clone, a software project can reuse only the portions that require a rebuild. Binary distributions become much simpler.
This book is used as the documentation of the project. It is a work in progress as Xvc, and contain outdated information. Please report any errors and bugs in https://github.com/iesahin/xvc as the rest of project.
Comparison with other tools
There are many similar tools for managing large files on Git, managing machine learning pipelines and experiments. Most of ML oriented tools are provided as SaaS and in a different vein than Xvc.
Similar tools for file management on Git are the following:
git-annex
: One of the earliest and most successful projects to manage large files on Git. It supports a large number of remote storage types, as well as adding other utilities as backends, similar toxvc storage new generic
. It features an assistant aimed to make it easier for common use cases. It uses SHA-256 as the single digest option and uses symlinks as a cache type. It doesn't have data pipeline features.git-lfs
: It uses Git internals to track binary files. It requires server support for remote storages and allows only Git remotes to be used for binary file storage. Uses the same digest function Git uses. (By default, SHA-1). Uses.gitattributes
mechanism to track certain files by default. It doesn't have data pipeline features.dvc
: Uses YAML files in the working directory to track file content. It uses MD5 sums. It can use different cache type for all the files in the repository. It has experiments tracking features, data pipelines and a SaaS GUI.
I have done some preliminary benchmarks to measure time to add files. I added 70.000 files with a single command. xvc file track
(0.3.1) finished in 19 seconds, git lfs track '*.png' ; git add 'data/images/**/*.png'
in 56 seconds, dvc add data/images
in 80 seconds and git-annex add data/images
in around 11 minutes. Note that these measurements are affected by output behavior and commands may gain some speed by turning off the default terminal output. Some finer benchmarks will be provided in the future, when Xvc is optimized.
Installation
Rust
Linux
macOS
Windows
Configuration
Configuration Files
Configure with Environment Variables
Changing configuration for a command
Get Started to Xvc
Xvc is a multipurpose tool. Its features can be used by professionals with various roles. If you're working with data, you can benefit from Xvc data management features.
Xvc for Everyone
Xvc Getting Started pages are written as stories and dialogues between tortoise (🐢) and hare (🐇).
🐇 Hello tortoise. How are you? Let's take a selfie. Do you take selfies? I have lots of them. Terabytes of them.
🐢 I don't have much selfies, you know. I don't change quickly and scenery is changing less often.
🐇 I see. I have terabytes of them, but can't find a good solution to store them. How do you store your documents? I know you have documents, lots and lots of them.
🐢 I track them with Git to track my evolving thoughts on text files. Images are different. I think it's not a good idea to keep images on Git, but there is a tool for that.
🐇 What kind of tool? Not Git, but something different?
🐢 It's called Xvc. You can keep track of your selfies with it. You can backup them, and get them as needed.
🐇 Tell me more about it. I have a directory in my home, ~/Selfies
and I have thousands of them. How will I start?
🐢 Xvc can be used as a standalone tool but better when used with Git. You can just type
$ git init
$ xvc init
to start working with Xvc.
🐇 It looks easy but I heard that Git is complicated. Will I need to learn it?
🐢 Ah, no. If you're not willing to learn Git, you can just let Xvc to handle that. By default, it handles all Git operations about the changes it makes. If you want to push your files with someone, you may need to learn how to manage a repository.
🐇 How do I track my files?
🐢 You use xvc file track
command. Do you have directories in ~/Selfies
?
🐇 Yep. I have. Lots of them.
🐢 Do you want to track all of them?
🐇 Almost all. Some of them are so private that I want to hide even from Xvc.
🐢 You can use .xvcignore
file to list them. Xvc ignores the files you list in .xvcignore
.
🐇 How do I add others? Could you give an example?
🐢 If you have a folder for today's selfies, type this in ~/Selfies
$ xvc file track today/
and Xvc will track everything in that directory.
🐇 Oh, that's easy. If I want to track everything not ignored, I can type xvc file track
then.
🐢 You're a quick learner.
After some brief period 🐇 went to home and added files.
🐇 Now, I want to learn how to share my selfies.
🐢 Xvc can store file contents in another location. First you must setup a storage. Do you use AWS S3?
🐇 Yes. I have buckets there. I want to keep my selfies in my rabbit-hole
.
🐢 You can configure Xvc to use it with xvc storage new s3
command. You'll specify the region and bucket, and Xvc will prepare it.
🐇 types
$ xvc storage new s3 --name selfies --region eu-lepus-1 --bucket rabbit-hole
🐢 Now, you can send your files there with xvc file send --to selfies
.
🐇 Is that all?
🐢 You will also need to push your Git files to another place. Do you have a Github account?
🐇 Ah, yeah, I have.
🐢 Now create a repository for your selfies. We will configure Git to use it as origin
.
$ git remote add origin https://github.com/🐇/selfies
$ git push --set-upstream origin main
Now, you can share your selfies with your friends.
🐇 Cool, but how Xvc knows my AWS password? Does it share my passwords?
🐢 No, never. You must allow your friends to read that bucket of yours. Xvc reads the credentials from AWS configuration, either from the file or the environment variables.
🐇 How will they get my files?
🐢 First, they must clone the repository.
$ git clone https://github.com/🐇/selfies
Then, they can get all files with:
$ cd selfies
$ xvc file get .
🐇 Oh, cool, they don't have to xvc init
again? Right?
🐢 No, they don't. Xvc should be initialized only once per repository. When you have new selfies, you can share them with:
$ xvc file track
$ git push
and your friends can receive the changes with
$ git pull
$ xvc file get
🐇 The order of these commands are important, it looks.
🐢 Yep. You add to Xvc first. Xvc automatically commits the changes to Git. Then you push Git changes to remote. Your friends first pull these changes, then get the actual files.
🐇 Thank you tortoise. Let me get back to my hole.
Xvc for Data
Xvc for Machine Learning
Xvc Getting Started pages are written as stories and dialogues between tortoise (🐢) and hare (🐇).
🐇 Ah, hello tortoise. How are you? I began to work as an machine learning engineer, you know? I'll be the fastest.
🐢 You're quick as always, hare. How is your job going so far?
🐇 It's good. We have lots and lots of data. We have models. We have scripts to create those models. We have notebooks full of experiments. That's all good stuff. We'll solve the hare intelligence problem.
🐢 Sounds cool. Aren't you losing yourself in all these, though?
🐇 Time to time we have those moments. Some models work with some data, some experiments require some kind of preprocessing, some data changed since we started to work with it and now we have multiple versions.
🐢 I see. I began to use a tool called Xvc. It may be of use to you.
🐇 What does it do?
🐢 It keeps track of all these stuff you mentioned. Data, models, scripts. It also can detect when data changed and run the scripts associated with that data.
🐇 That sound something we need. My boss wanted me to build a pipeline for cat pictures. He makes a contest for cat pictures. Every time he finds a new cat picture he likes, we have to update the model.
🐢 He must have lots of cat pictures.
🐇 He has. He sometimes find higher resolution versions and replaces older pictures. He has terabytes of cat pictures.
🐢 How do you keep track of those versions?
🐇 We don't. We have a disk for cat pictures. He puts everything there and we train models with it.
🐢 You can use Xvc to version those files. You can go back and forth in time, or have different branches. It's based on Git.
🐇 I know, but Git is for code files, right? I never found a good way to store image files in Git. It stores everything.
🐢 Yep. Git keeps all history in each repository. Better to keep that terabytes of images away from Git. Otherwise, you'll have terabytes of cat pictures in each clone you use. Xvc helps there. It tracks contents of data files separately from Git. Image files are not put into Git objects, and they are not duplicated in all repositories.
🐇 You know, I'm not interested in details. Tell me how this works.
🐢 Ok. When you go back to cat picture directory, create a Git repository, and initialize Xvc immediately.
$ git init
...
$ xvc init
? 0
🐇 No messages?
🐢 Xvc is of silent type of Unix commands. It follows "no news is good news" principle. We use ? 0
to indicate the command return code. 0 means success. If you want more output, you can add -v
as a flag. Increase the number of -v
s to increase the details.
🐇 So -vvvvvvvvvvvvvvv
will show which atoms interact in disk while running Xvc?
🐢 It may work, try that next time. Now, you can add your cat pictures to Xvc. Xvc makes copies of tracked files by default. I assume you have a large collection. Better to make everything symlinks for now. We can change how specific files are linked to cache later.
$ xvc -v file track --cache-type symlink .
...
🐇 Does it track everything that way?
🐢 Yes. If you want to track only particular files or directories, you can replace .
with their names.
🐇 What's the best cache type for me?
🐢 If your file system supports, best way seems reflink
to me. It's like a symlink but makes a copy when your file changes. Most of the widely used file systems don't support it though. If your files are read only and you don't have many links to the same files, you can use hardlink
. If they are likely to change, you can use copy
. If there are many links to same files, better to use symlink
.
🐇 So, symlinks are not the best? Why did you select it?
🐢 I suspect most of the files in your cat pictures are duplicates. Xvc stores only one copy of these in cache and links all occurrences in the workspace to this copy. This is called deduplication. There are limits to number of hardlinks, so I recommended you to use symlinks. They are more visible. You can see they are links. Hardlinks are harder to detect.
🐇 Ah, when I type ls -l
, they all show the cache location now.
🐢 If you have a models/
directory and want to track them as copies, you can tell Xvc:
$ xvc file track --cache-type copy models/
It replaces previous symlinks with the copies of the files only in models/
.
🐇 Can I have my data read only and models writable?
🐢 You can. Xvc keeps track of each file's cache-type
separately. Data can stay in read-only symlinks, and models can be copied so they can be updated and stored as different versions.
🐇 I have also scripts, what should I do with them?
🐢 Are you using Git for them?
🐇 Yep. They are in a separate repository. I think I can use the same repository now.
🐢 You can. Better to keep them in the same repository. They can be versioned with the data they use and models they produce. You can use standard Git commands to track them. If you track a file with Git, Xvc doesn't track it. It stays away from it.
🐇 You said we can create pipelines with Xvc as well. I created a multi-stage pipeline for cat picture models. It's like this:
graph LR cats["data/cats/"] --> pp-train["preprocess.py --train data/pp-train/"] pp-train --> train["train.py"] params["params.yaml"] --> train cat-ratings["cat-ratings.txt"] --> train train --> model["models/model.bin"] cats --> pp-test["preprocess.py --test data/pp-test/"] model --> test["test.py"] pp-test --> test test --> metrics["metrics.json"] test --> best-model["best-model.json"] best-model --> deploy["deploy.sh"]
🐢 It looks like a fairly complex pipeline. You can create a pipeline definition for it. For each separate command we'll have a step. How many different commands do you have?
🐇 A preprocess --train
command, a preprocess --test
command, a train
command, a test
command and a deploy
command. Five.
🐢 Do you need more than one pipeline? Maybe you would like to put deployment to another pipeline?
🐇 No, I don't think so. I may have in the future.
🐢 Xvc has a default pipeline. We'll use it for now. If you need more pipelines you can create with xvc pipeline new
.
🐇 How do I create step for commands?
🐢 Let's create the steps at once. Each step requires a name and a command.
$ xvc pipeline step new --name preprocess-train --command 'python3 src/preprocess.py --train data/cats data/pp-train/'
$ xvc pipeline step new --name preprocess-test --command 'python3 src/preprocess.py --test data/cats data/pp-test/'
$ xvc pipeline step new --name train --command 'python3 src/train.py data/pp-train/'
$ xvc pipeline step new --name test --command 'python3 src/test.py data/pp-test/ metrics.json'
$ xvc pipeline step new --name deploy --command 'python3 deploy.py models/model.bin /var/server/files/model.bin'
🐇 How do we define dependencies?
🐢 You can have many different types of dependencies. All are defined by xvc pipeline step dependency
command. You can set up direct dependencies between steps, if one is invalidated, its dependents also run. You can set up file dependencies, if the file changes the step is invalidated and requires to run. There are other, more detailed dependencies like parameter dependencies which take a file in JSON or YAML format, then checks whether a value has changed. There are regular expression dependencies, for example if you have a piece of code in your training script that you change to update the parameters, you can define a regex dependency.
🐇 It looks I can use this for CSV files as well.
🐢 Yes. If your step depends not on the whole CSV file, but only specific rows, you can use regex dependencies. You can also specify line numbers of a file to depend.
🐇 My preprocess.py
script depends on data/cats
directory. My train.py
script depends on params.yaml
for some hyperparameters, and reads 5 Star
ratings from cat-contest.txt
. I want to deploy when the newly produced model is better than the older one by checking best-model.json
. My deployment script doesn't update the deployment if the new model is not the best.
🐢 Let's see. For each step, you can use a single command to define its dependencies. For preprocess.py
you'll depend to the data directory and the script itself. We want to run the step when the script changes. It's like this:
$ xvc pipeline step dependency --step-name preprocess-train --directory data/cats --file src/preprocess.py
$ xvc pipeline step dependency --step-name preprocess-test --directory data/cats --file src/preprocess.py
$ xvc pipeline step dependency --step-name train --directory data/pp-train --file src/train.py --param 'params.yaml::learning_rate' --regex 'cat-contest.csv:/^5,.*'
$ xvc pipeline step dependency --step-name test --directory models/ --directory data/pp-test/
$ xvc pipeline step dependency --step-name deploy --file best-model.json
You must also define the outputs these steps produce, so when the output is missing or dependency is newer than the output, the step will require to rerun.
$ xvc pipeline step output --step-name preprocess-train --directory data/pp-train
$ xvc pipeline step output --step-name preprocess-test --directory data/pp-test
$ xvc pipeline step output --step-name train --directory models/
$ xvc pipeline step output --step-name test --file metrics.json --file best-model.json
$ xvc pipeline step output --step-name deploy --file /var/server/files/model.bin
🐇 These commands become too long to type. You know, I'm a lazy hare and don't like to type much. Is there an easier way?
🐢 You can try source $(xvc aliases)
in your Bash or Zsh, and get a bunch of aliases for these commands. xvc pipeline step output
becomes xvcpso
, xvc pipeline step dependency
becomes xvcpsd
, etc. You can see the whole list:
$ xvc aliases
alias xls='xvc file list'
alias pvc='xvc pipeline'
alias fvc='xvc file'
alias xvcf='xvc file'
alias xvcft='xvc file track'
alias xvcfl='xvc file list'
alias xvcfs='xvc file send'
alias xvcfb='xvc file bring'
alias xvcfh='xvc file hash'
alias xvcfc='xvc file checkout'
alias xvcp='xvc pipeline'
alias xvcpr='xvc pipeline run'
alias xvcps='xvc pipeline step'
alias xvcpsn='xvc pipeline step new'
alias xvcpsd='xvc pipeline step dependency'
alias xvcpso='xvc pipeline step output'
alias xvcpi='xvc pipeline import'
alias xvcpe='xvc pipeline export'
alias xvcpl='xvc pipeline list'
alias xvcpn='xvc pipeline new'
alias xvcpu='xvc pipeline update'
alias xvcpd='xvc pipeline dag'
alias xvcs='xvc storage'
alias xvcsn='xvc storage new'
alias xvcsl='xvc storage list'
alias xvcsr='xvc storage remove'
🐇 Oh, there are many more commands.
🐢 Yep. More to come as well. If you want to edit the pipelines you created in YAML, you can use xvc pipeline export
and after making the changes, you can use xvc pipeline import
.
🐇 I don't need to delete the pipeline to rewrite everything, then?
🐢 You can export a pipeline, edit and import with a different name to test. When you want to run them, you specify their names.
🐇 Ah, yeah, that's the most important part. How do I run?
🐢 xvc pipeline run
, or xvcpr
. It takes the name of the pipeline and runs it. It sorts steps, checks if there are any cycles. The steps musn't have cycles, otherwise it's an infinite loop and computers don't like infinite loops like turtles do. Xvc runs steps in parallel if there are no common dependencies.
🐇 So, if I have multiple preprocessing steps that don't depend each other, they can run in parallel?
🐢 Yeah, they run in parallel. For example in your pipeline preprocess-train
and preprocess-test
can run in parallel, because they don't depend on each other.
🐇 Cool. I want to see the pipeline we created.
🐢 You can see it with xvc pipeline dag
(xvcpd
) It prints a mermaid.js diagram that you can paste to your files.
🐇 Better to have an image of this, maybe.
🐢 I'll inform the developer about it. Please tell him anything you'd like to see in the tool in Github or via email He's extremely introverted but tries to be a nice guy.
🐇 Ah, ok, I'll write to him about this.
Xvc for Software Development
Xvc for DVC Users
DVC is an MLOps utility to track data, pipelines and machine learning experiments on top of Git. Xvc is inspired by DVC in its purpose, but there are major technical differences between these two.
Note that this document refers mostly to Xvc v0.4 and DVC 2.30. Both commands are in development, and similarities and differences may change considerably.
Similarities
The purposes of these two commands are similar, and these are alternatives to each other. Both of these aims to manage data, pipelines and experiments of an ML project.
Both of the utilities similarly work on top of Git. DVC became more bound to Git after the introduction of its experiment tracking features. Before that, Git was optional (but recommended) for DVC. Xvc has the same optional and recommended reliance on Git.
Both of these commands use hashing the content to detect changes in files.
Both of these use DAGs to represent pipelines.
Conceptual Differences
- What DVC calls "remote", Xvc calls "storage." This is to emphasize the difference between Xvc storages and Git remotes.
- What DVC calls "stage" in a data pipeline, Xvc calls "step." "Stage" has a different meaning in the Git context, and I believe using the same word in a different meaning increases the mental effort to describe and understand.
- In DVC, there is a 1-1 correspondence between
dvc.yaml
files in a repository and the pipelines. In Xvc, pipelines are more abstract. They are defined withxvc pipeline
family of commands. No single file contains a pipeline definition. You can export pipelines to YAML, JSON, and TOML, and import them after making changes. Xvc doesn't consider any file format authoritative for pipelines, and their YAML/JSON/TOML representation may change between versions. - DVC is more liberal in creating files among user files in the repository. When you add a file to DVC with
dvc add
, DVC creates a.dvc
file next to it. Xvc only creates a.xvc/
directory in the repository root and only updates.gitignore
files to hide tracked files from Git. - Cache type, (or rather recheck type) that is whether a file in the repository is linked to its cached version by copying, reflink, symlink or hardlink is determined repository-wide in DVC. You can either have all your cache links as symlinks, or hardlinks, etc. Xvc tracks these per file, you can have one file symlinked to cache, another file copied from cache, etc.
Command Differences
❗Note that, some of the Xvc commands described here are still under development.
- While naming Xvc commands, we tried our best to avoid name clashes with Git. Having both
git push
anddvc push
commands may look beneficial for exposition at first, as these two are analogous. However, giving the same name also hides some important details, that are more difficult to emphasize later. (e.g. DVC experiments are Git objects that are pushed to Git remotes, while the files changed during experiments are pushed to DVC remotes. ) dvc add
can be replaced byxvc file track
.dvc add
creates a.dvc
file (formatted in YAML) in the repository. Xvc doesn't create separate files for tracked paths.dvc check-ignore
can be replaced byxvc check-ignore
. Xvc version can be used against any other ignore filename. (.gitignore
,.ignore
,.fooignore
...)dvc checkout
is replaced byxvc file recheck
. There is a--recheck-as
option in several Xvc commands that tells whether to check out as symlink, hardlink, reflink or copy.dvc commit
is replaced byxvc file carry-in
.- There is no command similar to
dvc config
. You can either edit the configuration files, or modify configuration with-c
options in each run. You can also supply all configuration from the environment. See Configuration. dvc dag
is replaced byxvc pipeline dag
. DVC version uses ASCII art to present the pipeline. Xvc doesn't provide ASCII art, only Graphviz representation.dvc data status
anddvc status
can be replaced byxvc file list
. Xvc version doesn't provide information about pipelines, or the remotes.- There is no command similar to
dvc destroy
in Xvc. There will be anxvc deinit
command at some point. - There is no command similar to
dvc diff
in Xvc. - There is no command similar to
dvc doctor
ordvc version
. Version information should be visible in the help text. - Currently, there are no commands corresponding to
dvc exp
set of commands. This is on the roadmap for Xvc. Scope, implementation, and actual commands may differ. dvc fetch
is replaced byxvc file bring --no-recheck
.- Instead of freezing "pipeline stages" as in
dvc freeze
, and unfreezing withdvc unfreeze
,xvc pipeline step update --changed [never|always|by_dependencies]
can be used to specify if/when to run a pipeline step. - Instead of
dvc gc
to "garbage-collect" files, you can usexvc file delete
with various options. - There is no corresponding command for
dvc get-url
in Xvc. You can usewget
orcurl
instead. - Currently there is no command to replace
dvc get
anddvc import
, anddvc import-url
. URL dependencies are to be supported eventually with a different mechanism. - Instead of
dvc install
like hooks, Xvc issues Git commands itself ifgit.auto_commit
,git.auto_stage
configuration options are set. - There is no corresponding command for
dvc list-url
. dvc list
is replaced byxvc file list
for local paths. Its remote capabilities are not implemented but on the roadmap.- Currently, there is no params/metrics tracking/diff similar to
dvc params
,dvc metrics
ordvc plots
commands in Xvc. dvc move
is replaced byxvc file move
.dvc push
is replaced byxvc file send
.dvc pull
is replaced byxvc file bring
.- There are no commands similar to
dvc queue
for experiments in Xvc. Experiment tracking will probably be handled differently. dvc remote
set of commands are replaced byxvc storage
set of commands. You can usexvc storage new
for adding new storages. Currently, there is no "default remote" facility in Xvc. Instead ofdvc remote modify
, you can usexvc storage remove
andxvc storage new
.- There is no single command to replace
dvc remove
. For files, you can usexvc file delete
. For pipelines steps, you can use ]xvc pipeline step remove
- Instead of
dvc repro
, Xvc hasxvc pipeline run
. If you want to reproduce a pipeline, you can usexvc pipeline run
again. xvc root
is for the same purpose asdvc root
.dvc run
(that defines a stage in DVC pipeline and immediately runs it) can be replaced byxvc pipeline
set of commands.xvc pipeline new
for a new pipeline,xvc pipeline step new
for a new step in the pipeline,xvc pipeline step dependency
to specify dependencies of a step,xvc pipeline step output
to specify outputs of a step andxvc pipeline run
to run this pipeline.- Instead of
dvc stage add
, we havexvc pipeline step new
. Fordvc stage list
, we havexvc pipeline step list
. - There is no (need) for
dvc protect
ordvc unprotect
commands in Xvc. "Cache type" is not a repository-wide option. If you want to track a certain directory as symlink, and another as hardlink, you can do so withxvc file recheck --as
. If you want identical files copied to one directory and linked in another,xvc file copy
can help. - DVC needs
dvc update
for external dependencies in pipelines. Xvc checks their metadata like any other dependency before downloading and invalidates the step if the URL/file has changed automatically. - DVC leaves Git operations to the user, and automates them to a certain degree with Git hooks. Xvc adds Git commits to the repository after operations by default.
Technical Differences
- DVC is written in Python. Xvc is written in Rust.
- DVC uses MD5 to check file content changes. Xvc uses BLAKE3 by default, and can be configured to use BLAKE2s, SHA2-256 and SHA3-256.
- DVC tracks file/directory changes in separate
.dvc
files. Xvc tracks them in.json
files in.xvc/store
. There is no 1-1 correspondence between these files and the directory structure. - DVC uses Object-Oriented Programming in Python. Xvc tries to minimize function/data coupling and uses an Entity-Component System (
xvc-ecs
) in its core. - DVC remotes are identical to their cache in structure, and multiple DVC repositories use the same remote by mixing files. This leads to inter-repository deduplication. Xvc uses separate directory for each repository. This means identical files in separate Xvc repositories are duplicated.
- DVC considers directories as file-equivalent entities to track with
.dvc
files pointing to.json
files in the cache. Xvc doesn't track directories as identical to files. They are considered collections of files. - DVC uses Dulwich for Git operations. Xvc executes the Git process directly, with its common command line options.
Xvc for Git Annex Users
Xvc for Git LFS Users
How-To Guides
How to Compile Xvc
Why would you compile?
- You want to use Xvc on a platform that we don't distribute the binary.
- You want a smaller binary size by removing features that you don't use.
- You like your software compiled.
- It's easier to use
cargo
than other means to install for you. - Fix a bug for yourself.
- Contribute!
Install Rust
You must have Rust installed on your system.
If you have a sensible terminal on your system:
$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Otherwise refer to other installation methods page.
Clone the repository
Clone the repository from Emre's Github repository.
$ git clone https://github.com/iesahin/xvc -b latest
The latest
tag refers to the latest stable release. If you're willing to fight with compilation errors, you can also use main
branch directly.
Compile without default features
Xvc with Git Branches
When you're working with multiple branches in Git, you may ask Xvc to checkout a branch and commit to another branch.
These operations are performed at the beginning, and at the end of Xvc operations.
You can use --from-ref
and --to-branch
options to checkout a Git reference before an Xvc operation, and commit the results to a certain Git branch.
Checkout and commit operations sandwich Xvc operations.
graph LR checkout["git checkout $REF"] --> xvc xvc["xvc operation"] --> stash["git stash --staged"] stash --> branch["git checkout --branch $TO_BRANCH"] branch --> commit["git add .xvc && git commit"]
If --from-ref
is not given, initial git checkout
is not performed.
Xvc operates in the current branch.
This is the default behavior.
$ git init --initial-branch=main
...
$ xvc init
? 0
$ ls
data.txt
$ xvc --to-branch data-file file track data.txt
Switched to a new branch 'data-file'
$ git branch
* data-file
main
$ git status -s
$ xvc file list data.txt
FC 19 [..] c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size: 19 Cached Size: 19
If you return to main
branch, you'll see the file is tracked by neither Git nor Xvc.
$ git checkout main
...
$ xvc file list data.txt
FX 19 [..] c85f3e81 data.txt
Total #: 1 Workspace Size: 19 Cached Size: 0
$ git status -s
?? data.txt
Now, we'll add a step to the default pipeline to get an uppercase version of the data. We want this to work only in data
$ xvc --from-ref data-file pipeline step new --step-name to-uppercase --command 'cat data.txt | tr a-z A-Z > uppercase.txt'
Switched to branch 'data-file'
$ xvc pipeline step dependency --step-name to-uppercase --file data.txt
$ xvc pipeline step output --step-name to-uppercase --output-file uppercase.txt
Note that xvc pipeline step dependency
and xvc pipeline step output
commands don't need --from-ref
and --to-branch
options, as they run in data-file
branch already.
Now, we want to have this new version of data available only in uppercase
branch.
$ xvc --from-ref data-file --to-branch uppercase pipeline run
Already on 'data-file'
Switched to a new branch 'uppercase'
$ git branch
data-file
main
* uppercase
You can use this for experimentation.
Whenever you have a pipeline that you want to run and keep the results in another Git branch, you can use --to-branch
for experimentation.
$ xvcpr --from-ref data-file --to-branch another-uppercase
$ git-branch
* another-uppercase
uppercase
data-file
main
The pipeline always runs, because in data-file
branch uppercase.txt
is always missing.
It's stored only in the resulting branch you give by --to-branch
.
Command Reference
Synopsis
$ xvc --help
Xvc CLI to manage data and ML pipelines
Usage: xvc [OPTIONS] <COMMAND>
Commands:
file File and directory management commands
init Initialize an Xvc project
pipeline Pipeline management commands
storage Storage (cloud) management commands
root Find the root directory of a project
check-ignore Check whether files are ignored with `.xvcignore`
aliases Print command aliases to be sourced in shell files
help Print this message or the help of the given subcommand(s)
Options:
-v, --verbose... Output verbosity. Use multiple times to increase the output detail
--quiet Suppress all output
--debug Turn on all logging to $TMPDIR/xvc.log
-C <WORKDIR> Set working directory for the command. It doesn't create a new shell, or change the directory [default: .]
-c, --config <CONFIG> Configuration options set from the command line in the form section.key=value You can use multiple times
--no-system-config Ignore system configuration file
--no-user-config Ignore user configuration file
--no-project-config Ignore project configuration file (.xvc/config)
--no-local-config Ignore local (gitignored) configuration file (.xvc/config.local)
--no-env-config Ignore configuration options obtained from environment variables
--skip-git Don't run automated Git operations for this command. If you want to run git commands yourself all the time, you can set `git.auto_commit` and `git.auto_stage` options in the configuration to False
--from-ref <FROM_REF> Checkout the given Git reference (branch, tag, commit etc.) before performing the Xvc operation. This runs `git checkout <given-value>` before running the command
--to-branch <TO_BRANCH> If given, create (or checkout) the given branch before committing results of the operation. This runs `git checkout --branch <given-value>` before committing the changes
-h, --help Print help
-V, --version Print version
Subcommands
file
: File and directory management commandsinit
: Initialize an Xvc projectpipeline
: Pipeline management commandsstorage
: Storage (cloud) management commandsroot
: Find the root directory of a projectcheck-ignore
: Check whether files are ignored with.xvcignore
aliases
Print command aliases to be sourced in shell files
xvc init
Synopsis
$ xvc init --help
Initialize an Xvc project
Usage: xvc init [OPTIONS]
Options:
--path <PATH> Path to the directory to be intialized. (default: current directory)
--no-git Don't require Git
--force Create the repository even if already initialized. Overwrites the current .xvc directory Resets all data and guid, etc
-h, --help Print help
-V, --version Print version
Examples
To initialize a blank Xvc repository, initialize Git first and run xvc init
.
$ cd my-project-1
$ git init
...
$ xvc init
? 0
The command doesn't print anything upon success.
If you want to initialize
File Management
Synopsis
$ xvc file --help
File and directory management commands
Usage: xvc file [OPTIONS] <COMMAND>
Commands:
track Add file and directories to Xvc
hash Get digest hash of files with the supported algorithms
recheck Get files from cache by copy or *link
carry-in Carry (commit) changed files to cache
copy Copy from source to another location in the workspace
list List tracked and untracked elements in the workspace
send Send (push, upload) files to external storages
bring Bring (download, pull, fetch) files from external storages
help Print this message or the help of the given subcommand(s)
Options:
-v, --verbose... Verbosity level. Use multiple times to increase command output detail
--quiet Suppress error messages
-C <WORKDIR> Set the working directory to run the command as if it's in that directory [default: .]
-c, --config <CONFIG> Configuration options set from the command line in the form section.key=value
--no-system-config Ignore system config file
--no-user-config Ignore user config file
--no-project-config Ignore project config (.xvc/config)
--no-local-config Ignore local config (.xvc/config.local)
--no-env-config Ignore configuration options from the environment
-h, --help Print help
-V, --version Print version
Subcommands
track
: Begin tracking (add) files with XVChash
: Calculate hash of given filerecheck
: Copy/link files in the cache to the workspace (checkout)carry-in
: Carry (commit) changed files to cachecopy
: Copy files in the workspace to another locationlist
: List files tracked with XVCsend
: Send (push- ) files to remote
bring
: Bring (pull) files from remote
xvc file track
Purpose
xvc file track
is used to register any kind of file to Xvc for tracking versions.
Synopsis
$ xvc file track --help
Add file and directories to Xvc
Usage: xvc file track [OPTIONS] [TARGETS]...
Arguments:
[TARGETS]...
Files/directories to track
Options:
--cache-type <CACHE_TYPE>
How to track the file contents in cache: One of copy, symlink, hardlink, reflink.
Note: Reflink uses copy if the underlying file system doesn't support it.
--no-commit
Do not copy/link added files to the file cache
--text-or-binary <TEXT_OR_BINARY>
Calculate digests as text or binary file without checking contents, or by automatically. (Default: auto)
--force
Add targets even if they are already tracked
--no-parallel
Don't use parallelism
-h, --help
Print help (see a summary with '-h')
Examples
By default, the command runs similar to git add
and git commit
.
$ xvc file track my-large-image.jpeg
You can track directories with the same command.
$ xvc file track my-large-directory/
You can specify more than one target in a single command.
$ xvc file track my-large-image.jpeg my-large-directory
Caveats
-
This command doesn't discriminate symbolic links or hardlinks. Links are followed and any broken links may cause errors.
-
Under the hood, Xvc tracks only the files, not directories. Directories are considered as path collections. It doesn't matter if you track a directory or files in it separately.
Technical Details
- Detecting changes in files and directories employ different kinds of associated digests. If a file has different metadata digest, its content digest is calculated. If file's content digest has changed, the file is considered changed. A directory that contains different set of files, or files with changed content is considered changed.
xvc file list
Synopsis
$ xvc file list --help
List tracked and untracked elements in the workspace
Usage: xvc file list [OPTIONS] [TARGETS]...
Arguments:
[TARGETS]...
Files/directories to list.
If not supplied, lists all files under the current directory.
Options:
-f, --format <FORMAT>
A string for each row of the output table
The following are the keys for each row:
- {{acd8}}: actual content digest from the workspace file. First 8 digits.
- {{acd64}}: actual content digest. All 64 digits.
- {{aft}}: actual file type. Whether the entry is a file (F), directory (D), symlink (S), hardlink (H) or reflink (R).
- {{asz}}: actual size. The size of the workspace file in bytes. It uses MB, GB and TB to represent sizes larger than 1MB.
- {{ats}}: actual timestamp. The timestamp of the workspace file.
- {{name}}: The name of the file or directory.
- {{cst}}: cache status. One of "=", ">", "<", "X", or "?" to show whether the file timestamp is the same as the cached timestamp, newer, older, not cached or not tracked.
- {{rcd8}}: recorded content digest stored in the cache. First 8 digits.
- {{rcd64}}: recorded content digest stored in the cache. All 64 digits.
- {{rct}}: recorded cache type. Whether the entry is linked to the workspace as a copy (C), symlink (S), hardlink (H) or reflink (R).
- {{rsz}}: recorded size. The size of the cached content in bytes. It uses MB, GB and TB to represent sizes larged than 1MB.
- {{rts}}: recorded timestamp. The timestamp of the cached content.
The default format can be set with file.list.format in the config file.
-s, --sort <SORT>
Sort criteria.
It can be one of none (default), name-asc, name-desc, size-asc, size-desc, ts-asc, ts-desc.
The default option can be set with file.list.sort in the config file.
--no-summary
Don't show total number and size of the listed files.
The default option can be set with file.list.no_summary in the config file.
-h, --help
Print help (see a summary with '-h')
Examples
For these examples, we'll create a directory tree with five directories, each having a file.
$ xvc-test-helper create-directory-tree --directories 5 --files 5 --fill 23
$ tree
.
├── dir-0001
│ ├── file-0001.bin
│ ├── file-0002.bin
│ ├── file-0003.bin
│ ├── file-0004.bin
│ └── file-0005.bin
├── dir-0002
│ ├── file-0001.bin
│ ├── file-0002.bin
│ ├── file-0003.bin
│ ├── file-0004.bin
│ └── file-0005.bin
├── dir-0003
│ ├── file-0001.bin
│ ├── file-0002.bin
│ ├── file-0003.bin
│ ├── file-0004.bin
│ └── file-0005.bin
├── dir-0004
│ ├── file-0001.bin
│ ├── file-0002.bin
│ ├── file-0003.bin
│ ├── file-0004.bin
│ └── file-0005.bin
└── dir-0005
├── file-0001.bin
├── file-0002.bin
├── file-0003.bin
├── file-0004.bin
└── file-0005.bin
[..] directories, 25 files
xvc file list
command works only in Xvc repositories. As we didn't initialize
a repository yet, it lists nothing.
$ xvc file list
Let's initialize the repository.
$ git init
...
$ xvc init
Now it lists all files and directories.
$ xvc file list --sort name-asc
FX 107 [..] ce9fcf30 .gitignore
FX 130 [..] ac46bf74 .xvcignore
DX 224 [..] dir-0001
FX 1001 [..] 189fa49f dir-0001/file-0001.bin
FX 1002 [..] 8c079454 dir-0001/file-0002.bin
FX 1003 [..] 2856fe70 dir-0001/file-0003.bin
FX 1004 [..] 3640687a dir-0001/file-0004.bin
FX 1005 [..] e23e79a0 dir-0001/file-0005.bin
DX 224 [..] dir-0002
FX 1001 [..] 189fa49f dir-0002/file-0001.bin
FX 1002 [..] 8c079454 dir-0002/file-0002.bin
FX 1003 [..] 2856fe70 dir-0002/file-0003.bin
FX 1004 [..] 3640687a dir-0002/file-0004.bin
FX 1005 [..] e23e79a0 dir-0002/file-0005.bin
DX 224 [..] dir-0003
FX 1001 [..] 189fa49f dir-0003/file-0001.bin
FX 1002 [..] 8c079454 dir-0003/file-0002.bin
FX 1003 [..] 2856fe70 dir-0003/file-0003.bin
FX 1004 [..] 3640687a dir-0003/file-0004.bin
FX 1005 [..] e23e79a0 dir-0003/file-0005.bin
DX 224 [..] dir-0004
FX 1001 [..] 189fa49f dir-0004/file-0001.bin
FX 1002 [..] 8c079454 dir-0004/file-0002.bin
FX 1003 [..] 2856fe70 dir-0004/file-0003.bin
FX 1004 [..] 3640687a dir-0004/file-0004.bin
FX 1005 [..] e23e79a0 dir-0004/file-0005.bin
DX 224 [..] dir-0005
FX 1001 [..] 189fa49f dir-0005/file-0001.bin
FX 1002 [..] 8c079454 dir-0005/file-0002.bin
FX 1003 [..] 2856fe70 dir-0005/file-0003.bin
FX 1004 [..] 3640687a dir-0005/file-0004.bin
FX 1005 [..] e23e79a0 dir-0005/file-0005.bin
Total #: 32 Workspace Size: 26432 Cached Size: 0
With the default output format, the first two letters show the path type and cache type, respectively.
For example, if you track dir-0001
as copy
, the first letter is F
for the
files and D
for the directories. The second letter is C
for files, meaning
the file is a copy of the cached file, and it's X
for directories that means
they are not in the cache. Similar to Git, Xvc doesn't track only files and
directories are considered as collection of files.
$ xvc file track dir-0001/
$ xvc file list dir-0001/
FC 1005 [..] e23e79a0 e23e79a0 dir-0001/file-0005.bin
FC 1004 [..] 3640687a 3640687a dir-0001/file-0004.bin
FC 1003 [..] 2856fe70 2856fe70 dir-0001/file-0003.bin
FC 1002 [..] 8c079454 8c079454 dir-0001/file-0002.bin
FC 1001 [..] 189fa49f 189fa49f dir-0001/file-0001.bin
Total #: 5 Workspace Size: 5015 Cached Size: 5015
If you add another set of files as hardlinks to the cached copies, it will
print the second letter as H
.
$ xvc file track dir-0002 --cache-type hardlink
$ xvc file list dir-0002
FH 1005 [..] e23e79a0 e23e79a0 dir-0002/file-0005.bin
FH 1004 [..] 3640687a 3640687a dir-0002/file-0004.bin
FH 1003 [..] 2856fe70 2856fe70 dir-0002/file-0003.bin
FH 1002 [..] 8c079454 8c079454 dir-0002/file-0002.bin
FH 1001 [..] 189fa49f 189fa49f dir-0002/file-0001.bin
Total #: 5 Workspace Size: 5015 Cached Size: 5015
Note, as hardlinks are actually files with the same inode in the file system
with alternative paths, they are detected as F
.
Symbolic links are typically reported as SS
in the first letters.
It means they are symbolic links on the file system and their cache type is also
symbolic links.
$ xvc file track dir-0003 --cache-type symlink
$ xvc file list dir-0003
SS [..] [..] e23e79a0 dir-0003/file-0005.bin
SS [..] [..] 3640687a dir-0003/file-0004.bin
SS [..] [..] 2856fe70 dir-0003/file-0003.bin
SS [..] [..] 8c079454 dir-0003/file-0002.bin
SS [..] [..] 189fa49f dir-0003/file-0001.bin
Total #: 5 Workspace Size: 900 Cached Size: 5015
Although not all filesystems support, R
represents reflinks.
Globs
You may use globs to list files.
$ xvc file list 'dir-*/*-0001.bin'
FX 1001 [..] 189fa49f dir-0005/file-0001.bin
FX 1001 [..] 189fa49f dir-0004/file-0001.bin
SS [..] [..] 189fa49f dir-0003/file-0001.bin
FH 1001 [..] 189fa49f 189fa49f dir-0002/file-0001.bin
FC 1001 [..] 189fa49f 189fa49f dir-0001/file-0001.bin
Total #: 5 Workspace Size: 4184 Cached Size: 1001
Note that all these files are identical. They are cached once, and only one of them takes space in the cache.
You can also use multiple targets as globs.
$ xvc file list '*/*-0001.bin' '*/*-0002.bin'
FX 1002 [..] 8c079454 dir-0005/file-0002.bin
FX 1001 [..] 189fa49f dir-0005/file-0001.bin
FX 1002 [..] 8c079454 dir-0004/file-0002.bin
FX 1001 [..] 189fa49f dir-0004/file-0001.bin
SS [..] [..] 8c079454 dir-0003/file-0002.bin
SS [..] [..] 189fa49f dir-0003/file-0001.bin
FH 1002 [..] 8c079454 8c079454 dir-0002/file-0002.bin
FH 1001 [..] 189fa49f 189fa49f dir-0002/file-0001.bin
FC 1002 [..] 8c079454 8c079454 dir-0001/file-0002.bin
FC 1001 [..] 189fa49f 189fa49f dir-0001/file-0001.bin
Total #: 10 Workspace Size: 8372 Cached Size: 2003
Sorting
You may sort xvc file list
output by name, by modification time and by file
size.
Use --sort
option to specify the sort criteria.
$ xvc file list --sort name-desc dir-0001/
FC 1005 [..] e23e79a0 e23e79a0 dir-0001/file-0005.bin
FC 1004 [..] 3640687a 3640687a dir-0001/file-0004.bin
FC 1003 [..] 2856fe70 2856fe70 dir-0001/file-0003.bin
FC 1002 [..] 8c079454 8c079454 dir-0001/file-0002.bin
FC 1001 [..] 189fa49f 189fa49f dir-0001/file-0001.bin
Total #: 5 Workspace Size: 5015 Cached Size: 5015
$ xvc file list --sort name-asc dir-0001/
FC 1001 [..] 189fa49f 189fa49f dir-0001/file-0001.bin
FC 1002 [..] 8c079454 8c079454 dir-0001/file-0002.bin
FC 1003 [..] 2856fe70 2856fe70 dir-0001/file-0003.bin
FC 1004 [..] 3640687a 3640687a dir-0001/file-0004.bin
FC 1005 [..] e23e79a0 e23e79a0 dir-0001/file-0005.bin
Total #: 5 Workspace Size: 5015 Cached Size: 5015
Column Format
You can specify the columns that the command prints.
For example, if you only want to see the file names, use {{name}}
as the
format string.
The following command sorts all files with their sizes in the workspace, and prints their size and name.
$ xvc file list --format '{{asz}} {{name}}' --sort size-desc dir-0001/
1005 dir-0001/file-0005.bin
1004 dir-0001/file-0004.bin
1003 dir-0001/file-0003.bin
1002 dir-0001/file-0002.bin
1001 dir-0001/file-0001.bin
Total #: 5 Workspace Size: 5015 Cached Size: 5015
If you want to compare the recorded (cached) hashes and actual hashes in the workspace, you can use {{acd}} {{rcd}} {{name}}
format string.
$ xvc file list --format '{{acd8}} {{rcd8}} {{name}}' --sort ts-asc dir-0001
189fa49f 189fa49f dir-0001/file-0001.bin
8c079454 8c079454 dir-0001/file-0002.bin
2856fe70 2856fe70 dir-0001/file-0003.bin
3640687a 3640687a dir-0001/file-0004.bin
e23e79a0 e23e79a0 dir-0001/file-0005.bin
Total #: 5 Workspace Size: 5015 Cached Size: 5015
If `{{acd8}}` or `{{acd64}}` is not present in the format string, Xvc doesn't calculate these hashes. If you have large number of files where the default format (that includes actual content hashes) runs slowly, you may customize it to not to include these columns.
If you want to get a quick glimpse of what needs to carried in, or rechecked,
you can use cache status {{cst}}
column.
$ xvc-test-helper generate-random-file --size 100 --filename dir-0001/a-new-file.bin
$ xvc file list --format '{{cst}} {{name}}' dir-0001/
= dir-0001/file-0005.bin
= dir-0001/file-0004.bin
= dir-0001/file-0003.bin
= dir-0001/file-0002.bin
= dir-0001/file-0001.bin
X dir-0001/a-new-file.bin
Total #: 6 Workspace Size: 5115 Cached Size: 5015
The cache status column shows =
for unchanged files in the cache, X
for
untracked files, >
for files that there is newer version in the cache, and <
for files that there is a newer version in the workspace. The comparison is done
between recorded timestamp and actual timestamp with an accuracy of 1 second.
xvc file hash
Synopsis
$ xvc file hash --help
Get digest hash of files with the supported algorithms
Usage: xvc file hash [OPTIONS] [TARGETS]...
Arguments:
[TARGETS]... Files to process
Options:
-a, --algorithm <ALGORITHM>
Algorithm to calculate the hash. One of blake3, blake2, sha2, sha3. All algorithm variants produce 32-bytes digest
--text-or-binary <TEXT_OR_BINARY>
For "text" remove line endings before calculating the digest. Keep line endings if "binary". "auto" (default) detects the type by checking 0s in the first 8Kbytes, similar to Git [default: auto]
-h, --help
Print help
-V, --version
Print version
xvc file checkout
This is an alias of xvc file recheck
.
Please see that page for examples.
Synopsis
$ xvc file checkout --help
Get files from cache by copy or *link
Usage: xvc file recheck [OPTIONS] [TARGETS]...
Arguments:
[TARGETS]...
Files/directories to recheck
Options:
--cache-type <CACHE_TYPE>
How to track the file contents in cache: One of copy, symlink, hardlink, reflink.
Note: Reflink uses copy if the underlying file system doesn't support it.
--no-parallel
Don't use parallelism
--force
Force even if target exists
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
xvc file recheck
Synopsis
$ xvc file recheck --help
Get files from cache by copy or *link
Usage: xvc file recheck [OPTIONS] [TARGETS]...
Arguments:
[TARGETS]...
Files/directories to recheck
Options:
--cache-type <CACHE_TYPE>
How to track the file contents in cache: One of copy, symlink, hardlink, reflink.
Note: Reflink uses copy if the underlying file system doesn't support it.
--no-parallel
Don't use parallelism
--force
Force even if target exists
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
This command has an alias xvc file checkout
if you feel more at home with Git terminology.
Examples
Rechecking is analogous to git checkout. It copies or links a cached file to the workspace.
Start by tracking a file.
$ git init
...
$ xvc init
$ xvc file track data.txt
$ ls -l
total[..]
-rw-rw-rw- [..] data.txt
Once you added the file to the cache, you can delete the workspace copy.
$ rm data.txt
$ ls -l
total[..]
Then, recheck the file. By default, it makes a copy of the file.
$ xvc file recheck data.txt
$ ls -l
total [..]
-rw-rw-rw- [..] data.txt
Xvc updates the cache type if the file is not changed.
$ xvc file recheck data.txt --as symlink
$ ls -l data.txt
l[..] data.txt -> [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
Symlink and hardlinks are read-only.
You can delete the symlink, and replace with an updated copy.
(As perl -i
does below.)
$ perl -i -pe 's/a/ee/g' data.txt
$ xvc file recheck data.txt --as copy
[ERROR] data.txt has changed on disk. Either carry in, force, or delete the target to recheck.
$ rm data.txt
$ xvc -vv file recheck data.txt --as hardlink
[INFO] [HARDLINK] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt -> [CWD]/data.txt
$ ls -l
total[..]
-r--r--r-- [..] data.txt
Note that, as files in the cache are kept read-only, hardlinks and symlinks are also read only. Files rechecked as copy are made read-write explicitly.
Reflinks are supported by Xvc, but the underlying file system should also support it.
Otherwise it uses copy
.
$ rm -f data.txt
$ xvc file recheck data.txt --as reflink
The above command will create a read only link in macOS APFS and a copy in ext4 or NTFS file systems.
xvc file carry-in
Copies the file changes to cache.
Synopsis
$ xvc file carry-in --help
Carry (commit) changed files to cache
Usage: xvc file carry-in [OPTIONS] [TARGETS]...
Arguments:
[TARGETS]...
Files/directories to add
Options:
--text-or-binary <TEXT_OR_BINARY>
Calculate digests as text or binary file without checking contents, or by automatically. (Default: auto)
--force
Carry in targets even their content digests are not changed.
This removes the file in cache and re-adds it.
--no-parallel
Don't use parallelism
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
Examples
Carry in command works with Xvc repositories.
$ git init
...
$ xvc init
We first track a file.
$ xvc file track data.txt
$ xvc file list data.txt
FC 19 [..] c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size: 19 Cached Size: 19
We update the file with a command.
$ perl -i -pe 's/a/ee/g' data.txt
$ cat data.txt
Oh, deetee, my, deetee
$ xvc file list data.txt
FC 23 [..] c85f3e81 e37c686a data.txt
Total #: 1 Workspace Size: 23 Cached Size: 19
Note that the size of the file has increased, as we replace each a
with an ee
.
$ xvc file carry-in data.txt
$ xvc file list data.txt
FC 23 [..] e37c686a e37c686a data.txt
Total #: 1 Workspace Size: 23 Cached Size: 19
xvc file send
Synopsis
$ xvc file send --help
Send (push, upload) files to external storages
Usage: xvc file send [OPTIONS] --remote <REMOTE> [TARGETS]...
Arguments:
[TARGETS]... Targets to send/push/upload to storage
Options:
-r, --remote <REMOTE> Storage name or guid to send the files
--force Force even if the files are already present in the storage
-h, --help Print help
xvc file bring
Synopsis
$ xvc file bring --help
Bring (download, pull, fetch) files from external storages
Usage: xvc file bring [OPTIONS] --storage <STORAGE> [TARGETS]...
Arguments:
[TARGETS]...
Targets to bring from the storage
Options:
-s, --storage <STORAGE>
Storage name or guid to send the files
--force
Force even if the files are already present in the workspace
--no-recheck
Don't recheck (checkout) after bringing the file to cache.
This makes the command similar to `git fetch` in Git. It just updates the cache, and doesn't copy/link the file to workspace.
--recheck-as <RECHECK_AS>
Recheck (checkout) the file in one of the four alternative ways. (See `xvc file recheck`) and [CacheType]
-h, --help
Print help (see a summary with '-h')
This is yet to be implemented. Please see https://github.com/iesahin/xvc/issues/177 for progress.
xvc file copy
Synopsis
$ xvc file copy --help
Copy from source to another location in the workspace
Usage: xvc file copy [OPTIONS] <SOURCE> <DESTINATION>
Arguments:
<SOURCE>
Source file, glob or directory within the workspace.
If the source ends with a slash, it's considered a directory and all files in that directory are copied.
If the number of source files is more than one, the destination must be a directory.
<DESTINATION>
Location we copy file(s) to within the workspace.
If the target ends with a slash, it's considered a directory and created if it doesn't exist.
If the number of source files is more than one, the destination must be a directory.
Options:
--cache-type <CACHE_TYPE>
How the targets should be rechecked: One of copy, symlink, hardlink, reflink.
Note: Reflink uses copy if the underlying file system doesn't support it.
--force
Force even if target exists
--no-recheck
Do not recheck the destination files This is useful when you want to copy only records, without updating the workspace
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
Examples
This command is used to copy a set of files to another location in the workspace.
By default, it doesn't update the recheck method (cache type) of the targets. It rechecks them to the destination with the same method.
xvc file copy
works only with the tracked files.
$ git init
...
$ xvc init
$ xvc file track data.txt
$ ls -l
total[..]
-rw-rw-rw- [..] data.txt
Once you add the file to the cache, you can copy the file to another location.
$ xvc file copy data.txt data2.txt
$ ls
data.txt
data2.txt
Note that, multiple copies of the same content doesn't add up to the cache size.
$ xvc file list data.txt
FC 19 [..] c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size: 19 Cached Size: 19
$ xvc file list 'data*'
FC 19 [..] c85f3e81 c85f3e81 data2.txt
FC 19 [..] c85f3e81 c85f3e81 data.txt
Total #: 2 Workspace Size: 38 Cached Size: 19
Xvc can change the destination file's recheck method.
$ xvc file copy data.txt data3.txt --as symlink
$ ls -l
total[..]
-rw-rw-rw- 1 [..] data.txt
-rw-rw-rw- 1 [..] data2.txt
lrwxr-xr-x 1 [..] data3.txt -> [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
You can create views of your data by copying it to another location.
$ xvc file copy 'd*' another-set/ --as hardlink
$ xvc file list another-set/
FH 19 [..] c85f3e81 c85f3e81 another-set/data3.txt
FH 19 [..] c85f3e81 c85f3e81 another-set/data2.txt
FH 19 [..] c85f3e81 c85f3e81 another-set/data.txt
Total #: 3 Workspace Size: 57 Cached Size: 19
If the targets you specify are changed, copy operation is cancelled. Please either recheck old versions or carry in new versions.
$ perl -i -pe 's/a/ee/g' data.txt
$ xvc file copy data.txt data5.txt
You can copy files without them being in the workspace if they are in the cache.
$ rm -f data.txt
$ xvc file copy data.txt data6.txt
$ ls -l data6.txt
-rw-rw-rw- [..] data6.txt
You can also skip rechecking.
In this case, xvc won't create any copies in the workspace, and you don't need them to be available in the cache.
They will be listed with xvc file list
command.
$ xvc file copy data.txt data7.txt --no-recheck
$ ls
another-set
data2.txt
data3.txt
data5.txt
data6.txt
$ xvc file list
XC [..] c85f3e81 data7.txt
FC 19 [..] c85f3e81 c85f3e81 data6.txt
FC 19 [..] c85f3e81 c85f3e81 data5.txt
SS [..] [..] c85f3e81 data3.txt
FC 19 [..] c85f3e81 c85f3e81 data2.txt
XC [..] c85f3e81 data.txt
FH 19 [..] c85f3e81 c85f3e81 another-set/data3.txt
FH 19 [..] c85f3e81 c85f3e81 another-set/data2.txt
FH 19 [..] c85f3e81 c85f3e81 another-set/data.txt
DX 160 [..] another-set
FX 130 [..] ac46bf74 .xvcignore
FX 619 [..] [..] .gitignore
Total #: 12 Workspace Size: 1203 Cached Size: 19
Later, you can recheck them to the workspace.
$ xvc file recheck data7.txt
$ ls -l data7.txt
-rw-rw-rw- [..] data7.txt
Data-Model Pipelines
Synopsis
$ xvc pipeline --help
Pipeline management commands
Usage: xvc pipeline [OPTIONS] <COMMAND>
Commands:
new Create a new pipeline
update Rename, change dir or set a pipeline as default
delete Delete a pipeline
run Run a pipeline
list List all pipelines
dag Generate a dot or mermaid diagram for the pipeline
export Export the pipeline to a YAML or JSON file to edit
import Import the pipeline from a file
step Step creation, dependency, output commands
help Print this message or the help of the given subcommand(s)
Options:
-n, --name <NAME> Name of the pipeline this command applies
-h, --help Print help
xvc pipeline new
Synopsis
$ xvc pipeline new --help
Create a new pipeline
Usage: xvc pipeline new [OPTIONS] --name <NAME>
Options:
-n, --name <NAME> Name of the pipeline this command applies to
-w, --workdir <WORKDIR> Default working directory
--set-default Set this pipeline as default
-h, --help Print help
xvc pipeline list
Synopsis
$ xvc pipeline list --help
List all pipelines
Usage: xvc pipeline list
Options:
-h, --help Print help
xvc pipeline step
Synopsis
$ xvc pipeline step --help
Step creation, dependency, output commands
Usage: xvc pipeline step <COMMAND>
Commands:
new Add a new step
update Update step options
dependency Add a dependency to a step
output Add an output to a step
show Print step configuration
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
xvc pipeline step new
Purpose
Create a new step in the pipeline.
Synopsis
$ xvc pipeline step new --help
Add a new step
Usage: xvc pipeline step new [OPTIONS] --step-name <STEP_NAME>
Options:
-s, --step-name <STEP_NAME> Name of the new step
-c, --command <COMMAND> Step command to run
--when <WHEN> When to run the command. One of always, never, by_dependencies (default). This is used to freeze or invalidate a step manually
-h, --help Print help
Examples
Caveats
xvc pipeline step dependency
Purpose
Define a dependency to an existing step in the pipeline.
Synopsis
$ xvc pipeline step dependency --help
Add a dependency to a step
Usage: xvc pipeline step dependency [OPTIONS] --step-name <STEP_NAME>
Options:
-s, --step-name <STEP_NAME> Name of the step to add the dependency to
--file <FILES> Add a file dependency to the step. Can be used multiple times
--step <STEPS> Add a step dependency to a step. Can be used multiple times. Steps are referred with their names
--pipeline <PIPELINES> Add a pipeline dependency to a step. Can be used multiple times. Pipelines are referred with their names
--directory <DIRECTORIES> Add a directory dependency to the step. Can be used multiple times
--glob <GLOBS> Add a glob dependency to the step. Can be used multiple times
--param <PARAMS> Add a parameter dependency to the step in the form filename.yaml::model.units . Can be used multiple times
--regex <REGEXPS> Add a regex dependency in the form filename.txt:/^regex/ . Can be used multiple times
--line <LINES> Add a line dependency in the form filename.txt::123-234
-h, --help Print help
Examples
Caveats
xvc pipeline step output
Purpose
Define an output (file, metrics or plots) to an already existing step in the pipeline.
Synopsis
$ xvc pipeline step output --help
Add an output to a step
Usage: xvc pipeline step output [OPTIONS] --step-name <STEP_NAME>
Options:
-s, --step-name <STEP_NAME> Name of the step to add the output to
--output-file <FILES> Add a file output to the step. Can be used multiple times
--output-metric <METRICS> Add a metric output to the step. Can be used multiple times
--output-image <IMAGES> Add an image output to the step. Can be used multiple times
-h, --help Print help
Examples
Caveats
xvc pipeline step show
Purpose
Print the steps of a pipeline.
Synopsis
$ xvc pipeline step show --help
Print step configuration
Usage: xvc pipeline step show --step-name <STEP_NAME>
Options:
-s, --step-name <STEP_NAME> Name of the step to show
-h, --help Print help
Examples
Caveats
xvc pipeline step update
Purpose
Update the name, running condition, or command of a step.
Synopsis
$ xvc pipeline step update --help
Update step options
Usage: xvc pipeline step update [OPTIONS] --step-name <STEP_NAME>
Options:
-s, --step-name <STEP_NAME> Name of the step to update. The step should already be defined
-c, --command <COMMAND> Step command to run
--when <WHEN> When to run the command. One of always, never, by_dependencies (default). This is used to freeze or invalidate a step manually
-h, --help Print help
Examples
Caveats
xvc pipeline run
Synopsis
$ xvc pipeline run --help
Run a pipeline
Usage: xvc pipeline run [OPTIONS]
Options:
-n, --name <NAME> Name of the pipeline to run
-h, --help Print help
xvc pipeline delete
Synopsis
$ xvc pipeline delete --help
Delete a pipeline
Usage: xvc pipeline delete --name <NAME>
Options:
-n, --name <NAME> Name or GUID of the pipeline to be deleted
-h, --help Print help
xvc pipeline export
Synopsis
$ xvc pipeline export --help
Export the pipeline to a YAML or JSON file to edit
Usage: xvc pipeline export [OPTIONS]
Options:
-n, --name <NAME> Name of the pipeline to export
--file <FILE> File to write the pipeline. Writes to stdout if not set
--format <FORMAT> Output format. One of json or yaml. If not set, the format is guessed from the file extension. If the file extension is not set, json is used as default
-h, --help Print help
xvc pipeline import
Synopsis
$ xvc pipeline import --help
Import the pipeline from a file
Usage: xvc pipeline import [OPTIONS]
Options:
-n, --name <NAME> Name of the pipeline to import. If not set, the name from the file is used
--file <FILE> File to read the pipeline. Use stdin if not specified
--format <FORMAT> Input format. One of json or yaml. If not set, the format is guessed from the file extension. If the file extension is not set, json is used as default
--overwrite Overwrite the pipeline even if the name already exists
-h, --help Print help
xvc pipeline update
Synopsis
$ xvc pipeline update --help
Rename, change dir or set a pipeline as default
Usage: xvc pipeline update [OPTIONS]
Options:
-n, --name <NAME> Name of the pipeline this command applies to
--rename <RENAME> Rename the pipeline to
--workdir <WORKDIR> Set the working directory
--set-default set this pipeline default
-h, --help Print help
xvc pipeline dag
Synopsis
$ xvc pipeline dag --help
Generate a dot or mermaid diagram for the pipeline
Usage: xvc pipeline dag [OPTIONS]
Options:
-n, --name <NAME> Name of the pipeline to generate the diagram
--file <FILE> Output file. Writes to stdout if not set
--format <FORMAT> Format for graph. Either dot or mermaid [default: dot]
-h, --help Print help
Storage management commands (xvc storage
)
Purpose
Xvc allows to keep tracked content in storages.
These can be in either local file system or the cloud.
xvc storage
set of commands allow to configure, list and delete these storages.
Synopsis
$ xvc storage --help
Storage (cloud) management commands
Usage: xvc storage <COMMAND>
Commands:
list List all configured storages
remove Remove a storage configuration
new Configure a new storage
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
xvc storage list
Purpose
List all configured storages with their names and guids.
Synopsis
$ xvc storage list --help
List all configured storages
Usage: xvc storage list
Options:
-h, --help Print help
Examples
List all storages in the repository:
$ xvc storage list
Caveats
This one uses the local configuration and doesn't try to connect storages. If it's listed with the command, it doesn't mean it's guaranteed to be able to pull or push.
xvc storage remove
Purpose
Remove unused or inaccessible storages from the configuration
Synopsis
$ xvc storage remove --help
Remove a storage configuration.
This doesn't delete any files in the storage.
Usage: xvc storage remove --name <NAME>
Options:
--name <NAME>
Name of the storage to be deleted
-h, --help
Print help (see a summary with '-h')
Caveats
xvc storage new
Synopsis
$ xvc storage new --help
Configure a new storage
Usage: xvc storage new <COMMAND>
Commands:
local Add a new local storage
generic Add a new generic storage
rsync Add a new rsync storages
s3 Add a new S3 storage
minio Add a new Minio storage
digital-ocean Add a new Digital Ocean storage
r2 Add a new R2 storage
gcs Add a new Google Cloud Storage storage
wasabi Add a new Wasabi storage
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
xvc storage new local
Purpose
Create a new storage reachable from the local filesystem. It allows to keep tracked file contents in a different directory for backup or sharing purposes.
Synopsis
$ xvc storage new local --help
Add a new local storage
A local storage is a directory accessible from the local file system. Xvc will use common file operations for this directory without accessing the network.
Usage: xvc storage new local --path <PATH> --name <NAME>
Options:
--path <PATH>
Directory (outside the repository) to be set as a storage
-n, --name <NAME>
Name of the storage.
Recommended to keep this name unique to refer easily.
-h, --help
Print help (see a summary with '-h')
Examples
Create a new Xvc backup storage on a path
$ xvc storage new-local --name backup --path /media/bigdisk/backups/my-project-xvc
Caveats
--name NAME
is not checked to be unique but you should use unique storage names to refer them later.
--path PATH
should be accessible for writing and shouldn't already exist.
Technical Details
The command creates the PATH
and a new file under PATH
called .xvc-guid
.
The file contains the unique identifier for this storage.
The same identifier is also recorded to the project.
A file that's found in .xvc/{{HASH_PREFIX}}/{{CACHE_PATH}}
is saved to PATH/{{REPO_ID}}/{{HASH_PREFIX}}/{{CACHE_PATH}}
.
{{REPO_ID}}
is the unique identifier for the repository created during xvc init
.
Hence if you use a common storage for different Xvc projects, their files are kept under different directories.
There is no inter-project deduplication.
xvc storage new generic
Purpose
Create a new storage that uses shell commands to send and retrieve cache files. It allows to keep tracked files in any kind of service that can be used command line.
Synopsis
$ xvc storage new generic --help
Add a new generic storage.
⚠️ Please note that this is an advanced method to configure storages. You may damage your repository and local and remote files with incorrect configurations.
Please see https://docs.xvc.dev/ref/xvc-storage-new-generic.html for examples and make necessary backups before continuing.
Usage: xvc storage new generic [OPTIONS] --name <NAME> --init <INIT_COMMAND> --list <LIST_COMMAND> --download <DOWNLOAD_COMMAND> --upload <UPLOAD_COMMAND> --delete <DELETE_COMMAND>
Options:
-n, --name <NAME>
Name of the storage.
Recommended to keep this name unique to refer easily.
-i, --init <INIT_COMMAND>
Command to initialize the storage. This command is run once after defining the storage.
You can use {URL} and {STORAGE_DIR} as shortcuts.
-l, --list <LIST_COMMAND>
Command to list the files in storage
You can use {URL} and {STORAGE_DIR} placeholders and define values for these with --url and --storage_dir options.
-d, --download <DOWNLOAD_COMMAND>
Command to download a file from storage.
You can use {URL} and {STORAGE_DIR} placeholders and define values for these with --url and --storage_dir options.
-u, --upload <UPLOAD_COMMAND>
Command to upload a file to storage.
You can use {URL} and {STORAGE_DIR} placeholders and define values for these with --url and --storage_dir options.
-D, --delete <DELETE_COMMAND>
The delete command to remove a file from storage You can use {URL} and {STORAGE_DIR} placeholders and define values for these with --url and --storage_dir options
-M, --processes <MAX_PROCESSES>
Number of maximum processes to run simultaneously
[default: 1]
--url <URL>
You can set a string to replace {URL} placeholder in commands
--storage-dir <STORAGE_DIR>
You can set a string to replace {STORAGE_DIR} placeholder in commands
-h, --help
Print help (see a summary with '-h')
You can use the following placeholders in your commands. These are replaced with the actual paths in runtime and commands are run with concrete paths.
{URL}
: The content of--url
option. (default ""){STORAGE_DIR}
Content of--storage-dir
option. (default ""){RELATIVE_CACHE_PATH}
The portion of the cache path after.xvc/
.{ABSOLUTE_CACHE_PATH}
The absolute local path for the cache element.{RELATIVE_CACHE_DIR}
The portion of directory that contains the file after.xvc/
.{ABSOLUTE_CACHE_DIR}
The portion of the local directory that contains the file after.xvc
.{XVC_GUID}
: Repository GUID used in storages to differ repository elements{FULL_STORAGE_PATH}
: Concatenation of{URL}{STORAGE_DIR}{XVC_GUID}/{RELATIVE_CACHE_PATH}
{FULL_STORAGE_DIR}
: Concatenation of{URL}{STORAGE_DIR}{XVC_GUID}/{RELATIVE_CACHE_DIR}
{LOCAL_GUID_FILE_PATH}
: The path that contains guid of the storage locally. Used only in--init
option.{STORAGE_GUID_FILE_PATH}
: The path that should have guid of the storage, in storage. Used only in--init
option.
Examples
Create a generic storage in the same filesystem
You can create a storage that's using shell commands to send and receive files to another location in the file system.
There are two variables that you can use in the commands.
For a storage in the same file system, --url
could be blank and --storage-dir
could be the location you want to define.
$ xvc storage new-generic
--url ""
--storage-dir $HOME/my-xvc-storage
...
You need to specify the commands for the following operations:
init
: The command that's used to create the directory that will be used as a storage. It should also copyXVC_STORAGE_GUID_FILENAME
(currently.xvc-guid
) to that location. This file is used to identify the location as an Xvc storage.
$ xvc storage new-generic
...
--init 'mkdir -p {STORAGE_DIR} ; cp {LOCAL_GUID_FILE_PATH} {STORAGE_GUID_FILE_PATH}'
...
Note that if the command doesn't contain {LOCAL_GUID_FILE_PATH}
and {STORAGE_GUID_FILE_PATH}
variables, it won't be run and Xvc will report an error.
list
: This operation should list all files under{URL}{STORAGE_DIR}
. The list is filtered through a regex that matches the format of the paths. Hence, even the command lists all files in the storage, Xvc will consider only the relevant paths.
All paths should be listed in separate lines.
$ xvc storage new-generic
...
--list 'ls -1 {URL}{STORAGE_DIR}'
...
upload
: The command that will copy a file from local cache to the storage. Normally, it uses{ABSOLUTE_CACHE_PATH}
variable. For the local file system, we also need to create a directory before copying.
$ xvc storage new-generic
...
--upload 'mkdir -p {FULL_STORAGE_DIR} && cp {ABSOLUTE_CACHE_PATH} {FULL_STORAGE_PATH}'
...
download
: This command will be used to copy from storage to the local cache. It must create local cache directory as well.
$ xvc storage new-generic
...
--download 'mkdir -p {ABSOLUTE_CACHE_DIR} && cp {FULL_STORAGE_PATH} {ABSOLUTE_CACHE_PATH}'
...
delete
: This operation is used to delete the storage file. It shouldn't touch the local file in any way, otherwise you may lose data.
$ xvc storage new-generic
...
--delete 'rm -f {FULL_STORAGE_PATH} ; rmdir {FULL_STORAGE_DIR}'
...
In total, the command you write is the following. It defines all operations of this storage.
$ xvc storage new-generic
--url ""
--storage-dir $HOME/my-xvc-storage
--init 'mkdir -p {STORAGE_DIR} ; cp {LOCAL_GUID_FILE_PATH} {STORAGE_GUID_FILE_PATH}'
--list 'ls -1 {URL}{STORAGE_DIR}'
--upload 'mkdir -p {FULL_STORAGE_DIR} && cp {ABSOLUTE_CACHE_PATH} {FULL_STORAGE_PATH}'
--download 'mkdir -p {ABSOLUTE_CACHE_DIR} && cp {FULL_STORAGE_PATH} {ABSOLUTE_CACHE_PATH}'
--delete 'rm -f {FULL_STORAGE_PATH} ; rmdir {FULL_STORAGE_DIR}'
Create a storage using rclone
Create a storage using rsync
Rsync is found for all popular platforms to copy file contents. Xvc can use it to maintain a storage if you already have a working rsync setup.
We need to define operations for init
, upload
, download
, list
and delete
with rsync or ssh.
Some of the commands need ssh
to perform operations, like creating a directory.
We'll use placeholders for paths.
As rsync URL format is slightly different than SSH, we will define the commands verbosely.
Suppose you want to use your account at user@example.com
to store your Xvc files.
You want to store the files under /home/user/my-xvc-storage
.
We assume you have configured public key authentication for your account. Xvc doesn't receive user input during storage operations, and can't receive your password during runs.
We first define these as our --url
and --storage-dir
options.
$ xvc --url user@example.com
--storage-dir '/home/user/my-xvc-storage'
...
Initialization command must create this directory and copy the storage GUID file to its respective location.
$ xvc
...
--init "ssh {URL} 'mkdir -p {STORAGE_DIR}' ; rsync -av '{LOCAL_GUID_FILE_PATH}' '{URL}:{STORAGE_GUID_FILE_PATH}'"
Note the use of :
in rsync
command.
As it doesn't support ssh://
URLs currently, we are using a form that's compatible with both ssh and rsync as URL.
It may be possible to use &&
between ssh
and rsync
commands, but if the first command fails (e.g. the directory already exists), we still want to copy the guid file.
Caveats
Technical Details
The paths in list
commands are filtered through a regex.
They are matched against {REPO_GUID}/{RELATIVE_CACHE_DIR}/0
pattern and only the {RELATIVE_CACHE_DIR}
portion is reported.
Any line that doesn't conform to this pattern is ignored.
You can any listing command that returns a recursive file list, and only the pattern matching elements are considered.
xvc storage new s3
Purpose
Configure an S3 (or a compatible) service as an Xvc storage.
Synopsis
$ xvc storage new s3 --help
Add a new S3 storage
Usage: xvc storage new s3 [OPTIONS] --name <NAME> --bucket-name <BUCKET_NAME> --region <REGION>
Options:
-n, --name <NAME>
Name of the storage
This must be unique among all storages of the project
--storage-prefix <STORAGE_PREFIX>
You can set a directory in the bucket with this prefix
[default: ]
--bucket-name <BUCKET_NAME>
S3 bucket name
--region <REGION>
AWS region
-h, --help
Print help (see a summary with '-h')
Examples
xvc storage new gcs
Purpose
Configure an Google Cloud Storage service as an Xvc storage.
Synopsis
$ xvc storage new gcs --help
Add a new Google Cloud Storage storage
Usage: xvc storage new gcs [OPTIONS] --name <NAME> --bucket-name <BUCKET_NAME> --region <REGION>
Options:
-n, --name <NAME>
Name of the storage
This must be unique among all storages of the project
--bucket-name <BUCKET_NAME>
Bucket name
--region <REGION>
Region of the server, e.g., europe-west3
--storage-prefix <STORAGE_PREFIX>
You can set a directory in the bucket with this prefix
[default: ]
-h, --help
Print help (see a summary with '-h')
Examples
xvc storage new minio
Purpose
Create a new Xvc storage on a MinIO instance. It allows to store tracked file contents in a Minio server.
Synopsis
$ xvc storage new minio --help
Add a new Minio storage
Usage: xvc storage new minio [OPTIONS] --name <NAME> --endpoint <ENDPOINT> --bucket-name <BUCKET_NAME> --region <REGION>
Options:
-n, --name <NAME>
Name of the storage
This must be unique among all storages of the project
--endpoint <ENDPOINT>
Minio server url in the form https://myserver.example.com:9090
--bucket-name <BUCKET_NAME>
Bucket name
--region <REGION>
Region of the server
--storage-prefix <STORAGE_PREFIX>
You can set a directory in the bucket with this prefix
[default: ]
-h, --help
Print help (see a summary with '-h')
Credentials
Xvc doesn't store any credentials.
Xvc gets server credentials from two environment variables: XVC_STORAGE_ACCESS_KEY_ID
and XVC_STORAGE_SECRET_KEY
.
You must supply the credentials in these two environment variables before running any command that connects to the storage.
These environment variables can contain user name and password to Minio server. If you have created service accounts, you can also set the keys to them as keys.
$ export XVC_STORAGE_ACCESS_KEY_ID=myname
$ export XVC_STORAGE_SECRET_KEY=mypassword
$ xvc storage new minio --name minio-storage --endpoint 'http://example.com:9001' --bucket-name xvc-bucket --region us-east-1 --storage-prefix my-project
Examples
You can create a new Minio storage by supplying the credentials and required parameters.
$ export XVC_STORAGE_ACCESS_KEY_ID=myname
$ export XVC_STORAGE_SECRET_KEY=mypassword
$ xvc storage new minio --name minio-storage --endpoint 'http://example.com:9001' --bucket-name xvc-bucket --region us-east-1 --storage-prefix my-project
After defining the storage, you can push, fetch, and pull files with xvc file push
and xvc file pull
commands.
Caveats
--name NAME
is not verified to be unique but you should use unique storage names to refer them later.
You can also use storage GUIDs listed by xvc storage list
to refer to storages.
You must have a valid connection to the server.
Xvc uses Minio API port (9001, by default) to connect to the server. Ensure that it's accessible.
For reasons caused from the underlying library, Xvc tries to connect http://xvc-bucket.example.com:9001
if you give http://example.com:9001
as the endpoint, and xvc-bucket
as the bucket name.
You may need to consider this when you have servers running in exact URLs.
If you have a http://minio.example.com:9001
as a Minio server, you may want to supply http://example.com:9001
as the endpoint, and minio
as the bucket name to form the correct URL.
This behavior may change in the future.
Technical Details
This command requires Xvc to be compiled with minio
feature, which is on by default.
It uses Rust async features via rust-s3
crate, and may add some bulk to the binary.
If you want to compile Xvc without these features, please refer to How to Compile Xvc document.
The command creates .xvc-guid
file in http://{{BUCKET-NAME}}.{{ENDPOINT}}/{{STORAGE-PREFIX}}/.xvc-guid
.
The file contains the unique identifier for this storage.
The same identifier is also recorded to the project.
A file that's found in .xvc/{{HASH_PREFIX}}/{{CACHE_PATH}}
is saved to http://{{BUCKET-NAME}}.{{ENDPOINT}}/{{STORAGE-PREFIX}}/{{REPO_ID}}/{{HASH_PREFIX}}/{{CACHE_PATH}}
.
{{REPO_ID}}
is the unique identifier for the repository created during xvc init
.
Hence if you use a common storage for different Xvc projects, their files are kept under different directories.
There is no inter-project deduplication.
xvc storage new r2
Purpose
Configure Cloudflare R2 as an Xvc storage.
Synopsis
$ xvc storage new r2 --help
Add a new R2 storage
Usage: xvc storage new r2 [OPTIONS] --name <NAME> --account-id <ACCOUNT_ID> --bucket-name <BUCKET_NAME>
Options:
-n, --name <NAME>
Name of the storage
This must be unique among all storages of the project
--account-id <ACCOUNT_ID>
R2 account ID
--bucket-name <BUCKET_NAME>
Bucket name
--storage-prefix <STORAGE_PREFIX>
You can set a directory in the bucket with this prefix
[default: ]
-h, --help
Print help (see a summary with '-h')
Examples
xvc storage new wasabi
Purpose
Configure a Wasabi service as an Xvc storage.
Synopsis
$ xvc storage new wasabi --help
Add a new Wasabi storage
Usage: xvc storage new wasabi [OPTIONS] --name <NAME> --bucket-name <BUCKET_NAME>
Options:
-n, --name <NAME>
Name of the storage
This must be unique among all storages of the project
--bucket-name <BUCKET_NAME>
Bucket name
--endpoint <ENDPOINT>
Endpoint for the server, complete with the region if there is
e.g. for eu-central-1 region, use s3.eu-central-1.wasabisys.com as the endpoint.
[default: s3.wasabisys.com]
--storage-prefix <STORAGE_PREFIX>
You can set a directory in the bucket with this prefix
[default: ]
-h, --help
Print help (see a summary with '-h')
Examples
xvc storage new digital-ocean
Purpose
Configure a Digital Ocean Spaces service as an Xvc storage.
Synopsis
$ xvc storage new digital-ocean --help
Add a new Digital Ocean storage
Usage: xvc storage new digital-ocean [OPTIONS] --name <NAME> --bucket-name <BUCKET_NAME> --region <REGION>
Options:
-n, --name <NAME>
Name of the storage
This must be unique among all storages of the project
--bucket-name <BUCKET_NAME>
Bucket name
--region <REGION>
Region of the server
--storage-prefix <STORAGE_PREFIX>
You can set a directory in the bucket with this prefix
[default: ]
-h, --help
Print help (see a summary with '-h')
Examples
Utilities
xvc root
Purpose
Shows the Xvc root project directory where .xvc/
resides.
Synopsis
$ xvc root --help
Find the root directory of a project
Usage: xvc root [OPTIONS]
Options:
--absolute Show absolute path instead of relative
-h, --help Print help
Examples
xvc root
can be used in scripts to make paths relative to the Xvc project root.
By default, it shows the relative path.
$ xvc root
..
When you supply --absolute
, it prints the absolute path.
$ xvc root --absolute
/home/user/my-xvc-project/
xvc check-ignore
Purpose
Check whether a path is ignored or whitelisted by Xvc.
Synopsis
$ xvc check-ignore --help
Check whether files are ignored with `.xvcignore`
Usage: xvc check-ignore [OPTIONS] [TARGETS]...
Arguments:
[TARGETS]...
Targets to check. If no targets are provided, they are read from stdin
Options:
-d, --details
Show the exclude patterns along with each target path. A series of lines are printed in this format: <path/to/.xvcignore>:<line_num>:<pattern> <target_path>
--ignore-filename <IGNORE_FILENAME>
Filename that contains ignore rules
This can be set to .gitignore to test whether Git and Xvc work the same way.
[default: .xvcignore]
-n, --non-matching
Include the target paths which don’t match any pattern in the --details list. All fields in each line, except for <target_path>, will be empty. Has no effect without --details
-h, --help
Print help (see a summary with '-h')
Examples
By default it checks the files supplied from stdin
.
$ xvc check-ignore
my-dir/my-file
If you supply paths from the CLI, they are checked instead.
$ xvc check-ignore my-dir/my-file another-dir/another-file
If you're looking which .xvcignore
file ignores (or whitelists) a certain path, you can use --details
.
$ xvc check-ignore --details my-dir/my-file another-dir/another-file
.xvcignore
file format is identical to .gitignore
file format.
This utility can be used to check any other ignore rules in other files as well.
You can specify an alternative ignore filename with --ignore-filename
option.
The below command is identical to git check-ignore
and should give the same results.
$ xvc check-ignore --ignore-filename .gitignore
xvc aliases
Synopsis
$ xvc aliases --help
Print command aliases to be sourced in shell files
Usage: xvc aliases
Options:
-h, --help Print help
Examples
You can include aliases in interactive shells.
$ . $(xvc aliases)
$ pvc --help
Pipeline management commands
Usage: xvc pipeline [OPTIONS] <COMMAND>
Commands:
new Add a new pipeline
update Rename, change dir or set a pipeline default
delete Delete a pipeline
run Run a pipeline
list List all pipelines
dag Generate mermaid diagram for the pipeline
export Export the pipeline to a YAML, TOML or JSON file
import Import the pipeline from a file
step Step management commands
help Print this message or the help of the given subcommand(s)
Options:
-n, --name <NAME> Name of the pipeline this command applies to
-h, --help Print help information
If you add the above line to your .bashrc
or .zshrc
, these aliases will always be available.
You can get a list of aliases.
$ xvc aliases
alias xls='xvc file list'
alias pvc='xvc pipeline'
alias fvc='xvc file'
alias xvcf='xvc file'
alias xvcft='xvc file track'
alias xvcfl='xvc file list'
alias xvcfs='xvc file send'
alias xvcfb='xvc file bring'
alias xvcfh='xvc file hash'
alias xvcfco='xvc file checkout'
alias xvcfr='xvc file recheck'
alias xvcp='xvc pipeline'
alias xvcpr='xvc pipeline run'
alias xvcps='xvc pipeline step'
alias xvcpsn='xvc pipeline step new'
alias xvcpsd='xvc pipeline step dependency'
alias xvcpso='xvc pipeline step output'
alias xvcpi='xvc pipeline import'
alias xvcpe='xvc pipeline export'
alias xvcpl='xvc pipeline list'
alias xvcpn='xvc pipeline new'
alias xvcpu='xvc pipeline update'
alias xvcpd='xvc pipeline dag'
alias xvcs='xvc storage'
alias xvcsn='xvc storage new'
alias xvcsl='xvc storage list'
alias xvcsr='xvc storage remove'
If there are aliases that you'd rather not use with Xvc, you can unalias them.
This command is not implemented yet. Please see https://github.com/iesahin/xvc/issues/176 for its progress.
Rust API
xvc
See https://docs.rs/xvc/ for latest version of the Xvc API
xvc-config
See https://docs.rs/xvc-config/ for latest version of the Xvc API
xvc-core
See https://docs.rs/xvc-core/ for latest version of the Xvc API
xvc-ecs
xvc-file
See https://docs.rs/xvc-file/ for latest version of the Xvc API
xvc-logging
See https://docs.rs/xvc-logging/ for latest version of the Xvc API
xvc-pipeline
See https://docs.rs/xvc-pipeline/ for latest version of the Xvc API
xvc-storage
See https://docs.rs/xvc-storage/ for latest version of the Xvc API
xvc-walker
See https://docs.rs/xvc-walker/ for latest version of the Xvc API
Xvc Architecture
The malleability of the material (bits and bytes) we're working with leads to difficulties in architecting software. Unlike real architecture, bits and bytes don't bring natural restrictions. It's not possible to build skyscrapers with mud bricks, and our material is much more malleable. There are too many options, too many ways to solve problems that it's easy to merge in technical mud with the decisions we make.
Software developers created a set of architectural principles to overcome this unlimitation. Most of these principles are bogus. They are not tested on the field. We seldom have software that's still perfectly maintainable after ten years. Usually, reading and understanding the code is more difficult than coming up with a new solution and rewriting it.
In this chapter, we describe the problems, assumptions, and solutions in Xvc's intended domain. It's a work in progress but should give you ideas about the intentions behind decisions.
After two decades, I (un)learned a few basic principles regarding software development.
-
Object Oriented Programming doesn't work. Mixing data and functions (methods) isn't a good way to write programs. It leads to artificial layers and structures that become burdensome the long run. It forces the developer to think about both the data and functionality at the same time. This makes reasoning and solving the problem harder than it should be.
-
Data structures are more important than algorithms. Using a few distinct, well thought data structures is more important than creating the best algorithm. Algorithms are replaceable locally without much peripheral impact. Modifying data structures usually requires updates to all related elements.
-
DRY is overrated. It may be a good principle after you write the first version. However, during the actual development phase, it's not a good idea to try not to repeat yourself. What parts of the program repeat, what parts rhyme, and what should be abstracted can be seen after we write the whole. Trying to apply abstract principles to exploratory development hinders the ability to solve problems as plainly as possible.
-
More errors are done in the name of abstraction than the reverse. Abstractions don't always help. They usually distribute a single functionality across arbitrary layers. In the age of LSP, it's easier to find repeating functionality and merge/rewrite, rather than fixing incorrect assumptions about abstractions. Problems with repeating code are obvious and easier to fix than problems with abstractions.
-
Vertical architecture is more important than horizontal architecture. Vertical architecture means the lower the number of layers between the user and their intention, the better. If the user wants to copy a file, creating a layer of abstract classes to make this more modular doesn't result in more resilient software. If you want to detect whether we're in a Git repository, checking the presence of
.git
directory is simpler than creating a few abstract classes that work for more than one SCM, and implementing abstract methods for them. The architecture shouldn't try to satisfy abstract patterns, it should make the path between the user's action and effect as direct as possible.
Xvc Modules (Crates)
Xvc is composed of modules that can be tested and used independently.
core
module is in the middle of the architecture.
Lower-level crates interface with the OS and convert these to data structures.
Higher levels use these data structures to implement functionality.
For example xvc-walker
crate interfaces with the directories and paths, ignore rules and serves a set of paths with their metadata.
xvc-file
crate uses these to check whether a file is changed or not.
logging
: Logger definitions and debugging macros.walker
: A file system directory walker that checks ignore files. It can also notify the changes in the directory via channels after the initial traversal.config
: Configuration framework that loads configuration from various levels (Default, System, User, Project, Environment) and merges these with command line options for each module.ecs
: The entity-component system responsible for saving and loading state of all data structures, along with their associations and queries.
storage
: Commands and functionality to configure external (local or cloud) locations to store file content.
core
: Xvc specific data structures and utilities.
All user level modules use this module for shared functionality.
file
: Commands to track files and utilities around file management.pipeline
: Commands to define data pipelines as DAGs and run them.
The current dependency graph where lower-level modules are used directly is this:
graph TD xvc --> xvc-file xvc --> xvc-pipeline xvc-file --> xvc-config xvc-file --> xvc-core xvc-file --> xvc-ecs xvc-file --> xvc-logging xvc-file --> xvc-walker xvc-file --> xvc-storage xvc-pipeline --> xvc-config xvc-pipeline --> xvc-core xvc-pipeline --> xvc-ecs xvc-pipeline --> xvc-logging xvc-pipeline --> xvc-walker xvc-config --> xvc-walker xvc-config --> xvc-logging xvc-ecs --> xvc-logging xvc-core --> xvc-config xvc-core --> xvc-logging xvc-core --> xvc-walker xvc-core --> xvc-ecs xvc-walker --> xvc-logging
After the crate interfaces are stabilized, all lower-level functions will be reused from xvc-core
.
It will provide the basic Xvc API.
In this case, the graph will be simplified.
graph TD xvc --> xvc-file xvc --> xvc-pipeline xvc-file --> xvc-core xvc-pipeline --> xvc-core xvc-config --> xvc-walker xvc-config --> xvc-logging xvc-ecs --> xvc-logging xvc-core --> xvc-config xvc-core --> xvc-logging xvc-core --> xvc-walker xvc-core --> xvc-ecs xvc-core --> xvc-storage xvc-walker --> xvc-logging
Any improvement in user-level API will be done higher than xvc-core
levels.
Any improvement in lower-level modules will be done in dependencies of xvc-core
.
Goals
Xvc is an CLI MLOps tool to track file, data, pipeline, experiment, model versions.
It has the following goals:
- Enable to track any kind of files, including large binary, data and models in Git.
- Enable to get subset of these files.
- Enable to remove files from workspace temporarily, and retrieve them from cache.
- Enable to upload and download these files to/from a central server.
- Enable users to run pipelines composed of commands.
- Be able to invalidate pipelines partially.
- Enable to run a pipeline or arbitrary commands as experiments, and store and retrieve them.
Xvc users are data and machine learning professionals that need to track large amounts of data. They also want to run arbitrary commands on this data when it changes. Their goal is to produce better machine learning models and better suited data for their problems.
We have three quality goals:
- Robustness: The system should be robust for basic operations.
- Performance: The overall system performance must be within the ballpark of usual commands like
b3sum
orcp
. - Availability: The system must run on all major operating systems.
Xvc users work with large amounts of data. They want to depend on Xvc for basic operations like tracking file versions, and uploading these to a central location.
They don't want to wait too long for these operations on common hardware.
They would like to download their data to any system running various operating systems.
Xvc Cache
The cache is where Xvc copies the files it tracks.
It's located under .xvc
directory.
Instead of the file tree that's normally used to address files, it uses content digest of files to organize them.
In a standard file hierarchy, we have files in paths like /home/iesahin/Photos/my-photo.png
.
Xvc doesn't use such a tree in its cache.
It uses paths like .xvc/b3/a12/b45/d789a...f54/0.png
to refer to files.
Producing the cache path from its content leads cache paths change when the files are updated.
For example, if you save another photo on top of my-photo.png
, the first version will be lost.
However, as these two versions produce different digests, they can be stored in different locations in cache.
There are 4 parts of this cache path.
.xvc
part is the standard directory xvc init
command creates. It resides in the root folder of your project.
b3/
denotes the [digest type] of the content digest.
Xvc supports more than one algorithm to calculate content digests.
[HashAlgorithm][https://docs.rs/xvc-core/0.4.0/xvc_core/types/hashalgorithm/enum.HashAlgorithm.html] enum shows which algorithms are supported.
Each of these algorithms has a 2-letter prefix.
b3
:obs_right_arrow_with_tail: BLAKE3- b2 :obs_right_arrow_with_tail:BLAKE2s
- s3 :obs_right_arrow_with_tail: SHA2-256
- s2 :obs_right_arrow_with_tail: SHA3-256
Note that, all these digest algorithms produce 256bits/32 bytes of digests. This is converted to 64 hexadecimal digits. In order to keep the total path length shorter, currently Xvc requires digests to be 32 bytes in length.
The third part in cache path is this 64 hexadecimal digits in the form a12/b45/d789...f54/
.
64 digits are split into directories to keep the number of directories under one directory lower.
Had Xvc put all cache elements in a single directory, it could lead to degraded performance in some file systems.
With this arrangement, b3/
can contain at most 4096 directories, that contain 4096 directories each.
With usual distribution and good hash algorithms, there won't be more than 4000 elements per directory until 68 billion files in the cache. (4096³)
The fourth part is the 0.png
part, that's the file itself with the same extension but with 0
as the basename.
Xvc uses digest as a directory instead of file name.
There may be times when the file in the cache should be used manually, on remote storages for example.
The extension is kept for this reason, to make sure that the OS recognizes the file type correctly.
The rename to 0
means, that this is the whole file.
In the future, when Xvc will support splitting large files to transfer to remotes, all parts of the file will be put into this directory.
Storages also use the same cache structure, with an added GUID
part to use a single storage for multiple projects.
The Architecture of Xvc Entity Component System
Xvc uses an entity component system (ECS) in its core. ECS architecture is popular among game development, but didn't find popularity in other areas. It's an alternative to Object-Oriented Programming.
There are a few basic notions of ECS architecture. Although it may differ in other frameworks, Xvc assumes the following:
-
An entity is a neutral way of tracking components and their relationships. It doesn't contain any semantics other than being an entity. An entity in Xvc is an atomic integer tuple. (
XvcEntity
) -
A component is a bundle of associated data about these entities. All semantics of entities are described through components. Xvc uses components to keep track of different aspects of file system objects, dependencies, remotes, etc.
-
A system is where the components are created and modified. Xvc considers all modules that interact with components as separate systems.
Suppose you're want to track a new file in Xvc.
Xvc creates a new entity for this file.
Associates the path (XvcPath
) with this entity.
Checks the file metadata, creates an instance of XvcMetadata
, and associates it with this entity.
If this object is commit to Xvc cache, an XvcDigest
struct is associated with the entity.
The difference from OOP is that there is no basic or main object.
If you want to work only with digests and want to find the workspace paths associated with them, you can write a function (system) that starts from XvcDigest
records and collect the associated paths.
If you want to get only the files larger than a certain size, you can work with XvcMetadata
, filter them and get the paths later.
In contrast, in an OOP setting, these kind of data is associated with paths and when you want to do such operations, you need to load paths and all their associations first.
OOP way of doing things is usually against the principle of locality.
The whole idea is to be flexible for further changes.
As of now, Xvc doesn't have different notion of data and models.
It doesn't have different functionality for files that are models or data.
In the future, however, when this will be added, an XvcModel
component will be created and associated with the same entity of an XvcPath
.
It will allow to work with some paths as model files but it doesn't require paths to be known beforehand.
There may be other metadata, like features or version associated with models that are more important.
There may be some models without a file system path, maybe living only in memory or in the cloud.
Those kind of models might be checked by verifying whether the model has a corresponding XvcPath
component or not.
In contrast, OOP would define this either by inheritance (a model is a path) or containment (a model has a path). When you select any of these, it becomes a relationship that must be maintained indefinitely. When you only have an integer that identifies these components, it's much easier to describe models without a path later. There is no predefined relationship between paths and models.
The architecture is approximately similar to database modeling. Components are in-memory tables, albeit they are small and mostly contain a few fields. Entities are sequential primary keys. Systems are insert, query and update mechanisms.
Stores
An XvcStore
in its basic definition is a map structure between XvcEntity
and a component type T
It has facilities for persistence, iteration, search and filtering.
It can be considered a system in the usual ECS sense.
Loading and Saving Stores
As our goal is to track data files with Git, stores save and load binary files' metadata to text files.
Instead of storing the binary data itself in Git, Xvc stores information about these files to track whether they are changed.
By default, these metadata are persisted to JSON.
Component types must be serializable because of this.
They are meant to be stored to disk in JSON format.
Nevertheless, as they are almost always composed of basic types [serde] supports, this doesn't pose a difficulty in usage.
The JSON files are then commit to Git.
Note that, there are usually multiple branches in Git repositories. Also multiple users may work on the same branch.
When these text files are reused by the stores, they are modified and this may lead to merge conflicts. We don't want our users to deal with merge conflicts with entities and components in text files. This also makes it possible to use binary formats like MessagePack in the future.
Suppose user A made a change in XvcStore<XvcPath>
by adding a few files.
Another user B made another change to the project, by adding another set of files in another copy of the project.
This will lead to merge conflicts:
XvcEntity
counter will have different values in A and B's repositories.XvcStore<XvcPath>
will have different records in A and B's repositories.
Instead of saving and loading to monolithical files, XvcStore
saves and loads event logs.
There are two kind of events in a store:
Add(XvcEntity, T)
: Adds an elementT
to a store.Remove(XvcEntity)
: Removes the element with entity id.
These events are saved into files. When the store is loaded, all files after the last full snapshot are loaded and replayed.
When you add an item to a store, it saves the Add
event to a log.
These events are then put into a vector.
A BTreeMap
is also created by this vector.
When an item is deleted, a Remove
event is added to the event vector.
While loading, stores removes the elements with Remove
events from the BTreeMap
.
So the final set of elements doesn't contain the removed item.
The second problem with multiple branches is duplicate entities in separate branches. Xvc uses a counter to generate unique entity ids. When a store is loaded, it checks the last entity id in the event log and uses it as the starting point for the counter. But using this counter as is causes duplicate values in different branches. Xvc solves this by adding a random value to these counter values.
Since v0.5, XvcEntity
is a tuple of 64-bit integers. The first is loaded from
the disk and is an atomic counter. The second is a random value that is renewed
at every command invocation. Therefore we have a unique entity id for every run,
that's also sortable by the first value. Easy sorting with integers is sometimes
required for stable lists.
Inverted Index
Stores also have a inverted index for quick lookup.
They store value of T
as key and a list of entities that correspond to this key.
For example, when we have a path that we stored, it's a single operation to get the corresponding XvcEntity
and after this, all recorded metadata about this path is available.
All search, iteration and filtering functionality is performed using these two internal maps.
In summary, a store has four components.
- An immutable log of previous events:
Vec<Event<T>>
- A mutable log of current events:
Vec<Event<T>>
- A mutable map of the current data:
BTreeMap<XvcEntity, T>
- A mutable map of the entities from values:
BTreeMap<T, Vec<XvcEntity>>
Note that, when two branches perform the same operation, the event logs will be
different, as the random part of XvcEntity
is different. When two parties
branches merge, the inverted index may contain conflicting values. In this case,
a fsck
command is used to merge the store files and merge conflicting entity
ids.
Insert, update and delete operations affect mutable log and maps.
Queries, iteration and such non-destructive operations are done with the maps.
When loading, all log files are merged in immutable log.
No standard operation touches the event logs.
All log modifications are done outside of the normal worflow.
When saving, only the mutable log is saved.
Note that only can only be added to the log, they are not removed.
(See xvc fsck --merge-stores
for merging store files.)
Relationship Stores
XvcStore
keeps component-per-entity.
Each component is a flat structure that doesn't refer to other components.
Xvc also has relation stores that represent relationships between entities, and components. Similar to the database Entity-Relationship model, there are three kinds of the relationship store:
R11Store<T, U>
keeps two sets of components associated with the same entity.
It represents a 1-1 relationship between T
and U
.
It contains two XvcStore
s for each component type.
These two stores are indexed with the same XvcEntity
values.
For example, an R11Store<XvcPath, XvcMetadata>
keeps track of path metadata for the identical XvcEntity
keys.
R1NStore<T, U>
keeps parent-child relationships.
It represents a 1-N relationship between T
and U
.
On top of two XvcStore
s, this one keeps track of relationships with a third XvcStore<XvcEntity>
.
It lists which U
's are children of T
s.
For example, a value of XvcPipeline
can have multiple XvcStep
s.
These are represented with R1NStore<XvcPipeline, XvcStep>
.
This struct has parent-to-child
and child-to-parent
functions that can be used get children of a parent, or parent of child element.
The third type is RMNStore<T, U>
.
This one keeps arbitrary number of relationships between T
and U
.
Any number of T
s may correspond to any number of U
s.
This type of store keeps the relationships in two XvcStore<XvcEntity>
's.
Comparisons
In order to avoid unnecessary work, we need to find differences across versions.
What has changed between the previous version and this version of type T
?
Xvc is built bottom up, with vertical, long functions that do one thing.
For example, xvc file track
is written separately from xvc file checkout
, and the commonalities are arised after these implementations.
We consider implementation is a form of planning philosophy.
We didn't start from traits and try to fit everything to these.
Instead we began from concrete enums and structs, saw some of these share common functionality and thought to group this common functionality as a trait after implementing several concrete functions.
We saw that the diff
pattern across all functionality.
In xvc pipeline
, dependencies need to detect changes to decide whether to invalidate them.
In xvc file
, files and directories need to detect changes to decide whether they should be commit to cache.
It's easy to make comparison/subtraction when the data types are numeric.
For a signed integer, you can get a single numeric value as diff with diff = a - b
.
For complex data structures, representing the change is usually not straightforward.
We keep track of everything in the repository in stores.
These serialize a type T
to a file, and get it back when needed.
Diff pattern works with these types.
Sometimes, there happens to be no record of something we have in the repository.
Sometimes, we only have only the record, and not the actual thing on disk.
The diff should also handle this.
Instead of trying to come up some wizardy, at the end, we decided to represent this with five conditions.
-
Identical
: When two things of the same typeT
are equal. Nothing has changed between the actual version and its record. -
RecordMissing { actual: T }
: If we have something on disk, but can't find the respective record. For example, a new file is added to the disk butxvc file track
detects it for the first time. The action is usually creating a record fromactual: T
-
ActualMissing { record: T }
: We found a record in the store, but the corresponding file is not there. If a tracked file is deleted, but the record still keeps it. -
Difference { record: T, actual: T }
: There is a record, but the actual data isn't identical. When a tracked file is changed, and its content hash now returns another digest, this can be reflected withDifference
. -
Skipped
: When the comparison seems unnecessary. For example, if we know a file hasn't changed by checking its metadata. In this case, we don't calculate its content digest and set it toSkipped
.
These five conditions are represented in DeltaField
type.
As an entity may have more than one component, a comparison may require multiple DeltaField
s.
For example, we may want to compare an XvcPath
, to see whether it has changed.
This requires comparing its XvcMetadata
, its ContentDigest
if it's a file, its CollectionDigest
if it's a directory, etc.
There are various such Delta
types.
Comparing files
Files are compared with several aspects.
We assume their relative path (XvcPath
) doesn't change.
Other features like XvcMetadata
, ContentDigest
, etc. could be modified and are tracked.
The following struct is used to compare two files:
#![allow(unused)] fn main() { pub struct FileDelta { pub delta_md: DeltaField<XvcMetadata>, pub delta_content_digest: DeltaField<ContentDigest>, pub delta_metadata_digest: DeltaField<MetadataDigest>, pub delta_cache_type: DeltaField<CacheType>, pub delta_text_or_binary: DeltaField<DataTextOrBinary>, } }
When the user first start tracking a file, all delta fields are of the value RecordMissing
.
It contains the actual value on disk.
These are recorded to stores.
When they edit the file, its delta_md
changes.
Xvc checks whether the delta_content_digest
has also changed.
When the user wants to check out the file in a different cache_type
, for example changing the workspace version from Copy to Hardlink, delta_cache_type
field contains a Difference
value.
Comparing directories
A directory is considered as a collection of paths.
Its comparison is based on the (non-ignored) paths it contains.
#![allow(unused)] fn main() { pub struct DirectoryDelta { pub delta_xvc_metadata: DeltaField<XvcMetadata>, pub delta_collection_digest: DeltaField<CollectionDigest>, pub delta_metadata_digest: DeltaField<MetadataDigest>, pub delta_content_digest: DeltaField<ContentDigest>, } }
We record the size and modification time of the directories as well.
When these change, they are reflected with delta_xvc_metadata
file.
The other fields are generated from the paths the directory contains.
Storages
Xvc uses storages to store content of the files. These storages are different from Git remotes. They don't contain Git history of a repository, but they can store contents of the files tracked by Xvc.
A storage uses the same content-addresses used in Xvc cache to store the files.
For example, if there is a file in Xvc repository that points to /b3/1886572424...defa/0.png
in local cache, this path will be used to identify the content in storage as well.
Additionally, Xvc stores storage event logs that lists which operations are performed on that storage. By using these event logs, it's possible to identify what has gone on with storages without checking the file lists. These event logs are also shared with the other users, and a user can identify which files are present in a storage even without a connection.
Basic Operations
All storages should support the following operations:
- Init to initialize a storage
- List to list the files available in the storage.
- Send to upload files from local cache to a storage.
- Receive to download files from a storage to local cache.
- Delete to delete file from a storage.
All these operations record a distinct event to the event log.
Events record the event, guid of the storage and the event content.
Event contents are like the following:
- Init creates the necessary directories and the guid file in a storage
- List includes the listing got from the storage. Once a list is retrieved from the storage, it's available for local operations. Most recent lists are starting point to determine files available in a storage.
- Send event contains the affected paths. These paths are added to storage file list.
- Receive event contains the affected paths. These paths are added to storage file list.
- Delete to delete multiple files at once. These paths are removed from storage file list.
Storage types
Local Storages
A local storage is a directory in the local file system. It may be a mount point shared with others, or another disk that you use for backups and sharing.
- Init uses
std::fs::copy
to copy the GUID file to the appropriate directory - List uses
std::fs::listdir
. - Send uses
std::fs::copy
with rayon. - Receive uses
std::fs::copy
with rayon. - Delete uses
std::fs::remove_file
with rayon.
Generic Storages
These storages define commands for each of the operations listed above.
It allows to run external programs such as rsync
, rclone
, s5cmd
.
For such storages, commands for the above operations must be defined and they will be run in separate processes.
This storage type offloads the responsibility of exact operations to the user.
The user is expected to supply the value following variables:
-
{URL}
: The url for the storage. This can be anything the commands to send/receive/list will accept. It's to build the paths with minor repeats. -
{STORAGE_DIR}
: You can separate the storage directory. -
{PATH}
: This is set by Xvc for each singular commands. It's a relative path to the local cache directory. -
{PROCESS_POOL_SIZE}
: This value is used to set the number of processes to perform operations. Setting this to1
makes all operations sequential. -
List Command
: A command to list the{URL}
. For example, forrsync --list-only {URL}{STORAGE_DIR}
-
Send Command
: A command to send a file to{URL}{STORAGE_DIR}
. It can use{URL}
and should use{PATH}
in the command. An example may bersync -a {PATH} {URL}{STORAGE_DIR}{PATH}
-
Receive Command
: A command to receive a file from a storage. It can use{URL}
and{STORAGE_DIR}
, and should use{PATH}
in the command. Example:rsync -a {URL}{STORAGE_DIR}{PATH} {PATH}
-
Delete Command
: A command to delete a file from the storage. It can use{URL}
and{STORAGE_DIR}
, and should use{PATH}
in the command. Example:ssh {URL} "rm {STORAGE_DIR}{PATH}"
Generic storages use these commands to create multiple processes to send/receive/delete files. It's not as fast as using other types because of the overhead involved, but its flexibility is useful.
Git and Xvc
Xvc aims to fill the gap Git leaves for certain workflows. These workflows involve large binary data that shouldn't be replicated in each repository.
Xvc tracks all its metadata on top of Git. In most cases, Xvc assumes the presence of a Git repository where the user tracks the history, text files, and metadata. However, the relationship between these should be clear and separate.
Xvc doesn't (and shouldn't) use Git more than a user could use manually. Our aim is not to replace Git operations with Xvc operations or tamper with the internal structure of the Git repository. When Xvc uses Git to track ECS or other metadata, the operations must be separate and sandwich Xvc operations.
-
Any Git operation that involves to checkout commits, branches, tags, or other references must come before any Xvc operation. As Xvc relies on the files tracked by Git, resuming any state for Xvc operations should be complete before these operations start.
-
Xvc helps to stage and commit certain files in
.xvc/
to Git. By default, any state-changing operation in Xvc adds a commit to Git. -
Xvc also helps to store this changed metadata in a new or existing branch. In this case, a checkout must be done before Xvc records the files.
sequenceDiagram User ->> Xvc: xvc --from-ref my-branch --to-branch another-branch file track large-dir/ Xvc ->> Git: git checkout my-branch Git ->> Xvc: branch = my-branch Xvc->> xvc-file: track large-dir/ xvc-file ->> Xvc: Ok. Saved the stores and metadata. Xvc ->> Git: Do we have user staged files? Git ->> Xvc: Yes. This and this. Xvc ->> Git: Stash them. Git ->> Xvc: Stashed user staged files. Xvc ->> Git: git checkout -b another-branch Git ->> Xvc: branch = another-branch Xvc ->> Git: git add .xvc/ Git ->> Xvc: added .xvc/ Xvc ->> Git: git commit -m "Commit after xvc file track" Xvc ->> Git: Unstash files that we have stashed
Note that if the user has some already staged files, these are stashed and unstashed to the requested branch.
This is a side effect of doing xvc commit operations on behalf of the user.
The other option is to report an error and quit if the user has the --to-branch
option set.
The behavior may change in the future.
For the time being, we will keep this stash-unstash operation for the user files.
One other issue is the library that we're going to use. I checked several options when I was writing auto-commit functionality.
At that time, I decided that the number of Git operations for each Xvc operation is less than five.
These can be done by creating a Git process.
The libraries are not 100% identical in features.
Even the most widely used libgit2 doesn't provide shallow clones, or it's not possible to use git stash --staged
.
The second reason for this is explainability. Instead of trying to explain to the user what we are doing with Git, we can report the commands we are running. The library interfaces are different from Git CLI. They need to be learned before reading the code. Using Git CLI is more dependable, observable, and understandable than trying to come up with a set of library calls.
Concepts
- Digest: A digest is a 32-byte numeric sequence to identify a file, content or any other data. Xvc uses different algorithms to generate this sequence.
- Associated Digest: This is a specific kind of digest associated with an entity. An entity can have more than one digests, like content digest or metadata digest. Xvc uses these different kinds of digests to avoid unncessary digest calculations.
Digest
A numerical summary of an entity. In Xvc digests are 32-bytes, and produced by BLAKE3 by default.
See Associated Digest for different types of digests.
Associated Digest
There may be multiple digests associated with an entity like path, directory or dependency. An associated digest is all digests associated with an entity.
Metadata Digest
Files and directories have metadata.
Metadata shows information about creation, modification, access time of the file, or the size of it.
Metadata is OS dependent in most cases.
Xvc abstracts file and directory metadata with XvcMetadata
struct.
Metadata digest represents this abstraction in 32-bytes to compare changes in files and directories.
Content Digest
The content digest of a file is calculated by the data it contains. It calculates 32-bytes from the content. When content changes, this calculation result also change.
Collection Digest
Some entities in Xvc are composed of multiple elements. Examples are directories (composed of files), file lines, regex filter results, SQL query results etc. Instead trying to compare all elements, Xvc creates a 32-byte digest of the collection with the same conditions. For example, when a new file is added to a directory, its collection digest also changes. This is used keep track of changed directories easier than moving members around.
Development
Code and Documentation Conventions
- Xvc is spelled capitalized in documentation. It's Xvc, not XVC, not xvc.