Pipeline sharing
Nextflow seamlessly integrates with BitBucket [1], GitHub, and GitLab hosted code repositories and sharing platforms. This feature allows you to manage your project code in a more consistent manner or use other people’s Nextflow pipelines, published through BitBucket/GitHub/GitLab, in a quick and transparent way.
How it works
When you launch a script execution with Nextflow, it will look for a file with the pipeline name you’ve specified. If that file does not exist, it will look for a public repository with the same name on GitHub (unless otherwise specified). If it is found, the repository is automatically downloaded to your computer and executed. This repository is stored in the Nextflow home directory, that is by default the $HOME/.nextflow
path, and thus will be reused for any further executions.
Running a pipeline
To launch the execution of a pipeline project, hosted in a remote code repository, you simply need to specify its qualified name or the repository URL after the run
command. The qualified name is formed by two parts: the owner
name and the repository
name separated by a /
character.
In other words if a Nextflow project is hosted, for example, in a GitHub repository at the address http://github.com/foo/bar
, it can be executed by entering the following command in your shell terminal:
nextflow run foo/bar
or using the project URL:
nextflow run http://github.com/foo/bar
Note
In the first case, if your project is hosted on a service other than GitHub, you will need to specify this hosting service in the command line by using the -hub
option. For example -hub bitbucket
or -hub gitlab
. In the second case, i.e. when using the project URL as name, the -hub
option is not needed.
You can try this feature out by simply entering the following command in your shell terminal:
nextflow run nextflow-io/hello
It will download a trivial Hello
example from the repository published at the following address http://github.com/nextflow-io/hello and execute it in your computer.
If the owner
part in the pipeline name is omitted, Nextflow will look for a pipeline between the ones you have already executed having a name that matches the name specified. If none is found it will try to download it using the organisation
name defined by the environment variable NXF_ORG
(which by default is nextflow-io
).
Tip
To access a private repository, specify the access credentials by using the -user
command line option, then the program will ask you to enter the password interactively. Private repository access credentials can also be defined in the SCM configuration file(#s.
Handling revisions
Any Git branch, tag or commit ID defined in a project repository, can be used to specify the revision that you want to execute when launching a pipeline by adding the -r
option to the run command line. So for example you could enter:
nextflow run nextflow-io/hello -r mybranch
or
nextflow run nextflow-io/hello -r v1.1
It will execute two different project revisions corresponding to the Git tag/branch having that names.
New in version 24.03.0-edge.
Nextflow downloads and locally maintains each explicitly requested Git branch, tag or commit ID in a separate directory path, thus enabling to run multiple revisions of the same pipeline at the same time. Each downloaded revision is stored in a sister path to the default revision one, featuring an extra suffix string :<revision id>
.
Warning
If you really care about reproducibility of your pipelines, you should explicitly refer to them by tag or commit ID, rather than my branch. This is because the same branch will point to different underlying commits over time, as pipeline development goes on. This caveat is particularly relevant in a scenario where multiple people manage and share the same local collection of pipelines.
Commands to manage projects
The following commands allows you to perform some basic operations that can be used to manage your projects.
Note
Nextflow is not meant to completely replace the Git tool. You may still need git
to create new repositories or commit changes, etc.
Listing available projects
The list
command allows you to list all the projects you have downloaded in your computer. For example:
nextflow list
This prints a list similar to the following one:
cbcrg/ampa-nf
cbcrg/piper-nf
nextflow-io/hello
nextflow-io/examples
Showing project information
By using the info
command you can show information from a downloaded project. For example:
$ nextflow info hello
project name: nextflow-io/hello
repository : http://github.com/nextflow-io/hello
local path : $HOME/.nextflow/assets/nextflow-io/hello
main script : main.nf
revisions :
P master (default)
mybranch
P v1.1 [t]
v1.2 [t]
Starting from the top it shows: 1) the project name; 2) the Git repository URL; 3) the local path where the default project can be found (alternate revisions are in sister paths with an extra suffix :<revision id>
); 4) the script that is executed when launched; 5) the list of available revisions i.e. branches and tags. Tags are marked with a [t]
on the right, the locally pulled revisions are marked with a P
on the left.
Pulling or updating a project
The pull
command allows you to download a project from a GitHub repository or to update it if that repository has already been downloaded. For example:
nextflow pull nextflow-io/examples
Alternatively, you can use the repository URL as the name of the project to pull:
nextflow pull https://github.com/nextflow-io/examples
Downloaded pipeline projects are stored in the folder $HOME/.nextflow/assets
in your computer.
Viewing the project code
The view
command allows you to quickly show the content of the pipeline script you have downloaded. For example:
nextflow view nextflow-io/hello
By adding the -l
option to the example above it will list the content of the repository.
Cloning a project into a folder
The clone
command allows you to copy a Nextflow pipeline project to a directory of your choice. For example:
nextflow clone nextflow-io/hello target-dir
If the destination directory is omitted the specified project is cloned to a directory with the same name as the pipeline base name (e.g. hello
) in the current folder.
The clone command can be used to inspect or modify the source code of a pipeline project. You can eventually commit and push back your changes by using the usual Git/GitHub workflow.
Deleting a downloaded project
Downloaded pipelines can be deleted by using the drop
command, as shown below:
nextflow drop nextflow-io/hello
SCM configuration file
The file $HOME/.nextflow/scm
allows you to centralise the security credentials required to access private project repositories on Bitbucket, GitHub and GitLab source code management (SCM) platforms or to manage the configuration properties of private server installations (of the same platforms).
The configuration properties for each SCM platform are defined inside the providers
section, properties for the same provider are grouped together with a common name and delimited with curly brackets as in this example:
providers {
<provider-name> {
property = value
// ...
}
}
In the above template replace <provider-name>
with one of the “default” servers (i.e. bitbucket
, github
or gitlab
) or a custom identifier representing a private SCM server installation.
New in version 20.10.0: A custom location for the SCM file can be specified using the NXF_SCM_FILE
environment variable.
The following configuration properties are supported for each provider configuration:
providers.<provider>.user
User name required to access private repositories on the SCM server.
providers.<provider>.password
User password required to access private repositories on the SCM server.
providers.<provider>.token
Required only for private Gitlab servers
Private API access token.
providers.<provider>.platform
Required only for private SCM servers
SCM platform name, either:
github
,gitlab
orbitbucket
.providers.<provider>.server
Required only for private SCM servers
SCM server name including the protocol prefix e.g.
https://github.com
.providers.<provider>.endpoint
Required only for private SCM servers
SCM API
endpoint
URL e.g.https://api.github.com
(default: the same asproviders.<provider>.server
).
SCM providers
BitBucket credentials
Create a bitbucket
entry in the SCM configuration file specifying your user name and app password, as shown below:
providers {
bitbucket {
user = 'me'
password = 'my-secret'
}
}
Note
App passwords are substitute passwords for a user account which you can use for scripts and integrating tools in order to avoid putting your real password into configuration files. Learn more at this link.
BitBucket Server credentials
BitBucket Server is a self-hosted Git repository and management platform.
Note
BitBucket Server uses a different API from the BitBucket cloud service. Make sure to use the right configuration whether you are using the cloud service or a self-hosted installation.
To access your local BitBucket Server create an entry in the SCM configuration file specifying as shown below:
providers {
mybitbucket {
platform = 'bitbucketserver'
server = 'https://your.bitbucket.host.com'
endpoint = 'https://your.bitbucket.host.com'
user = 'your-user'
password = 'your-password or your-token'
}
}
GitHub credentials
Create a github
entry in the SCM configuration file specifying your user name and access token as shown below:
providers {
github {
user = 'your-user-name'
password = 'your-personal-access-token'
}
}
GitHub requires the use of a personal access token (PAT) in place of a password when accessing APIs. Learn more about PAT and how to create it at this link.
New in version 23.01.0-edge: Nextflow automatically uses the GITHUB_TOKEN
environment variable to authenticate access to the GitHub repository if no credentials are provided via the scm
file. This is useful especially when accessing pipeline code from a GitHub Action. Read more about the token authentication in the GitHub documentation.
GitLab credentials
Create a gitlab
entry in the SCM configuration file specifying the user name, password and your API access token that can be found in your GitLab account page (sign in required). For example:
providers {
gitlab {
user = 'me'
password = 'my-secret'
token = 'YgpR8m7viH_ZYnC8YSe8'
}
}
Tip
The GitLab token string can be used as the password
value in the above setting. When doing that the token
field can be omitted.
Gitea credentials
Gitea is a Git repository server with GitHub-like GUI access. Since Gitea installation is quite easy, it is suitable for building a private development environment in your network. To access your Gitea server, you have to provide all the credential information below:
providers {
mygitea {
server = 'http://your-domain.org/gitea'
endpoint = 'http://your-domain.org/gitea/api/v1'
platform = 'gitea'
user = 'your-user'
password = 'your-password'
token = 'your-api-token'
}
}
See Gitea documentation about how to enable API access on your server and how to issue a token.
Azure Repos credentials
Nextflow has a builtin support for Azure Repos, a Git source code management service hosted in the Azure cloud. To access your Azure Repos with Nextflow provide the repository credentials using the configuration snippet shown below:
providers {
azurerepos {
user = 'your-user-name'
password = 'your-personal-access-token'
}
}
Tip
The Personal access token can be generated in the repository Clone Repository
dialog.
AWS CodeCommit credentials
New in version 22.06.0-edge.
Nextflow supports AWS CodeCommit as a Git provider to access and to share pipelines code.
To access your project hosted on AWS CodeCommit with Nextflow provide the repository credentials using the configuration snippet shown below:
providers {
my_aws_repo {
platform = 'codecommit'
user = '<AWS ACCESS KEY>'
password = '<AWS SECRET KEY>'
}
}
In the above snippet replace <AWS ACCESS KEY>
and <AWS SECRET KEY>
with your AWS credentials, and my_aws_repo
with a name of your choice.
Tip
The user
and password
are optional settings, if omitted the AWS default credentials provider chain is used.
Then the pipeline can be accessed with Nextflow as shown below:
nextflow run https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/my-repo
In the above example replace my-repo
with your own repository. Note also that AWS CodeCommit has different URLs depending the region in which you are working.
Note
The support for protocols other than HTTPS is not available at this time.
Private server configuration
Nextflow is able to access repositories hosted on private BitBucket, GitHub, GitLab and Gitea server installations.
In order to use a private SCM installation you will need to set the server name and access credentials in your SCM configuration file .
If, for example, the host name of your private GitLab server is gitlab.acme.org
, you will need to have in the $HOME/.nextflow/scm
file a configuration like the following:
providers {
mygit {
server = 'http://gitlab.acme.org'
platform = 'gitlab'
user = 'your-user'
password = 'your-password'
token = 'your-api-token'
}
}
Then you will be able to run/pull a project with Nextflow using the following command line:
nextflow run foo/bar -hub mygit
Or, in alternative, using the Git clone URL:
nextflow run http://gitlab.acme.org/foo/bar.git
Note
You must also specify the server API endpoint URL if it differs from the server base URL. For example, for GitHub Enterprise V3, add endpoint = 'https://git.your-domain.com/api/v3'
.
Warning
When accessing a private SCM installation over https
from a server that uses a custom SSL certificate, you may need to import the certificate into your local Java keystore. Read more here.
Local repository configuration
Nextflow is also able to handle repositories stored in a local or shared file system. The repository must be created as a bare repository.
Having, for example. a bare repository store at path /shared/projects/foo.git
, Nextflow is able to run it using the following syntax:
nextflow run file:/shared/projects/foo.git
See Git documentation for more details about how create and manage bare repositories.
Publishing your pipeline
In order to publish your Nextflow pipeline to GitHub (or any other supported platform) and allow other people to use it, you only need to create a GitHub repository containing all your project script and data files. If you don’t know how to do it, follow this simple tutorial that explains how create a GitHub repository.
Nextflow only requires that the main script in your pipeline project is called main.nf
. A different name can be used by specifying the manifest.mainScript
attribute in the nextflow.config
file that must be included in your project. For example:
manifest.mainScript = 'my_very_long_script_name.nf'
To learn more about this and other project metadata information, that can be defined in the Nextflow configuration file, read the Manifest section on the Nextflow configuration page.
Once you have uploaded your pipeline project to GitHub other people can execute it simply using the project name or the repository URL.
For if your GitHub account name is foo
and you have uploaded a project into a repository named bar
the repository URL will be http://github.com/foo/bar
and people will able to download and run it by using either the command:
nextflow run foo/bar
or
nextflow run http://github.com/foo/bar
See the Running a pipeline section for more details on how to run Nextflow projects.
Manage dependencies
Computational pipelines are rarely composed by a single script. In real world applications they depend on dozens of other components. These can be other scripts, databases, or applications compiled for a platform native binary format.
External dependencies are the most common source of problems when sharing a piece of software, because the users need to have an identical set of tools and the same configuration to be able to use it. In many cases this has proven to be a painful and error prone process, that can severely limit the ability to reproduce computational results on a system other than the one on which it was originally developed.
Nextflow tackles this problem by integrating GitHub, BitBucket and GitLab sharing platforms and Docker containers technology.
The use of a code management system is important to keep together all the dependencies of your pipeline project and allows you to track the changes of the source code in a consistent manner.
Moreover to guarantee that a pipeline is reproducible it should be self-contained i.e. it should have ideally no dependencies on the hosting environment. By using Nextflow you can achieve this goal following these methods:
Binary applications
Docker allows you to ship any binary dependencies that you may have in your pipeline to a portable image that is downloaded on-demand and can be executed on any platform where a Docker engine is installed.
In order to use it with Nextflow, create a Docker image containing the tools needed by your pipeline and make it available in the Docker Hub.
Then declare in the nextflow.config
file, that you will include in your project, the name of the Docker image you have created. For example:
process.container = 'my-docker-image'
docker.enabled = true
In this way when you launch the pipeline execution, the Docker image will be automatically downloaded and used to run your tasks.
Read the Containers page to learn more on how to use containers with Nextflow.
This mix of technologies makes it possible to write self-contained and truly reproducible pipelines which require zero configuration and can be reproduced in any system having a Java VM and a Docker engine installed.
Bundling executables in the workflow
In most cases, software dependencies should be provided by the execution environment (container, conda/spack environment, or host-native modules).
In cases where you do not wish to modify the execution environment(s), executable scripts can be included in the bin/
directory in the workflow repository root. This can be useful to make changes that affect task execution across all environments with a single change.
To ensure your scripts can be made available to the task:
Write scripts in the
bin/
directory (relative to the project repository root)Specify a portable shebang (see note below for details).
Ensure the scripts are executable. For example:
chmod a+x bin/my_script.py
Tip
To maximize portability of your bundled script, it is recommended to avoid hard-coding the interpreter path in the shebang line.
For example, shebang definitions #!/usr/bin/python
and #!/usr/local/bin/python
both hard-code specific paths to the python interpreter. To improve portability, rely on env
to dynamically resolve the path to the interpreter. An example of the recommended approach is:
#!/usr/bin/env python
Using bundled executables in the workflow
Nextflow will automatically add the bin/
directory to the PATH
environment variable, and the scripts will automatically be accessible in your pipeline without the need to specify an absolute path to invoke them.
Utility code
Any Groovy scripts or JAR files in the lib
directory will be automatically loaded and made available to your pipeline scripts. The lib
directory is a useful way to provide utility code or external libraries without cluttering the pipeline scripts.
System environment
Any environment variable that may be required by the tools in your pipeline can be defined in the nextflow.config
file by using the env
scope and including it in the root directory of your project. For example:
env {
DELTA = 'foo'
GAMMA = 'bar'
}
See the Configuration page to learn more about the Nextflow configuration file.
Resource manager
When using Nextflow you don’t need to write the code to parallelize your pipeline for a specific grid engine/resource manager because the parallelization is defined implicitly and managed by the Nextflow runtime. The target execution environment is parametrized and defined in the configuration file, thus your code is free from this kind of dependency.
Bootstrap data
Whenever your pipeline requires some files or dataset to carry out any initialization step, you can include this data in the pipeline repository itself and distribute them together.
To reference this data in your pipeline script in a portable manner (i.e. without the need to use a static absolute path) use the implicit variable baseDir
which locates the base directory of your pipeline project.
For example, you can create a folder named dataset/
in your repository root directory and copy there the required data file(s) you may need, then you can access this data in your script by writing:
sequences = file("$baseDir/dataset/sequences.fa")
sequences.splitFasta {
println it
}
User inputs
Nextflow scripts can be easily parametrised to allow users to provide their own input data. Simply declare on the top of your script all the parameters it may require as shown below:
params.my_input = 'default input file'
params.my_output = 'default output path'
params.my_flag = false
// ...
The actual parameter values can be provided when launching the script execution on the command line by prefixed the parameter name with a double minus character i.e. --
, for example:
nextflow run <your pipeline> --my_input /path/to/input/file --my_output /other/path --my_flag true