Data Transfer and Backup on Remote Computers¶

This section covers the use of tools like scp , Rsync , sftp and Globus to make transfers and backups. While the examples place an emphasis on Compute Canada, they can be generalised to apply to any remote computer.

rsync-time-backup

rsync-time-backup is a utility available here that builds on top of Rsync .

It allows backups over SSH, backup resumes, uses hardlinks to avoid duplication and provides functionality to create full backups. In short, it can be used to create full snapshots, the frequency of which can be configured e.g. keep each hour of the last 24 hours, each day of the previous month and each month in the previous year.

You may find it useful to combine several techinques in the examples below and in the docmentation of this script.

Example: Backing up to Compute Canada Project Space¶

The project space is a great place for:

  • Frequent backups
  • Internal and external sharing

You may want to tarball collections of small files (multiple files < 100 MB) to keep the file count down and to keep the diretory uncluttered.

Note

This works if you set up the SSH key on the Compute Canada server

In this example, we are backing up a drive mounted to our system. In this instance, it is a NAS drive, althogh it could be a hard drive or any other external media.

                    $ rsync_tmbackup.sh /media/user/data cedar:/home/user/projects/def-pi/data/                  

You can replace cedar with graham or beluga as required

Example: Backing up to Remote Computer¶

Suppose you have an ssh machine computer.domain.com

                    $ rsync_tmbackup.sh /media/user/data user@computer.domain.com:/data/                  

Example: Backing up using an exclusion list¶

Suppose you do not want to backup certain directories, files or file types. You must create an exclusion list exclude-file.txt , for instance:

                                        secrets                    .                    txt                    folder1                    /*                    folder2                    folder3                    /                    scratch                    /                    *.                    mat                    *                  

This will exclude the file secrets.txt , copy the folder folder1 but not its contents, will not copy the folder folder2 or its contents, will not copy the subdirectory folder3/scratch/ or its contents, and will exclude all files with the .mat extension.

Local machine to external drive

                    $ rsync_tmbackup.sh /home/user /media/user/data/ excluded_patterns.txt                  

Compute Canada

                    $ rsync_tmbackup.sh /media/user/data cedar:/home/user/projects/def-pi/data/ excluded_patterns.txt                  

Example: Change the Expiration Strategy¶

From the README:

                    The default strategy is ``1:1 30:7 365:30``, which means:  * After 1 day, keep one backup every 1 day (1:1). * After 30 days, keep one backup every 7 days (30:7). * After 365 days, keep one backup every 30 days (365:30).                  

To change the strategy to: After 30 Days, keep one backup every 14 days,

STFP

From the Compute Canada Documentation:

                                    SFTP                  (                  Secure                  File                  Transfer                  Protocol                  )                  uses                  the                  SSH                  protocol                  to                  transfer                  files                  between                  machines                  which                  encrypts                  data                  being                  transferred                  .                

Unlike SCP , SFTP omes with an interactive prompt.

Dropping into the SFTP Prompt¶

                    $ sftp user@remote_hostname_or_ip_address                  

For instance,

                    $ sftp john@cedar.arc.ubc.ca                  

If you set up your SSH key on the remote computer, you won't even need a password.

If it worked, you should be in the prompt e.g.

                    $ sftp cedar Connected to cedar. sftp>                  

Exiting the SFTP Prompt¶

Help¶

Transferring File to Remote¶

                    sftp> put <local                    file or directory> <new name on remote                    [OPTIONAL]>                  

e.g.

                    sftp> put data.hdf5 Uploading data.hdf5 to /project/6006382/user/data.hdf5 data.hdf5                    100%   11GB                    100.3MB/s                    01:50                  
                    sftp> put data.hdf5 data_20181012.hdf5 Uploading data.hdf5 to /project/6006382/user/data_20181012.hdf5 data.hdf5                    100%   11GB                    100.3MB/s                    01:50                  

Transferring File from Remote¶

                    sftp> get <remote file or directory> <new name on                    local                    [OPTIONAL]>                  

e.g.

                    sftp> get data.hdf5 Fetching /project/6006382/user/data.hdf5 to data.hdf5 /project/6006382/user/data.hdf5                    100%   11GB                    100.1MB/s                    01:50                  
                    sftp> get data.hdf5 data_20181012.hdf5 Fetching /project/6006382/user/data.hdf5 to data_20181012.hdf5 /project/6006382/user/data.hdf5                    100%   11GB                    100.1MB/s                    01:50                  

SCP

From the Compute Canada Documentation:

                                    SCP                  stands                  for                  "Secure Copy"                  .                  Like                  SFTP                  it                  uses                  the                  SSH                  protocol                  to                  encrypt                  data                  being                  transferred                  .                  It                  does                  not                  support                  synchronization                  like                  Globus                  or                  rsync                  .                  Some                  examples                  of                  SCP                  use                  are                  shown                  here                  .                  SCP                  supports                  an                  option                  ,                  -                  r                  ,                  to                  recursively                  transfer                  a                  set                  of                  directories                  and                  files                  .                  We                  recommend                  against                  using                  scp                  -                  r                  to                  transfer                  data                  into                  /                  project                  because                  the                  setgid                  bit                  is                  turned                  off                  in                  the                  created                  directories                  ,                  which                  may                  lead                  to                  Disk                  quota                  exceeded                  or                  similar                  errors                  if                  files                  are                  later                  created                  there                  .                

Basic Usage¶

                    $ scp <location/file to copy from> <location/file to copy to>                  

Transferring Files¶

Suppose a folder in your current local working directory is as follows:

                    package/ ├── package │   ├── conf.py │   ├── __init__.py │   ├── models.py ├── LICENSE ├── README.md ├── setup.py └── tests     ├── test_interface     |   ├── tests.py     ├── test_models     ├── run_tests.py                  

Running this will only copy LICENSE , README.md and setup.py , and nothing in the other folders or subdirectories

                    $ scp package cedar:/home/user                  

Running this will copy everything

                    $ scp -r package cedar:/home/user                  

Note

The above examples will only work if you set up an ssh key on the remote computer

If using the full address of the remote computer, the equivalent examples are:

                    $ scp package username@cedar.computecanada.ca:/home/user                  
                    $ scp -r package username@cedar.computecanada.ca:/home/user                  

Transferring between two remote Computers

                    $ scp graham:/home/user cedar:/home/user                  
                    $ scp username@graham.computecanada.ca:/home/user username@cedar.computecanada.ca:/home/user                  

Globus¶

Option 1: Globus GUI¶

Tip

This can be extended to file transfers between personal endpoints and between Compute Canada servers

  1. Log in

../_images/1.PNG

  1. Click on Endpoints
  2. Click on Add an Endpoint then Globus Connect Personal
  3. Enter a name for your endpoint, e.g. My_workstation in this case. Check the This will be a high assurance endpoint box if dealing with highly sensitve data

../_images/4.PNG

  1. Generate the Setup Key and copy it to your clipboard.
  2. Using the links at the bottom of the page, install the Globus Connect Personal Client on your machine and follow the on screen instructions
  3. Run the client and paste your Setup Key, then click OK

../_images/4.PNG

  1. Add location(s) of data you want globus to be able to access by clicking on +.

../_images/8.PNG

Tip

  • Ticking Shareable will allow outbound transfers
  • Ticking Writable will allow inbound transfers
  1. Click Save when done
  2. Go back to your browser and click on the File Manager Tab
  3. Search for the Compute Canada server you want ot upoad the files to. In this example, we are using Graham.

../_images/11.PNG

  1. Log in with you compute canada credentials and click on Authenticate
  2. Your Home directory should now be displayed. Navigate to the folder you want to keep the data in. In this example I will use the globus_transfers directory in my home directory.
  3. Click on Transfer or Sync to... on the right side menu

../_images/14.PNG

  1. Click on Transfer or sync to... box. Click on Your Collections then on your desired endpoint's name.
  2. Navigate the directory structure on either endpoint and select the folder(s) or file(s) you want to transfer/sync to the other endpoint. Clicking on Transfer and Sync Options below, you can select a multitude of options for managing the content on the destination endpoint. Click on Start when done.

../_images/16.PNG

  1. You should see a message like: Transfer request submitted successfully. Task id: <TASK_ID> where <TASK_ID> is a system generated hash for your task.

../_images/17.PNG

18. The client on our endpoint will handle your transfer and send you an email when it is done. You can view the status of the transfer in the Activity tab 19. Looking at the filesystem on Graham upon completion, we can see that the file is indeed there:

                                        [<user>@gra-login2 globus_transfers]$ ls -lha total 555M drwxr-x---                    2                    <user> <user>                    3                    Jul                    18                    14:54 . drwx------                    6                    <user> <user>                    13                    Jul                    18                    14:38 .. -rw-r--r--                    1                    <user> <user> 550M Jul                    18                    15:02 2015_11_18_5_filtered0.1to10.mat                  

Option 2: Globus CLI¶

Globus offers a command line interface, which is useful for its convenience and for automating transfers and backups. Its documentation is available here.

Option 3: Archeion¶

Archeion can be downloaded here. Requirement: Globus personal must be set up on personal endpoints. It can be used to script transfers using python and provides functionality that handles authentication and transfer management.

Git-Annex¶

Git-Annex uses git to create an annex, which presents files to the user in a single directory structure, even though the individual files are distributed across multiple locations. It can also be confiogured to create a number of copies of a file distributed across different annexes. This enables users to remove a local copy while ensuring redundancies are available on other storage locations. It is also able to synchronise files across redundancies. File versions can be uniquely tracked and referenced using git changeset hashes.

It also comes with a webapp interface.

Note

Git-Annex is a powerful tool but requires knowledge of git, UNIX command line and careful scripting to use effectively.

Use Case Demo: Syncing files with Git-Annex using Linux CLI¶

The following demo was tested on Ubuntu 18.04 LTS

Note

sudo priviledges are required to install git-annex

Install the Git-Annex from NeuroDebian.

Create a repository in a location of your preference

                    $ mkdir annex $                    cd                    annex $ git init Initialized empty Git repository                    in                    /home/user/annex/.git/ $ git annex init init ok                  

Add file to the repository

                    $                    cd                    /home/user/annex/ $ cp ~/Pictures/121406.jpg ./ $ git annex add . add                    121406.jpg ok                    (recording state                    in                    git...)                    $ git commit -a -m added                    [master                    (root-commit)                    1259e1c]                    added                    1                    file changed,                    1                    insertion(+)                    create mode                    120000                    121406.jpg                  

Add a remote, in this case, an external hard drive called 'My Passport'

                    $                    cd                    /media/user/USB\                    DISK/ $ git clone /home/user/annex/ Cloning into                    'annex'...                    done. $                    cd                    annex                    # get the desktop and the hard drive to get to know each other                    $ git annex init $ git remote add desktop /home/user/annex $                    cd                    /home/user/annex $ git remote add harddrive1 /media/user/My\                    Passport/annex                  

Now let's add a bunch of files to the Desktop's Annex

And now let's add a bunch of other files to the Hard Drive's Annex

Looking at the contents of the Desktop annex, we see the following:

                    $ ls                    121406.jpg                    121419.jpg                    121421.jpg                  

Looking at the contents of the Hard drive annex, we see the following:

                    $ ls                    121406.jpg                    121415.jpg                    121420.jpg                  

Now we need to sync the files and make sure our annexes have the same contents

Now, looking at the contents of the Desktop annex, we see the following:

                    /home/user/annex$ ls                    121406.jpg                    121415.jpg                    121419.jpg                    121420.jpg                    121421.jpg                  

And also when looking at the contents of the Hard drive annex, we see the following:

                    /media/user/My Passport/annex$ ls                    121406.jpg                    121415.jpg                    121419.jpg                    121420.jpg                    121421.jpg                  

This can be automated as a cron job that syncs your files with your backups in regular intervals

Refer to the documentation to learn more about setting up ssh remotes, removing and transferring files and troubleshooting.

Downloading files from data respositories¶

FRDR¶

FRDR offers the option to download files using globus. Refer to the Globus GUI section above for instrutions on how to download files using Globus. Globus alkso provides the option of downloading the file(s) using a direct download link.

Direct Downloads¶

To download files using a direct downlaod link, for instance, via Dataverse, use wget or curl .

Example¶

To download the `Neurophotonics tutorial on making connectivity diagrams from Channelrhodopsin-2 stimulated data <>`_ dataset from Dataverse, using wget

                      $ wget https://dataverse.scholarsportal.info/api/access/datafile/77286?gbrecs=                      true                    

Alternatively, you can use curl

                      $ curl https://dataverse.scholarsportal.info/api/access/datafile/77286?gbrecs=                      true                      --output download.zip                      # or whatever you want to call the file you download. Keep in mind the file format.