Data management

When dealing with large amounts of data, it is recommended to follow a few simple principles. This pages provides an introduction on how to properly manage your data.

Naming convention

Choosing meaningful file/folder names plays a big role in the readability of a dataset. Running linux, the mass storage establishes a set of mandatory rules to follow when it comes to naming things.

Names can contain:

digits.
non-accentuated lowercase characters.
non-accentuated uppercase characters.
underscores and hyphens.

Names cannot contain:

whitespace characters.
punctuation other than underscore and hyphen.
diacritis such as accents, umlaut, tilde or cedilla.

Organisation

While you are free to organise your data as you like inside your home directory and team or project directories, here are a few rules that can help you in the task of defining the ideal structure:

The names of files and folders should be meaningful and reflect their content.
Folders should be used to group files based on their context.
Each dataset should contain metadata files used to describe their structure and content. For that purpose, it is recommended to use text files (.txt, .md, .json, .csv and .tsv are a good fit). These files can greatly increase the usability of the datasets on the long run.

Hint

Adding human-readable metadata files to a piece of data (be it a single file or a complex dataset) is, just like code documentation, a great time saver when the time comes to use or share said data. It also improves the FAIRness of the dataset by making it more reusable. As a rule of thumb, there is no such thing as “too much metadata”.

Maintenance

Note

Data maintenance tasks should be performed on regular basis (at leas quarterly). This helps improving the mass storage performances, and maximises the fair share of the available storage space.

To minimise storage usage, you should:

never duplicate files. The mass storage infrastructure already ensures that all files are securely backed up.
use hard and/or soft links when a file is needed in multiple directories.
avoid storing unnecessary files[1].
remove ROT files (Redundant, Obsolete, Trivial).
compress large text files.
group numerous small files in compressed archives.
Archive old data that might useful for later use.

Compression

Compressing file is very useful to save space on the mass storage, especially when dealing with large text files (e.g., sequencing data), or when storing hundreds/thousands of related small files.

For instance, an uncompressed sequencing file containing 30 million reads takes up to 7 GB of disk space, while its compressed version can be reduced to roughly 1.4 GB, saving 80% of its size.

Moreover, most analysis software can read compressed files and, if not, it is always possible to uncompress them on the fly.

Note

For more information on data compression tools, refere to the compression tools page.

Tip

Compressing a small file can yield a larger file due to the added compression information. It is recommended not to compress files smaller than 50 MB. The same way, compressing an already compressed file will not save any space.

If you need help or want advice on how to compress/uncompress your data, please contact the GIGA bioinformatics team.

Other resources

J. Biernaux, “Research data management and reproducibility”, GIGA doctoral school 2022