You are on page 1of 12

DataHoarderCloud Organization

hashin
Hash File Index Standard

Revision 1

Pre-Release Draft v.0.9.0

Breaking changes to be expected until release

Date: June 13. 2019


Table of contents
1. Introduction...................................................................................................................................................................................................................3
2. Scope..................................................................................................................................................................................................................................3
3. Terms and definitions........................................................................................................................................................................................3
4. Clauses..............................................................................................................................................................................................................................4
4.1. Definition...............................................................................................................................................................................................................4
4.1.1. Byte-String...............................................................................................................................................................................................4
4.1.2. Tier Requirements..........................................................................................................................................................................5
4.1.2.1 Tier 1 Requirements............................................................................................................................................................5
4.1.2.2 Tier 2 Requirements...........................................................................................................................................................5
4.1.2.3. Tier 3 Requirements..........................................................................................................................................................5
4.1.2.4. Additional Tier Requirements.................................................................................................................................5
4.2. Versions................................................................................................................................................................................................................6
4.2.0 Version 0...................................................................................................................................................................................................6
4.2.0.1. Byte-String.................................................................................................................................................................................6
4.2.1. Version 1.....................................................................................................................................................................................................6
4.2.1.1. Byte-String...................................................................................................................................................................................6
4.2.1.2. Tier Requirements...............................................................................................................................................................7
4.2.1.2.1. Tier 1 Requirements................................................................................................................................................7
4.2.1.2.2. Tier 2 Requirements..............................................................................................................................................8
4.2.1.2.3. Tier 3 Requirements..............................................................................................................................................8
4.2.1.2.4. Tier 4 Requirements..............................................................................................................................................9
4.2.1.3. Examples......................................................................................................................................................................................9
4.3. Extensions.......................................................................................................................................................................................................10
5. Annexes..........................................................................................................................................................................................................................11
5.1. Annex A (informative).............................................................................................................................................................................11
5.2 Annex B (normative)...............................................................................................................................................................................12
1. Introduction
This document was developed in response to current file-indexing solutions
using different and incompatible methods of identifying files by a short
variable. There is no as-simple-as-possible standard to uniquely identify
individual files that ensures compatibility between services and allows for
independence of their users should one service become unavailable.
Additionally none of the current standards encourage a mentality allowing
passive finding of file-duplicates between users.

2. Scope
This document specifies a hash-based index format.
It is applicable to individual files, both lossless and lossy, and zip-archives.
It is not applicable to collections of files and different types of archives than
zip.

3. Terms and definitions


For the purpose of this document, the following terms and definitions apply:

Lossless

Referring to compression where the original can be fully recovered.

Lossy

Referring to compression where the original cannot be fully recovered.


The resulting files can cause user-perceived file-duplication in hash-
algorithms hashing differently compressed files.

Byte-String

Sequence of bytes.

Flag-Bits

Bits which serve solely for the purpose to indicate a status.


Source-File

Referring to a file that is used to generate a hashin.

User-Perceived File-Duplicate

Two files which seem very similar or the same to a user but are in fact
different and thus produce different hashes.

hashin

An unchangable byte-string consisting of a hash and additional variables


as defined in this document.

4. Clauses

4.1. Definition

The hashin (Hash File Index Standard) consists of two separate parts:

4.1.1. Byte-String

The Byte-String is the body of every hashin. It shall be in the following


format:

Part 1: Flag-bits defining its version.

Part 2: Tier requirement bits, if applicable in the relevant version.

Part 3: The hash of the content of the source-file as defined by the indivual
version.1

Part 4: Bits defining the size of the source-file.

1 The hash shall not be a result of a hash-algorithm that includes the


filename or any metadata of the source-file during the hashing-process.
4.1.2. Tier Requirements

The tier requirements are rules that define which category hashins shall fall
under, depending on their source-files. They also define how and in which
format source-files shall be hashed.

4.1.2.1 Tier 1 Requirements

Tier 1 requirements are aimed at source-files which are under little to no


risk of generating different hashins for user-perceived file-duplicates. It
should be used for for the professional management of archives.

4.1.2.2 Tier 2 Requirements

Tier 2 requirements are aimed at source-files which are under high risk of
generating different hashins for user-perceived file-duplicates. It should
be used for casual management of archives.

4.1.2.3. Tier 3 Requirements

Tier 3 requirements are aimed at source-files which are considered


exotic and zip-archives containing a collection of files which are
dependent on each other.

4.1.2.4. Additional Tier Requirements

A version may add new requirement-lists or remove some.


4.2. Versions

The versions are current revisions that shall define the exact specifications of
the standard. Every new version shall be backwards-compatible to all old
versions in case of the hash-algorithm staying the same between versions.

4.2.0 Version 0

4.2.0.1. Byte-String

The byte-string of version 1 shall consist of 312 bits, or 39 bytes, in total. It


shall be in the following format:

Part 1: 2 flag-bits, which shall always read 10.

Part 2: 256 hash-bits containing a sha-256 hash.

Part 3: 46 file-size bits, allowing a file-size up to 70.368.744.177.664 bytes


or ~70 TB.

4.2.0.2. Tier Requirements

Version 0 does not contain any tier requirements.

4.2.1. Version 1

4.2.1.1. Byte-String

The byte-string of version 1 shall consist of 312 bits, or 39 bytes, in total. It


shall be in the following format:

Part 1: 8 flag-bits, the first 6 defining the version, the last two defining the
tier requirement list it falls under. The first 6 bits shall always be 000000.
The last two shall read 00 in case of Tier 1 Requirements met, 01 in case
of Tier 2 Requirements met, 10 in case of Tier 3 Requirements met and 11
in case of Tier 4 Requirements met.

Part 2: 256 hash-bits containing a sha-256 hash.

Part 3: 48 file-size bits, allowing a file-size up to 281.474.976.710.656


bytes, or ~281 TB.
4.2.1.2. Tier Requirements

4.2.1.2.1. Tier 1 Requirements

Tier 1 shall consist of:

1. Executables, binary packages and disk images

2. Document formats

3. Lossless video formats

4. Lossless audio formats

5. Lossless image formats with contents of public interest or


importance

The following source-file types are applicable:

.exe
.sh
.bin
.iso
.pdf
.epub
.csv
.doc
.docx
.odt
.xls
.xlsx
.ods
.flv
.flac
.png
.bmp
.tiff

Special rules:

Any source-image file shall be archived into a zip (as per specification
given in Tier 3 requirement list) containing a single file (itself) to apply
for Tier 1.
4.2.1.2.2. Tier 2 Requirements

Tier 2 shall consist of:

1. Lossy video formats

2. Lossy audio formats

3. Lossy image formats with contents of public interest or importance

The following source-file types are applicable:

.mp4
.mkv
.mp3
.wav
.jpg

Special rules:

Any source-image file shall be archived into a zip (as per specification
given in Tier 3 requirement list) containing a single file (itself) to apply
for Tier 2.

4.2.1.2.3. Tier 3 Requirements

Tier 3 shall consist of any source-files not applicable to Tier 1 and Tier
2, except images. Archives shall also be included, which shall be in the
zip-format only.

Special rules:

Zips shall be created without any metadata or timestamps and


compressed by gzip with the minimal necessary parameters to fulfill
this requirement. (TODO/Input needed: Discuss usage of archives)

Source-files which are not dependent on each other or fall into Tier 1
or Tier 2 should not be packed together into one archive. Additionally,
files which are dependent on each other but easily tracable (for
example, a small .exe to unpack one or more large .bin files) should be
hashed indepentently instead of packed together with one single
hash for all files.

(TODO/Input needed: Specify more file-formats in the lists)


4.2.1.2.4. Tier 4 Requirements

Version 1 contains an additional requirements list solely dedicated to


image-formats with no public interest or importance.

Special rules:

Tier 4 source-files shall be hashed individually and without putting


them into an archive.

4.2.1.3. Examples

Example 1:

We have a source-file called „testfile“ with the content „Testdata“. The


length of the source-file is 8 Bytes.

The Hashin Ver.1 of this source-file is the following in hexadecimal:

02-89-71-5F-B8-A3-3F-2E-77-1E-66-D8-68-C1-C5-05-91-E8-75-93-F5-B9-
0D-37-0E-43-FA-DA-79-5B-E5-E4-3A-00-00-00-00-00-08

In parts:

02 hex, in binary: 0000 0010

0000 00 the identifier for version 1

10 the 2 flag-bits specifiying that it falls in Tier 3 because the source-


file does not have a file-extension and thus does not fall into any list of
another tier and is not an archive.

89-71-5F-B8-A3-3F-2E-77-1E-66-D8-68-C1-C5-05-91-E8-75-93-F5-B9-0D-
37-0E-43-FA-DA-79-5B-E5-E4-3A hex, the sha-256 hash.

00-00-00-00-00-08 hex, in decimal: 8, the filesize of 8 Bytes.


4.3. Extensions

Third parties may use any version greater than 0 and extend it with their own
attributes (which they shall put exclusively after the intact byte-stream) as
long as they remain compatible to the original specification. They shall set the
last flag-bit to 1 and give their extension a name in the following name-
scheme (asterisks will from here on mark place-holders):

Hashin Version *version number* – Extension *author* v*version number*

Version number shall be replaced with the version they are deriving from and
to which they are remaining compatible.

Author shall be replaced with the name of the individual, organization or


project creating the extension.

Version number shall be replaced with the version number of the extension
developed by the third party.

Third parties specifically may not create standards with names in the following
name-scheme:

Hashin Version *version number*

or anything looking similar (for example v. instead of Version, version,


etc)where version number consists of exclusively numbers and dots (for
example Hashin Version 1 or Hashin Verson 2.50.6).

Third parties may create standards with names in the following name-
scheme:

Hashin Version *Org* *R*

if Org consists of more than 5 letters (R is an arbitrary amount of following


symbols, for example Hashin Version ACKR 4 or Hashin Version ACKR goohkr).
5. Annexes

5.1. Annex A (informative)

It should be noted that in version 1 the motivation to place images in archives


is to avoid accidental including of files which are likely of no interest (example:
private photos) and the motivation behind specifying the exact way of
archving is to avoid user-perceived file-duplicates.

It should also be noted that version 0 is designed as a lowest-possible-


complexity baseline for version 1. It can thus be used for separate standard
expansions compatible to version 0, but alone should only be used in local
single-user environments. Additionally it should be noted that in case
demands for sizes above 70 TB rise or different hash-algorithms get adopted,
versions 0.1, 0.2 and so on can be created to keep version 0 updated as a
baseline.

Lastly, it should be noted that the rules for extension and derivative name-
scheming were made to avoid confusion between the specifications made by
third parties and this organization and ensure maximal compatibility from
derived works while still allowing third parties to use the Hashin name and
give the opportunity to continue developing the standard should this
organization cease to be.
5.2 Annex B (normative)

Anyone shall hereby be granted a noncommercial free-use license to any


version of hashin provided in this document if used with the naming-schemes
provided. Anyone shall hereby be granted the noncommercial rights to
modify and distribute, under the conditions laid out in this document, any
version of hashin provided in this document if providing a reference to hashin
and this organization in their derived work. These license and rights are
unlimited and non-revokable. This organization, as well as its members and
any authors of or contributors to this document shall not be held responsible
or legally liable for anyone‘s actions while using this standard or for any
resulting misuse.

Anyone shall hereby be granted a non-commercial, non-revokable license, to


freely access and distribute this document. Anyone shall hereby be granted a
non-commercial, non-revokable license, to freely modify this document
provided it is made clear without a doubt that the editor does not represent
this organization as well as its members or any authors of or contributors to
this document, if this is the case, and a clear reference to this organization is
added in the derived work.

You might also like