Hashing background

AI systems depend on large data files.

TAIBOM like assurance requires the hashing and attestation of the same large data sets

From a security perspective the choice of hashing algorithm impacts the security properties of the solution

From an engineering perspective hashing is expensive and performance needs to be considered.

Hashing specification

Requirement: TAIBOM should use SHA-256 or SHA3-256 for its hashing functions.

TODO: should we consider additional hashing recommendation functions

TODO: TAIBOM site overview - split into specification and engineering sections

Hashing best practice

We should initiate a quick research on best practice in the following areas

Hardware optimisation methods: for extremely large data sets are their hardware based solutions that can speed up the hashing of large data sets
Software parallelisation methods: for extremely large data sets are their recommended best practices in terms of software configuration and use of parallelization methods

Hashing research: performance benchmarks

We should initiate research to establish typical performance of hash creation and validation using. This can probably be desk based research. We will need some indicative control parameters

hashing algorithm

CPU speed

File size

Diss speed

Query is has creation lower than valiation

https://www.numberanalytics.com/blog/computational-complexity-hash-based-signatures

Hashing research: Industry best practice

We should initiates a quick research on best practice in the following areas

Hardware optimisation methods: for extremely large data sets are their hardware based solutions that can speed up the hashing of large data sets
Software parallelisation methods: for extremely large data sets are their recommended best practices in terms of software configuration and use of parallelization methods

Hashing research: algorithmic approaches

Distinguishing from the computational parallelisation methods above, which looks at raw throughput, we should consider some high level analysis of hashing algorithmic approaches and data structures.

This should consider high level considerations

Recommendations on splitting large files into smaller files, and recommendations on clustering sizing
How to create stable links between hashes
Comparison of hash tree and hash graph methods
Consideration of how you can create a W3C VC that can create a stable immutable reference to another VC
Consideration of the integration of JSON-LD into W3C-VCs and how this relates to stable links

Fundamentally do we need to distinguish between shallow links and deeply verifiable links

Probabilistic Hashing

TODO: explore the use of probabilistic hashing as a practical audit method

Actions

Action Item	Owner	Notes
Clean up and separate sections in spec	Nick Allott	Distinguish spec, background, and engineering notes Still to do
Confirm whether SHA-256 is quantum-safe	Gareth	Added to doc
Investigate performance benchmarks for hashing	Ryan Baldwin	Use existing studies or run new tests
Share source discussing hash generation vs verification speed	Mark	Posted link in shared doc DONE
Confirm definitively whether SHA-256 is non-parallelisable	Ryan Baldwin	Clarify implications and alternatives
Liaise hashing/addressing issues with W3C	Nick Allott	Especially around deep linking and hash-in-URI DONE - wating on resposne
Collect industry best practices on hashing, chunking, file handling	All	Check for optimised hardware, hashing libraries

Draft content

Research links

NVidia Hash maps

https://developer.nvidia.com/blog/maximizing-performance-with-massively-parallel-hash-maps-on-gpus/

?? is there a serialiseation standard

Sha-2 wikipedia

https://en.wikipedia.org/wiki/SHA-3

interesting points on

performance
hardware optimisation trade-offs
parallelisation

NIST SP

http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-185.pdf

Hashing specification​

Hashing best practice​

Hashing research: performance benchmarks​

Hashing research: Industry best practice​

Hashing research: algorithmic approaches​

Probabilistic Hashing​