5. Introduction Hashing Techniques Applications
HASHING
The idea of hashing is to distribute the entries of a dataset
across an array of buckets.
Given a key, the algorithm computes an index that
suggests where an entry can be found:
index = f(key, array_size)
Often this is done in two steps:
hash = hashfunc(key).
index = hash % array_size
6. Introduction Hashing Techniques Applications
WHAT IS HASHING
A Hash Table
A data structure to implement an associative array.
A structure that can map keys to values.
Uses a hash function to compute an index into an array of
buckets or slots from which the correct value can be found.
8. Introduction Hashing Techniques Applications
HASH FUNCTION
Crucial for good hash table performance.
Can be difficult to achieve.
A basic expectation is that the function would provide a
uniform distribution of hash values.
A non-uniform distribution increases the number of
collisions and the cost of resolving them.
10. Introduction Hashing Techniques Applications
SEPARATE CHAINING
Every bucket is independent.
And maintains a list of entries with the same index.
Time for hash function operations depends on the time to
find the bucket (constant) and the time for list operations.
The technique is also called open hashing or closed
addressing.
In a good hash table every bucket has very few entries.
12. Introduction Hashing Techniques Applications
SEPARATE CHAINING WITH LINKED LISTS
Popular as they require basic data structures with simple
algorithms.
They can use simple hash functions that are unsuitable for
other methods.
Cost of the table operation depends on the size of the
selected bucket for the desired key.
The worst case scenario is when all the entries are
inserted into the same bucket.
14. Introduction Hashing Techniques Applications
TIME COMPLEXITY MEASURES
TABLE: Time Complexity Measures
Guarantee Average Case
Implementation Search Insert Delete Search Insert Delete
Unordered Array N N N N/2 N/2 N/2
Ordered Array lg N N N lg N N/2 N/2
Unordered List N N N N/2 N N/2
Ordered List N N N N/2 N/2 N/2
BST N N N 1.39 lg N 1.39 lg N ?
Randomized BST 7 lg N 7 lg N 7 lg N 1.39 lg N 1.39 lg N 1.39 lg N
15. Introduction Hashing Techniques Applications
OPEN ADDRESSING (CLOSED HASHING)
All entry records are stored in the bucket array itself.
Insertion of a new entry: The buckets are examined,
starting from the hashed-to slot and proceeding in some
probe sequence, until an unoccupied slot is found.
Searching: The buckets are scanned in the same
sequence, until the target entry is found, or an unused slot
is found, which indicates that there is no such key in the
table.
Open Addressing: Refers to the fact that location (address)
of an entry is not determined by its hash value.
Closed Hashing: Not to be confused with open hashing or
close addressing -> names reserved for separate chaining.
16. Introduction Hashing Techniques Applications
PROBE SEQUENCES
Linear Probing – A fixed interval between probes (usually
1).
Quadratic Probing – Interval between probes is increased
by adding the successive outputs of a quadratic polynomial
to the starting value given by the original computation.
Double Hashing – Interval between probes is computed by
another hash function.
Drawback: The number of stored entries cannot exceed
the number of slots in the bucket array.
18. Introduction Hashing Techniques Applications
LOAD FACTOR – A KEY STATISTIC
Number of entries divided by the number of buckets – n/k.
If this grows too large the hash table becomes slow.
Variance of number of entires per bucket is important.
Two tables have 1000 entries and 1000 buckets.
One has one entry in one bucket and the second has all
the entries in one bucket.
Hashing is not working in the second hash table.
A low load factor is not beneficial.
As the load factor approaches 0, the proportion of unused
areas in the hash table increases.
This does not necessarily reduce the search cost.
This results in wasted memory.
19. Introduction Hashing Techniques Applications
HOW DROPBOX KNOWS YOU ARE SHARING
COPYRIGHTED STUFF
Dropbox checks the hash of a shared file against a banned
list, and blocks the share if there is a match.
With a properly implemented hash function, running the
same exact file through the algorithm twice will return the
same identifier both times – but changing a file even
slightly completely changes the hash.
This identifier can be used to tell you if a file is exactly the
same as another file – but it is a one way street.
The hash couldn’t tell you what that original file is, without
you already knowing or having a copy of the file to
compare it to.
21. Introduction Hashing Techniques Applications
DROPBOX
When you upload a file to Dropbox, two things happen to it:
a hash is generated, and then the file gets encrypted to
keep any unauthorized user (be it a hacker or a Dropbox
employee) who somehow stumbles it sitting on Dropbox’s
servers from easily being able to open it up.
After a DMCA complaint is verified by Dropbox’s legal
team, Dropbox adds that file’s hash to a big blacklist of
hashes known to be those corresponding to files they can’t
legally allow to be shared. When you share a link to a file,
it checks that file’s hash against the blacklist.
If the file you are sharing is the exact same file that a
copyright holder complained about, it is blocked from being
shared with others. If it is something else – a new file, or
even a modified version of the same file – a hash-based
anti-infringement system should not have any idea what it
is looking at.
22. Introduction Hashing Techniques Applications
SUBTREE CACHING (IN SYMBOLIC REGRESSION)
log
log
tan z
+
y
x * (tan y + z ) log (x + yz )
*
x +
*
x *
y z
parents
Functions
subtrees selected randomly for crossover
23. Introduction Hashing Techniques Applications
SUBTREE CACHING
Every subtree is evaluated and cached, along with its
evaluation.
As a new tree arrives, its subtrees are supposed to be
evaluated recursively.
Before evaluation, the cache is checked for an evaluation
of a matching subtree.
If found, evaluation is kept. If not found, the new subtree is
evaluated and its evaluation is stored in the cache.
Improves performance by saving time on unnecessary
evaluations.