Abstracts
A Trade-Off for
Worst-Case Efficient Dictionaries
Abstract: We consider dynamic dictionaries over the universe
U={0,1}w on a unit-cost RAM with word size w and a standard
instruction set, and present a linear space deterministic dictionary
accommodating membership queries in time (log log n)O(1)
and updates in time (log n)O(1), where n is the size of the
set stored. Previous solutions either had query time (log
n)Omega(1) or update time 2omega(sqrt(log n)) in
the worst case.
Low Redundancy in
Static Dictionaries with Constant Query Time
Abstract: A static dictionary is a data structure
storing subsets of a finite universe U, answering membership
queries. We show that on a unit cost RAM with word size Theta(log
|U|), a static dictionary for n-element sets with constant worst case
query time can be obtained using B+O(log log |U|)+o(n) bits of
storage, where B=ceiling(log2 binom(|U|,n) is the minimum
number of bits needed to represent all n-element subsets of U.
Deterministic
Dictionaries
Abstract: It is shown
that a static dictionary that offers constant-time access to n
elements with w-bit keys and occupies O(n) words of memory can be
constructed deterministically in O(n log n) time on a unit-cost RAM
with word length w and a standard instruction set including
multiplication. Whereas a randomized construction working in linear
expected time was known, the running time of the best previous
deterministic algorithm was Omega(n2). Using a standard
dynamization technique, the first deterministic dynamic dictionary
with constant lookup time and sublinear update time is derived. The
new algorithms are weakly nonuniform; i.e., they require access to a
fixed number of precomputed constants dependent on w. The main
technical tools employed are unit-cost error-correcting codes, word
parallelism, and derandomization using conditional expectations.
Fast Random Access to Wavelet Compressed Volumetric Data Using Hashing
Abstract: We present a new approach to lossy storage of
the coefficients of wavelet transformed data. While it is common to
store the coefficients of largest magnitude (and let all other
coefficients be zero), we allow a slightly different set of
coefficients to be stored. This brings into play a recently proposed
hashing technique that allows space efficient storage and very
efficient retrieval of coefficients. Our approach is applied to
compression of volumetric data sets. For the ``Visible Man'' volume we
obtain up to 80% improvement in compression ratio over previously
suggested schemes. Further, the time for accessing a random voxel is
considerably reduced.
Hash and Displace: Efficient Evaluation of Minimal Perfect Hash Functions
Abstract: A new way of constructing (minimal) perfect hash
functions is described. The technique is a mixture of two well-known
ones, and yields simple as well as efficient hashing schemes. More
specifically, the overhead associated with resolving buckets in
two-level hashing schemes is reduced considerably. Evaluation of a
hash function requires just one multiplication and a few additions
apart from primitive bit operations. The number of accesses to memory
is two, one of which is to a fixed location. This improves the probe
performance of previous minimal perfect hashing schemes, and is shown
to be optimal. The hash function description ("program") for a set of
size n occupies O(n) words, and can be constructed in expected O(n)
time.
Dispersing Hash Functions
Abstract: A new
hashing primitive is introduced: dispersing hash functions. A family
of hash functions F is dispersing if, for any set S of a certain size
and random h in F, the expected value of |S|-|h[S]| is ``close'' to
the expectancy if h had been chosen at random from the set of all
functions.
We give tight, up to a logarithmic factor, upper and lower bounds on
the size of dispersing families. Such families previously studied, for
example universal families, are significantly larger than the smallest
dispersing families, making them less suitable for derandomization. We
present several applications of dispersing families to derandomization
(fast element distinctness, set inclusion, and static dictionary
initialization). Also, a tight relationship between dispersing
families and extractors, which may be of independent interest, is
exhibited.
We also investigate the related issue of program size for hash
functions which are nearly perfect. In particular, we exhibit a
dramatic increase in program size for hash functions more dispersing
than a random function.
On the Cell Probe
Complexity of Membership and Perfect Hashing
Abstract: We study two fundamental static data structure
problems, membership and perfect hashing, in Yao's cell probe
model. The first space and bit probe optimal upper bound is given for
the membership problem. We also give a new efficient membership scheme
where the query algorithm makes just one adaptive choice, and probes a
total of three words. A lower bound shows that two word probes
generally do not suffice, regardless of the adaptivity. For minimal
perfect hashing we show a tight bit probe lower bound, and give a
simple scheme achieving this performance, making just one adaptive
choice. Linear range perfect hashing is shown to be implementable with
the same number of bit probes, of which just one is adaptive. In
contrast, we establish that for sufficiently sparse sets, non-adaptive
perfect hashing needs exponentially more bit probes. This is the first
such separation of adaptivity and non-adaptivity.
Cuckoo Hashing
Abstract: We present a simple dictionary with worst case
constant lookup time, equaling the theoretical performance of the
classic dynamic perfect hashing scheme of Dietzfelbinger et
al. (Dynamic perfect hashing: Upper and lower bounds. SIAM J. Comput.,
23(4):738--761, 1994). The space usage is similar to that of binary
search trees, i.e., three words per key on average. Besides being
conceptually much simpler than previous dynamic dictionaries with
worst case constant lookup time, our data structure is interesting in
that it does not use perfect hashing, but rather a variant of open
addressing where keys can be moved back in their probe sequences. An
implementation inspired by our algorithm, but using weaker hash
functions, is found to be quite practical. It is competitive with the
best known dictionaries having an average case (but no nontrivial
worst case) guarantee.
Lossy
Dictionaries
Abstract: Bloom filtering is an important technique for
space efficient storage of a conservative approximation of a set
S. The set stored may have up to some specified number of ``false
positive'' members, but all elements of S are included. In this paper
we consider ``lossy dictionaries'' that are allowed to also have false
negatives. This relaxation allows a very fast and simple data
structure making almost optimal use of memory. Being more time
efficient than Bloom filters, we believe our data structure to be well
suited for replacing Bloom filters in some applications. Also, the
fact that our data structure supports information associated to keys
paves the way for new applications, as illustrated by an application
in lossy image compression.
Optimal Time-Space Trade-Offs for Non-Comparison-Based Sorting
Abstract: We study the fundamental problem of sorting n
integers of w bits on a unit-cost RAM with word size w, and in
particular consider the time-space trade-off (product of time and
space in bits) for this problem. For comparison-based algorithms, the
time-space complexity is known to be Theta(n2). A result of
Beame shows that the lower bound also holds for non-comparison-based
algorithms, but no algorithm has met this for time below the
comparison-based Omega(n lg n) lower bound.
We show that if sorting
within some time bound T' is possible, then time T =
O(T'+ n lg* n) can be achieved with high probability using
space S = O(n2/T+w), which is optimal. Given a
deterministic priority queue using amortized time t(n) per operation
and space nO(1), we provide a deterministic algorithm
sorting in time T = O(n (t(n) + lg* n)) with S =
O(n2/T+w). Both results require that w <=
n1-Omega(1). Using existing priority queues and sorting
algorithms, this implies that we can deterministically sort time-space
optimally in time Theta(T) for T>= n (lg lg n)2, and with
high probability for T>= n lg lg n.
Our results imply that recent
lower bounds for deciding element distinctness in o(n lg n) time are
nearly tight.
One Probe Search
Abstract: We consider dictionaries that perform lookups
by probing a single word of memory, knowing only the size of
the data structure. We describe a randomized dictionary where a lookup
returns the correct answer with probability 1-epsilon, and otherwise
returns "dont know". The lookup procedure uses an expander graph to
select the memory location to probe. Recent explicit expander
constructions are shown to yield space usage much smaller than what
would be required using a deterministic lookup procedure. Our data
structure supports efficient deterministic updates, exhibiting
new probabilistic guarantees on dictionary running
time.
Space Efficient Hash Tables With Worst Case Constant Access Time
Abstract: We generalize Cuckoo
Hashing to d-ary Cuckoo Hashing and show how this yields a
simple hash table data structure that stores n elements in
(1+epsilon)n memory cells, for any constant epsilon >
0. Assuming uniform hashing, accessing or deleting table entries
takes at most d=O(ln(1/epsilon)) probes and the expected
amortized insertion time is constant. This is the first dictionary
that has worst case constant access time and expected constant update
time, works with (1+epsilon)n space, and supports satellite
information. Experiments indicate that d=4 choices suffice for
epsilon around 0.03. We also describe a hash table data
structure using explicit constant time hash functions, using at most
d=O(ln2(1/epsilon)) probes in the worst case.
A corollary is an expected linear time algorithm for finding
maximum cardinality matchings in a rather natural model
of sparse random bipartite graphs.
Basic External Memory Data Structures
Abstract:
This paper gives an overview of the most important basic data
structures in external memory, introducing fundamental principles
for designing data structures that make efficient use of the memory
hierarchy. We show how to implement elementary abstract data
structures (queues, stacks, linked lists), B-trees, buffer trees
(including their use for priority queues), and hashing based
dictionaries.
Uniform Hashing in Constant Time and Linear Space
Abstract:
Many algorithms and data structures employing hashing have been
analyzed under the uniform hashing assumption, i.e., the
assumption that hash functions behave like truly random functions.
Starting with the discovery of universal hash functions, many
researchers have studied to what extent this theoretical ideal can
be realized by hash functions that do not take up too much space
and can be evaluated quickly. In this paper we present an almost
ideal solution to this problem: A hash function that, on any set of
n inputs, behaves like a truly random function with high
probability, can be evaluated in constant time on a RAM, and can be
stored in O(n) words, which is optimal. For many hashing schemes
this is the first hash function that makes their uniform hashing
analysis come true, with high probability, without incurring
overhead in time or space.
On Adaptive Integer Sorting
Abstract:
This paper considers integer sorting on a RAM. We show that adaptive sorting of a sequence with qn inversions is asymptotically equivalent to multisorting groups of at most q keys, and a total of n keys. Using the recent O(n sqrt(log log n)) expected time sorting of Han and Thorup on each set, we immediately get an adaptive expected sorting time of O(n sqrt(log log q)). Interestingly, for any positive constant epsilon, we show that multisorting and adaptive inversion sorting can be performed in linear time if q<= 2(log n)(1-epsilon). We also show how to asymptotically improve the running time of any traditional sorting algorithm on a class of inputs much broader than those with few inversions.
An Optimal Bloom Filter Replacement
Abstract:
This paper considers space-efficient data structures for storing an
approximation S' to a set S such that S is a superset of S' and any
element not in S belongs to S' with probability at most
epsilon. The Bloom filter data structure, solving this
problem, has found widespread use. Our main result is a new RAM data
structure that improves Bloom filters in several ways:
- The time for looking up an element in S' is O(1),
independent of epsilon.
- The space usage is within a lower order term of the lower bound.
- The data structure uses explicit hash function families.
- The data structure supports insertions and deletions on
S in amortized expected constant time.
The main technical ingredient is a succinct representation of
dynamic multisets. We also consider three recent generalizations of
Bloom filters.
Home page
:-)