public class SortedDataBag<E> extends AbstractDataBag<E>
This data bag will gather items in memory until a size threshold is passed, at which point it will write out all of the items to disk using the supplied serializer.
After adding is finished, call iterator()
to set up the data bag for reading back items and iterating over them.
The iterator will retrieve the items in sorted order using the supplied comparator.
IMPORTANT: You may not add any more items after this call. You may subsequently call iterator()
multiple
times which will give you a new iterator for each invocation. If you do not consume the entire iterator, you should
call Iter.close(Iterator)
to close any FileInputStreams associated with the iterator.
Additionally, make sure to call close()
when you are finished to free any system resources (preferably in a finally block).
Implementation Notes: Data is stored in an ArrayList as it comes in. When it is time to spill, that data is sorted and written to disk. An iterator will read in each file and perform a merge-sort as the results are returned.
Constructor and Description |
---|
SortedDataBag(ThresholdPolicy<E> policy,
SerializationFactory<E> serializerFactory,
Comparator<? super E> comparator) |
Modifier and Type | Method and Description |
---|---|
void |
add(E item)
Add a tuple to the bag.
|
void |
close() |
void |
flush() |
boolean |
isDistinct()
Find out if the bag is distinct.
|
boolean |
isSorted()
Find out if the bag is sorted.
|
Iterator<E> |
iterator()
Returns an iterator over a set of elements of type E.
|
public SortedDataBag(ThresholdPolicy<E> policy, SerializationFactory<E> serializerFactory, Comparator<? super E> comparator)
public boolean isSorted()
DataBag
public boolean isDistinct()
DataBag
public void add(E item)
DataBag
item
- tuple to add.public void flush()
public Iterator<E> iterator()
Iter.close(Iterator)
to be sure any open file handles are closed.public void close()
Licenced under the Apache License, Version 2.0