Class TDigest

  • Direct Known Subclasses:
    AbstractTDigest

    public abstract class TDigest
    extends Object
    Adaptive histogram based on something like streaming k-means crossed with Q-digest.
    The special characteristics of this algorithm are:
    a) smaller summaries than Q-digest
    b) works on doubles as well as integers.
    c) provides part per million accuracy for extreme quantiles and typically <1000 ppm accuracy for middle quantiles
    d) fast
    e) simple
    f) test coverage > 90%
    g) easy to adapt for use with map-reduce
    • Constructor Summary

      Constructors 
      Constructor Description
      TDigest()  
    • Method Summary

      All Methods Static Methods Instance Methods Abstract Methods Concrete Methods 
      Modifier and Type Method Description
      abstract void add​(double x)
      Add a sample to this TDigest.
      abstract void add​(double x, int w)
      Adds a sample to a histogram.
      abstract void add​(TDigest other)
      Add all of the centroids of another TDigest to this one.
      abstract void asBytes​(ByteBuffer buf)
      Serialize this TDigest into a byte buffer.
      abstract void asSmallBytes​(ByteBuffer buf)
      Serialize this TDigest into a byte buffer.
      abstract int byteSize()
      Returns the number of bytes required to encode this TDigest using #asBytes().
      abstract double cdf​(double x)
      Returns the fraction of all points added which are <= x.
      abstract int centroidCount()
      The number of centroids currently in the TDigest.
      abstract Iterable<? extends Centroid> centroids()
      An iterable that lets you go through the centroids in ascending order by mean.
      protected void checkValue​(double x)  
      abstract void compress()
      Re-examines a t-digest to determine whether some centroids are redundant.
      abstract double compression()
      Returns the current compression factor.
      static ArrayDigest createArrayDigest​(double compression)
      Creates an ArrayDigest with default page size.
      static ArrayDigest createArrayDigest​(int pageSize, double compression)
      Creates an ArrayDigest with specified page size.
      static TDigest createTreeDigest​(double compression)
      Creates a TreeDigest.
      abstract boolean isRecording()  
      abstract double quantile​(double q)
      Returns an estimate of the cutoff such that a specified fraction of the data added to this TDigest would be less than or equal to the cutoff.
      abstract TDigest recordAllData()
      Tell this TDigest to record the original data as much as possible for test purposes.
      abstract long size()
      Returns the number of points that have been added to this TDigest.
      abstract int smallByteSize()
      Returns the number of bytes required to encode this TDigest using #asSmallBytes().
    • Constructor Detail

      • TDigest

        public TDigest()
    • Method Detail

      • createArrayDigest

        public static ArrayDigest createArrayDigest​(double compression)
        Creates an ArrayDigest with default page size.
        Parameters:
        compression - The compression parameter. 100 is a common value for normal uses. 1000 is extremely large. The number of centroids retained will be a smallish (usually less than 10) multiple of this number.
        Returns:
        the ArrayDigest
      • createArrayDigest

        public static ArrayDigest createArrayDigest​(int pageSize,
                                                    double compression)
        Creates an ArrayDigest with specified page size.
        Parameters:
        pageSize - The internal page size to use. This should be about sqrt(10*compression)
        compression - The compression parameter. 100 is a common value for normal uses. 1000 is extremely large. The number of centroids retained will be a smallish (usually less than 10) multiple of this number.
        Returns:
        the ArrayDigest
      • createTreeDigest

        public static TDigest createTreeDigest​(double compression)
        Creates a TreeDigest. Going forward, ArrayDigests should be preferred to the TreeDigest since they are uniformly faster and require less memory while producing nearly identical results.
        Parameters:
        compression - The compression parameter. 100 is a common value for normal uses. 1000 is extremely large. The number of centroids retained will be a smallish (usually less than 10) multiple of this number.
        Returns:
        the TreeDigest
      • add

        public abstract void add​(double x,
                                 int w)
        Adds a sample to a histogram.
        Parameters:
        x - The value to add.
        w - The weight of this point.
      • checkValue

        protected final void checkValue​(double x)
      • compress

        public abstract void compress()
        Re-examines a t-digest to determine whether some centroids are redundant. If your data are perversely ordered, this may be a good idea. Even if not, this may save 20% or so in space.
        The cost is roughly the same as adding as many data points as there are centroids. This is typically < 10 * compression, but could be as high as 100 * compression.
        This is a destructive operation that is not thread-safe.
      • size

        public abstract long size()
        Returns the number of points that have been added to this TDigest.
        Returns:
        The sum of the weights on all centroids.
      • cdf

        public abstract double cdf​(double x)
        Returns the fraction of all points added which are <= x.
      • quantile

        public abstract double quantile​(double q)
        Returns an estimate of the cutoff such that a specified fraction of the data added to this TDigest would be less than or equal to the cutoff.
        Parameters:
        q - The desired fraction
        Returns:
        The value x such that cdf(x) == q
      • centroidCount

        public abstract int centroidCount()
        The number of centroids currently in the TDigest.
        Returns:
        The number of centroids
      • centroids

        public abstract Iterable<? extends Centroid> centroids()
        An iterable that lets you go through the centroids in ascending order by mean. Centroids returned will not be re-used, but may or may not share storage with this TDigest.
        Returns:
        The centroids in the form of an Iterable.
      • compression

        public abstract double compression()
        Returns the current compression factor.
        Returns:
        The compression factor originally used to set up the TDigest.
      • byteSize

        public abstract int byteSize()
        Returns the number of bytes required to encode this TDigest using #asBytes().
        Returns:
        The number of bytes required.
      • smallByteSize

        public abstract int smallByteSize()
        Returns the number of bytes required to encode this TDigest using #asSmallBytes().
        Returns:
        The number of bytes required.
      • asBytes

        public abstract void asBytes​(ByteBuffer buf)
        Serialize this TDigest into a byte buffer. Note that the serialization used is very straightforward and is considerably larger than strictly necessary.
        Parameters:
        buf - The byte buffer into which the TDigest should be serialized.
      • asSmallBytes

        public abstract void asSmallBytes​(ByteBuffer buf)
        Serialize this TDigest into a byte buffer. Some simple compression is used such as using variable byte representation to store the centroid weights and using delta-encoding on the centroid means so that floats can be reasonably used to store the centroid means.
        Parameters:
        buf - The byte buffer into which the TDigest should be serialized.
      • recordAllData

        public abstract TDigest recordAllData()
        Tell this TDigest to record the original data as much as possible for test purposes.
        Returns:
        This TDigest so that configurations can be done in fluent style.
      • isRecording

        public abstract boolean isRecording()
      • add

        public abstract void add​(double x)
        Add a sample to this TDigest.
        Parameters:
        x - The data value to add
      • add

        public abstract void add​(TDigest other)
        Add all of the centroids of another TDigest to this one.
        Parameters:
        other - The other TDigest