Printing Datarrays

One of the most important ways to understand what’s going on in a labeled array is to be able to see a pretty text representation of it. In Divisi2 I stole the __str__ method from PySparse to accomplish this, but NumPy arrays are more varied than PySparse (where everything is two-dimensional and made of floats).

Can we build on NumPy’s str?

NumPy has provided somewhat-pretty text representations for a long time, but the code in numpy.core.arrayprint is

  • difficult to extend

  • undocumented

  • kind of spaghetti, frankly

  • largely untouched for the last 13 years!

Its output can be aesthetically suboptimal in some cases. When printing large arrays of floats, for example, it will wrap every line like this:

[[  0.00000000e+00   1.00000000e-04   2.00000000e-04 ...,   4.70000000e-03
    4.80000000e-03   4.90000000e-03]
 [  5.00000000e-03   5.10000000e-03   5.20000000e-03 ...,   9.70000000e-03
    9.80000000e-03   9.90000000e-03]
 [  1.00000000e-02   1.01000000e-02   1.02000000e-02 ...,   1.47000000e-02
    1.48000000e-02   1.49000000e-02]
 ...,
 [  3.35000000e-01   3.35100000e-01   3.35200000e-01 ...,   3.39700000e-01
    3.39800000e-01   3.39900000e-01]
 [  3.40000000e-01   3.40100000e-01   3.40200000e-01 ...,   3.44700000e-01
    3.44800000e-01   3.44900000e-01]
 [  3.45000000e-01   3.45100000e-01   3.45200000e-01 ...,   3.49700000e-01
    3.49800000e-01   3.49900000e-01]]

The user can understand what that means, but it’ll be hard to stick labels on.

My conclusion is that it will be better to build this representation from the ground up.

The 2D pretty-printer

Screens are 2D, so everything is a variant of the 2D case. What we need is a class designed for printing strings in a grid. This class will then:

  • Find a formatter for the dtype of the matrix (the “cell formatter”).

  • Make an array (a string array? might as well) of equal-width string representations

  • Attach row and column labels as the first row and column of the array

  • Join together everything into a correctly-aligned, multi-line string

The width of each cell is a negotiation between the grid formatter and the cell formatter:

  • Cell: I can print these floats in 5 to 15 characters. More characters is better, of course.

  • Grid: I’ll give you 7.

  • Cell: Stingy bastard.

Maybe this could be accomplished with “small”, “medium”, and “large” options for each formatter, allowing us to reuse arrayprint formatters:

  • float: large = high precision, medium = lower precision, small = lower precision and suppress_small

  • int: large = max number of digits, medium/small = exponential notation

  • str: large = maximum length, medium = truncate

  • bool: large = ‘ True’/’False’, medium/small = ‘T’/’-‘ (to be visually distinct)

Brackets are _not_ printed (it’s too hard to work them in with the labels).

The 1D pretty-printer

It’s the 2D printer with only one row.

The 3D pretty-printer

When people work with n-dimensional labeled data and n>2, what they often do is flatten it out into 2 dimensions. The rows are single data points, and the columns are all the indices followed by the value. Show a few of these from the beginning of the matrix, dots, and a few of these from the end of the matrix.

Then put all those back into the grid-maker.

If there are more than 30 or so dimensions, we are sad.