Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

name/title system #430

Closed
henryiii opened this issue Jul 17, 2020 · 18 comments · Fixed by #456
Closed

name/title system #430

henryiii opened this issue Jul 17, 2020 · 18 comments · Fixed by #456

Comments

@henryiii
Copy link
Member

henryiii commented Jul 17, 2020

I'd like to propose upstreaming @LovelyBuggies fantastic title/name system from Hist to boost-histogram. There are several reasons for this:

  • It has no dependencies
  • It can take part in DRAFT: Plotting (and later fitting) Protocol #423 in providing a consistent API for uproot and mplhep (maybe others) (mplhep.plothist(uproot.to_boost()) would show titles too, like Hist, and @jpivarski probably could add the same api directly to TH* objects in Uproot4)
  • It can make dict-indexing, projections, and axes read more clearly if names are used, allowing a high level of clarity as to which axis is being referenced.
  • It is completely optional, and does not affect the current usage at all
  • It would make conversions to/from hist easier, since currently names and titles get packed into a dict, which then has to be changed going between hist/boost-histogram (especially for uproot, since title and name have to be handled during the conversion)
  • We can optimize the storage internally, just keeping the strings directly in the metadata type, rather than storing python dicts and python strings

Quick example of usage:

h = bh.Histogram(
    bh.axis.Regular(10, -1, 1, name="x"),
    bh.axis.Regular(20, -2, 2, name="y"),
)
h.fill(x=np.random.normal(size=1_000_000), y=np.random.normal(size=1_000_000))

h.project("y");

h[{"x": sum, "y": bh.rebin(2)}]

h.axes["x"].label = "x [μm]"
h.axes["y"].label = "y [cm]"
mplhep.hist2dplot(h);

The proposed design, open for discussion:

  • Axes gain a new name parameter.
    • This probably should be a read-only parameter (which is not how Hist currently works)
    • When histograms are created, they make sure all names are ether empty or unique
    • Names are used anywhere axis numbers can be used.
      • Empty-name axes must be accessed by number.
  • Axes gain a new label parameter
    • Will override the name for plotting libraries, so titles can be attached to the histogram rather than added afterwords in the plot.
    • If the label is not set, the name is used if available.
  • The metadata slot in boost-histogram's C++ becomes a struct, with two strings and an arbitrary Python object.

@HDembinski, what do you think?

@HDembinski
Copy link
Member

It seems very useful and rather consistent. The use of names seems Pythonic. My only complaint is use of title instead of label. I would prefer label.

Some thoughts: why not go one step further and allow an axis named x to be accessed as h.axes.x?

@henryiii
Copy link
Member Author

I'm happy with either, I just picked up "title" from ROOT. We've already had one person (at least) comment that label was more Pythonic. (or, at least, more matplotlibathonic)

Some thoughts: why not go one step further and allow an axis named x to be accessed as h.axes.x?

Sure, that would be friendly/reasonable; that's how Pandas and Numpy's record arrays work. We can even add __dir__ to make tab completion work.

@henryiii
Copy link
Member Author

I was looking over the talk video and I literally called it label when first introducing it, even though it was .title in the code. 😆

@HDembinski
Copy link
Member

HDembinski commented Jul 18, 2020

Ok label it is :) Apart from being closer to matplotlib, I think it also makes more sense word-wise. Things have labels. Books have titles. Edit: I know these words are synonymous, I am referring to typical use.

@HDembinski
Copy link
Member

I thought some more about this, and then changed my mind. This feature is perfect for hist, but shouldn't be backported to boost-histogram. It doesn't really fit in, because boost-histogram does not assign any meaning to metadata and that should stay this way.

@henryiii
Copy link
Member Author

henryiii commented Jul 19, 2020

It's up to you, but just to be clear - metadata would be completely untouched from Python. The C++ struct would be touched, but that's an internal detail that is not visible to Python. From Python, there would be a metadata slot that acts exacty as it does now, and two new slots, a label and a name. You are free to use strings in metadata for names, and nothing changes. You don't have to use title or name. I wouldn't think about the item in the C++ backend at all - it's just a tool that currently holds a single py::object in it, which is the current python .metadata. Instead, we would assign .name, .label, and .metadata` to it. If it wasn't for the fact that we need to be able to recreate Python objects from C++ objects, we wouldn't need to use this mechanism, and it would just be a pair of added properties.

If we don't have this, then we lose a very pythonic named axis system - all downstream libraries will have to reinvent this like hist does if they want it. It would be very nice to have name as part of the universal API for how all histograms based on boost-histogram work.

And we lose the ability to have a universal "Histogram" API that can be produced by Uproot4, modified in boost-histogram, and then plotted in mplhep, since Hist has to assign meaning to the metadata if it wants to put a name and title in, since Axis only have the single metadata slot available. All plotting libraries, like mplhep and histoprint, would have to check to see if there was a "label" attribute (Hist), then check to see if metadata was a dict (bh), then check to see if it had a title field - and this was done, it would not leave metadata free to be anything for it to work in other libraries. If all histograms have a "label" field, that's just part of what a histogram is and that's all that has to be checked. If empty, no label.

Unlike normal Python libraries, subclasses are not allowed to add arbitrary Python properties, since they don't get passed through manipulations in C++; this forces metadata in Hist to not be the same structure as in boost-histogram. Currently, the conversion I (briefly) showed between Hist and boost-histogram doesn't work if you assign anything that's not a dict to boost-histogram's metadata.

There's no reason to make a rushed decision - after the last couple of weeks, I'm taking time off working on histograms until at least Tuesday. In fact, I tried not to open my computer today. Also, if you don't want both .label and .name, would you consider just one of them?


PS: The problem mentioned above - that we can't make and pass through arbitrary members, is one we've discussed before - __dict__ can't be used since it is not propagated. Maybe, though, we could make the C++ metadata hold two items - the contents of .metadata, and the contents of __dict__. That way, subclasses could add arbitrary members and pass them through transformations. Then the discussion about whether or not boost-histogram has a .name and .label will truly be one about whether you should be able to name and label axes in boost-histogram, and not about whether the C++ binding code changes.

@henryiii
Copy link
Member Author

Actually, I think our discussion in #400 was too restrictive - we were considering replacing metadata with __dict__ like storage, when in fact we could support both. We should support .metadata and metadata= because Boost.Histogram supports it. We should support passing __dict__ through transforms because not doing it is a restrictive limitation imposed on us by the fact we really are manipulating C++ objects behind the scenes instead of Python objects, and I think we can fix it. Then we can decide if we support .name and/or .label separately from discussing metadata, and if Hist/other libraries want to add more properties, they can, without having to rely on forcing structure to metadata, which is already very useful for storing actual metadata on axes, such as the ROOT metadata from uproot.

Then I would make the case, completely unrelated to .metadata now, that having a name= option enables nice Pythonic named access to axes (in a way C++ doesn't have, keyword fill would not be possible there, for example), and is completely optional.

I would then separately make the case that having the .title property would enable a PlottableHistogram Protocol across Scikit-HEP (since boost-histogram is a Scikit-HEP package too, unlike Boost.Histogram).

@HDembinski
Copy link
Member

I understand all this, but boost-histogram should stay close to the metal. It's the library where we have a bare metal histogram without convenience features, with a very small code base. It is the library where we innovate to make histogramming faster and add new exciting accumulators.

I am ok with downstream libraries re-inventing more convenient ways to access axis. Your proposal is not only about that, you also want to enable keyword-based filling. All this is useful, but goes against this bare-metal idea.

If other libraries want these conveniences, they can just build on hist instead of boost-histogram.

@HDembinski
Copy link
Member

Boost-histogram was originally conceived as a thin wrapper over C++, so that it is easy to innovate on the C++ side and then add according features to Python. The more complex the Python wrapper is, the harder this is.

Regarding metadata and __dict__, we should only support on of the two. Keeping both dicts and a metadata field is redundant and not a sharp design.

@HDembinski
Copy link
Member

HDembinski commented Jul 19, 2020

The metadata class in C++ can handle the Python dict. The axis constructor in Python could pass all keywords not recognized to the dict, so writing metadata=... would still work, but also name=..., foo=..., etc. Boost-Histogram would still assign no meaning to these fields.

@HDembinski
Copy link
Member

The problem with that design of course is that it is open for conflicts between fields that already exist on the axis or histogram, which was why funneling everything through the metadata field is the cleaner solution.

@HDembinski
Copy link
Member

HDembinski commented Jul 19, 2020

It is fine if boost-histogram doesn't have an explicit metadata field. The metadata is anyway optional in C++. Its whole purpose is to allow adding runtime metadata to an axis. We changed many other things to make it more Pythonic. The natural way in Python to add arbitrary metadata is the __dict__.

@henryiii
Copy link
Member Author

boost-histogram's scope is to be the best filling and manipulation library for histograms in Python. Hist's scope is to add plotting, shortcuts, simple access to other libraries, dependencies etc. Of all the current and planned features for Hist, name sticks out because it is part of filling and manipulation and not something that is easily added after the fact with simple subclasses.

I am ok with downstream libraries re-inventing more convenient ways to access axis.

This is non-trivial, and Hist has to integrate deeply with boost-histogram, since it has to modify the internal caching system to make NamedAxesTuple the item that gets cached instead of AxesTuple. It would be much easier to support, and would be much simpler code, if boost-histogram had the idea of names built-in. It should be telling that I had to add this and it caused a PR or two in boost-histogram, while @LovelyBuggies was able to do the other parts.

I'm perfectly fine with dropping name-based filling from boost-histogram, actually, and making that Hist only - that's a tiny bit magical, and that's easy for a downstream library to add. It's using names in the three places where integers currently already can be used (AxesTuple, projection, and dict indexing) that would be nice to upstream.

@henryiii
Copy link
Member Author

henryiii commented Jul 20, 2020

By the way, having a way to transmit other information besides the metadata through transforms would solve the biggest problem - and we could remove label/name from the PlottableHistogram - the official spec would lose the ability to attach labels to histograms in a uniform way, but it could be added optionally for some libraries/situations. I think it is not optimal, but I could live with it. I think name is a Pythonic addition that is not natural in C++ - in Python, you have inspection, and class __name__, and you can always access an object by name in locals/globals, a concept missing from C++ (for now).

Also, name should be a read only property - allowing it to be changed after an axis is created is not ideal/more error prone.

I'm not a fan of allowing arbitrary keywords in the constructor - misspelling keywords like undrflow would not give errors, and we would be limited in what we could do later - for example, Hist might add a flow= keyword to disable the flow in one keyword - and then moving to boost-histogram would cause this to silently do something different that would be hard to catch (just one possible example). (That's also the reasoning behind the suggestion of dropping the keyword fill for boost-histogram above, and making it hist only - though in that case, a mistake would throw an error). I think it would be best to allow metadata (and maybe name/label if those are added or as special cases), but any other item would have to be set after the construction. Subclasses would be free to do whatever they wanted.

build on hist instead of boost-histogram

Hist is not designed to be built upon; boost-histogram is. Hist is designed to be an end-user library, and hopefully one of several - I'd like to see if Physt could be built on boost-histogram too, for example. If sata acquisition/online picks up our tools, they should use boost-histogram, not Hist. I think we should still make boost-histogram the best it can be, rather than intentionally complicating Hist and making boost-histogram a sub-optimal library that only Hist can use. Hist is supposed to help boost-histogram, not limit/damage it.

@henryiii
Copy link
Member Author

PS: I do understand that if someone comes to boost-histogram, sees names, then goes to Boost.Histogram, they may wonder why it doesn't have names. I really don't think that will be a large number of people, I think the answer is easy, names and labels are part of the Histogram ecosystem in Python - there isn't one for C++, there isn't a Boost.Plot, and it's a language where looping yourself is encouraged. Two of the three uses (dict UHI and AxesTuple) don't exist in Boost.Histogram.

I'm focused on making the best possible histogram library for Python - I think the number of people wanting a great histogram library for Python outnumbers the number of people wanting a perfect clone of a C++ library, and I think there will always be a difference between a compiled language and a very different interpreted language. But I do understand if you really don't want to have this difference.

@henryiii
Copy link
Member Author

henryiii commented Sep 8, 2020

By the way, "h.name" will not be 1:1 when converting to Hist - .name has to be a read-only property, as changing the name after the histogram is created would allow multiple axes with the same name (which can be checked on creation of a histogram, but not afterwords on axis instances). Allowing arbitrary properties still is much better than before, since we can convert .name to ._name and then back. Actually the converting back might be tricky, as that code may sit inside boost-histogram...

If you did want to allow just name; that gives the nice integrated manipulation in AxesTuple and such, and then label is just a "normal" property now. ;)

@HDembinski
Copy link
Member

boost-histogram's scope is to be the best filling and manipulation library for histograms in Python.

No, boost-histogram's scope is to bring the power and flexibility of Boost.Histogram in C++ to Python with a minimal API.

@HDembinski
Copy link
Member

HDembinski commented Sep 29, 2020

Dynamically generating keywords in .fill and attributes to AxesTuple based on a name keyword to the axis is not Pythonic. Tuples are accesses by index and not keys and vice versa for dicts. Your enhanced AxisTuple is neither a tuple nor a dict. It is not Pythonic. I will not accept that in boost-histogram.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants