The easiest thing is to use the .value
attribute of the HDF5 dataset.
>>> hf = h5py.File('/path/to/file', 'r')
>>> data = hf.get('dataset_name').value # `data` is now an ndarray.
You can also slice the dataset, which produces an actual ndarray with the requested data:
>>> hf['dataset_name'][:10] # produces ndarray as well
But keep in mind that in many ways the h5py
dataset acts like an ndarray
. So you can pass the dataset itself unchanged to most, if not all, NumPy functions. So, for example, this works just fine: np.mean(hf.get('dataset_name'))
.
EDIT:
I misunderstood the question originally. The problem isn't loading the numerical data, it's that the dataset actually contains HDF5 references. This is a strange setup, and it's kind of awkward to read in h5py
. You need to dereference each reference in the dataset. I'll show it for just one of them.
First, let's create a file and a temporary dataset:
>>> f = h5py.File('tmp.h5', 'w')
>>> ds = f.create_dataset('data', data=np.zeros(10,))
Next, create a reference to it and store a few of them in a dataset.
>>> ref_dtype = h5py.special_dtype(ref=h5py.Reference)
>>> ref_ds = f.create_dataset('data_refs', data=(ds.ref, ds.ref), dtype=ref_dtype)
Then you can read one of these back, in a circuitous way, by getting its name ,and then reading from that actual dataset that is referenced.
>>> name = h5py.h5r.get_name(ref_ds[0], f.id) # 2nd argument is the file identifier
>>> print(name)
b'/data'
>>> out = f[name]
>>> print(out.shape)
(10,)
It's round-about, but it seems to work. The TL;DR is: get the name of the referenced dataset, and read directly from that.
Note:
The h5py.h5r.dereference
function seems pretty unhelpful here, despite the name. It returns the ID of the referenced object. This can be read from directly, but it's very easy to cause a crash in this case (I did it several times in this contrived example here). Getting the name and reading from that is much easier.
Note 2:
As stated in the release notes for h5py 2.1, the use of Dataset.value
property is deprecated and should be replaced by using mydataset[...]
or mydataset[()]
as appropriate.
The property Dataset.value
, which dates back to h5py 1.0, is deprecated and will be removed in a later release. This property dumps the entire dataset into a NumPy array. Code using .value
should be updated to use NumPy indexing, using mydataset[...]
or mydataset[()]
as appropriate.