thelper.data package

Dataset parsing/loading package.

This package contains classes and functions whose role is to fetch the data required to train, validate, and test a model. The thelper.data.utils.create_loaders() function contained herein is responsible for preparing the task and data loaders for a training session. This package also contains the base interfaces for dataset parsers.

Submodules

thelper.data.loaders module

Dataset loaders module.

This module contains a dataset loader specialization used to properly seed samplers and workers.

class thelper.data.loaders.DataLoader(*args, seeds=None, epoch=0, collate_fn=<function default_collate>, **kwargs)[source]

Bases: sphinx.ext.autodoc.importer._MockObject

Specialized data loader used to load minibatches from a dataset parser.

This specialization handles the seeding of samplers and workers.

See torch.utils.data.DataLoader for more information on attributes/methods.

__init__(*args, seeds=None, epoch=0, collate_fn=<function default_collate>, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

sample_count
set_epoch(epoch=0)[source]

Sets the current epoch number in order to offset RNG states for the workers and the sampler.

class thelper.data.loaders.DataLoaderWrapper(loader, callback)[source]

Bases: thelper.data.loaders.DataLoader

Data loader wrapper used to transform all loaded samples with an external function.

This can be useful to convert the samples before the user gets to access them, or to upload them on a specific device.

The wrapped data loader should be compatible with thelper.data.loaders.DataLoader.

__init__(loader, callback)[source]

Initialize self. See help(type(self)) for accurate signature.

class thelper.data.loaders.LoaderFactory(config)[source]

Bases: object

Factory used for preparing and splitting dataset parsers into usable data loader objects.

This class is responsible for parsing the parameters contained in the ‘loaders’ field of a configuration dictionary, instantiating the data loaders, and shuffling/splitting the samples. An example configuration is presented in thelper.data.utils.create_loaders().

__init__(config)[source]

Receives and parses the data configuration dictionary.

create_loaders(datasets, train_idxs, valid_idxs, test_idxs)[source]

Returns the data loaders for the train/valid/test sets based on a prior split.

This function essentially takes the dataset parser interfaces and indices maps, and instantiates data loaders that are ready to produce samples for training or evaluation. Note that the dataset parsers will be deep-copied in each data loader, meaning that they should ideally not contain a persistent loading state or a large buffer.

Parameters:
  • datasets – the map of dataset parsers, where each has a name (key) and a parser (value).
  • train_idxs – training data samples indices map.
  • valid_idxs – validation data samples indices map.
  • test_idxs – test data samples indices map.
Returns:

A three-element tuple containing the training, validation, and test data loaders, respectively.

get_base_transforms()[source]

Returns the (global) sample transformation operations parsed in the data configuration.

get_split(datasets, task)[source]

Returns the train/valid/test sample indices split for a given dataset (name-parser) map.

Note that the returned indices are unique, possibly shuffled, and never duplicated between sets. If the samples have a class attribute (i.e. the task is related to classification), the split will respect the initial distribution and apply the ratios within the classes themselves. For example, consider a dataset of three classes (\(A\), \(B\), and \(C\)) that contains 100 samples such as:

\[|A| = 50,\;|B| = 30,\;|C| = 20\]

If we require a 80%-10%-10% ratio distribution for the training, validation, and test loaders respectively, the resulting split will contain the following sample counts:

\[\text{training loader} = {40A + 24B + 16C}\]
\[\text{validation loader} = {5A + 3B + 2C}\]
\[\text{test loader} = {5A + 3B + 2C}\]

In the case of multi-label classification datasets, there is no guarantee that the classes will be balanced across the training/validation/test sets. Instead, for a given class list, the classes with fewer samples will be split first.

Parameters:
  • datasets – the map of datasets to split, where each has a name (key) and a parser (value).
  • task – a task object that should be compatible with all provided datasets (can be None).
Returns:

A three-element tuple containing the maps of the training, validation, and test sets respectively. These maps associate dataset names to a list of sample indices.

thelper.data.loaders.default_collate(batch, force_tensor=True)[source]

Puts each data field into a tensor with outer dimension batch size.

This function is copied from PyTorch’s torch.utils.data._utils.collate.default_collate, but additionally supports custom objects from the framework (such as bounding boxes). These will not be converted to tensors, and it will be up to the trainer to handle them accordingly.

See torch.utils.data.DataLoader for more information.

thelper.data.parsers module

Dataset parsers module.

This module contains dataset parser interfaces and base classes that define basic i/o operations so that the framework can automatically interact with training data.

class thelper.data.parsers.ClassificationDataset(class_names, input_key, label_key, meta_keys=None, transforms=None, deepcopy=False)[source]

Bases: thelper.data.parsers.Dataset

Classification dataset specialization interface.

This specialization receives some extra parameters in its constructor and automatically defines a thelper.tasks.classif.Classification task based on those. The derived class must still implement thelper.data.parsers.ClassificationDataset.__getitem__(), and it must still store its samples as dictionaries in self.samples to behave properly.

__getitem__(idx)[source]

Returns the data sample (a dictionary) for a specific (0-based) index.

__init__(class_names, input_key, label_key, meta_keys=None, transforms=None, deepcopy=False)[source]

Classification dataset parser constructor.

This constructor receives all extra arguments necessary to build a classification task object.

Parameters:
  • class_names – list of all class names (or labels) that will be associated with the samples.
  • input_key – key used to index the input data in the loaded samples.
  • label_key – key used to index the label (or class name) in the loaded samples.
  • meta_keys – list of extra keys that will be available in the loaded samples.
  • transforms – function or object that should be applied to all loaded samples in order to return the data in the requested transformed/augmented state.
  • deepcopy – specifies whether this dataset interface should be deep-copied inside thelper.data.loaders.LoaderFactory so that it may be shared between different threads. This is false by default, as we assume datasets do not contain a state or buffer that might cause problems in multi-threaded data loaders.
class thelper.data.parsers.Dataset(transforms=None, deepcopy=False)[source]

Bases: sphinx.ext.autodoc.importer._MockObject

Abstract dataset parsing interface that holds a task and a list of sample dictionaries.

This interface helps fix a failure of PyTorch’s dataset interface (torch.utils.data.Dataset): the lack of identity associated with the components of a sample. In short, a data sample loaded by a dataset typically contains the input data that should be forwarded to a model as well as the expected prediction of the model (i.e. the ‘groundtruth’) that will be used to compute the loss. These two elements are typically paired in a tuple that can then be provided to the data loader for batching. Problems however arise when the model has multiple inputs or outputs, when the sample needs to carry supplemental metadata to simplify debugging, or when transformation operations need to be applied only to specific elements of the sample. Here, we fix this issue by specifying that all samples must be provided to data loaders as dictionaries. The keys of these dictionaries explicitly define which value(s) should be transformed, which should be forwarded to the model, which are the expected model predictions, and which are only used for debugging. The keys are defined via the task object that is generated by the dataset or specified via the configuration file (see thelper.tasks.utils.Task for more information).

To properly use this interface, a derived class must implement thelper.data.parsers.Dataset.__getitem__(), as well as provide proper task and samples attributes. The task attribute must derive from thelper.tasks.utils.Task, and samples must be an array-like object holding already-parsed information about the dataset samples (in dictionary format). The length of the samples array will automatically be returned as the size of the dataset in this interface. For class-based datasets, it is recommended to parse the classes in the dataset constructor so that external code can directly peek into the samples attribute to see their distribution without having to call __getitem__. This is done for example in thelper.data.loaders.LoaderFactory.get_split() to automatically rebalance classes without having to actually load the samples one by one, which speeds up the process dramatically.

Variables:
  • transforms – function or object that should be applied to all loaded samples in order to return the data in the requested transformed/augmented state.
  • deepcopy – specifies whether this dataset interface should be deep-copied inside thelper.data.loaders.LoaderFactory so that it may be shared between different threads. This is false by default, as we assume datasets do not contain a state or buffer that might cause problems in multi-threaded data loaders.
  • samples – list of dictionaries containing the data that is ready to be forwarded to the data loader. Note that relatively costly operations (such as reading images from a disk or pre-transforming them) should be delayed until the thelper.data.parsers.Dataset.__getitem__() function is called, as they will most likely then be accomplished in a separate thread. Once loaded, these samples should never be modified by another part of the framework. For example, transformation and augmentation operations will always be applied to copies of these samples.
  • task – object used to define what keys are used to index the loaded data into sample dictionaries.
__getitem__(idx)[source]

Returns the data sample (a dictionary) for a specific (0-based) index.

__init__(transforms=None, deepcopy=False)[source]

Dataset parser constructor.

In order for derived datasets to be instantiated automatically by the framework from a configuration file, they must minimally accept a ‘transforms’ argument like the shown one here.

Parameters:
  • transforms – function or object that should be applied to all loaded samples in order to return the data in the requested transformed/augmented state.
  • deepcopy – specifies whether this dataset interface should be deep-copied inside thelper.data.loaders.LoaderFactory so that it may be shared between different threads. This is false by default, as we assume datasets do not contain a state or buffer that might cause problems in multi-threaded data loaders.
deepcopy

specifies whether this dataset interface should be deep-copied inside thelper.data.loaders.LoaderFactory so that it may be shared between different threads. This is false by default, as we assume datasets do not contain a state or buffer that might cause problems in multi-threaded data loaders.

samples

Returns the list of internal samples held by this dataset interface.

task

Returns the task object associated with this dataset interface.

transforms

Returns the transformation operations to apply to this dataset’s loaded samples.

class thelper.data.parsers.ExternalDataset(dataset, task, transforms=None, deepcopy=False, **kwargs)[source]

Bases: thelper.data.parsers.Dataset

External dataset interface.

This interface allows external classes to be instantiated automatically in the framework through a configuration file, as long as they themselves provide implementations for __getitem__ and __len__. This includes all derived classes of torch.utils.data.Dataset such as torchvision.datasets.ImageFolder, and the specialized versions such as torchvision.datasets.CIFAR10.

Note that for this interface to be compatible with our runtime instantiation rules, the constructor needs to receive a fully constructed task object. This object is currently constructed in thelper.data.utils.create_parsers() based on extra parameters; see the code there for more information.

Variables:
  • dataset_type – type of the internally instantiated or provided dataset object.
  • warned_dictionary – specifies whether the user was warned about missing keys in the output samples dictionaries.
__getitem__(idx)[source]

Returns the data sample (a dictionary) for a specific (0-based) index.

__init__(dataset, task, transforms=None, deepcopy=False, **kwargs)[source]

External dataset parser constructor.

Parameters:
  • dataset – fully qualified name of the dataset object to instantiate, or the dataset itself.
  • task – fully constructed task object providing key information for sample loading.
  • transforms – function or object that should be applied to all loaded samples in order to return the data in the requested transformed/augmented state.
  • deepcopy – specifies whether this dataset interface should be deep-copied inside thelper.data.loaders.LoaderFactory so that it may be shared between different threads. This is false by default, as we assume datasets do not contain a state or buffer that might cause problems in multi-threaded data loaders.
class thelper.data.parsers.HDF5Dataset(root, subset='train', transforms=None)[source]

Bases: thelper.data.parsers.Dataset

HDF5 dataset specialization interface.

This specialization is compatible with the HDF5 packages made by the CLI’s “split” operation. The archives it loads contains pre-split datasets that can be reloaded without having to resplit their data. The archive also contains useful metadata, and a task interface.

Variables:
  • archive – file descriptor for the opened hdf5 dataset.
  • subset – hdf5 group section representing the targeted set.
  • target_args – list decompression args required for each sample key.
  • source – source logstamp of the hdf5 dataset.
  • git_sha1 – framework git tag of the hdf5 dataset.
  • version – version of the framework that saved the hdf5 dataset.
  • orig_config – configuration used to originally generate the hdf5 dataset.
__getitem__(idx)[source]

Returns the data sample (a dictionary) for a specific (0-based) index.

__init__(root, subset='train', transforms=None)[source]

HDF5 dataset parser constructor.

This constructor receives the path to the HDF5 archive as well as a subset indicating which section of the archive to load. By default, it loads the training set.

close()[source]

Closes the internal HDF5 file.

class thelper.data.parsers.ImageDataset(root, transforms=None, image_key='image', path_key='path', idx_key='idx')[source]

Bases: thelper.data.parsers.Dataset

Image dataset specialization interface.

This specialization is used to parse simple image folders, and it does not fulfill the requirements of any specialized task constructors due to the lack of groundtruth data support. Therefore, it returns a basic task object (thelper.tasks.utils.Task) with no set value for the groundtruth key, and it cannot be used to directly train a model. It can however be useful when simply visualizing, annotating, or testing raw data from a simple directory structure.

__getitem__(idx)[source]

Returns the data sample (a dictionary) for a specific (0-based) index.

__init__(root, transforms=None, image_key='image', path_key='path', idx_key='idx')[source]

Image dataset parser constructor.

This constructor exposes some of the configurable keys used to index sample dictionaries.

class thelper.data.parsers.ImageFolderDataset(root, transforms=None, image_key='image', label_key='label', path_key='path', idx_key='idx')[source]

Bases: thelper.data.parsers.ClassificationDataset

Image folder dataset specialization interface for classification tasks.

This specialization is used to parse simple image subfolders, and it essentially replaces the very basic torchvision.datasets.ImageFolder interface with similar functionalities. It it used to provide a proper task interface as well as path metadata in each loaded packet for metrics/logging output.

__getitem__(idx)[source]

Returns the data sample (a dictionary) for a specific (0-based) index.

__init__(root, transforms=None, image_key='image', label_key='label', path_key='path', idx_key='idx')[source]

Image folder dataset parser constructor.

class thelper.data.parsers.SegmentationDataset(class_names, input_key, label_map_key, meta_keys=None, dontcare=None, transforms=None, deepcopy=False)[source]

Bases: thelper.data.parsers.Dataset

Segmentation dataset specialization interface.

This specialization receives some extra parameters in its constructor and automatically defines its task (thelper.tasks.segm.Segmentation) based on those. The derived class must still implement thelper.data.parsers.SegmentationDataset.__getitem__(), and it must still store its samples as dictionaries in self.samples to behave properly.

__getitem__(idx)[source]

Returns the data sample (a dictionary) for a specific (0-based) index.

__init__(class_names, input_key, label_map_key, meta_keys=None, dontcare=None, transforms=None, deepcopy=False)[source]

Segmentation dataset parser constructor.

This constructor receives all extra arguments necessary to build a segmentation task object.

Parameters:
  • class_names – list of all class names (or labels) that must be predicted in the image.
  • input_key – key used to index the input image in the loaded samples.
  • label_map_key – key used to index the label map in the loaded samples.
  • meta_keys – list of extra keys that will be available in the loaded samples.
  • transforms – function or object that should be applied to all loaded samples in order to return the data in the requested transformed/augmented state.
  • deepcopy – specifies whether this dataset interface should be deep-copied inside thelper.data.loaders.LoaderFactory so that it may be shared between different threads. This is false by default, as we assume datasets do not contain a state or buffer that might cause problems in multi-threaded data loaders.
class thelper.data.parsers.SuperResFolderDataset(root, downscale_factor=2.0, rescale_lowres=True, center_crop=None, transforms=None, lowres_image_key='lowres_image', highres_image_key='highres_image', path_key='path', idx_key='idx', label_key='label')[source]

Bases: thelper.data.parsers.Dataset

Image folder dataset specialization interface for super-resolution tasks.

This specialization is used to parse simple image subfolders, and it essentially replaces the very basic torchvision.datasets.ImageFolder interface with similar functionalities. It it used to provide a proper task interface as well as path/class metadata in each loaded packet for metrics/logging output.

__getitem__(idx)[source]

Returns the data sample (a dictionary) for a specific (0-based) index.

__init__(root, downscale_factor=2.0, rescale_lowres=True, center_crop=None, transforms=None, lowres_image_key='lowres_image', highres_image_key='highres_image', path_key='path', idx_key='idx', label_key='label')[source]

Image folder dataset parser constructor.

thelper.data.pascalvoc module

PASCAL VOC dataset parser module.

This module contains a dataset parser used to load the PASCAL Visual Object Classes (VOC) dataset for semantic segmentation or object detection. See http://host.robots.ox.ac.uk/pascal/VOC/ for more info.

class thelper.data.pascalvoc.PASCALVOC(root, task='segm', subset='trainval', target_labels=None, download=False, preload=True, use_difficult=False, use_occluded=True, use_truncated=True, transforms=None, image_key='image', sample_name_key='name', idx_key='idx', image_path_key='image_path', gt_path_key='gt_path', bboxes_key='bboxes', label_map_key='label_map')[source]

Bases: thelper.data.parsers.Dataset

PASCAL VOC dataset parser.

This class can be used to parse the PASCAL VOC dataset for either semantic segmentation or object detection. The task object it exposes will be changed accordingly. In all cases, the 2012 version of the dataset will be used.

TODO: Add support for semantic instance segmentation.

__getitem__(idx)[source]

Returns the data sample (a dictionary) for a specific (0-based) index.

__init__(root, task='segm', subset='trainval', target_labels=None, download=False, preload=True, use_difficult=False, use_occluded=True, use_truncated=True, transforms=None, image_key='image', sample_name_key='name', idx_key='idx', image_path_key='image_path', gt_path_key='gt_path', bboxes_key='bboxes', label_map_key='label_map')[source]

Dataset parser constructor.

In order for derived datasets to be instantiated automatically by the framework from a configuration file, they must minimally accept a ‘transforms’ argument like the shown one here.

Parameters:
  • transforms – function or object that should be applied to all loaded samples in order to return the data in the requested transformed/augmented state.
  • deepcopy – specifies whether this dataset interface should be deep-copied inside thelper.data.loaders.LoaderFactory so that it may be shared between different threads. This is false by default, as we assume datasets do not contain a state or buffer that might cause problems in multi-threaded data loaders.
decode_label_map(label_map)[source]

Returns a color image from a label indices map.

encode_label_map(label_map)[source]

Returns a map of label indices from a color image.

thelper.data.samplers module

Samplers module.

This module contains classes used for raw dataset rebalancing or augmentation.

All samplers here should aim to be compatible with PyTorch’s sampling interface (torch.utils.data.sampler.Sampler) so that they can be instantiated at runtime through a configuration file and used as the input of a data loader.

class thelper.data.samplers.FixedWeightSubsetSampler(indices, labels, weights, seeds=None, epoch=0)[source]

Bases: sphinx.ext.autodoc.importer._MockObject

Provides a rebalanced list of sample indices to use in a data loader.

Given a list of sample indices and the corresponding list of class labels, this sampler will produce a new list of indices that rebalances the distribution of samples according to a provided array of weights.

Example configuration file:

# ...
# the sampler is defined inside the 'loaders' field
"loaders": {
    # ...
    # this field is completely optional, and can be omitted entirely
    "sampler": {
        # the type of the sampler we want to instantiate
        "type": "thelper.data.samplers.FixedWeightSubsetSampler",
        # the parameters passed to the sampler's constructor
        "params": {
            "weights": {
                # the weights must be provided using class name pairs
                "class_A": 0.1,
                "class_B": 5.0,
                "class_C": 1.0,
                # ...
            }
        },
        # specifies whether the sampler should receive class labels
        "pass_labels": true
    },
    # ...
},
# ...
Variables:
  • nb_samples – total number of samples to rebalance (i.e. scaled size of original dataset).
  • weights – weight map to use for sampling each class
  • indices – copy of the original list of sample indices provided in the constructor.
  • seeds – dictionary of seeds to use when initializing RNG state.
  • epoch – epoch number used to reinitialize the RNG to an epoch-specific state.
__init__(indices, labels, weights, seeds=None, epoch=0)[source]

Receives sample indices, labels, rebalancing strategy, and dataset scaling factor.

This function will validate all input arguments, parse and categorize samples according to labels, initialize rebalancing parameters, and determine sample counts for each valid class. Note that an empty list of indices is an acceptable input; the resulting object will also create and empty list of samples when __iter__ is called.

Parameters:
  • indices – list of integers representing the indices of samples of interest in the dataset.
  • labels – list of labels tied to the list of indices (must be the same length).
  • weights – weight map to use for sampling each class.
  • seeds – dictionary of seeds to use when initializing RNG state.
  • epoch – epoch number used to reinitialize the RNG to an epoch-specific state.
set_epoch(epoch=0)[source]

Sets the current epoch number in order to offset the RNG state for sampling.

class thelper.data.samplers.SubsetRandomSampler(indices, seeds=None, epoch=0, scale=1.0)[source]

Bases: sphinx.ext.autodoc.importer._MockObject

Samples elements randomly from a given list of indices, without replacement.

This specialization handles seeding based on the epoch number, and scaling (via duplication/decimation) of samples.

Parameters:
  • indices (list) – a list of indices
  • seeds (dict) – dictionary of seeds to use when initializing RNG state.
  • epoch (int) – epoch number used to reinitialize the RNG to an epoch-specific state.
  • scale (float) – scaling factor used to increase/decrease the final number of samples.
__init__(indices, seeds=None, epoch=0, scale=1.0)[source]

Initialize self. See help(type(self)) for accurate signature.

set_epoch(epoch=0)[source]

Sets the current epoch number in order to offset the RNG state for sampling.

class thelper.data.samplers.SubsetSequentialSampler(indices)[source]

Bases: sphinx.ext.autodoc.importer._MockObject

Samples element indices sequentially, always in the same order.

Parameters:indices (list) – a list of indices
__init__(indices)[source]

Initialize self. See help(type(self)) for accurate signature.

class thelper.data.samplers.WeightedSubsetRandomSampler(indices, labels, stype='uniform', scale=1.0, seeds=None, epoch=0)[source]

Bases: sphinx.ext.autodoc.importer._MockObject

Provides a rebalanced list of sample indices to use in a data loader.

Given a list of sample indices and the corresponding list of class labels, this sampler will produce a new list of indices that rebalances the distribution of samples according to a specified strategy. It can also optionally scale the dataset’s total sample count to avoid undersampling large classes as smaller ones get bigger.

The currently implemented strategies are:

  • random: will return a list of randomly picked samples based on the multinomial distribution of the initial class weights. This sampling is done with replacement, meaning that each index is picked independently of the already-picked ones.

  • uniform: will rebalance the dataset by normalizing the sample count of all classes, oversampling and undersampling as required to distribute all samples equally. All removed or duplicated samples are selected randomly without replacement whenever possible.

  • root: will rebalance the dataset by normalizing class weight using an n-th degree root. More specifically, for a list of initial class weights \(W^0=\{w_1^0, w_2^0, ... w_n^0\}\), we compute the adjusted weight \(w_i\) of each class via:

    \[w_i = \frac{\sqrt[\leftroot{-1}\uproot{3}n]{w_i^0}}{\sum_j\sqrt[\leftroot{-1}\uproot{3}n]{w_j^0}}\]

    Then, according to the new distribution of weights, all classes are oversampled and undersampled as required to reobtain the dataset’s total sample count (which may be scaled). All removed or duplicated samples are selected randomly without replacement whenever possible.

    Note that with the root strategy, if a very large root degree n is used, this strategy is equivalent to uniform. If the degree is one, the original weights will be used for sampling. The root strategy essentially provides a flexible solution to rebalance very uneven label sets where uniform over/undersampling would be too aggressive.

By default, this interface will try to keep the dataset size constant and balance oversampling with undersampling. If undersampling is undesired, the user can increase the total dataset size via a scale factor. Finally, note that the rebalanced list of indices is generated by this interface every time the __iter__ function is called, meaning two consecutive lists might not contain the exact same indices.

Example configuration file:

# ...
# the sampler is defined inside the 'loaders' field
"loaders": {
    # ...
    # this field is completely optional, and can be omitted entirely
    "sampler": {
        # the type of the sampler we want to instantiate
        "type": "thelper.data.samplers.WeightedSubsetRandomSampler",
        # the parameters passed to the sampler's constructor
        "params": {
            "stype": "root3",
            "scale": 1.2
        },
        # specifies whether the sampler should receive class labels
        "pass_labels": true
    },
    # ...
},
# ...
Variables:
  • nb_samples – total number of samples to rebalance (i.e. scaled size of original dataset).
  • label_groups – map that splits all samples indices into groups based on labels.
  • stype – name of the rebalancing strategy to use.
  • indices – copy of the original list of sample indices provided in the constructor.
  • sample_weights – list of weights used for random sampling.
  • label_counts – number of samples in each class for the uniform and root strategies.
  • seeds – dictionary of seeds to use when initializing RNG state.
  • epoch – epoch number used to reinitialize the RNG to an epoch-specific state.
__init__(indices, labels, stype='uniform', scale=1.0, seeds=None, epoch=0)[source]

Receives sample indices, labels, rebalancing strategy, and dataset scaling factor.

This function will validate all input arguments, parse and categorize samples according to labels, initialize rebalancing parameters, and determine sample counts for each valid class. Note that an empty list of indices is an acceptable input; the resulting object will also create and empty list of samples when __iter__ is called.

Parameters:
  • indices – list of integers representing the indices of samples of interest in the dataset.
  • labels – list of labels tied to the list of indices (must be the same length).
  • stype – rebalancing strategy given as a string. Should be either “random”, “uniform”, or “rootX”, where the ‘X’ is the degree to use in the root computation (float).
  • scale – scaling factor used to increase/decrease the final number of sample indices to generate while rebalancing.
  • seeds – dictionary of seeds to use when initializing RNG state.
  • epoch – epoch number used to reinitialize the RNG to an epoch-specific state.
set_epoch(epoch=0)[source]

Sets the current epoch number in order to offset the RNG state for sampling.

thelper.data.utils module

Dataset utility functions and tools.

This module contains utility functions and tools used to instantiate data loaders and parsers.

thelper.data.utils.create_hdf5(archive_path, task, train_loader, valid_loader, test_loader, compression=None, config_backup=None)[source]

Saves the samples loaded from train/valid/test data loaders into an HDF5 archive.

The loaded minibatches are decomposed into individual samples. The keys provided via the task interface are used to fetch elements (input, groundtruth, …) from the samples, and save them in the archive. The archive will contain three groups (train, valid, and test), and each group will contain a dataset for each element originally found in the samples.

Note that the compression operates at the sample level, not at the dataset level. This means that elements of each sample will be compressed individually, not as an array. Therefore, if you are trying to compress very correlated samples (e.g. frames in a video sequence), this approach will be pretty bad.

Parameters:
  • archive_path – path pointing where the HDF5 archive should be created.
  • task – task object that defines the input, groundtruth, and meta keys tied to elements that should be parsed from loaded samples and saved in the HDF5 archive.
  • train_loader – training data loader (can be None).
  • valid_loader – validation data loader (can be None).
  • test_loader – testing data loader (can be None).
  • compression – the compression configuration dictionary that will be parsed to determine how sample elements should be compressed. If a mapping is missing, that element will not be compressed.
  • config_backup – optional session configuration file that should be saved in the HDF5 archive.

Example compression configuration:

# the config is given as a dictionary
{
    # each field is a key that corresponds to an element in each sample
    "key1": {
        # the 'type' identifies the compression approach to use
        # (see thelper.utils.encode_data for more information)
        "type": "jpg",
        # extra parameters might be needed to encode the data
        # (see thelper.utils.encode_data for more information)
        "encode_params": {}
        # these parameters are packed and kept for decoding
        # (see thelper.utils.decode_data for more information)
        "decode_params": {"flags": "cv.IMREAD_COLOR"}
    },
    "key2": {
        # this explicitly means that no encoding should be performed
        "type": "none"
    },
    ...
    # if a key is missing, its elements will not be compressed
}
thelper.data.utils.create_loaders(config, save_dir=None)[source]

Prepares the task and data loaders for a model trainer based on a provided data configuration.

This function will parse a configuration dictionary and extract all the information required to instantiate the requested dataset parsers. Then, combining the task metadata of all these parsers, it will evenly split the available samples into three sets (training, validation, test) to be handled by different data loaders. These will finally be returned along with the (global) task object.

The configuration dictionary is expected to contain two fields: loaders, which specifies all parameters required for establishing the dataset split, shuffling seeds, and batch size (these are listed and detailed below); and datasets, which lists the dataset parser interfaces to instantiate as well as their parameters. For more information on the datasets field, refer to thelper.data.utils.create_parsers().

The parameters expected in the ‘loaders’ configuration field are the following:

  • <train_/valid_/test_>batch_size (mandatory): specifies the (mini)batch size to use in data loaders. If you get an ‘out of memory’ error at runtime, try reducing it.
  • <train_/valid_/test_>collate_fn (optional): specifies the collate function to use in data loaders. The default one is typically fine, but some datasets might require a custom function.
  • shuffle (optional, default=True): specifies whether the data loaders should shuffle their samples or not.
  • test_seed (optional): specifies the RNG seed to use when splitting test data. If no seed is specified, the RNG will be initialized with a device-specific or time-related seed.
  • valid_seed (optional): specifies the RNG seed to use when splitting validation data. If no seed is specified, the RNG will be initialized with a device-specific or time-related seed.
  • torch_seed (optional): specifies the RNG seed to use for torch-related stochastic operations (e.g. for data augmentation). If no seed is specified, the RNG will be initialized with a device-specific or time-related seed.
  • numpy_seed (optional): specifies the RNG seed to use for numpy-related stochastic operations (e.g. for data augmentation). If no seed is specified, the RNG will be initialized with a device-specific or time-related seed.
  • random_seed (optional): specifies the RNG seed to use for stochastic operations with python’s ‘random’ package. If no seed is specified, the RNG will be initialized with a device-specific or time-related seed.
  • workers (optional, default=1): specifies the number of threads to use to preload batches in parallel; can be 0 (loading will be on main thread), or an integer >= 1.
  • pin_memory (optional, default=False): specifies whether the data loaders will copy tensors into CUDA-pinned memory before returning them.
  • drop_last (optional, default=False): specifies whether to drop the last incomplete batch or not if the dataset size is not a multiple of the batch size.
  • sampler (optional): specifies a type of sampler and its constructor parameters to be used in the data loaders. This can be used for example to help rebalance a dataset based on its class distribution. See thelper.data.samplers for more information.
  • augments (optional): provides a list of transformation operations used to augment all samples of a dataset. See thelper.transforms.utils.load_augments() for more info.
  • train_augments (optional): provides a list of transformation operations used to augment the training samples of a dataset. See thelper.transforms.utils.load_augments() for more info.
  • valid_augments (optional): provides a list of transformation operations used to augment the validation samples of a dataset. See thelper.transforms.utils.load_augments() for more info.
  • test_augments (optional): provides a list of transformation operations used to augment the test samples of a dataset. See thelper.transforms.utils.load_augments() for more info.
  • eval_augments (optional): provides a list of transformation operations used to augment the validation and test samples of a dataset. See thelper.transforms.utils.load_augments() for more info.
  • base_transforms (optional): provides a list of transformation operations to apply to all loaded samples. This list will be passed to the constructor of all instantiated dataset parsers. See thelper.transforms.utils.load_transforms() for more info.
  • train_split (optional): provides the proportion of samples of each dataset to hand off to the training data loader. These proportions are given in a dictionary format (name: ratio).
  • valid_split (optional): provides the proportion of samples of each dataset to hand off to the validation data loader. These proportions are given in a dictionary format (name: ratio).
  • test_split (optional): provides the proportion of samples of each dataset to hand off to the test data loader. These proportions are given in a dictionary format (name: ratio).
  • skip_verif (optional, default=True): specifies whether the dataset split should be verified if resuming a session by parsing the log files generated earlier.
  • skip_split_norm (optional, default=False): specifies whether the question about normalizing the split ratios should be skipped or not.
  • skip_class_balancing (optional, default=False): specifies whether the balancing of class labels should be skipped in case the task is classification-related.

Example configuration file:

# ...
"loaders": {
    "batch_size": 128,  # batch size to use in data loaders
    "shuffle": true,  # specifies that the data should be shuffled
    "workers": 4,  # number of threads to pre-fetch data batches with
    "train_sampler": {  # we can use a data sampler to rebalance classes (optional)
        # see e.g. 'thelper.data.samplers.WeightedSubsetRandomSampler'
        # ...
    },
    "train_augments": { # training data augmentation operations
        # see 'thelper.transforms.utils.load_augments'
        # ...
    },
    "eval_augments": { # evaluation (valid/test) data augmentation operations
        # see 'thelper.transforms.utils.load_augments'
        # ...
    },
    "base_transforms": { # global sample transformation operations
        # see 'thelper.transforms.utils.load_transforms'
        # ...
    },
    # optionally indicate how to resolve dataset loader task vs model task incompatibility if any
    # leave blank to get more details about each case during runtime if this situation happens
    "task_compat_mode": "old|new|compat",
    # finally, we define a 80%-10%-10% split for our data
    # (we could instead use one dataset for training and one for testing)
    "train_split": {
        "dataset_A": 0.8
        "dataset_B": 0.8
    },
    "valid_split": {
        "dataset_A": 0.1
        "dataset_B": 0.1
    },
    "test_split": {
        "dataset_A": 0.1
        "dataset_B": 0.1
    }
    # (note that the dataset names above are defined in the field below)
},
"datasets": {
    "dataset_A": {
        # type of dataset interface to instantiate
        "type": "...",
        "params": {
            # ...
        }
    },
    "dataset_B": {
        # type of dataset interface to instantiate
        "type": "...",
        "params": {
            # ...
        },
        # if it does not derive from 'thelper.data.parsers.Dataset', a task is needed:
        "task": {
            # this type must derive from 'thelper.tasks.Task'
            "type": "...",
            "params": {
                # ...
            }
        }
    },
    # ...
},
# ...
Parameters:
  • config – a dictionary that provides all required data configuration information under two fields, namely ‘datasets’ and ‘loaders’.
  • save_dir – the path to the root directory where the session directory should be saved. Note that this is not the path to the session directory itself, but its parent, which may also contain other session directories.
Returns:

A 4-element tuple that contains – 1) the global task object to specialize models and trainers with; 2) the training data loader; 3) the validation data loader; and 4) the test data loader.

thelper.data.utils.create_parsers(config, base_transforms=None)[source]

Instantiates dataset parsers based on a provided dictionary.

This function will instantiate dataset parsers as defined in a name-type-param dictionary. If multiple datasets are instantiated, this function will also verify their task compatibility and return the global task. The dataset interfaces themselves should be derived from thelper.data.parsers.Dataset, be compatible with thelper.data.parsers.ExternalDataset, or should provide a ‘task’ field specifying all the information related to sample dictionary keys and model i/o.

The provided configuration will be parsed for a ‘datasets’ dictionary entry. The keys in this dictionary are treated as unique dataset names and are used for lookups. The value associated to each key (or dataset name) should be a type-params dictionary that can be parsed to instantiate the dataset interface.

An example configuration dictionary is given in thelper.data.utils.create_loaders().

Parameters:
  • config – a dictionary that provides unique dataset names and parameters needed for instantiation under the ‘datasets’ field.
  • base_transforms – the transform operation that should be applied to all loaded samples, and that will be provided to the constructor of all instantiated dataset parsers.
Returns:

A 2-element tuple that contains – 1) the list of dataset interfaces/parsers that were instantiated; and 2) a task object compatible with all of those (see thelper.tasks.utils.Task for more information).

thelper.data.utils.get_class_weights(label_map, stype='linear', maxw=inf, minw=0.0, norm=True, invmax=False)[source]

Returns a map of label weights that may be adjusted based on a given rebalancing strategy.

Parameters:
  • label_map – map of index lists or sample counts tied to class labels.
  • stype – weighting strategy (‘uniform’, ‘linear’, or ‘rootX’). Using ‘uniform’ will provide a uniform map of weights. Using ‘linear’ will return the actual weights, unmodified. Using ‘rootX’ will rebalance the weights according to factor ‘X’. See thelper.data.samplers.WeightedSubsetRandomSampler for more information on these strategies.
  • maxw – maximum allowed weight value (applied after invmax, if required).
  • minw – minimum allowed weight value (applied after invmax, if required).
  • norm – specifies whether the returned weights should be normalized (default=True, i.e. normalized).
  • invmax – specifies whether to max-invert the weight vector (thus creating cost factors) or not. Not compatible with norm (it would return weights again instead of factors).
Returns:

Map of weights tied to class labels.