thelper.data package

Dataset parsing/loading package.

This package contains classes and functions whose role is to fetch the data required to train, validate, and test a model. The thelper.data.utils.create_loaders() function contained herein is responsible for preparing the task and data loaders for a training session. This package also contains the base interfaces for dataset parsers.

Submodules

thelper.data.loaders module

Dataset loaders module.

This module contains a dataset loader specialization used to properly seed samplers and workers.

class thelper.data.loaders.LoaderFactory(config)[source]

Bases: object

Factory used for preparing and splitting dataset parsers into usable data loader objects.

This class is responsible for parsing the parameters contained in the ‘loaders’ field of a configuration dictionary, instantiating the data loaders, and shuffling/splitting the samples. An example configuration is presented in thelper.data.utils.create_loaders().

__init__(config)[source]

Receives and parses the data configuration dictionary.

create_loaders(datasets, train_idxs, valid_idxs, test_idxs)[source]

Returns the data loaders for the train/valid/test sets based on a prior split.

This function essentially takes the dataset parser interfaces and indices maps, and instantiates data loaders that are ready to produce samples for training or evaluation. Note that the dataset parsers will be deep-copied in each data loader, meaning that they should ideally not contain a persistent loading state or a large buffer.

Parameters
  • datasets – the map of dataset parsers, where each has a name (key) and a parser (value).

  • train_idxs – training data samples indices map.

  • valid_idxs – validation data samples indices map.

  • test_idxs – test data samples indices map.

Returns

A three-element tuple containing the training, validation, and test data loaders, respectively.

get_base_transforms()[source]

Returns the (global) sample transformation operations parsed in the data configuration.

get_split(datasets, task)[source]

Returns the train/valid/test sample indices split for a given dataset (name-parser) map.

Note that the returned indices are unique, possibly shuffled, and never duplicated between sets. If the samples have a class attribute (i.e. the task is related to classification), the split will respect the initial distribution and apply the ratios within the classes themselves. For example, consider a dataset of three classes (\(A\), \(B\), and \(C\)) that contains 100 samples such as:

\[|A| = 50,\;|B| = 30,\;|C| = 20\]

If we require a 80%-10%-10% ratio distribution for the training, validation, and test loaders respectively, the resulting split will contain the following sample counts:

\[\text{training loader} = {40A + 24B + 16C}\]
\[\text{validation loader} = {5A + 3B + 2C}\]
\[\text{test loader} = {5A + 3B + 2C}\]

In the case of multi-label classification datasets, there is no guarantee that the classes will be balanced across the training/validation/test sets. Instead, for a given class list, the classes with fewer samples will be split first.

Parameters
  • datasets – the map of datasets to split, where each has a name (key) and a parser (value).

  • task – a task object that should be compatible with all provided datasets (can be None).

Returns

A three-element tuple containing the maps of the training, validation, and test sets respectively. These maps associate dataset names to a list of sample indices.

thelper.data.loaders.default_collate(batch, force_tensor=True)[source]

Puts each data field into a tensor with outer dimension batch size.

This function is copied from PyTorch’s torch.utils.data._utils.collate.default_collate, but additionally supports custom objects from the framework (such as bounding boxes). These will not be converted to tensors, and it will be up to the trainer to handle them accordingly.

See torch.utils.data.DataLoader for more information.

thelper.data.parsers module

Dataset parsers module.

This module contains dataset parser interfaces and base classes that define basic i/o operations so that the framework can automatically interact with training data.

thelper.data.pascalvoc module

PASCAL VOC dataset parser module.

This module contains a dataset parser used to load the PASCAL Visual Object Classes (VOC) dataset for semantic segmentation or object detection. See http://host.robots.ox.ac.uk/pascal/VOC/ for more info.

thelper.data.samplers module

Samplers module.

This module contains classes used for raw dataset rebalancing or augmentation.

All samplers here should aim to be compatible with PyTorch’s sampling interface (torch.utils.data.sampler.Sampler) so that they can be instantiated at runtime through a configuration file and used as the input of a data loader.

thelper.data.utils module

Dataset utility functions and tools.

This module contains utility functions and tools used to instantiate data loaders and parsers.

thelper.data.utils.create_hdf5(archive_path, task, train_loader, valid_loader, test_loader, compression=None, config_backup=None)[source]

Saves the samples loaded from train/valid/test data loaders into an HDF5 archive.

The loaded minibatches are decomposed into individual samples. The keys provided via the task interface are used to fetch elements (input, groundtruth, …) from the samples, and save them in the archive. The archive will contain three groups (train, valid, and test), and each group will contain a dataset for each element originally found in the samples.

Note that the compression operates at the sample level, not at the dataset level. This means that elements of each sample will be compressed individually, not as an array. Therefore, if you are trying to compress very correlated samples (e.g. frames in a video sequence), this approach will be pretty bad.

Parameters
  • archive_path – path pointing where the HDF5 archive should be created.

  • task – task object that defines the input, groundtruth, and meta keys tied to elements that should be parsed from loaded samples and saved in the HDF5 archive.

  • train_loader – training data loader (can be None).

  • valid_loader – validation data loader (can be None).

  • test_loader – testing data loader (can be None).

  • compression – the compression configuration dictionary that will be parsed to determine how sample elements should be compressed. If a mapping is missing, that element will not be compressed.

  • config_backup – optional session configuration file that should be saved in the HDF5 archive.

Example compression configuration:

# the config is given as a dictionary
{
    # each field is a key that corresponds to an element in each sample
    "key1": {
        # the 'type' identifies the compression approach to use
        # (see thelper.utils.encode_data for more information)
        "type": "jpg",
        # extra parameters might be needed to encode the data
        # (see thelper.utils.encode_data for more information)
        "encode_params": {}
        # these parameters are packed and kept for decoding
        # (see thelper.utils.decode_data for more information)
        "decode_params": {"flags": "cv.IMREAD_COLOR"}
    },
    "key2": {
        # this explicitly means that no encoding should be performed
        "type": "none"
    },
    ...
    # if a key is missing, its elements will not be compressed
}
thelper.data.utils.create_loaders(config, save_dir=None)[source]

Prepares the task and data loaders for a model trainer based on a provided data configuration.

This function will parse a configuration dictionary and extract all the information required to instantiate the requested dataset parsers. Then, combining the task metadata of all these parsers, it will evenly split the available samples into three sets (training, validation, test) to be handled by different data loaders. These will finally be returned along with the (global) task object.

The configuration dictionary is expected to contain two fields: loaders, which specifies all parameters required for establishing the dataset split, shuffling seeds, and batch size (these are listed and detailed below); and datasets, which lists the dataset parser interfaces to instantiate as well as their parameters. For more information on the datasets field, refer to thelper.data.utils.create_parsers().

The parameters expected in the ‘loaders’ configuration field are the following:

  • <train_/valid_/test_>batch_size (mandatory): specifies the (mini)batch size to use in data loaders. If you get an ‘out of memory’ error at runtime, try reducing it.

  • <train_/valid_/test_>collate_fn (optional): specifies the collate function to use in data loaders. The default one is typically fine, but some datasets might require a custom function.

  • shuffle (optional, default=True): specifies whether the data loaders should shuffle their samples or not.

  • test_seed (optional): specifies the RNG seed to use when splitting test data. If no seed is specified, the RNG will be initialized with a device-specific or time-related seed.

  • valid_seed (optional): specifies the RNG seed to use when splitting validation data. If no seed is specified, the RNG will be initialized with a device-specific or time-related seed.

  • torch_seed (optional): specifies the RNG seed to use for torch-related stochastic operations (e.g. for data augmentation). If no seed is specified, the RNG will be initialized with a device-specific or time-related seed.

  • numpy_seed (optional): specifies the RNG seed to use for numpy-related stochastic operations (e.g. for data augmentation). If no seed is specified, the RNG will be initialized with a device-specific or time-related seed.

  • random_seed (optional): specifies the RNG seed to use for stochastic operations with python’s ‘random’ package. If no seed is specified, the RNG will be initialized with a device-specific or time-related seed.

  • workers (optional, default=1): specifies the number of threads to use to preload batches in parallel; can be 0 (loading will be on main thread), or an integer >= 1.

  • pin_memory (optional, default=False): specifies whether the data loaders will copy tensors into CUDA-pinned memory before returning them.

  • drop_last (optional, default=False): specifies whether to drop the last incomplete batch or not if the dataset size is not a multiple of the batch size.

  • sampler (optional): specifies a type of sampler and its constructor parameters to be used in the data loaders. This can be used for example to help rebalance a dataset based on its class distribution. See thelper.data.samplers for more information.

  • augments (optional): provides a list of transformation operations used to augment all samples of a dataset. See thelper.transforms.utils.load_augments() for more info.

  • train_augments (optional): provides a list of transformation operations used to augment the training samples of a dataset. See thelper.transforms.utils.load_augments() for more info.

  • valid_augments (optional): provides a list of transformation operations used to augment the validation samples of a dataset. See thelper.transforms.utils.load_augments() for more info.

  • test_augments (optional): provides a list of transformation operations used to augment the test samples of a dataset. See thelper.transforms.utils.load_augments() for more info.

  • eval_augments (optional): provides a list of transformation operations used to augment the validation and test samples of a dataset. See thelper.transforms.utils.load_augments() for more info.

  • base_transforms (optional): provides a list of transformation operations to apply to all loaded samples. This list will be passed to the constructor of all instantiated dataset parsers. See thelper.transforms.utils.load_transforms() for more info.

  • train_split (optional): provides the proportion of samples of each dataset to hand off to the training data loader. These proportions are given in a dictionary format (name: ratio).

  • valid_split (optional): provides the proportion of samples of each dataset to hand off to the validation data loader. These proportions are given in a dictionary format (name: ratio).

  • test_split (optional): provides the proportion of samples of each dataset to hand off to the test data loader. These proportions are given in a dictionary format (name: ratio).

  • skip_verif (optional, default=True): specifies whether the dataset split should be verified if resuming a session by parsing the log files generated earlier.

  • skip_split_norm (optional, default=False): specifies whether the question about normalizing the split ratios should be skipped or not.

  • skip_class_balancing (optional, default=False): specifies whether the balancing of class labels should be skipped in case the task is classification-related.

Example configuration file:

# ...
"loaders": {
    "batch_size": 128,  # batch size to use in data loaders
    "shuffle": true,  # specifies that the data should be shuffled
    "workers": 4,  # number of threads to pre-fetch data batches with
    "train_sampler": {  # we can use a data sampler to rebalance classes (optional)
        # see e.g. 'thelper.data.samplers.WeightedSubsetRandomSampler'
        # ...
    },
    "train_augments": { # training data augmentation operations
        # see 'thelper.transforms.utils.load_augments'
        # ...
    },
    "eval_augments": { # evaluation (valid/test) data augmentation operations
        # see 'thelper.transforms.utils.load_augments'
        # ...
    },
    "base_transforms": { # global sample transformation operations
        # see 'thelper.transforms.utils.load_transforms'
        # ...
    },
    # optionally indicate how to resolve dataset loader task vs model task incompatibility if any
    # leave blank to get more details about each case during runtime if this situation happens
    "task_compat_mode": "old|new|compat",
    # finally, we define a 80%-10%-10% split for our data
    # (we could instead use one dataset for training and one for testing)
    "train_split": {
        "dataset_A": 0.8
        "dataset_B": 0.8
    },
    "valid_split": {
        "dataset_A": 0.1
        "dataset_B": 0.1
    },
    "test_split": {
        "dataset_A": 0.1
        "dataset_B": 0.1
    }
    # (note that the dataset names above are defined in the field below)
},
"datasets": {
    "dataset_A": {
        # type of dataset interface to instantiate
        "type": "...",
        "params": {
            # ...
        }
    },
    "dataset_B": {
        # type of dataset interface to instantiate
        "type": "...",
        "params": {
            # ...
        },
        # if it does not derive from 'thelper.data.parsers.Dataset', a task is needed:
        "task": {
            # this type must derive from 'thelper.tasks.Task'
            "type": "...",
            "params": {
                # ...
            }
        }
    },
    # ...
},
# ...
Parameters
  • config – a dictionary that provides all required data configuration information under two fields, namely ‘datasets’ and ‘loaders’.

  • save_dir – the path to the root directory where the session directory should be saved. Note that this is not the path to the session directory itself, but its parent, which may also contain other session directories.

Returns

A 4-element tuple that contains – 1) the global task object to specialize models and trainers with; 2) the training data loader; 3) the validation data loader; and 4) the test data loader.

thelper.data.utils.create_parsers(config, base_transforms=None)[source]

Instantiates dataset parsers based on a provided dictionary.

This function will instantiate dataset parsers as defined in a name-type-param dictionary. If multiple datasets are instantiated, this function will also verify their task compatibility and return the global task. The dataset interfaces themselves should be derived from thelper.data.parsers.Dataset, be compatible with thelper.data.parsers.ExternalDataset, or should provide a ‘task’ field specifying all the information related to sample dictionary keys and model i/o.

The provided configuration will be parsed for a ‘datasets’ dictionary entry. The keys in this dictionary are treated as unique dataset names and are used for lookups. The value associated to each key (or dataset name) should be a type-params dictionary that can be parsed to instantiate the dataset interface.

An example configuration dictionary is given in thelper.data.utils.create_loaders().

Parameters
  • config – a dictionary that provides unique dataset names and parameters needed for instantiation under the ‘datasets’ field.

  • base_transforms – the transform operation that should be applied to all loaded samples, and that will be provided to the constructor of all instantiated dataset parsers.

Returns

A 2-element tuple that contains – 1) the list of dataset interfaces/parsers that were instantiated; and 2) a task object compatible with all of those (see thelper.tasks.utils.Task for more information).

See also

thelper.data.parsers.Dataset
thelper.data.parsers.ExternalDataset
thelper.data.utils.get_class_weights(label_map, stype='linear', maxw=inf, minw=0.0, norm=True, invmax=False)[source]

Returns a map of label weights that may be adjusted based on a given rebalancing strategy.

Parameters
  • label_map – map of index lists or sample counts tied to class labels.

  • stype – weighting strategy (‘uniform’, ‘linear’, or ‘rootX’). Using ‘uniform’ will provide a uniform map of weights. Using ‘linear’ will return the actual weights, unmodified. Using ‘rootX’ will rebalance the weights according to factor ‘X’. See thelper.data.samplers.WeightedSubsetRandomSampler for more information on these strategies.

  • maxw – maximum allowed weight value (applied after invmax, if required).

  • minw – minimum allowed weight value (applied after invmax, if required).

  • norm – specifies whether the returned weights should be normalized (default=True, i.e. normalized).

  • invmax – specifies whether to max-invert the weight vector (thus creating cost factors) or not. Not compatible with norm (it would return weights again instead of factors).

Returns

Map of weights tied to class labels.

See also

thelper.data.samplers.WeightedSubsetRandomSampler