Update proposal for S5TFDataLoader #14

rickwierenga · 2020-01-26T11:37:48Z

As @WilliamHYZhang and I discussed we need shuffling for data loaders. I would like to make a proposal to facilitate this.

We need to store an array of indices on a data loader. The type of Element in this array will be set by a type alias by the implementor of S5TFDataLoader which itself will need an associatedtype Index. An implementation of S5TFDataLoader needs to be able to load an element at this Index so we add the following line to the protocol:

func getElement(at index: Index) -> Tensor<DataType>

where DataType is the associated data type discussed in s5tf-team/datasets#14 (comment).

indices would be an array of row numbers in CSVDataLoader, an array of IDs in MNIST and comparable datasets and an array of paths in Imagenette for example.

We should add shuffled() in a protocol extension:

func shuffled() -> Self {
    var mutableSelf = self.copy()
    mutableSelf.indices = self.indices.shuffled()
    return mutableSelf
}

This uses copy() which should be another requirement of S5TFDataLoader. This will allow for more default implementations like the following:

func batched(_ batchSize: Int) -> Self {
    guard batchSize >= 1 else {
        fatalError("Batch size must be greater than or equal to 1")
    }

    guard batchSize <= count else {
        fatalError("Batch size equal to or smaller than the number of items.")
    }

    var mutableSelf = self.copy()
    mutableSelf. batchSize = batchSize
    return mutableSelf
}

getElement(at:) should be used by next. If we write a function createBatch(from indices): we can write next in advance too. Implementing a data loader would just involve writing a few basic functions because everything else will be handled by S5TFDataLoader. The implementor should not have to duplicate the next() functionality for every dataset because it will always be roughly the same code.

indices also allows for

var count: Int {
    return indices.count
}

This will save a lot of work in future implementations of datasets.

The text was updated successfully, but these errors were encountered:

rickwierenga · 2020-01-29T23:49:23Z

The ultimate goal here is to have just two requirements on S5TFDataLoader:

public protocol S5TFDataLoader: Sequence, IteratorProtocol where Element: S5TFBatch {
    func load() -> [Index]
    func createBatch(from indices: [Index]) -> Element
}

I have been experimenting with this on a local branch. We might need to make the current implementation of S5TFDataLoader a struct, rename it (I have been using S5TFDataIterator) and make its sole responsibility making data in data loaders accessible.

rickwierenga mentioned this issue Jan 26, 2020

Data augmentation proposal #15

Open

rickwierenga added the proposal Proposal for a new feature label Jan 26, 2020

rickwierenga mentioned this issue Jan 30, 2020

Data loader #28

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update proposal for S5TFDataLoader #14

Update proposal for S5TFDataLoader #14

rickwierenga commented Jan 26, 2020 •

edited

Loading

rickwierenga commented Jan 29, 2020

Update proposal for S5TFDataLoader #14

Update proposal for S5TFDataLoader #14

Comments

rickwierenga commented Jan 26, 2020 • edited Loading

rickwierenga commented Jan 29, 2020

rickwierenga commented Jan 26, 2020 •

edited

Loading