Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update proposal for S5TFDataLoader #14

Open
rickwierenga opened this issue Jan 26, 2020 · 1 comment
Open

Update proposal for S5TFDataLoader #14

rickwierenga opened this issue Jan 26, 2020 · 1 comment
Labels
proposal Proposal for a new feature

Comments

@rickwierenga
Copy link
Contributor

rickwierenga commented Jan 26, 2020

As @WilliamHYZhang and I discussed we need shuffling for data loaders. I would like to make a proposal to facilitate this.

We need to store an array of indices on a data loader. The type of Element in this array will be set by a type alias by the implementor of S5TFDataLoader which itself will need an associatedtype Index. An implementation of S5TFDataLoader needs to be able to load an element at this Index so we add the following line to the protocol:

func getElement(at index: Index) -> Tensor<DataType>

where DataType is the associated data type discussed in s5tf-team/datasets#14 (comment).

indices would be an array of row numbers in CSVDataLoader, an array of IDs in MNIST and comparable datasets and an array of paths in Imagenette for example.

We should add shuffled() in a protocol extension:

func shuffled() -> Self {
    var mutableSelf = self.copy()
    mutableSelf.indices = self.indices.shuffled()
    return mutableSelf
}

This uses copy() which should be another requirement of S5TFDataLoader. This will allow for more default implementations like the following:

func batched(_ batchSize: Int) -> Self {
    guard batchSize >= 1 else {
        fatalError("Batch size must be greater than or equal to 1")
    }

    guard batchSize <= count else {
        fatalError("Batch size equal to or smaller than the number of items.")
    }

    var mutableSelf = self.copy()
    mutableSelf. batchSize = batchSize
    return mutableSelf
}

getElement(at:) should be used by next. If we write a function createBatch(from indices): we can write next in advance too. Implementing a data loader would just involve writing a few basic functions because everything else will be handled by S5TFDataLoader. The implementor should not have to duplicate the next() functionality for every dataset because it will always be roughly the same code.

indices also allows for

var count: Int {
    return indices.count
}

This will save a lot of work in future implementations of datasets.

@rickwierenga rickwierenga added the proposal Proposal for a new feature label Jan 26, 2020
@rickwierenga
Copy link
Contributor Author

The ultimate goal here is to have just two requirements on S5TFDataLoader:

public protocol S5TFDataLoader: Sequence, IteratorProtocol where Element: S5TFBatch {
    func load() -> [Index]
    func createBatch(from indices: [Index]) -> Element
}

I have been experimenting with this on a local branch. We might need to make the current implementation of S5TFDataLoader a struct, rename it (I have been using S5TFDataIterator) and make its sole responsibility making data in data loaders accessible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Proposal for a new feature
Projects
None yet
Development

No branches or pull requests

1 participant