Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
815 views
in Technique[技术] by (71.8m points)

python - Creating a TimeseriesGenerator with multiple inputs

I'm trying to train an LSTM model on daily fundamental and price data from ~4000 stocks, due to memory limits I cannot hold everything in memory after converting to sequences for the model.

This leads me to using a generator instead like the TimeseriesGenerator from Keras / Tensorflow. Problem is that if I try using the generator on all of my data stacked it would create sequences of mixed stocks, see the example below with a sequence of 5, here Sequence 3 would include the last 4 observations of "stock 1" and the first observation of "stock 2"

Example of problem of creating sequences from independent stacked data

Instead what I would want is similar to this:

enter image description here

Slightly similar question: Merge or append multiple Keras TimeseriesGenerator objects into one

I explored the option of combining the generators like this SO suggests: How do I combine two keras generator functions, however this is not idea in the case of ~4000 generators.

I hope my question makes sense.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

So what I've ended up doing is to do all the preprocessing manually and save an .npy file for each stock containing the preprocessed sequences, then using a manually created generator I make batches like this:

class seq_generator():

  def __init__(self, list_of_filepaths):
    self.usedDict = dict()
    for path in list_of_filepaths:
      self.usedDict[path] = []

  def generate(self):
    while True: 
      path = np.random.choice(list(self.usedDict.keys()))
      stock_array = np.load(path) 
      random_sequence = np.random.randint(stock_array.shape[0])
      if random_sequence not in self.usedDict[path]:
        self.usedDict[path].append(random_sequence)
        yield stock_array[random_sequence, :, :]

train_generator = seq_generator(list_of_filepaths)

train_dataset = tf.data.Dataset.from_generator(seq_generator.generate(),
                                               output_types=(tf.float32, tf.float32), 
                                               output_shapes=(n_timesteps, n_features)) 

train_dataset = train_dataset.batch(batch_size)

Where list_of_filepaths is simply a list of paths to preprocessed .npy data.


This will:

  • Load a random stock's preprocessed .npy data
  • Pick a sequence at random
  • Check if the index of the sequence has already been used in usedDict
  • If not:
    • Append the index of that sequence to usedDict to keep track as to not feed the same data twice to the model
    • Yield the sequence

This means that the generator will feed a single unique sequence from a random stock at each "call", enabling me to use the .from_generator() and .batch() methods from Tensorflows Dataset type.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...