Training a LSTM on ...

Training a LSTM on a text corpus, how to preprocess the text data  


New Member
Joined: 11 months ago
Posts: 1
26/12/2019 1:43 am  


I am trying to train an LSTM on some given training text data with the goal of producing new texts given a sequence of words.

I'm not sure about the preprocessing of the given text data (just simply a big text file of speeches) and would greatly appreciate some advice. My current approach is to tokenize the text and simply build numerical vectors of sequences by assigning unique integer values to every token (or word). There's some data clean up before, like removing characters and newlines and converting to lower case. Finally, I have a dataset of numbers (indices) correlating to the words which I can feed as input to the LSTM network.

Should I go for the characters each, or are the words good enough (for now it doesn't matter if the output is grammatically incorrect)?

Is this even a correct approach to the LSTM?



We use cookies to collect information about our website and how users interact with it. We’ll use this information solely to improve the site. You are agreeing to consent to our use of cookies if you click ‘OK’. All information we collect using cookies will be subject to and protected by our Privacy Policy, which you can view here.


Please Login or Register