Be able to extract random sentences from an ebook (.epub) of choice and remove any punctuation, capitalisation and display that to the user.
- A python IDE I like isn’t installed on my Pi and I don’t really understand Linux.
- Once in Python I don’t really know where to start
First, I installed Jupyter using this guide, this took more restarts than I realised it would and I lost the first draft of this post.
Then I got an ebook of Treasure Island, along with a handful of other books for testing, from Project Gutenberg. Easy bits done!
After much messing around I found this post on Medium which gave me the functions I needed to extract the text from the book. With some keyboard mashing I got this to work and import my book and create a list of all of the sentences. I then added some functions to remove returns from the sentences, choose 15 sentences at random, remove the punctuation and then create a pandas dataframe of the questions and answers.
To get the data into a pdf I heavily relied on this post to use the packages Jinja and WeasyPrint. I had to set up a HTML template and for the styling I used the same CSS file as in the post as I don’t really understand CSS.
Once that was done, I could make run the code over my book and create my worksheets.
I’ve tested it on a number of books and it seems to work pretty well. There are some issues around where characters speak more than one sentence so there are open quotation marks in the answers but I’m happy with the result. Another issue is that it sometimes includes copyright notices in the questions, but to get around that I just run the code again to generate a new set.
- had there been a breath of wind we should have fallen on the six mutineers who were left aboard with us slipped our cable and away to sea
- and who may you be and then as he saw the squires letter he seemed to me to give something almost like a start
The code to make these challenges, including the CSS and HTML templates are on my Github page here