Ngram Remix – Alice, Huck Finn, and the Kama Sutra

For the assignment I took the NgramGenerateText script and modified it to be able to take any number of source text files. My intention was to generate ngram text from multiple texts combined, and be able to mix and match the source texts quickly and easily at will to see what interesting results might arise. My original idea was to generate a text from the Bible and Koran combined.

I soon realized that I was probably getting unbalanced results since the text for the Bible was so much longer than the Koran. Also, I was getting errors from java about a memory overload that I assumed had to do with the length of the source texts. To circumvent this issue, I manually reduced the source text file size to something more manageable. A more robust way that I would like to implement would be to dynamically change the bias based on the lengths of the input text, so that they are equally represented in the output.

The results proved to be uninteresting, I think due to the fact that the texts are somewhat similar in nature to begin with. I proceeded to add a stop list to the tokenizer and used a for loop to generate stop token for number 1-100, as I was getting a lot of verse numbers and such from these texts. I also omitted some punctuation that I just found to be annoying.

Here is a sample result with a 3gram and 100 words. I edited these results slightly to improve the flow:

Wounding, and of all the people of the same day the LORD commanded him and the people of Noah, and thy sons' garments, and the people go, and all the congregation and the moon as a lie against God. He said, I pray thee, and purple, and begat a son. And on the earth and the plague be spread in the morning prostrate in their Lord, " We believe in it. And we gave him back to thy estimation, even the memorial thereof, and they shall not be.

Next, I thought that I would try mixing some more different texts together. Here is the result of Alice in Wonderland and Edgar Allen Poe’s poetry (I edited it a bit to flow better):

Every syllable! It's the queerest thing about it, if I chose. The Dormouse sulkily remarked "If you didn t mean it! -- coward! -- tis well -- eh , stupid?" But worse still that death no immortality -- but, Angelo, than waste it in a great hurry. An enormous puppy was looking about me (as all men and with a sensual delight immeasurable.) I saw thee on thy bridal day -- oppress my mind with double loveliness. We must be the steady pressing down of the poem.

And finally, The Kama Sutra, Alice in Wonderland and Huckleberry Finn, together at last:

If we don't know how to make a saw out of the man should get hold of her husband, and in the coffin, and I was weakening, I was getting towards the sky. Twinkle, twinkle -- "Hold on, you know," said Alice. Reeling and writhing, of course. Anybody would. "I reckon it wouldn't have no trouble bout something, my boy ?" he says "Looky here, and she couldn't see no way." "No, sah -- nuffn else"

There were still a few annoying things occurring with the formatting of the output that I could not figure out how to resolve in the code. Mainly this had to do with the way the script was formatting punctuation, by always adding a period between EVERY token, even if the token was a period, exclamation point, quote, comma, etc. I ended up modifying the script to remove those extra spaces, but the formatting of quotes are still off.

This entry was posted in Learning Bit by Bit. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>