Assignment 3 – Stop Tokenizer

This is the stop list I created for my stop tokenizer:

  • -
  • .
  • a
  • an
  • and
  • are
  • be
  • for
  • from
  • if
  • in
  • is
  • of
  • that
  • the
  • this
  • was
  • will
  • with

I tested text that I pulled from news articles I found online.  Basically what I did was add all punctuation and prepositions to the stop list.

Here is my code:

import java.util.Set;

import com.aliasi.tokenizer.EnglishStopTokenizerFactory;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.LowerCaseTokenizerFactory;
import com.aliasi.tokenizer.StopTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.tokenizer.EnglishStopTokenizerFactory;
import com.aliasi.util.CollectionUtils;
import com.lingpipe.book.tok.DisplayTokens;

public class StopTokenizer {
	/*
	 * the Indo-European tokenizer will tokenize, the resulting tokens will be converted to lower case, and
		then stop words will be removed.
	 */

	public static void main(String[] args) {
		//String text = "This is a test of the emergency broadcast system.";
		String text = "If the projections are correct, 2012 will be the fourth and final year with a deficit over $1 trillion. When Mr. Obama took office in January 2009, the deficit for that year was projected to be — and ultimately was — $1.3 trillion. A similarly large shortfall followed for 2010. The president’s budget charts a decline from the trillion-dollar level after 2012 to a low of $607 billion in fiscal year 2015, before the annual deficits start inching up again in dollar terms.";

		Set<String> stopSet = CollectionUtils.asSet("with","was","this","that","from","s", "'", ".", ",", "—", "-", "’", "in", "to", "if", "are", "will", "be", "and", "a", "an", "the", "of", "is");
		TokenizerFactory f1 = IndoEuropeanTokenizerFactory.INSTANCE;
		TokenizerFactory f2 = new LowerCaseTokenizerFactory(f1);
		TokenizerFactory f3 = new StopTokenizerFactory(f2,stopSet);
		//could try EnglishStopTokenizerFactory instead here
	//	TokenizerFactory f3 = new EnglishStopTokenizerFactory(f2);

		DisplayTokens.displayTokens(text,f3);
	}

}

Actually, I’m not sure what stop lists are used by Google, Yahoo, Bing, or any other search engines, because I don’t believe this information is public. In fact, from my understanding, search engines don’t actually even use stop lists anymore. This can be proven by, for instance making a search for “rat”, “a rat”, and “the rat” in Google. You will notice that you get completely different results in each case. The same holds true for Yahoo and Bing.

This entry was posted in Learning Bit by Bit. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>