This is the stop list I created for my stop tokenizer:
- ‘
- “
- ’
- -
- —
- .
- a
- an
- and
- are
- be
- for
- from
- if
- in
- is
- of
- that
- the
- this
- was
- will
- with
I tested text that I pulled from news articles I found online. Basically what I did was add all punctuation and prepositions to the stop list.
Here is my code:
import java.util.Set;
import com.aliasi.tokenizer.EnglishStopTokenizerFactory;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.LowerCaseTokenizerFactory;
import com.aliasi.tokenizer.StopTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.tokenizer.EnglishStopTokenizerFactory;
import com.aliasi.util.CollectionUtils;
import com.lingpipe.book.tok.DisplayTokens;
public class StopTokenizer {
/*
* the Indo-European tokenizer will tokenize, the resulting tokens will be converted to lower case, and
then stop words will be removed.
*/
public static void main(String[] args) {
//String text = "This is a test of the emergency broadcast system.";
String text = "If the projections are correct, 2012 will be the fourth and final year with a deficit over $1 trillion. When Mr. Obama took office in January 2009, the deficit for that year was projected to be — and ultimately was — $1.3 trillion. A similarly large shortfall followed for 2010. The president’s budget charts a decline from the trillion-dollar level after 2012 to a low of $607 billion in fiscal year 2015, before the annual deficits start inching up again in dollar terms.";
Set<String> stopSet = CollectionUtils.asSet("with","was","this","that","from","s", "'", ".", ",", "—", "-", "’", "in", "to", "if", "are", "will", "be", "and", "a", "an", "the", "of", "is");
TokenizerFactory f1 = IndoEuropeanTokenizerFactory.INSTANCE;
TokenizerFactory f2 = new LowerCaseTokenizerFactory(f1);
TokenizerFactory f3 = new StopTokenizerFactory(f2,stopSet);
//could try EnglishStopTokenizerFactory instead here
// TokenizerFactory f3 = new EnglishStopTokenizerFactory(f2);
DisplayTokens.displayTokens(text,f3);
}
}
Actually, I’m not sure what stop lists are used by Google, Yahoo, Bing, or any other search engines, because I don’t believe this information is public. In fact, from my understanding, search engines don’t actually even use stop lists anymore. This can be proven by, for instance making a search for “rat”, “a rat”, and “the rat” in Google. You will notice that you get completely different results in each case. The same holds true for Yahoo and Bing.