sentiment analysis - What exactly is an n Gram? -


i found previous question on so: n-grams: explanation + 2 applications. op gave example , asked if correct:

sentence: "i live in ny."  word level bigrams (2 n): "# i', "i live", "live in", "in ny", 'ny #' character level bigrams (2 n): "#i", "i#", "#l", "li", "iv", "ve", "e#", "#i", "in", "n#", "#n", "ny", "y#"  when have array of n-gram-parts, drop duplicate ones , add counter each part giving frequency:  word level bigrams: [1, 1, 1, 1, 1] character level bigrams: [2, 1, 1, ...] 

someone in answer section confirmed correct, unfortunately i'm bit lost beyond didn't understand else said! i'm using lingpipe , following tutorial stated should choose value between 7 , 12 - without stating why.

what ngram value , how should take account when using tool lingpipe?

edit: tutorial: http://cavajohn.blogspot.co.uk/2013/05/how-to-sentiment-analysis-of-tweets.html

n-grams combinations of adjacent words or letters of length n can find in source text. example, given word fox, 2-grams (or “bigrams”) fo , ox. may count word boundary – expand list of 2-grams #f, fo, ox, , x#, # denotes a word boundary.

you can same on word level. example, hello, world! text contains following word-level bigrams: # hello, hello world, world #.

the basic point of n-grams capture language structure statistical point of view, letter or word follow given one. longer n-gram (the higher n), more context have work with. optimum length depends on application – if n-grams short, may fail capture important differences. on other hand, if long, may fail capture “general knowledge” , stick particular cases.


Comments

Popular posts from this blog

ios - UICollectionView Self Sizing Cells with Auto Layout -

node.js - ldapjs - write after end error -

DOM Manipulation in Wordpress (and elsewhere) using php -