sentiment analysis - What exactly is an n Gram? -
i found previous question on so: n-grams: explanation + 2 applications. op gave example , asked if correct:
sentence: "i live in ny." word level bigrams (2 n): "# i', "i live", "live in", "in ny", 'ny #' character level bigrams (2 n): "#i", "i#", "#l", "li", "iv", "ve", "e#", "#i", "in", "n#", "#n", "ny", "y#" when have array of n-gram-parts, drop duplicate ones , add counter each part giving frequency: word level bigrams: [1, 1, 1, 1, 1] character level bigrams: [2, 1, 1, ...]
someone in answer section confirmed correct, unfortunately i'm bit lost beyond didn't understand else said! i'm using lingpipe , following tutorial stated should choose value between 7 , 12 - without stating why.
what ngram value , how should take account when using tool lingpipe?
edit: tutorial: http://cavajohn.blogspot.co.uk/2013/05/how-to-sentiment-analysis-of-tweets.html
n-grams combinations of adjacent words or letters of length n can find in source text. example, given word fox
, 2-grams (or “bigrams”) fo
, ox
. may count word boundary – expand list of 2-grams #f
, fo
, ox
, , x#
, #
denotes a word boundary.
you can same on word level. example, hello, world!
text contains following word-level bigrams: # hello
, hello world
, world #
.
the basic point of n-grams capture language structure statistical point of view, letter or word follow given one. longer n-gram (the higher n), more context have work with. optimum length depends on application – if n-grams short, may fail capture important differences. on other hand, if long, may fail capture “general knowledge” , stick particular cases.
Comments
Post a Comment