NLP related topics

Example applications

NLP key problems/techniques

  • Categorizing and tagging words
    Lexical categories (e.g,. noun) and part-of-speech (POS) tagger (e.g., VBP for present tense verb, NN for a noun); n-gram tagging; transformation-based tagging (e.g, Drill tagging, guess the tag of each word, then go back and fix the mistakes)
    Sequence labeling and n-gram models
  • Information extraction
    Named Entity Recognition (NER) and Relation Detection (RD)
    NER seeks to locate and classify named entities mentioned in unstructured text into predefined categories such as person names and locations. Names of genes and gene product, chemical compounds and drugs are examples of NERs that can be of great interest in bioinformatics. NER can be realized using grammer-based approaches or ML approaches (token level classification).
    Practices for NER and RD allow to identify interactions between proteins and drugs or genes and diseases (BioNER and BioRD).
    E.g, from the sentence "BRCA1 gene causes predisposition to breast cancer and ovarian cancer”, we get three tagged entities BRCA1, Breast Cancer, and Ovarian Cancer. see the example (and sentence dependency tree)
  • Learning to classify texts
    Transformation of unstuctured data into structured, and then the rest is similar as in data mining of other structured data (clustering, classification, etc).
    Bag-of-words (POS tags) (e.g., term-word matrix)
    Word2vec: a tool that takes a text corpus as input and produces the word vectors (embeddings) as output (each word is represented as a vector). The resulting word vector file can be used a features in NLP and ML applications. Word vectors capture many linguistic regularities, for example vector operations vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'), and vector('king') - vector('man') + vector('woman') is close to vector('queen') (see details); See a tutorial about word embedding on tensorflow
  • Contextualized Language Models (LMs, e.g., transformers)
    Token or sentence level predictions
    Machine translation (e.g., English-to-German translation)
  • See more at https://www.nltk.org/book/ (NLP toolkit in python)

Miscellaneous

  • word cloud
  • token
  • stopwords
  • TextVectorization, tf_idf; TfidfVectorizer
  • datasketch: MinHash, MinHashLSHForest,MinHashLSH