Skip to content

LtgOslo_Norsk

StephanOepen edited this page Nov 14, 2016 · 2 revisions

Background

LTG-internal notes on available data and software resources for the morpho-syntactic analysis of Norwegian text.

History

  • Norsk Ordbank

  • OBT

  • NorGram

  • NDT

  • INESS

  • UD

  • UDPipe

Segmentation

Segmentation of raw, running text into (a) splitting into ‘sentences’ (i.e. root-level utterances) and (b) tokenization into ‘word’-like units. Two to three known de-facto strandards: OBT-style, NorGram-style, and NDT-style.

The mtag tool is the first component in the OBT pipeline and implements sentence spilitting, tokenization, and multi-tagging (morphological analysis). OBT tokens at times comprise multi-words, presumably for ‘words with spaces’ (for eksempel) and maybe for some names?

NDT tokenization (I believe) resembles PTB-style tokens, e.g. splitting at hyphens and slashes. StephanOepen has started to develop an NDT-compatible tokenizer in the REPP framework (through adaptation of PTB-style rules); he would also be interested in using REPP to devise a rule-based sentence splitter. One main advantage of REPP is its output of character offset pointers, which are often useful in applications.

The Punkt implementation in NLTK includes a pre-trained sentence splitting model for Norwegian.

The LAP team has been discussing integration of NorGram parsing, which would most likely make available the NorGram sentence splitter and tokenizer as stand-alone components; however, they require licensing of the proprietary XFST toolkit.

UDPipe implements a classification-based approach to splitting and tokenization and ships with pre-trained models for Norwegian. While the current release of the Norwegian UD treebank lacked information on whitespace boundaries, a forthcoming new version will address this problem.

Morphology

By morphology we mean all sub-tasks of analyzing word-internal structure, including PoS tagging and lemmatization.

Syntax

Clone this wiki locally