User Tools

Site Tools


crf:main

Conditional Random Fields(CRFs)

Introduction of CRF model resource
Training Tool

CRF with JavaNLP

I use Stanford Named Entity Recognizer to train data from ORCID(Thai text corpus)
To train ORCID in JavaNLP, I have changed data format to
Example data separate each field with(\t)

Word  Label
ท่าน	OT
ผู้ฟัง	OT
คะ	OT


ความ	OT
สุขใจ	OT
ย่อม	OT
เป็น	OT
สิ่ง	OT
ที่	OT
คน	OT
เรา	OT
ปรารถนา	OT

**** NA for name entity
**** OT for other entity

Command to train data

java  -Duser.language=th -Duser.region=TH -Dfile.encoding=tis-620 -mx1000m 
edu.stanford.nlp.ie.crf.CRFClassifier 
-readerAndWriter edu.stanford.nlp.sequences.ColumnDocumentReaderAndWriter 
-trainFile orcid_notag.crf -serializeTo orcid.ner.notag.gz 
-testFile test_notag.crf -conllNoTags > output_no_tag

parameter

  • Duser.language=th -Duser.region=TH -Dfile.encoding=tis-620 ⇒ for Java Thai Encoding
  • mx1000m ⇒ Java Maximum memory heap size
  • edu.stanford.nlp.ie.crf.CRFClassifier ⇒ java class file
  • readerAndWriter edu.stanford.nlp.sequences.ColumnDocumentReaderAndWriter ⇒ java class file manage input file format
  • trainFile orcid_notag.crf ⇒ name of train file
  • serializeTo orcid.ner.notag.gz ⇒ model name
  • testFile test_notag.crf ⇒ name of test file
  • output_no_tag ⇒ output result file
  • conllNoTags⇒ set of parameter to train model

Example output result of JavaNlp

Word GoldLabel PredictLabel
พบ      OT      OT
เพียง   OT      OT
รายงาน  OT      OT
โรค     OT      OT
เอดส์   OT      OT
ครั้ง   OT      OT
แรก     OT      OT
ใน      OT      OT
สหรัฐอเมริกา    NA      NA


แอฟริกา NA      NA


ใน      OT      OT
ประเทศ  OT      OT
ไทย     NA      NA
เรา     OT      OT
พบ      OT      OT
ผู้ป่วย OT      OT
โรค     OT      OT
เอดส์   OT      OT
ราย     OT      OT
แรก     OT      OT

JavaNLP parameter set Can select one of this below

  • conllNoTags
  • macro
  • goodCoNLL
  • notags

I select conllNoTags for training my data format

//modify in edu.stanford.nlp.sequences.SeqClassifierFlags.java

          readerAndWriter = "edu.stanford.nlp.sequences.ColumnDocumentReaderAndWriter";
//          trainMap=testMap="word=0,answer=1";
          map="word=0,answer=1";
          useObservedSequencesOnly = true;
          // useClassFeature = true;
          useLongSequences = true;
          //useTaggySequences = true;
          useNGrams = true;
          usePrev = true;
          useNext = true;
          //useTags = true;
          useWordPairs = true;
          useSequences = true;
          useTitle=true;
          usePrevSequences = true;
          // noMidNGrams
          noMidNGrams = true;
          // reverse
          useReverse = false;
           // typeseqs3
          useTypeSeqs = true;
          useTypeSeqs2 = true;
          useTypeySequences = true;
          // wordtypes2 && known
          wordShape = WordShapeClassifier.WORDSHAPEDAN2USELC;
          // occurrence
          //useOccurrencePatterns = true;
          // realword
          useLastRealWord = true;
          useNextRealWord = true;
          // smooth
          sigma = 20.0;
          adaptSigma = 20.0;
           
          // normalize
          normalize = true;
          normalizeTimex = true;
          maxLeft = 2;
          useDisjunctive = true;
          disjunctionWidth = 4;
          useBoundarySequences = true;
          //useLemmas = true;  // no-op except for German
          //usePrevNextLemmas = true;  // no-op except for German
          //inputEncoding="iso-8859-1";
          inputEncoding="tis-620";
          // opt
          useQN = true;
          QNsize = 15;

more detail See JavaNLP JavaDoc

CRF with CRF++

I use CRF++ in Thai Word segmentation task.
And data in ORCID as sequence of character.
Example train-data separate each field with(\t)

character  type  label
ท	c	B
่	t	I
า	v1	I
น	c	I
ผ	c	B
ู	v1	I
้	t	I
ฟ	c	I
ั	v1	I
ง	c	I
ค	c	B
ะ	v1	I

type of character

  • c ⇒ consonent character
  • v1 ⇒ vowel that can not be start character of word
  • v2 ⇒ vowel that can be start character of word
  • t ⇒ tone character
  • s ⇒ sign character
  • d ⇒ digit character
  • o ⇒ other character (Enlish character etc.)

command for training data

crf_learn -f 3 -c 4.0 template orcid.seg model

parameter more detail go to CRF++

  • orcid.seg ⇒ train file
  • model ⇒ output model
  • f 3 ⇒ This parameter sets the cut-off threshold for the features.

CRF++ uses the features that occurs no less than NUM times in the given training data. The default value is 1. When you apply CRF++ to large data, the number of unique features would amount to several millions. This option is useful in such cases.

  • c 4.0 ⇒With this option, you can change the hyper-parameter for the CRFs. With larger C value, CRF tends to overfit to the give training corpus. This parameter trades the balance between overfitting and underfitting. The results will significantly be influenced by this parameter. You can find an optimal value by using held-out data or more general model selection method such as cross validation

command for testing data

crf_test -m model test.seg > result_parameter.txt

parameter more detail go to CRF++

  • -m model ⇒ model from training process
  • test.seg ⇒ test data file
  • result_parameter.txt ⇒ result file
crf/main.txt · Last modified: 2014/10/31 06:44 (external edit)