Feature/Optional fine-grained POS (#11)

asajatovic · web-flow · commit 934e9ad1f06d · 2020-03-21T19:55:40.000+01:00
* Update README.md * Fix #10
diff --git a/README.md b/README.md
@@ -14,7 +14,7 @@ Use the package manager [pip](https://pip.pypa.io/en/stable/) to install spacy-u
 pip install spacy-udpipe
 ```
 
-After installation, use `spacy_udpipe.download(lang)` to download the pre-trained model for the desired language.
+After installation, use `spacy_udpipe.download(<language ISO code>)` to download the pre-trained model for the desired language.
 
 ## Usage
 The loaded UDPipeLanguage class returns a spaCy [`Language` object](https://spacy.io/api/language), i.e., the nlp object you can use to process text and create a [`Doc` object](https://spacy.io/api/doc).
@@ -32,32 +32,42 @@ for token in doc:
     print(token.text, token.lemma_, token.pos_, token.dep_)
 
 ```
-As all attributes are computed once and set in the custom [`Tokenizer`](https://spacy.io/api/tokenizer), the `nlp.pipeline` is empty.
+As all attributes are computed once and set in the custom [`Tokenizer`](https://spacy.io/api/tokenizer), the `Language.pipeline` is empty.
+
+#### Loading a custom model
+The following code snippet demonstrates how to load a custom `UDPipe` model (for the Croatian language):
+```python
+import spacy_udpipe
+
+nlp = spacy_udpipe.load_from_path(lang="hr",
+                                  path="./custom_croatian.udpipe",
+                                  meta={"description": "Custom 'hr' model"})
+text = "Wikipedija je enciklopedija slobodnog sadržaja."
+
+doc = nlp(text)
+for token in doc:
+    print(token.text, token.lemma_, token.pos_, token.dep_)
+```
+This can be done for any of the languages supported by spaCy. For an exhaustive list, see [spaCy languages](https://spacy.io/usage/models#languages).
 
 ## Authors and acknowledgment
-Created by [Antonio Šajatović](http://github.com/asajatovic)
-during an internship at [Text Analysis and Knowledge Engineering Lab (TakeLab)](http://takelab.fer.hr/).
+Created by [Antonio Šajatović](http://github.com/asajatovic) during an internship at [Text Analysis and Knowledge Engineering Lab (TakeLab)](http://takelab.fer.hr/).
 
 ## Contributing
 Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
 
-Please make sure to update tests as appropriate.
-
-Tests are run automatically for each pull request on the master branch.
-To start the tests locally, just run [`pytest`](https://docs.pytest.org/en/latest/contents.html) in the root source directory.
+Please make sure to update tests as appropriate. Tests are run automatically for each pull request on the master branch. To start the tests locally, just run [`pytest`](https://docs.pytest.org/en/latest/contents.html) in the root source directory.
 
 ## License
-[MIT](https://choosealicense.com/licenses/mit/)  © TakeLab
+[MIT](https://choosealicense.com/licenses/mit/)  © Text Analysis and Knowledge Engineering Lab (TakeLab)
 
 ## Project status
 Maintained by [Text Analysis and Knowledge Engineering Lab (TakeLab)](http://takelab.fer.hr/).
 
 ## Notes
 * All available pre-trained models are licensed under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).
 
-* All annotations match with Spacy's, except for token.tag_, which map from [CoNLL](https://universaldependencies.org/format.html) XPOS tag (language-specific part-of-speech tag), defined for each language separately by the corresponding [Universal Dependencies](https://universaldependencies.org/) treebank.
-
-* Full list of supported languages and models is available in [`languages.json`](https://github.com/TakeLab/spacy-udpipe/blob/master/spacy_udpipe/languages.json).
+* Full list of pre-trained models for supported languages is available in [`languages.json`](https://github.com/TakeLab/spacy-udpipe/blob/master/spacy_udpipe/languages.json).
 
 * This package exposes a `spacy_languages` entry point in its [`setup.py`](https://github.com/TakeLab/spacy-udpipe/blob/master/setup.py) so full suport for serialization is enabled:
     ```python
@@ -71,3 +81,13 @@ Maintained by [Text Analysis and Knowledge Engineering Lab (TakeLab)](http://tak
     nlp = spacy.load("./udpipe-spacy-model", udpipe_model=udpipe_model)
 
     ```
+* Known possible issues:
+    * Tag map
+
+      `Token.tag_` is a [CoNLL](https://universaldependencies.org/format.html) XPOS tag (language-specific part-of-speech tag), defined for each language separately by the corresponding [Universal Dependencies](https://universaldependencies.org/) treebank. Mappings between between XPOS and Universal Dependencies POS tags should be defined in a `TAG_MAP` dictionary (located in language-specific `tag_map.py` files), along with optional morphological features. See [spaCy tag map](https://spacy.io/usage/adding-languages#tag-map) for more details.
+    * Syntax iterators
+
+      In order to extract `Doc.noun_chunks`, a proper syntax iterator implementation for the language of interest is required. For more details, please see [spaCy syntax iterators](https://spacy.io/usage/adding-languages#syntax-iterators).
+    * Other language-specific issues
+
+      A quick way to check language-specific defaults in [spaCy](https://spacy.io) is to visit [spaCy language support](https://spacy.io/usage/models#languages). Also, please see [spaCy language data](https://spacy.io/usage/adding-languages#language-data) for details regarding other language-specific data.
diff --git a/spacy_udpipe/language.py b/spacy_udpipe/language.py
@@ -6,36 +6,37 @@
 from spacy.language import Language
 from spacy.symbols import DEP, HEAD, LEMMA, POS, TAG
 from spacy.tokens import Doc
-
 from ufal.udpipe import (InputFormat, Model, OutputFormat, ProcessingError,
                          Sentence)
 
 from .util import get_defaults, get_path
 
 
-def load(lang):
+def load(lang, **kwargs):
     """Convenience function for initializing the Language class that
     mimicks spacy.load.
 
     lang (unicode): ISO 639-1 language code or shorthand UDPipe model name.
+    kwargs: Optional config parameters.
     RETURNS (spacy.language.Language): The UDPipeLanguage object.
     """
     model = UDPipeModel(lang)
-    nlp = UDPipeLanguage(model)
+    nlp = UDPipeLanguage(model, **kwargs)
     return nlp
 
 
-def load_from_path(lang, path, meta=None):
+def load_from_path(lang, path, meta=None, **kwargs):
     """Convenience function for initializing the Language class and loading
     a custom UDPipe model via the path argument.
 
     lang (unicode): ISO 639-1 language code.
     path (unicode): Path to the UDPipe model.
     meta (dict): Meta-information about the UDPipe model.
+    kwargs: Optional config parameters.
     RETURNS (spacy.language.Language): The UDPipeLanguage object.
     """
     model = UDPipeModel(lang, path, meta)
-    nlp = UDPipeLanguage(model)
+    nlp = UDPipeLanguage(model, **kwargs)
     return nlp
 
 
@@ -57,6 +58,9 @@ def __init__(self, udpipe_model, meta=None, **kwargs):
         """
         self.udpipe = udpipe_model
         self.Defaults = get_defaults(udpipe_model._lang)
+        ignore_tag_map = kwargs.get("ignore_tag_map", False)
+        if ignore_tag_map:
+            self.Defaults.tag_map = {}  # workaround for ValueError: [E167]
         self.vocab = self.Defaults.create_vocab()
         self.tokenizer = UDPipeTokenizer(self.udpipe, self.vocab)
         self.pipeline = []
@@ -142,10 +146,25 @@ def __call__(self, text):
             else:
                 next_token = tokens[i + 1]
                 spaces.append(not span.startswith(next_token.form))
-        attrs = [POS, TAG, DEP, HEAD]
-        array = numpy.array(list(zip(pos, tags, deps, heads)), dtype="uint64")
-        doc = Doc(self.vocab, words=words,
-                  spaces=spaces).from_array(attrs, array)
+        try:
+            attrs = [POS, TAG, DEP, HEAD]
+            array = numpy.array(
+                list(zip(pos, tags, deps, heads)), dtype="uint64")
+            doc = Doc(self.vocab,
+                      words=words,
+                      spaces=spaces).from_array(attrs, array)
+        except ValueError as e:
+            if '[E167]' in str(e):
+                raise ValueError(
+                    "Could not properly assign morphology features. "
+                    f"Please update the tag map for '{self.model._lang}'"
+                    " language. See "
+                    "https://spacy.io/usage/adding-languages#tag-map "
+                    "for details. A quick workaround is to use the keyword "
+                    "argument ignore_tag_map=True when loading UDPipeLanguage."
+                )
+            else:
+                raise e
         # Overwrite lemmas separately to prevent overwritting by spaCy
         lemma_array = numpy.array([[lemma]
                                    for lemma in lemmas], dtype="uint64")