Skip to content

Commit 934e9ad

Browse files
authored
Feature/Optional fine-grained POS (#11)
* Update README.md * Fix #10
1 parent 4ce8e7b commit 934e9ad

File tree

2 files changed

+60
-21
lines changed

2 files changed

+60
-21
lines changed

README.md

Lines changed: 32 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Use the package manager [pip](https://pip.pypa.io/en/stable/) to install spacy-u
1414
pip install spacy-udpipe
1515
```
1616

17-
After installation, use `spacy_udpipe.download(lang)` to download the pre-trained model for the desired language.
17+
After installation, use `spacy_udpipe.download(<language ISO code>)` to download the pre-trained model for the desired language.
1818

1919
## Usage
2020
The loaded UDPipeLanguage class returns a spaCy [`Language` object](https://spacy.io/api/language), i.e., the nlp object you can use to process text and create a [`Doc` object](https://spacy.io/api/doc).
@@ -32,32 +32,42 @@ for token in doc:
3232
print(token.text, token.lemma_, token.pos_, token.dep_)
3333

3434
```
35-
As all attributes are computed once and set in the custom [`Tokenizer`](https://spacy.io/api/tokenizer), the `nlp.pipeline` is empty.
35+
As all attributes are computed once and set in the custom [`Tokenizer`](https://spacy.io/api/tokenizer), the `Language.pipeline` is empty.
36+
37+
#### Loading a custom model
38+
The following code snippet demonstrates how to load a custom `UDPipe` model (for the Croatian language):
39+
```python
40+
import spacy_udpipe
41+
42+
nlp = spacy_udpipe.load_from_path(lang="hr",
43+
path="./custom_croatian.udpipe",
44+
meta={"description": "Custom 'hr' model"})
45+
text = "Wikipedija je enciklopedija slobodnog sadržaja."
46+
47+
doc = nlp(text)
48+
for token in doc:
49+
print(token.text, token.lemma_, token.pos_, token.dep_)
50+
```
51+
This can be done for any of the languages supported by spaCy. For an exhaustive list, see [spaCy languages](https://spacy.io/usage/models#languages).
3652

3753
## Authors and acknowledgment
38-
Created by [Antonio Šajatović](http://github.com/asajatovic)
39-
during an internship at [Text Analysis and Knowledge Engineering Lab (TakeLab)](http://takelab.fer.hr/).
54+
Created by [Antonio Šajatović](http://github.com/asajatovic) during an internship at [Text Analysis and Knowledge Engineering Lab (TakeLab)](http://takelab.fer.hr/).
4055

4156
## Contributing
4257
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
4358

44-
Please make sure to update tests as appropriate.
45-
46-
Tests are run automatically for each pull request on the master branch.
47-
To start the tests locally, just run [`pytest`](https://docs.pytest.org/en/latest/contents.html) in the root source directory.
59+
Please make sure to update tests as appropriate. Tests are run automatically for each pull request on the master branch. To start the tests locally, just run [`pytest`](https://docs.pytest.org/en/latest/contents.html) in the root source directory.
4860

4961
## License
50-
[MIT](https://choosealicense.com/licenses/mit/) © TakeLab
62+
[MIT](https://choosealicense.com/licenses/mit/) © Text Analysis and Knowledge Engineering Lab (TakeLab)
5163

5264
## Project status
5365
Maintained by [Text Analysis and Knowledge Engineering Lab (TakeLab)](http://takelab.fer.hr/).
5466

5567
## Notes
5668
* All available pre-trained models are licensed under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).
5769

58-
* All annotations match with Spacy's, except for token.tag_, which map from [CoNLL](https://universaldependencies.org/format.html) XPOS tag (language-specific part-of-speech tag), defined for each language separately by the corresponding [Universal Dependencies](https://universaldependencies.org/) treebank.
59-
60-
* Full list of supported languages and models is available in [`languages.json`](https://github.com/TakeLab/spacy-udpipe/blob/master/spacy_udpipe/languages.json).
70+
* Full list of pre-trained models for supported languages is available in [`languages.json`](https://github.com/TakeLab/spacy-udpipe/blob/master/spacy_udpipe/languages.json).
6171

6272
* This package exposes a `spacy_languages` entry point in its [`setup.py`](https://github.com/TakeLab/spacy-udpipe/blob/master/setup.py) so full suport for serialization is enabled:
6373
```python
@@ -71,3 +81,13 @@ Maintained by [Text Analysis and Knowledge Engineering Lab (TakeLab)](http://tak
7181
nlp = spacy.load("./udpipe-spacy-model", udpipe_model=udpipe_model)
7282

7383
```
84+
* Known possible issues:
85+
* Tag map
86+
87+
`Token.tag_` is a [CoNLL](https://universaldependencies.org/format.html) XPOS tag (language-specific part-of-speech tag), defined for each language separately by the corresponding [Universal Dependencies](https://universaldependencies.org/) treebank. Mappings between between XPOS and Universal Dependencies POS tags should be defined in a `TAG_MAP` dictionary (located in language-specific `tag_map.py` files), along with optional morphological features. See [spaCy tag map](https://spacy.io/usage/adding-languages#tag-map) for more details.
88+
* Syntax iterators
89+
90+
In order to extract `Doc.noun_chunks`, a proper syntax iterator implementation for the language of interest is required. For more details, please see [spaCy syntax iterators](https://spacy.io/usage/adding-languages#syntax-iterators).
91+
* Other language-specific issues
92+
93+
A quick way to check language-specific defaults in [spaCy](https://spacy.io) is to visit [spaCy language support](https://spacy.io/usage/models#languages). Also, please see [spaCy language data](https://spacy.io/usage/adding-languages#language-data) for details regarding other language-specific data.

spacy_udpipe/language.py

Lines changed: 28 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -6,36 +6,37 @@
66
from spacy.language import Language
77
from spacy.symbols import DEP, HEAD, LEMMA, POS, TAG
88
from spacy.tokens import Doc
9-
109
from ufal.udpipe import (InputFormat, Model, OutputFormat, ProcessingError,
1110
Sentence)
1211

1312
from .util import get_defaults, get_path
1413

1514

16-
def load(lang):
15+
def load(lang, **kwargs):
1716
"""Convenience function for initializing the Language class that
1817
mimicks spacy.load.
1918
2019
lang (unicode): ISO 639-1 language code or shorthand UDPipe model name.
20+
kwargs: Optional config parameters.
2121
RETURNS (spacy.language.Language): The UDPipeLanguage object.
2222
"""
2323
model = UDPipeModel(lang)
24-
nlp = UDPipeLanguage(model)
24+
nlp = UDPipeLanguage(model, **kwargs)
2525
return nlp
2626

2727

28-
def load_from_path(lang, path, meta=None):
28+
def load_from_path(lang, path, meta=None, **kwargs):
2929
"""Convenience function for initializing the Language class and loading
3030
a custom UDPipe model via the path argument.
3131
3232
lang (unicode): ISO 639-1 language code.
3333
path (unicode): Path to the UDPipe model.
3434
meta (dict): Meta-information about the UDPipe model.
35+
kwargs: Optional config parameters.
3536
RETURNS (spacy.language.Language): The UDPipeLanguage object.
3637
"""
3738
model = UDPipeModel(lang, path, meta)
38-
nlp = UDPipeLanguage(model)
39+
nlp = UDPipeLanguage(model, **kwargs)
3940
return nlp
4041

4142

@@ -57,6 +58,9 @@ def __init__(self, udpipe_model, meta=None, **kwargs):
5758
"""
5859
self.udpipe = udpipe_model
5960
self.Defaults = get_defaults(udpipe_model._lang)
61+
ignore_tag_map = kwargs.get("ignore_tag_map", False)
62+
if ignore_tag_map:
63+
self.Defaults.tag_map = {} # workaround for ValueError: [E167]
6064
self.vocab = self.Defaults.create_vocab()
6165
self.tokenizer = UDPipeTokenizer(self.udpipe, self.vocab)
6266
self.pipeline = []
@@ -142,10 +146,25 @@ def __call__(self, text):
142146
else:
143147
next_token = tokens[i + 1]
144148
spaces.append(not span.startswith(next_token.form))
145-
attrs = [POS, TAG, DEP, HEAD]
146-
array = numpy.array(list(zip(pos, tags, deps, heads)), dtype="uint64")
147-
doc = Doc(self.vocab, words=words,
148-
spaces=spaces).from_array(attrs, array)
149+
try:
150+
attrs = [POS, TAG, DEP, HEAD]
151+
array = numpy.array(
152+
list(zip(pos, tags, deps, heads)), dtype="uint64")
153+
doc = Doc(self.vocab,
154+
words=words,
155+
spaces=spaces).from_array(attrs, array)
156+
except ValueError as e:
157+
if '[E167]' in str(e):
158+
raise ValueError(
159+
"Could not properly assign morphology features. "
160+
f"Please update the tag map for '{self.model._lang}'"
161+
" language. See "
162+
"https://spacy.io/usage/adding-languages#tag-map "
163+
"for details. A quick workaround is to use the keyword "
164+
"argument ignore_tag_map=True when loading UDPipeLanguage."
165+
)
166+
else:
167+
raise e
149168
# Overwrite lemmas separately to prevent overwritting by spaCy
150169
lemma_array = numpy.array([[lemma]
151170
for lemma in lemmas], dtype="uint64")

0 commit comments

Comments
 (0)