Saving and Loading
If you’ve been modifying the pipeline, vocabulary, vectors and entities, or made
updates to the component models, you’ll eventually want to save your
progress – for example, everything that’s in your nlp
object. This means
you’ll have to translate its contents and structure into a format that can be
saved, like a file or a byte string. This process is called serialization. spaCy
comes with built-in serialization methods and supports the
Pickle protocol.
All container classes, i.e. Language
(nlp
),
Doc
, Vocab
and StringStore
have the following methods available:
Method | Returns | Example |
---|---|---|
to_bytes | bytes | data = nlp.to_bytes() |
from_bytes | object | nlp.from_bytes(data) |
to_disk | - | nlp.to_disk("/path") |
from_disk | object | nlp.from_disk("/path") |
Serializing the pipeline
When serializing the pipeline, keep in mind that this will only save out the binary data for the individual components to allow spaCy to restore them – not the entire objects. This is a good thing, because it makes serialization safe. But it also means that you have to take care of storing the config, which contains the pipeline configuration and all the relevant settings.
Serialize
Deserialize
This is also how spaCy does it under the hood when loading a pipeline: it loads
the config.cfg
containing the language and pipeline information, initializes
the language class, creates and adds the pipeline components based on the config
and then loads in the binary data. You can read more about this process
here.
Serializing Doc objects efficiently
If you’re working with lots of data, you’ll probably need to pass analyses
between machines, either to use something like Dask or
Spark, or even just to save out work to disk. Often
it’s sufficient to use the Doc.to_array
functionality for
this, and just serialize the numpy arrays – but other times you want a more
general way to save and restore Doc
objects.
The DocBin
class makes it easy to serialize and deserialize a
collection of Doc
objects together, and is much more efficient than calling
Doc.to_bytes
on each individual Doc
object. You can
also control what data gets saved, and you can merge pallets together for easy
map/reduce-style processing.
If store_user_data
is set to True
, the Doc.user_data
will be serialized as
well, which includes the values of
extension attributes
(if they’re serializable with msgpack).
Using Pickle
When pickling spaCy’s objects like the Doc
or the
EntityRecognizer
, keep in mind that they all require
the shared Vocab
(which includes the string to hash mappings,
label schemes and optional vectors). This means that their pickled
representations can become very large, especially if you have word vectors
loaded, because it won’t only include the object itself, but also the entire
shared vocab it depends on.
If you need to pickle multiple objects, try to pickle them together instead
of separately. For instance, instead of pickling all pipeline components, pickle
the entire pipeline once. And instead of pickling several Doc
objects
separately, pickle a list of Doc
objects. Since they all share a reference to
the same Vocab
object, it will only be included once.
Pickling objects with shared data
Implementing serialization methods
When you call nlp.to_disk
,
nlp.from_disk
or load a pipeline package, spaCy
will iterate over the components in the pipeline, check if they expose a
to_disk
or from_disk
method and if so, call it with the path to the pipeline
directory plus the string name of the component. For example, if you’re calling
nlp.to_disk("/path")
, the data for the named entity recognizer will be saved
in /path/ner
.
If you’re using custom pipeline components that depend on external data – for
example, model weights or terminology lists – you can take advantage of spaCy’s
built-in component serialization by making your custom component expose its own
to_disk
and from_disk
or to_bytes
and from_bytes
methods. When an nlp
object with the component in its pipeline is saved or loaded, the component will
then be able to serialize and deserialize itself.
The following example shows a custom component that keeps arbitrary JSON-serializable data, allows the user to add to that data and saves and loads the data to and from a JSON file.
After adding the component to the pipeline and adding some data to it, we can
serialize the nlp
object to a directory, which will call the custom
component’s to_disk
method.
The contents of the directory would then look like this.
CustomComponent.to_disk
converted the data to a JSON string and saved it to a
file data.json
in its subdirectory:
Directory structure
When you load the data back in, spaCy will call the custom component’s
from_disk
method with the given file path, and the component can then load the
contents of data.json
, convert them to a Python object and restore the
component state. The same works for other types of data, of course – for
instance, you could add a
wrapper for a model trained with a
different library like TensorFlow or PyTorch and make spaCy load its weights
automatically when you load the pipeline package.
Using entry points
Entry points let you expose parts of a Python package you write to other Python
packages. This lets one application easily customize the behavior of another, by
exposing an entry point in its setup.py
. For a quick and fun intro to entry
points in Python, check out
this excellent blog post.
spaCy can load custom functions from several different entry points to add
pipeline component factories, language classes and other settings. To make spaCy
use your entry points, your package needs to expose them and it needs to be
installed in the same environment – that’s it.
Entry point | Description |
---|---|
spacy_factories | Group of entry points for pipeline component factories, keyed by component name. Can be used to expose custom components defined by another package. |
spacy_languages | Group of entry points for custom Language subclasses, keyed by language shortcut. |
spacy_lookups | Group of entry points for custom Lookups , including lemmatizer data. Used by spaCy’s spacy-lookups-data package. |
spacy_displacy_colors | Group of entry points of custom label colors for the displaCy visualizer. The key name doesn’t matter, but it should point to a dict of labels and color values. Useful for custom models that predict different entity types. |
Loading probability tables into existing models
You can load a probability table from
spacy-lookups-data into an
existing spaCy model like en_core_web_sm
.
When training a model from scratch you can also specify probability tables in
the config.cfg
.
config.cfg (excerpt)
Custom components via entry points
When you load a pipeline, spaCy will generally use its config.cfg
to set up
the language class and construct the pipeline. The pipeline is specified as a
list of strings, e.g. pipeline = ["tagger", "parser", "ner"]
. For each of
those strings, spaCy will call nlp.add_pipe
and look up the name in all
factories defined by the decorators
@Language.component
and
@Language.factory
. This means that you have to import
your custom components before loading the pipeline.
Using entry points, pipeline packages and extension packages can define their
own "spacy_factories"
, which will be loaded automatically in the background
when the Language
class is initialized. So if a user has your package
installed, they’ll be able to use your components – even if they don’t import
them!
To stick with the theme of this entry points blog post, consider the following custom spaCy pipeline component that prints a snake when it’s called:
snek.py
Since it’s a very complex and sophisticated module, you want to split it off
into its own package so you can version it and upload it to PyPi. You also want
your custom package to be able to define pipeline = ["snek"]
in its
config.cfg
. For that, you need to be able to tell spaCy where to find the
component "snek"
. If you don’t do this, spaCy will raise an error when you try
to load the pipeline because there’s no built-in "snek"
component. To add an
entry to the factories, you can now expose it in your setup.py
via the
entry_points
dictionary:
setup.py
The same package can expose multiple entry points, by the way. To make them available to spaCy, all you need to do is install the package in your environment:
spaCy is now able to create the pipeline component "snek"
– even though you
never imported snek_component
. When you save the
nlp.config
to disk, it includes an entry for your
"snek"
component and any pipeline you train with this config will include the
component and know how to load it – if your snek
package is installed.
Instead of making your snek component a simple
stateless component, you
could also make it a
factory that takes
settings. Your users can then pass in an optional config
when they add your
component to the pipeline and customize its appearance – for example, the
snek_style
.
setup.py
The factory can also implement other pipeline component methods like to_disk
and from_disk
for serialization, or even update
to make the component
trainable. If a component exposes a from_disk
method and is included in a
pipeline, spaCy will call it on load. This lets you ship custom data with your
pipeline package. When you save out a pipeline using nlp.to_disk
and the
component exposes a to_disk
method, it will be called with the disk path.
The above example will serialize the current snake in a snek.txt
in the data
directory. When a pipeline using the snek
component is loaded, it will open
the snek.txt
and make it available to the component.
Custom language classes via entry points
To stay with the theme of the previous example and
this blog post on entry points,
let’s imagine you wanted to implement your own SnekLanguage
class for your
custom pipeline – but you don’t necessarily want to modify spaCy’s code to add a
language. In your package, you could then implement the following
custom language subclass:
snek.py
Alongside the spacy_factories
, there’s also an entry point option for
spacy_languages
, which maps language codes to language-specific Language
subclasses:
setup.py
In spaCy, you can then load the custom snk
language and it will be resolved to
SnekLanguage
via the custom entry point. This is especially relevant for
pipeline packages you train, which could then specify
lang = snk
in their config.cfg
without spaCy raising an error because the
language is not available in the core library.
Custom displaCy colors via entry points
If you’re training a named entity recognition model for a custom domain, you may
end up training different labels that don’t have pre-defined colors in the
displacy
visualizer. The spacy_displacy_colors
entry point lets you define a dictionary of entity labels mapped to their color
values. It’s added to the pre-defined colors and can also overwrite existing
values.
snek.py
Given the above colors, the entry point can be defined as follows. Entry points
need to have a name, so we use the key colors
. However, the name doesn’t
matter and whatever is defined in the entry point group will be used.
setup.py
After installing the package, the custom colors will be used when visualizing
text with displacy
. Whenever the label SNEK
is assigned, it will be
displayed in #3dff74
.
Saving, loading and distributing trained pipelines
After training your pipeline, you’ll usually want to save its state, and load it
back later. You can do this with the Language.to_disk
method:
The directory will be created if it doesn’t exist, and the whole pipeline data, meta and configuration will be written out. To make the pipeline more convenient to deploy, we recommend wrapping it as a Python package.
When you save a pipeline in spaCy v3.0+, two files will be exported: a
config.cfg
based on
nlp.config
and a meta.json
based on nlp.meta
.
- config: Configuration used to create the current
nlp
object, its pipeline components and models, as well as training settings and hyperparameters. Can include references to registered functions like pipeline components or model architectures. Given a config, spaCy is able reconstruct the whole tree of objects and thenlp
object. An exported config can also be used to train a pipeline with the same settings. - meta: Meta information about the pipeline and the Python package, such as
the author information, license, version, data sources and label scheme. This
is mostly used for documentation purposes and for packaging pipelines. It has
no impact on the functionality of the
nlp
object.
Generating a pipeline package
spaCy comes with a handy CLI command that will create all required files, and
walk you through generating the meta data. You can also create the
meta.json
manually and place it in the data
directory, or supply a path to it using the --meta
flag. For more info on
this, see the package
docs.
This command will create a pipeline package directory and will run
python -m build
in that directory to create a binary .whl
file or
.tar.gz
archive of your package that can be installed using pip install
.
Installing the binary wheel is usually more efficient.
Directory structure
You can also find templates for all files in the
cli/package.py
source.
If you’re creating the package manually, keep in mind that the directories need
to be named according to the naming conventions of lang_name
and
lang_name-version
.
Including custom functions and components
If your pipeline includes
custom components, model
architectures or other code, those functions need
to be registered before your pipeline is loaded. Otherwise, spaCy won’t know
how to create the objects referenced in the config. If you’re loading your own
pipeline in Python, you can make custom components available just by importing
the code that defines them before calling
spacy.load
. This is also how the --code
argument to CLI commands works.
With the spacy package
command, you can provide one or
more paths to Python files containing custom registered functions using the
--code
argument.
The Python files will be copied over into the root of the package, and the
package’s __init__.py
will import them as modules. This ensures that functions
are registered when the pipeline is imported, e.g. when you call spacy.load
. A
simple import is all that’s needed to make registered functions available.
Make sure to include all Python files that are referenced in your custom
code, including modules imported by others. If your custom code depends on
external packages, make sure they’re listed in the list of "requirements"
in your meta.json
. For the majority of use cases,
registered functions should provide you with all customizations you need, from
custom components to custom model architectures and lifecycle hooks. However, if
you do want to customize the setup in more detail, you can edit the package’s
__init__.py
and the package’s load
function that’s called by
spacy.load
.
Loading a custom pipeline package
To load a pipeline from a data directory, you can use
spacy.load()
with the local path. This will look
for a config.cfg
in the directory and use the lang
and pipeline
settings
to initialize a Language
class with a processing pipeline and load in the
model data.
If you want to load only the binary data, you’ll have to create a Language
class and call from_disk
instead.