Questions & Answers

Where can I learn more about the linguistic model?

For questions regarding the dependency relations you might want to consult the Stanford Dependencies website and especially the Stanford Dependencies manual (PDF).

Is there a size limit on the corpora?

Technically, there are certainly limitations by the software used, but the system has been tested to work with very large corpora (over 1 billion words). However, we currently impose a limit of 100MB (approx. 20 million words) per corpus by default for reasons of space. If you would like to use larger corpora, please contact us and we will arrange something.

Can I share my parsed corpora with colleagues and/or students?

Such a feature is not implemented yet but may be available in the future if there is a demand for it. Let us know if you are interested. For now, you will have to share your password, but keep in mind that everyone who knows your password can delete your corpus.

I found an error in the analysis!

Unfortunately, this is to be expected. The parsers used here are stochastic parsers and thus assign what they think is the most likely syntactic structure to each sentence. They may be horribly wrong. There is nothing we can do about it. (see also the explanation in the Stanford Parser FAQ)

Is there a way to speed up processing?

For the Java-based parsers, you will soon be able to download a Java package you can run on bored computers (such as Computer Lab machines outside of term time). It will automatically connect to our server and parse sentences. However, unless you have a substantial number of machines, this may not speed up processing significantly to be worth the hassle since we use the Erlangen High Performance Computing machines. We may also make available faster dependency parsers in the future, so stay tuned.

What software do you use in your project?

We use a whole zoo of programs to process the corpora:

  • A perl script for sentence splitting
  • MongoDB as the main storage of all corpora and metadata
  • HornetQ as the messaging middleware for the processing pipeline
  • The Open Corpus Workbench's Corpus Query Processor for the storage of the graph data - used for querying the corpus
  • SQLite to store query results for pagination and sorting
  • A range of taggers (HunPOS, SVMTool, Stanford Tagger, ...) and parsers (Stanford Parser, Berkeley Parser, Stanford CoreNLP, ...) for experiments of all sorts
  • A rule-based lemmatiser that closely follows the Lancaster LEMMINGS software