We have been working on two prototype systems for information extraction (IE) of knowledge related to brain architecture (including brain structure, genetic makeup, and disease) from a large text corpus. To date our corpus contains approximately 55,000 full-text journal articles.

The two systems in place operate on the same basic principles, and each currently relies on the Textpresso engine for annotating the full text corpus using a set of semantic categories. Both systems allow the user to search the full text using queries formed from a combination of keyword and category criteria, and return individually annotated sentences from the corpus, grouped by the articles from which they are drawn. Future work will include better filtering the returned sentences to reduce false positives relative to particular use cases, e.g. searching for connections in or out of a particular brain region.

The prototype systems differ in the details of their implementation, particularly in the way in which the documents and sentences are indexed, and in their user interfaces. We are in the process of evaluating their relative performance. Both systems operate on the same full-text corpus.
Click here for the Textpresso-based system, which has been engineered to operate in parallel across several comnpute nodes.
Click here for the Lucene-based text mining prototype system.