The ART project produced a tool for the sentenced based semantic annotation of papers (SAPIENT). SAPIENT incorporates SSSplit, an XML aware sentence splitter, which was also created within the ART project.


SAPIENT stands for "Semantic Annotation of Papers: Interface & ENrichment Tool". It is an annotation interface implemented as a web application, to help users annotate scientific papers in XML, sentence by sentence, with a set of concepts called General Scientific Concepts (GSCs See GSCs constitute the set of concepts essential for describing a scientific investigation. However, SAPIENT can also be used in conjunction with other annotation schemas to annotate papers in XML sentence by sentence. SAPIENT also incorporates Oscar3 functionality, allowing the automatic annotation of chemical named entities.

You can read the FAQ

Sapient Software Files


SAPIENT Sentence Splitter (SSSplit) is an XML-aware sentence splitter which preserves XML markup and identifies sentences through the addition of in-line markup. The reason for developing our own sentence splitter was that sentence splitters widely available could not handle XML properly. The XML markup contains useful information about the document structure and formatting in the form of inline tags, which is important for determining the logical structure of the paper.

SSSplit has been written in the platform-independent Java language (version 1.6), based on and extending open source Perl code for handling plain text. In order to make our sentence splitter XML aware, we translated the Perl regular expression rules into Java and modifed them to make them compatible with the SciXML schema.

SSSplit Software Files

For more details about SAPIENT and SSSplit you can also refer to our BioNLP2009 paper. Please reference this paper, if you find SAPIENT or SSSplit useful: Liakata M., Q Claire and Soldatova L. N. (2009) Semantic Annotation of Papers: Interface and Enrichment Tool (SAPIENT). Proceedings of BioNLP 2009, Boulder, Colorado, pp 193--200. bib