I am far from creating the best code possible but last week I spent some time writing a half decent Reader for my training data sets. I will write my code in Java, since it’s my most fluent programming language. But first I’ll write some lines about the pipeline.

I was encouraged to use the pipeline design, which is part of the famous UIMA framework (org.apache.uima.fit.pipeline). So my current setup looks like this:

[code language=”java”]
public class BasicPipeline {

public static void main(String[] args) throws Exception {



A SimplePipeline does exactly what it sounds like. Every algorithm is processed after each other. The modified data is available to the following stations. In the case above the reader has to read “full-twitter-train-c-A.tsv”, a training data set containing 9451 tweets. This happens depending on the Reader, normally it’s line-by-line, and that’s how I will go with it, too. I will explain the used arguments in a later post.

It is important to understand, that every document (here it’s a phrase from a tweet or a full tweet, depending on the data set) is processed through the whole pipeline and the next documents follows the same way until no more documents are available.

The following 3 AnalysisEngines are just for testing right now. The BreakIteratorSegmenter simply splits the reader output into tokens (“words”) as far is I understand.

The BaseLine is the still-in-VERY-early-development heart piece of the program. It is used to make a hopefully right suggestion about the currently processed document. For my chosen task it should say if a given phrase or tweet is sentiment wise positive, negative or neutral. Right now… Well, it just tags every give input as positive.

Finally the Evaluator is used to compare the guesses made by the BaseLine with the so-called gold standard, that came with the raw data. I’ll come to this in a second. As with the BaseLine, here is a lot to do later in the project.

The current setup will result in the following output. Of course there are a lot more. But here are the last two phrases and the final evaluation.

Twitter phrase is: trending worldwide
CAS contains 2 tokens.
positive detected as positive
Twitter phrase is: positive reviews
CAS contains 2 tokens.
positive detected as positive

4780 out of 9451 are correct.
1796 of the tweets were unusable.
4780 out of 7655 available tweets are correct.

As you can see, right now every phrase gets its String, token size and guessed/correct sentiment printed to console. I hope this shows the amazing advantage that pipelines have. Finally, if the data is fully processed, the evaluator uses the method collectionProcessComplete() and prints the final result (last three lines).

As mentioned above and promised last week, I will write about the Reader in the coming days. If you have any questions or suggestions, feel welcome and post your comment below or send me an email.

A Data Science consultant working at Materna. He occasionally blogs about data and related topics here and is the host of the Dortmund Data Science Meetup.