Parsing data files is always a little difficult, since you can’t be sure that your data is formatted properly. I mentioned in earlier posts that I am currently creating a Reader for my training data. Here is how I am doing.

If your input data is formatted perfectly correct, good for you! As soon as somebody else (either a person or someones algorithm) creates these data files… Well good luck, pray, hope for the best. But such things can be avoided if you analyze your data and carefully process it.

Maybe I am exaggerating a little bit but my Reader avoids some potential errors. Let me show you my code:

[code language=”java”]
public class Reader extends JCasCollectionReader_ImplBase{

// some parameters (you might have spotted them in the pipeline post)
public static final String PARAM_INPUT_FILE = "InputFile";
@ConfigurationParameter(name = PARAM_INPUT_FILE, mandatory = true)
private File inputFile;

public static final String PARAM_TASK_TYPE = "taskType";
@ConfigurationParameter(name = PARAM_TASK_TYPE, mandatory = true)
private String taskType;

// used objects for the full pipeline process
private List lines;
private int currentLine;

@Override
public void initialize(UimaContext context)
throws ResourceInitializationException
{
super.initialize(context);

try {
lines = FileUtils.readLines(inputFile);
currentLine = 0;
}
catch (IOException e) {
throw new ResourceInitializationException(e);
}
}

public Progress[] getProgress() {
return new Progress[] {
new ProgressImpl(currentLine, lines.size(), "lines")
};
}

public boolean hasNext() throws IOException, CollectionException {
return currentLine < lines.size();
}

@Override
public void getNext(JCas jcas) throws IOException, CollectionException {

// Here shall be the important code
currentLine++;
}
}
[/code]

The Reader is an extended class from the abstract org.apache.uima.collection.CollectionReader_ImplBase. Besides the getNext( ) method, the parameters and the private objects, everything should look almost the same in every reader. First it should be initialized with its own method.

The only possible error so far is the location and name of the input file. This should be expected to be correct. In the case that it doesn’t an IOException is thrown. If everything is fine, the file is saved line-by-line to a String List in the virtual memory and the counter (currentLine) is set to 0.

hasNext( ) and getProgress( ) are hopefully simple to understand by reading the code. If you have questions about them, feel free to leave a comment.

The core piece is the getNext( ) method. Most of your work to design a proper Reader is done right here. Let’s check some potential error sources:

  1. Lines might not be formatted properly. Normally you should use some split code to split the lines for further processing.
  2. Split count is not right. It could happen, that your line contains to many or to few splits resulting in a wrong amount of substrings (parts).
  3. Parts might contain false information. I.e. if one part is expected to be a natural number but has letters or other symbols in it, that’s bad or you might expect a String that has at least 10 words but contains only 2.

This is, how an input line should look like <ID><shortID><first token><last token><sentiment><tweet>:

100001199650123776 17063255 7 8 negative I'm stuck in London again... :( I don't wanna spend the night in McDonald's!

It should contain 6 different parts, these are split with a tab character. This character is not available in tweets, so it should work. Still there might be problems with the provided data. Another problem is that the crawled tweets might have become unavailable. This would look like this in the data file:

101783554563903488 19326476 9 10 positive Not Available

The Reader would have to look for the 9th to 10th token in the tweet. These are not available. It would result in an error…

Here is the getNext( ) method for the training data set A provided by the SemEval coordinators. In my real code, I have an alternative algorithm for task B (no phrases, full tweets are to be evaluated) but it eventually works mostly the same. I also have to say: If the data file contains errors in the first 5 columns I am screwed and can eat me words. I might edit this later. Since it will not be changed, everything is just fine with me right now.

So here is what the algorithm does:

  1. Split the current line into sub strings (split char is a tab symbol).
  2. Check if the result has 6 Strings.
  3. Split the tweet into tokens (split with a space symbol).
  4. Check if the tweet was available (else it has the String “Not Available”).
  5. Check if the given token size is correct, meaning the largest token number required can’t be longer than the amount of tweet sub strings +1.
  6. Extract the phrase as defined in the 3rd and 4th substring.
  7. Mark the Golden Standard (right now I haven’t defined a proper type for this and am using the GoldenLanguage type, this will change soon enough!).
  8. Save the relevant data to the jcas body.

[code language=”java”]
@Override
public void getNext(JCas jcas) throws IOException, CollectionException {
String[] parts = lines.get(currentLine).split(" ");

if (parts.length != 6) {
throw new IOException("Wrong line format: " + lines.get(currentLine));
}

String[] tokens = parts[5].split(" ");

int firstToken = Integer.parseInt(parts[2]);
int lastToken = Integer.parseInt(parts[3]);

/**
* Resolves available tweets, that have still the ‘right’ token size
*/
if (!parts[5].equals("Not Available") && !(tokens.length < lastToken)){

String phrase = "";

for(int i = firstToken; i <= lastToken; i++){
phrase += tokens[i];
if(!(i == lastToken)){
phrase += " ";
}
}

GoldLanguage goldLanguage = new GoldLanguage(jcas);
goldLanguage.setLanguage(parts[4]);
goldLanguage.addToIndexes();

jcas.setDocumentText(phrase);

} else{
String phrase = "INVALID";

GoldLanguage goldLanguage = new GoldLanguage(jcas);
goldLanguage.setLanguage("unknown");
goldLanguage.addToIndexes();

jcas.setDocumentText(phrase);
}
currentLine++;
}
[/code]

So, that’s it for the week. I might get another hour of Elite Dangerous tonight, guess I earned it. Got some bounties to hunt! If you have any questions or advises feel free to leave a comment.

A Data Science consultant working at Materna. He occasionally blogs about data and related topics here and is the host of the Dortmund Data Science Meetup.