Morphlines – Hadoop ETL by Cloudera

Manuel Lamelas Big Data Architecture Leave a Comment

Today we are going to talk about Morphlines,  an open source framework developed by Cloudera, that provides a new way to do ETL on Hadoop.

What are these morphlines?

Morphlines are simple configurations files that defines how to transform data on the fly. It consists on a file that describes the steps a data flow has to pass in order to get to the end. This steps define how to transform, create or drop any field in the data event received. There are a lot of predefined steps and we can also create our own and integrate them in the file in an easy way if we need to make something new.

Where to use it?

Morphlines are thought primarily to use in real time cases in combination with Flume, but also have an integration with MapReduce if we want to use it in batch mode (CrunchIndexerTool, MapReduceIndexerTool andHBaseMapReduceIndexerTool)

Morphlines Flume Interceptor

Morphlines Flume Solr Sink

In this image we can see where in the Hadoop architecture morphlines have their place:

Morphlines Architecture

How to use it?

This is a sample of a morphline configuration file in which we read a event in JSON, parse, clean, log and index it in Solr.

As we see in the code all the stages are chained and doing their part til the loadSolr command index it in solr.

This will make an excellent example for ingestion in real time as a solr sink as you can see in the series of posts we’ve published Ingest & Search JSON events in Real Time. But we could also use it with MapReduceIndexerTool or CrunchIndexer.

Interesting Morphlines Commands

In the Morphlines Reference Guide you could see all the available commads but we are going to highlight some of them that are very interesting.

  • readCSV or readJSON: they can read these files and emit records.
  • addValues, addValuesIfAbsent, removeFields : you can add or remove fields from the records.
  • addCurrenttime, convertTimestamp: Time functions so we can add a timestamp to the records.
  • contains, equals, if, not: We can use conditionals to decide what to do with a record.

Custom Commands

One of the most interesting property of Morphlines framework is the capacity of writing your own commands as a step or include a java block in your morphline. You can see in the ReadLine code of Cloudera how easy is to write a new step.

This is an example of a java block included in the morphline file:

Take away

So we can see with these examples how powerful Cloudera ETL approach “morphlines” is. Hortonworks has gone all the other way with Nifi but that is for another blog post.




Leave a Reply

Your email address will not be published. Required fields are marked *