Data Logistics with Apache Nifi

Manuel Lamelas BigData Architecture, Technology Leave a Comment

As announced in a previous post we’re now going to introduce you to Apache Nifi, the latest trend in ingestion tools. A new project from the Apache Software Foundation that allows you to manage data flows with a cool graphical interface. If we didn’t catch your attention yet, wait until you hear this: NSA created it!!!

Nifi – the UPS of data

Apache Nifi

We could describe Nifi as a straightforward event processing platform with the following features:

  • Awesome web interface: intuitive and fun, we can simply drag and drop elements, import and export configuration files…
  • High availability and scalability: in combination with Zookeeper we can install a Nifi cluster.
  • Secure: multitenant authorization and SSL.
  • Configurable: data flow can be fully customized (guaranteed delivery, prioritization…)
  • Data Governance: you can track the data flow to its very source.

Nifi Twitter Read Sample

The best way to learn how to walk is walking! So let’s go through a simple sample: We’re going to install a demo with docker and then run a flow that fetches tweets and stores them in files.

Let’s start by running the docker image:

Secondly, we’re going to store real time tweets into the folder /demos/nifi/tweets of our computer. All of this will be done through the web interface, so let’s begin. Go to localhost:8080/nifi where you’ll find an empty dashboard like this:

Nifi Interface

 

Now, we’ll add our Twitter processor and configure a filter endpoint to retrieve all tweets with game of thrones in it, through the properties section. In order to do so, we’ll add a processor, select GetTwitter, select filter endpoint, input our account details and in ‘Terms to Filter On’ we’ll type ‘game of thrones’.

Twitter Processor Configuration

 

Once the filter has been generated, we’ll create a file processor to store the filtered tweets into our folder. For that we’ll setup a Put File Processor like the one in the image below.

File Processor Configuration

Now we’re going to merge content so we don’t need to have a file for each individual tweet. In order to do this, add a Merge Content processor and configure it to merge if we reach 1000 tweets or a 5 minute threshold.

Twitter Processor Merge Content

 

Finally, we connect the twitter processor to the merge one through the success relationships and the merge one to the file processor through the merged relationships. We’ll leave every other relationship untouched. Ready to roll, just click play and enjoy the tweets flowing!

Nifi Twitter Complete Flow

We can always visualize our stats selecting Twitter Processor and Status History / Tasks. These graphical features certainly make Nifi a “nifty” tool to manage processes.

Twitter Processor Status History

Conclusion

In this post we haven’t gone into thorough detail over the potential behind Nifi, instead we’ve tried to explain a bit of what it can do through a simple example. However, you can always find all information regarding this topic at the Nifi Page.

Nifi is a great tool to have in our set, as it has a lot of processors that connect with HDFS, Kafka… and plenty of scheduling flexibility, as to be taken into serious consideration when designing a new Big Data Architecture. Additionally, it allows you to create your very own processor! Stay tuned to our blog, we’ll explain this feature at a later stage.

 

Leave a Reply

Your email address will not be published. Required fields are marked *