Ingest & Search JSON events in Real Time (II): Searching

Manuel Lamelas Big Data Architecture, Technology 1 Comment

To continue with the series, after Part1 Flume is configured to sink into HDFS, with this we can already search data in Real Time using Hive. However, if we want to search it fast, Hive will probably not be enough for us, so we are also going to configure a Solr Sink to store data in a Solr collection so we can search it via Solr API.

We are working with this sample JSON to provide some examples:

Hive

To search with Hive we only have to configure an external table over the HDFS files we ingested in part1. We store all the events in HDFS under /events/%Y/%m/%d, so we are going to create a table called events and partition it by date. It’s easy to create a Oozie workflow that can add the necessary partitions for us every day.

With this we can query the data and see it almost in real time. We are going to observe a small delay, which is the time the files ingested by Flume (remember to prefix the tmp files with dot ‘.’ to avoid errors) take to convert from temporary to permanent. This delay can be controlled with the Flume parameters rollInterval and rollSize (see part1).

Solr

In order to index data directly in Solr for real time indexing we are going to use a Solr Sink, and the configuration in flume is:

The last two properties, isProductionMode and  isIgnoringRecoverableExceptions, are neccesary in order for Flume to not stop in case of Solr Exceptions and to not retry forever badly formatted events.

Morphlines

Morphlines allow a user to define ETL processes to transform data in real time. In this case we will use them to format the event into Solr with this morphline.conf:

This will ingest the events directly in Solr in real time if the Solr schema is well defined. Apart from reading the JSON file in the extractJsonPaths, we also create an id field with and unique uid value and a dt field where we insert the current date (for quick comparison with the Hive partitioning we defined earlier).

With this, we already have JSON events stored in HDFS and indexed in Solr, so we can use the Solr API to query them in miliseconds and use Hive for more expressive queries using SQL that will take seconds.

Finally to end this series of articles around Flume JSON ingestion and search, we will see in part III High Availability Architecture and Benchmarking.

Comments 1

  1. Pingback: Ingest & Search JSON events in Real Time (I): Collecting Data - DataTons Blog

Leave a Reply

Your email address will not be published. Required fields are marked *