Ingest & Search JSON events in Real Time (III): Flume Architecture & BenchMarking

Manuel Lamelas Big Data Architecture Leave a Comment

To end this series of articles in Ingestion & Searching we are going to see the Flume Architecture for High Availability and see some benchmark tests.

Flume Architecture

To achieve high availability we have two flume characteristics to play with:

1. File Channel vs Memory Channel

This is a decision on 100% delivery vs fast ingestion. With file channel the data is backed in the filesystem (you can log in different disks to improve performance), and with memory channel the events are stored in-memory so it is faster but if the agent crashes data will be lost.

If your requirements are not a 100% delivery, Memory channel behaves better and support very high throughput, and we never see one down for months.

2. Failover vs Load Balance

So it’s difficult that one agent is down, but you need to make updates, network failures… In order to deal with this Flume let you choose Failover or Load Balance.

Failover

You priorize one over the other. This is used when you need to ensure all the events goes two one sink not simultaneously (ex: only one hdfs file per source)

Load Balance

The selector lets you play a bit, but load balance is the rigth choice when you can write in parallel to your sinks.

Benchmarking

After selecting Memory Channel and Load Balance we have the architecture we see in the first post and with nodes of only 32GB RAM and 6 cores you can have more than 25000 events of 2kb per sec.

Flume Architecture

Conclusion

With this we can end this series on how to ingest & search Json events in real time and with high availability. Reach us if you have any doubt or want to go deeper in this kind of architectures.

Leave a Reply

Your email address will not be published. Required fields are marked *