wizard of oz

The Wizard of Oz, Smart City version!

Miguel Izquierdo Business, Data Science, Technology Leave a Comment

Once upon a time, a girl named Dorothy lived in the state of Kansas Oz, in a smart city named Emerald City. A huge tornado struck this city, but fortunately, Dorothy and all the other citizens had been evacuated way before the tornado reached them. In this city, people used to guide themselves through a yellow brick road, but sometime after, …

logo_codemotion

Codemotion 2017: Un derroche de talento y carisma

Miguel Izquierdo Big Data Architecture, Business, Technology Leave a Comment

Este año hemos acudido por primera vez al Codemotion y además en calidad de invitados. Sabíamos que se trataba de un evento grande entre los desarrolladores, pero no fuimos muy conscientes de su magnitud hasta que presenciamos la ingente cantidad de participantes, las amplias salas repletas de gente y la locura de tener que ir con tiempo si querías encontrar …

Hierarchical Clustering of Twitter Followers

Manuel Lamelas Data Science, Technology Leave a Comment

Hi again!!! In this new post we are going to explore an impressive way of clustering Twitter followers, using the Datatons account as an example. We will try to segment our followers in different groups and see what they have in common. For this we are going to use Hierarchical Clustering in Python. If you want to see the complete code …

Data Logistics with Apache Nifi

Manuel Lamelas Big Data Architecture, Technology Leave a Comment

As announced in a previous post we’re now going to introduce you to Apache Nifi, the latest trend in ingestion tools. A new project from the Apache Software Foundation that allows you to manage data flows with a cool graphical interface. If we didn’t catch your attention yet, wait until you hear this: NSA created it!!! Nifi – the UPS of data …

Kerberos & Hadoop: Securing Big Data (part I)

Celeste Duran Big Data Architecture, Technology Leave a Comment

When I began to use Hadoop with Kerberos I felt as I was in the middle of the ocean. I found a lot of information about Kerberos technology but it was very difficult for me to find something about how to use it on Hadoop, why to use it and how to configure it for working with Hadoop. This trilogy of posts is going to …

Morphlines – Hadoop ETL by Cloudera

Manuel Lamelas Big Data Architecture Leave a Comment

Today we are going to talk about Morphlines,  an open source framework developed by Cloudera, that provides a new way to do ETL on Hadoop. What are these morphlines? Morphlines are simple configurations files that defines how to transform data on the fly. It consists on a file that describes the steps a data flow has to pass in order to …

NoSQL vs Relational: Which database to use

Iván Alejandro Marugán Big Data Architecture Leave a Comment

Nowadays information collection has changed a lot. Everybody wants to save more data and allow our users to consume that information in real time and in an easy way. This means that performance, scalability and availability are three key factors for database implementations. For this reason NoSQL databases have made their appearance. What’s a NoSQL database? A NoSQL database (“non SQL”, …

Random Forest – Modeling The Titanic Voyage with R

David Carrasco Data Science Leave a Comment

What’s a Random Forest? Random Forest is a machine learning algorithm used normally for classification and regression tasks in supervised learning which consists on an ensemble or group of simple decision tree models to predict the value of a target variable based on a bunch of input variables. The main advantage regarding a simple decision trees is that reduces the …

Random Forest – The magic behind the algorithm

David Carrasco Data Science Leave a Comment

Random Forest is an ensemble model based on decision trees which is built through the Bagging technique used for classification and regression tasks in supervised learning (Although, it can be used in unsupervised learning too). Yes, it’s a bit more technical than the definition in the previous post, isn’t it? Well, let’s explain the main concept behind the Random Forest …

Ingest & Search JSON events in Real Time (III): Flume Architecture & BenchMarking

Manuel Lamelas Big Data Architecture Leave a Comment

To end this series of articles in Ingestion & Searching we are going to see the Flume Architecture for High Availability and see some benchmark tests. Flume Architecture To achieve high availability we have two flume characteristics to play with: 1. File Channel vs Memory Channel This is a decision on 100% delivery vs fast ingestion. With file channel the …