Hi again!!! In this new post we are going to explore an impressive way of clustering Twitter followers, using the Datatons account as an example. We will try to segment our followers in different groups and see what they have in common. For this we are going to use Hierarchical Clustering in Python. If you want to see the complete code you can find it in this notebook.
Clustering is a unsupervised learning technique. This means that data has no labels so the method will try to get insights from the data itself. These algorithms will obtain several clusters and you will need to analyze them in order to know what they have in common.
One of the simplest clustering methods is KMeans, a model that tries to find groupings in the data by minimizing the distance between each point and the center of the group. It is the most known clustering technique and it works very well, but it is difficult to visualize unless you are working with 2 or 3 dimensions (or use dimensionality reduction models).
Another approach to clustering is growing the clusters from the bottom up. This is what Agglomerative Hierarchical Clustering do. Each observation starts in its own cluster, and pairs of clusters merge as the hierarchy moves up . This hierarchy is represented as a tree (dendrogram). This tree allows us to see our clusters and also how the data is similar to other as we can see how similar values are merged.
The first step in order to analyze our followers is to find out who our followers follow, and that will be our data. With that we are going to apply a hierarchical clustering model and try to get an insight about our followers. We have limited the study to the 100 followers with more friends (who they follow) in order to see these clusters graphically.
Once we proccess and transform our data using OneHotEncoding we will have a dataframe like this:
Clustering Our Followers
Then we are going to use the scipy library to cluster our data and plot the resulting dendogram.
Finally the dendogram represents in different colors the clusters of our followers. There is a green group very far away from the other ones and if we analyze the Twitter accounts all belong to Business Organizations and Magazines around Big Data and Tech in the Spanish market. Personal accounts are not in that cluster. If we analize which accounts this cluster follows we’ll see why they have been clustered together, they follow similar accounts as themselves.
Instead if we pick the yellow cluster, we can see that the accounts followed are very different. This cluster is formed around Startup, Innovation and Women in Tech. Finally, if we pick the blue one we could see that they tend to follow news accounts.
To conclude this post we can see that Hierarchical Clustering allow us to get significant insights from our data. It gives us a graphical interpretation of this clusters (dendogram) that enables us to identify the different segments in our data. We can also see which accounts are similar to each other by analizing how they are merged together. All these insights allow a company to improve the knowledge of their followers and can help it to improve their network.