From Big Data to Fast Data

Muy buen articulo de Raul Estrada. Principales puntos:

1. Data acquisition: pipeline for performance

In this step, data enters the system from diverse sources. The key focus of this stage is performance, as this step impacts of how much data the whole system can receive at any given point in time.

Technologies
For this stage you should consider streaming APIs and messaging solutions like:
- Apache Kafka - open-source stream processing platform
- Akka Streams - open-source stream processing based on Akka
- Amazon Kinesis - Amazon data stream processing solution
- ActiveMQ - open-source message broker with a JMS client in Java
- RabbitMQ - open-source message broker with a JMS client in Erlang
- JBoss AMQ - lightweight MOM developed by JBoss
- Oracle Tuxedo - middleware message platform by Oracle
- Sonic MQ - messaging system platform by Sonic

For handling many of these key principles of data acquisition, the winner is Apache Kafka because it’s open source, focused on high-throughput, low-latency, and handles real-time data feeds.

2. Data storage: flexible experimentation leads to solutions

There are a lot of points of view for designing this layer, but all should consider two perspectives: logical (i.e. the model) and physical data storage. The key focus for this stage is "experimentation” and flexibility.

Technologies
For this stage consider distributed database storage solutions like:
- Apache Cassandra - distributed NoSQL DBMS
- Couchbase - NoSQL document-oriented database
- Amazon DynamoDB - fully managed proprietary NoSQL database
- Apache Hive - data warehouse built on Apache Hadoop
- Redis - distributed in-memory key-value store
- Riak - distributed NoSQL key-value data store
- Neo4J - graph database management system
- MariaDB - with Galera form a replication cluster based on MySQL
- MongoDB - cross-platform document-oriented database
- MemSQL - distributed in-memory SQL RDBMS

For handling many of key principles of data storage just explained, the most balanced option is Apache Cassandra. It is open source, distributed, NoSQL, and designed to handle large data across many commodity servers with no single point of failure.

3. Data processing: combining tools and approaches

Years ago, there was discussion about whether big data systems should be (modern) stream processing or (traditional) batch processing. Today we know the correct answer for fast data is that most systems must be hybrid — both batch and stream at the same time. The type of processing is now defined by the process itself, not by the tool. The key focus of this stage is "combination."

Technologies
For this stage, you should consider data processing solutions like:
- Apache Spark - engine for large-scale data processing
- Apache Flink - open-source stream processing framework
- Apache Storm - open-source distributed realtime computation system
- Apache Beam - open-source, unified model for batch and streaming data
- Tensorflow - open-source library for machine intelligence

For managing many of the key principles of data storage just explained, the winner is a tie between Spark (micro batching) and Flink (streaming).

4. Data visualization

Visualization communicates data or information by encoding it as visual objects in graphs, to clearly and efficiently get information to users. This stage is not easy; it’s both an art and a science.

Technologies

For this layer you should consider visualization solutions in these three categories: