On seven common Hadoop and Spark project cases

  • 2020-06-23 02:25:28
  • OfStack

There's an old adage that says if you give someone your full support and financial backing to do something different and innovative, they'll end up doing what everyone else is doing. With hits like Hadoop, Spark, and Storm, everyone thinks they're doing something related to these new big data technologies, but it doesn't take long to encounter the same patterns. The implementation may vary, but in my experience they are the seven most common.

Project 1: Data consolidation

Call it an "enterprise data center" or "data lake." The idea is that you have different data sources and you want to analyze them. Such projects include obtaining data sources from all sources (real-time or batch) and storing them in hadoop. Sometimes this is the first step to becoming a "data-driven company"; Sometimes all you need is a good report. An "enterprise data center" typically consists of an HDFS file system and tables in HIVE or IMPALA. In the future, HBase and Phoenix will play a big role in big data integration, opening a new horizon and creating a new Brave New World of data.

Salespeople love to say "read mode," but the truth is, to be successful, you have to have a clear understanding of what your use case will be (the Hive mode doesn't look different from what you would do in an enterprise data warehouse). The real reason is that a data lake has greater horizontal scalability and much lower costs than Teradata and Netezza. Many people use Tabelu and Excel for front-end analysis. Many sophisticated companies have "data scientists" using Zeppelin or IPython notebooks as front ends.

Project 2: Professional analysis

Many data integration projects actually start with the analysis of your specific requirements and a certain data set system. These are often incredibly specific areas, such as liquidity risk/Monte Carlo simulation analysis in the banking sector. In the past, such specialized analysis relied on outdated, proprietary software packages that failed to scale the data, often suffering from a limited set of features (mostly because software vendors couldn't possibly know as much as professional organizations).

In the world of Hadoop and Spark, look at these systems which are roughly the same data consolidation system, but tend to have more HBase, custom non-ES38en code, and fewer data sources (if not the only one). They are increasingly based on Spark.

Item 3: Hadoop as a service

In any large organization with a "professional analytics" project (ironically, one or two "data collation" projects) they will inevitably start to feel "happy" (that is, painful) managing several different configurations of Hadoop clusters, sometimes from different vendors. Then they say, "Maybe we should consolidate these resource pools," rather than leaving most nodes idle most of the time. They should constitute cloud computing, but many companies often can't or won't for security reasons (internal politics and job protection). This usually means a lot of Docker container packages.

I don't use it, but recently Bluedata (Blue Data International Center) seems to have a solution, which would also attract small businesses without enough money to deploy Hadoop as a service.

Project 4: Flow analysis

Many people will turn this "stream", but the stream analysis is different from the device stream. Typically, flow analysis is a real-time version of an organization in a batch. Anti-money Laundering and Fraud Detection: Why not capture it on a transaction basis rather than at the end of a cycle? Same inventory management or anything else.
In some cases, this is a new type of trading system that analyzes bits of data because you're connecting it in parallel to one analysis system. These systems prove themselves as common data stores such as Spark or Storm and Hbase. Note that flow analysis cannot replace all forms of analysis, and you still want to analyze historical trends or look at past data for something you have never considered.

Project 5: Complex event handling

Here, we are talking about sub-second real-time event processing. While there are no fast enough applications with ultra-low latency (picoseconds or nanoseconds), such as high-end trading systems, you can expect millisecond response times. Examples include real-time evaluation of call data records processed by Internet telecommunications operators of things or events. Sometimes, you will see such systems using Spark and HBase -- but they generally fall in their face and have to be converted to Storm, which is based on the jamming mode developed by the LMAX exchange.
In the past, such systems have been based on custom messaging or high performance, off the shelf, client-server messaging products - but today the amount of data is too much. I haven't used it yet, but the Apex project looks promising and claims to be faster than Storm.

Item 6: ETL stream

Sometimes you want to capture streaming data and store it. These items usually coincide with No. 1 or 2, but add their respective scope and characteristics. (Some people think they're No. 4 or 5, but they're actually dumping and analyzing data on disks.) , almost all of which are Kafka and Storm projects. Spark is also used, but there's no reason, because you don't need to analyze in memory.

Item 7: Replacement or addition of SAS

SAS is fine, it's fine but SAS is expensive, we don't need to buy storage for your data scientists and analysts to "play" your data. In addition to SAS can do or produce beautiful graphical analysis, you can also do 1 different things. This is your data lake. Here are the IPython notebooks (now) and Zeppelin notebooks (later). We use SAS to store the results.

When I see other different types of Hadoop, Spark, or Storm projects every day, these are normal. If you use Hadoop, you probably know them. I implemented some of these projects a few years ago, using other techniques.

If you are an old-timer too afraid of "big" or "doing" big data Hadoop, don't worry. Things change more and more, but the essence remains the same. You'll find a lot of similarities between the technologies that you use to deploy and fashion around the Hadooposphere.

Andrew C, Andrew C Oliver is a professional cat herder who works part-time as a software consultant. He is President and founder of MammothData, a Big data consulting firm based in Durham, North Carolina.


This article on the introduction of 7 common Hadoop and Spark project case here, hoping to help you. Thank you for your support!

Related articles: