Big Data – Data Processing
There are many different areas of the architecture to design when looking at a big data project. As data is being added to your Big Data repository, do you need to transform the data or match to other sources of disparate data? Can you handle the amount of data streaming into your Big data framework or can you mostly focus on processing the data coming in and pick the right data store or warehouse? Here are the major elements we look at in an architecture with a focus on Data Processing in this section.
Data now comes from more places than ever and need to be connected to other data sets.As data is being added to your Big Data repository, do you need to transform the data or match to other sources of disparate data? This step of processing the data is most critical the right decision on which tool to select is imperative. There are some thoughts below on the pros and cons. Advanced inSight has experience with many of the products below including MapReduce, Hive on Tez, and Spark. Let us help you make the decision.
Pros and Cons of Data Processing
- Pros: handles any scale of data, reliable, lots of customization
- Cons: hard to program against, slow
- Pros: scalable, reliable, some customization possible
- Cons: still hard to program against, slow
Hive (on MapReduce)
- Pros: scalable, reliable, easy SQL interface
- Cons: slow (Hive on Tez faster), little customization possible
- Pros: lots of customization, in-memory processing
- Cons: not reliable, hard to program against
Presto / Spark SQL
- Pros: easy SQL interface, fast in-memory processing
- Cons: not reliable (out of memory), little customization possible, smaller data sets