Big Data - Jobs, tools and how to ace it

Big Data : Overview of Structure and Jobs

The demand for big data resources have increased dramatically in past few years. The requirements to create and get most out of "Big Data" environment is classified into 3 tiers

Base Layer - DevOps and Infrastructure
Mid Layer - Understanding & manipulating data
Front Layer - Analytics, data science

I feel the jobs surrounding "Big Data" would also ultimately reflect this. Learning Big Data should be also based on these tiers.

Software Suite/Tools

Base Layer - Summary

This layer forms the core infrastructure of "big data" platform and should be horizontally scalable.

OS - Linux is the way forward for big data technologies. RedHat, SuSe, Ubuntu, CentOS
Distributed Computing tools/software - Hadoop, Splunk
Data Storage - Splunk, MongoDB, Apache Cassandra
Configuration management - Ansible, Puppet, Chef
Others - Networking knowledge, Version Control (Git)

Mid Layer - Summary

This layer forms the engineering role of the project with focus on data mining, normalisation etc.

Regex (Regular Expression) Knowledge and machine learning
Various Application system knowledge to understand data
Data normalisation techniques, data enrichment, integration to external data stores
Technologies - Java, Scala, Splunk Search language, python, NoSQL,JSON
Tools - Apache Spark, Splunk, Mahout
Web Development - jQuery, AngularJS, D3.js

Front Layer - Summary

This layer forms the end-user experience, business interaction experience

Business Analysis - Requirement gathering, Data availability etc.
Web Development - Analytics, Visualisation techniques, D3.js, mathematical modelling etc.
Data Science - Data co-relation, regular expression, statistics and data modelling
Tools - Apache Ambari, Splunk, D3.js, XML/JSON formats

Big Data Jobs

The key thing here is to concentrate on one the above layers/tiers. Depending on the flavour of your previous experience it would be quite easy to fit your profile into one of the above Tiers.

Software Suite/Tools

Apache Products (Hadoop suite, Spark, Ambari) are so varied and flexible. But return on Investment (ROI) is poor in initial years and takes time and effort to create a fully fledged system. The integration of Apache products need to be designed and upgrade plan should be well in place as products change quite often.
Splunk on the other hand encompasses all into a single suite. Splunk is not a free product, but licensing is based on data volume you index. ROI is very quick and great and can build Proof of Concept systems (POCs) immediately.

DiaryFolio

Search This Blog