Big Data : Overview of Structure and Jobs
The demand for big data resources have increased dramatically in past few years. The requirements to create and get most out of "Big Data" environment is classified into 3 tiers- Base Layer - DevOps and Infrastructure
- Mid Layer - Understanding & manipulating data
- Front Layer - Analytics, data science
Software Suite/Tools
Base Layer - Summary
This layer forms the core infrastructure of "big data" platform and should be horizontally scalable.
- OS - Linux is the way forward for big data technologies. RedHat, SuSe, Ubuntu, CentOS
- Distributed Computing tools/software - Hadoop, Splunk
- Data Storage - Splunk, MongoDB, Apache Cassandra
- Configuration management - Ansible, Puppet, Chef
- Others - Networking knowledge, Version Control (Git)
Mid Layer - Summary
This layer forms the engineering role of the project with focus on data mining, normalisation etc.
- Regex (Regular Expression) Knowledge and machine learning
- Various Application system knowledge to understand data
- Data normalisation techniques, data enrichment, integration to external data stores
- Technologies - Java, Scala, Splunk Search language, python, NoSQL,JSON
- Tools - Apache Spark, Splunk, Mahout
- Web Development - jQuery, AngularJS, D3.js
Front Layer - Summary
This layer forms the end-user experience, business interaction experience
- Business Analysis - Requirement gathering, Data availability etc.
- Web Development - Analytics, Visualisation techniques, D3.js, mathematical modelling etc.
- Data Science - Data co-relation, regular expression, statistics and data modelling
- Tools - Apache Ambari, Splunk, D3.js, XML/JSON formats
Big Data Jobs
The key thing here is to concentrate on one the above layers/tiers. Depending on the flavour of your previous experience it would be quite easy to fit your profile into one of the above Tiers.Software Suite/Tools
- Apache Products (Hadoop suite, Spark, Ambari) are so varied and flexible. But return on Investment (ROI) is poor in initial years and takes time and effort to create a fully fledged system. The integration of Apache products need to be designed and upgrade plan should be well in place as products change quite often.
- Splunk on the other hand encompasses all into a single suite. Splunk is not a free product, but licensing is based on data volume you index. ROI is very quick and great and can build Proof of Concept systems (POCs) immediately.