Skip to main content

Big Data - Jobs, tools and how to ace it

Big Data : Overview of Structure and Jobs 

The demand for big data resources have increased dramatically in past few years. The requirements to create and get most out of "Big Data" environment is classified into 3 tiers

  • Base Layer - DevOps and Infrastructure
  • Mid  Layer - Understanding & manipulating data
  • Front Layer - Analytics, data science
I feel the jobs surrounding "Big Data" would also ultimately reflect this. Learning Big Data should be also based on these tiers.

Software Suite/Tools

Base Layer - Summary

This layer forms the core infrastructure of "big data" platform and should be horizontally scalable.
  • OS - Linux is the way forward for big data technologies. RedHat, SuSe, Ubuntu, CentOS
  • Distributed Computing tools/software - Hadoop, Splunk
  • Data Storage - Splunk, MongoDB, Apache Cassandra
  • Configuration management - Ansible, Puppet, Chef
  • Others - Networking knowledge, Version Control (Git)

Mid Layer - Summary

This layer forms the engineering role of the project with focus on data mining, normalisation etc.
  • Regex (Regular Expression) Knowledge and machine learning
  • Various Application system knowledge to understand data
  • Data normalisation techniques, data enrichment, integration to external data stores
  • Technologies - Java, Scala, Splunk Search language, python, NoSQL,JSON
  • Tools - Apache Spark, Splunk, Mahout
  • Web Development - jQuery, AngularJS, D3.js

Front Layer - Summary

This layer forms the end-user experience, business interaction experience
  • Business Analysis - Requirement gathering, Data availability etc.
  • Web Development - Analytics, Visualisation techniques, D3.js, mathematical modelling etc.
  • Data Science - Data co-relation, regular expression, statistics and data modelling
  • Tools - Apache Ambari, Splunk, D3.js, XML/JSON formats

Big Data Jobs

The key thing here is to concentrate on one the above layers/tiers. Depending on the flavour of your previous experience it would be quite easy to fit your profile into one of the above Tiers.

Software Suite/Tools

  • Apache Products (Hadoop suite, Spark, Ambari) are so varied and flexible. But return on Investment (ROI) is poor in initial years and takes time and effort to create a fully fledged system. The integration of Apache products need to be designed and upgrade plan should be well in place as products change quite often.
  • Splunk on the other hand encompasses all into a single suite. Splunk is not a free product, but licensing is based on data volume you index. ROI is very quick and great and can build Proof of Concept systems (POCs) immediately.

Popular posts from this blog

Create your own Passport Photo using GIMP

This tutorial is for semi-techies who knows a bit of GIMP (image editing).   This tutorial is for UK style passport photo ( 45mm x 35 mm ) which is widely used in UK, Australia, New Zealand, India etc.  This is a quick and easy process and one can create Passport photos at home If you are non-technical, use this link   .  If you want to create United States (USA) Passport photo or Overseas Citizen of India (OCI) photo, please follow this link How to Make your own Passport Photo - Prerequisite GIMP - One of the best image editing tools and its completely Free USB stick or any memory device to store and take to nearby shop A quality Digital camera Local Shops where you can print. Normally it costs (£0.15 or 25 US cents) to print 8 photos Steps (Video Tutorial attached blow of this page) Ask one of your colleague to take a photo  of you with a light background. Further details of how to take a photo  yourself       Take multiple pictures so that you can choose from th

Syslog Standards: A simple Comparison between RFC3164 & RFC5424

Syslog Standards: A simple Comparison between RFC3164 (old format) & RFC5424 (new format) Though syslog standards have been for quite long time, lot of people still doesn't understand the formats in detail. The original standard document is quite lengthy to read and purpose of this article is to explain with examples Some of things you might need to understand The RFC standards can be used in any syslog daemon (syslog-ng, rsyslog etc.) Always try to capture the data in these standards. Especially when you have log aggregation like Splunk or Elastic, these templates are built-in which makes your life simple. Syslog can work with both UDP & TCP  Link to the documents the original BSD format ( RFC3164 ) the “new” format ( RFC5424 ) RFC3164 (the old format) RFC3164 originated from combining multiple implementations (Year 2001)

VS Code & Portable GIT shell integration in Windows

Visual Studio Code & GIT Portable shell Integration Summary Many of your corporate laptop cannot install programs and it is quite good to have them as portable executables. Here we find a way to have Portable VS Code and Portable GIT and integrate the GIT shell into VS Code Pre-Reqs VS Code (Install version or Portable ) GIT portable Steps Create a directory in your Windows device (eg:  C:\installables\ ) Unpack GIT portable into the above directory (eg it becomes: C:\installables\PortableGit ) Now unpack Visual Studio (VS) Code and run it. The default shell would be windows based Update User or Workspace settings of VS Code (ShortCut is:  Control+Shift+p ) Update the settings with following setting { "workbench.colorTheme": "Default Dark+", "git.ignoreMissingGitWarning": true, "git.enabled": true, "git.path": "C:\\installables\\PortableGit\\bin\\git.exe", "terminal.integrated.shell.windows"