Search This Blog


Wednesday, October 19, 2016

Hadoop- Bigdata Basics

Hadoop- Bigdata Basics

  • Hadoop consisting of 2 Core concepts. MAP and Reduce
  • Hadoop is not good fit for small Data sets and wherever is a need for Transactions (OLTP).
  • Hadoop is good fit for analyzing structured (Text, CSV,XML ) Semi-structured (email,blogs) and unstructured data (Audio, Video).
  • Hadoop supports JAVA and Scripting languages (PHP, Python, Perl, ruby bash etc..)
  • Hadoop splits the data and it stores the data as block (64mb or 128mb) in different DataNodes.
  • Default replication types is 3(three) in cluster. In Pseudo-Distributed mode Default replication type is 1
  • Streaming API is used for developers to write code for Map/Reduce Programs in any other Language.
  • Developer has the full control over the setting on Hadoop cluster. All configurations can be changed via JAVA API.
  • Hadoop eco system are HIVE,PIG,Oozee,Flume,Sqoop.

    Hadoop is comprised of 5(five) daemons/Services.
    • Holds the metadata for HDFS. FSIMAGE is metadata of the Blocks and EDIT is Log.
    • It Holds all its metadata in RAM for fast access.
    • If Metadata is crashed the data is inaccessible.

    Secondary NameNode:      
    • Performs housekeeping functions for the NameNode. It is not backup or hot standby for the NameNode.
    • SNN performing CPU intensive operation of combining EDIT logs and current file system snapshots.
    • If the Secondary Name Node is not running, the EDIT log will grow significantly and it will slow down the system. Then the system will go to safemode.

    • Stores actual HDFS data blocks.

    • Manages MapReduce jobs, distributes individual tasks to machines running TaskTracker.

    • Instantiates and monitors individual Map and Reduce tasks.
HDFS Concepts:-
  • HDFS is a filesystem written in Java – Bases on Google GFS(Google File System).
  • HDFS sits on top of a native filesystem(ext2,xfs).Blocks are stored as standard files in the Datanode.
Mapper Concepts:
  • Hadoop will start transfer the data from Mapper to Reducer as the Mappers finish work.
  • Map tasks works on relatively small portions of data typically a single block.
  • Mapper may use or completely ignore the Input Key.
  • Mapper Output must be in the form of key Value pairs.
  • Key Output of the mapper must be identically to the reducer key.
  • If Splitable = FALSE then a single mapper process the entire file.
  • Map-Side Join – that allows for splitting map between different data nodes. The data will be loaded into memory. This will provide very fast performance for the join.
  • Configure method in the Mapper we should use to implement code for reading the file and populating the associative array.
  • We can perform a map job alone in case if we do not want SORT and AGG. Reduce job is a CPU intensive task
  • Map side join uses memory for joining the data based on a key. If the data size is exceeded then the memory we will get Out of memory error.
  • If a Mapper running slowly than others, a new instance of Mapper will be started on another machine. (Speculative Execution).The result of the first Mapper to finish will be used.Hadoop will kill off the Mapper which is still running.

Reducer Concepts in HDFS:

  • Reducer can either have 0(Zero) Output or more final key-value pairs.
  • One key is processed by one reducer (Output from Mapper)
  • Output of Reducer it will be written in HDFS.
  • Reducer usually emits single key-value pair for each input key.
  • No concept of data locality for the Reducers.
  • All mappers in general have to communicate with all Reducers.
  • If Mapper and reducer runs on the same node then there is no need of transferring the data over the n/w. which will reduce the n/w overhead.
  • Reducer has 3 primary phases.
    1. Shuffle – Reducer copies the sorted output from each mapper using HTTP across the N/W.
    2. Sort – Then the framework merges the input from reducers by keys. Shuffle and sort occurs simultaneously.
    3. Reduce – Here the Reduce method is called for each key in the sorted inputs.
  • The reducer() method in the Reducers cannot start until all the Mapper have finished.
  • Each Reducer runs independently so communication between the reducer will not happen.
  • Developer can set the number of reducer in the programming. Can be Zero (0) also.
  • We can suppress the reducer output in case if we using image processing and web crawling.
Combiner concepts:-
  • We can specify the Combiner, Which is consider mini-reducer.
  • Output from the Combiner is sent to Reducers.
  • Input and Output data types for the Combiner and Reducer must be identical/Equal.
  • Combiner may run once, or more than once, on the output from any given Mapper.
  • Combiner can be applied only when operation performed is commutative and associative.
  • If the Combiner runs more than once which could influence/impact our results.
  • Using combiner we can reduce the amount of intermediate data that must transfer the data from mapper to reducer.
  • Combiner does the local aggregation of data, so it reduces the number of Key-Value pairs that need to be shuffle.

Java Concepts in Hadoop:-

  • Job object is used for actual MapReduce job.
  • JobControl object do the managing and monitoring the job.
        There are various methods to get the status of the job.Some of the methods.
    1. getFailedJobs ().
    2. getReadyJobs ().
    3. getRunningJobs ().
    4. getState ().
    5. getSucessfulJobs ().
    6. getWaitingjobs ().
  • JobConf object is configuration related to job like input and output path.
  • If we want to run multiple MapReduce job, we have to create a new JobConf object and needs to set input path to be the output path of previous job.
  • Using addDependingJob (job) method we can define the dependency of the MR job, to control the flow.
              For eg x. addDependingJob(y) --> X will not support until y has finished.
  • ChainMapper class allows to use multiple Mapper classes within a single Map task.
              Using this class the output of the first become the input of the second and so on until the last Mapper. Output of the last Mapper will be the task’s output
  • ChainReducer class allows to chain multiple Mapper classes after a Reducer with in the Reducer task.
  • FileSystem and FileStatus Java class which is used to represent a directory or file to get metadata information. (ie., Block size, replication factor, ownership and permission etc..).
  • Parent abstract class of HDFS Filesystem  --> org.apache.hadoop.fs.FileSystem.
 ----------------Common Things to Note/Not to Forget ------------------------------

  • We can configure the Hadoop in single machine to run in Pseudo-Distributed mode.
    1. This acts like Single machine cluster.
    2. All five daemons are running on the same machine.
    3. Usually for testing and development purpose
  • Each daemons/Services cannot share the JVM (Java Virtual Machine). Each daemons run on its own JVM.
  • It is possible to run all the daemons in a single node/machine but it is not advisable.
  • NameNode, Secondary NameNode, JobTracker daemons are called Master Nodes. Only one of each of these daemons runs on cluster.
  • DataNode and TaskTracker are called Slave Nodes. A slave node run both of these daemons.
  • Nodes talks each other as little as possible.
  • Computation happens where the data is stored.
  • If a node fails, the master will detect that failure and re-assign the work to different node on the system.
  • If a failed node restarts, it is automatically added back to the system and assigned new tasks.
  • MapReduce processing phase are Map-->Reduce. In between there is 2 stage called Shuffle-->Sort.
  • Default input format is TextInputFormat.
  • Developer can set the default input format in the JobConfiguration.(ie binary,sequential…)
  • When a files is an input to the MapReduce Job the Key is the byte offset into the file at which the line starts.
  • Writable is a Java interface that needs to be implemented for streaming data to remote servers.
  • Writable data types are used for network transmissions.
  • Developers can be implemented the custom data type as long as they implement writable interface.
  • It is possible to join large datasets in Hadoop Mapreduce. Eg Map-Side, Reduce – side and using distributed Cache.
  • Distributed cache is a component that allows the developers to deploy the jar files in the datanode . (ie map-reduce)
  • Number of Mapper for a Job is decided by Hadoop frame work.
  • Number of Reduced is decided by User. We can defined in JobConfiguration.
  • Job Submission :
    1. When a client submits a job, its configuration information is packaged into an XML file.
    2. This XML file and the .jar file (actual program code) is handed to the JobTracker.
    3. The JobTracker then parcels out individual tasks to TaskTracker nodes.
    4. Then the TaskTracker receives a request to run a task, it initialize a separate JVM for the task.
    5. Job Tracker is responsible for cleaning up the temporary data and monitoring the tasktrackers .
    6. TaskTracker nodes can run multiple tasks at the same time if the node has enough processing power and memory.
    7. If TaskTracker runs on multiple tasks each runs on separate JVM for the task.
  • Job While Running :
    1. While job is running, the intermediate data is held on the TaskTrackers local disk.
    2. When Reducer start up, the intermediate data is distributed across the network to process the Reducers.
    3. Reducers write their final output to HDFS.
    4. Once the Job has completed, the TaskTracker can delete the intermediate data from its local disk.
    5. The intermediate data is not deleted until the entire job completes.
    6. Intermediate files are called Sequence files/Map files. It will be generated by Map jobs
    7. Sequence files are binary format files that are compresses and are splitable. It is mainly used for Map-Reduce Jobs.
    8. Sequence Files is the only file format which the Hadoop framework that allow to be sorted.
    9. Sequence files Contains a binary encoding of an arbitrary number key-value pairs. Each key must be same type.Each Value must be same type.
  • HIVE :
    1. Hive is an abstraction on top of MapReduce.
    2. We can use SQL like Query to get output from HDFS.
    3. Uses HiveQL language.
    4. Hive Interpreter runs on a client machine.
    5. This interpreter turns HiveQL into MapReduce Jobs.
    6. Then Hive submits the jobs to cluster.
    7. Remember it is not RDBMS running on HDFS.
    8. To fetch already stored data in HDFS. Hive acts/provide JDBC and ODBC related interface.
  • Hive Metastore :
    1. It is a database which contains table definitions and other metadata.
    2. By default, stored locally on the client machine in a Derby database.
    3. If multiple people will be using Hive, the system administrator should create a shared Metastore.
    4. Metastore should and can be stored on any RDBMS (Mysql,MSSql)
  • PIG :
    1. Pig is an abstraction on top of MapReduce.
    2. Pig uses a dataflow scripting language called PigLatin.
    3. Pig Interpreter runs on a client machine.
    4. This interpreter turns the scripts into a series of MapReduce jobs.
    5. Then Pig submits the jobs to cluster.
    6. We can define a schema at runtime.
  • HBase :
    1. HBase is the Hadoop database.
    2. A NoSQL datastore.
    3. Can store massive amounts of data.
    4. GB, TB and even Petabytes of data can be stored in a table.
    5. Scales to provide very high write throughput.
    6. Hundreds of thousands of inserts per second.
    7. Tables can have many thousands of columns.
    8. Even if most columns are empty for any given row.
    9. Insert a row, retrieve a row, do a full or partial table scan.
    10. Only one column (row key) is indexed.
    11. Does not support multi row transaction.
    12. Rows in HBase tables are sorted by row key. The sort is byte-ordered.
    13. All table accesses are via the table row key : it is primary key.
    14. We can get data from HBase through MR job.
    15. HBase provides random, real time read or write access.
    16. If we have hundreds of millions or billions of rows then HBase is good option
    17. If we have few thousands or millions rows better go with RDBMS.

  • Flume :
    1. Flume is a distributed, reliable, available service for efficiently moving large amounts of data.
    2. Ideally suited to gathering logs from multiple systems and inserting them to HDFS.
    3. Flume is designed to continue delivering events in the face of system component failure.
    4. Flume scales horizontally to support scalability.
    5. As load increases more machines can be added to the configuration.
    6. Flume provides a central master controller for manageability.
    7. Administrators can monitor and reconfigure data flows on real time (while running).
    8. Flume can be extended by adding connecters to existing storage layers or data platforms.
    9. Sources may be data from files, syslogs and output from other process.
    10. Destination provides a files on the local system or HDFS.
    11. Other connectors can be added using Flume is API.
    12. Master holds configuration information for each node.
    13. Node communicate with master in every 5 sec.
    14. Nodes passes its version number to master.
  • Scoop :
    1. Sqoop is used to import data from tables in a relational database into HDFS.
    2. It transfers the RDBMS tables as files in HDFS.
  • Oozie :
    1. Ooize is a workflow engine.
    2. It runs on a server.
    3. Typically outside from the cluster.
    4. It runs workflows of Hadoop jobs.
    5. Including Pig, HIVE, and Sqoop jobs.
    6. It submits those jobs to the cluster based on a workflow definition.
    7. Workflow definitions are submitted via HTTP.
    8. Jobs can be run in specific times.
    9. We can schedule the job on one time or recurring job.
    10. Job can be run when data is present in a directory.
    11. XML Language is used to define Map reduce workflow.
    12. It can run jobs sequentially (one by one) and in parallel (Multiple at a time).
  • Diskcp (Distributed copy) :
    1. It is a tool used for large inter/intra-cluster copying.
    2. It uses Map/Reduce to effect its distribution, error handling and recovery, and reporting.
    3. It expands a list of files and directories into to map tasks. Each of which will copy a partition of files specified in the source list.