DN=DataNode,NN=NameNode
In the last 5 years,we’ve seen our data usage through social media or any website crafting data increased massively day after day and that represente an issue on how we are going to deal with the enermous data crafted,Nevertheless,a fresh concept comes to the world called big Data that represent this large amount of information being analyzed in a complex faster than traditional methods could do,moreover,this concept characterized with 3 keys(there are more but only 3 principal one considered ) are volume,velocity and variety and based on them the data is going to be evaluated.
Without going in depth in that concept because it’s not our topic choose to talk about but it’s essential for the reader to know some keywords used 😂
After big data going viral,various newly framework entered the world of big data to make the analyze process more ease and fast with the intricate data,we name Spark,Hive and the famous one Hadoop.
Hadoop shows itself in dealing with large data and thats why it is one of the most used framework,moreover,it has the ability to store and process any kind of data quickly,also protecting against hardware failure,flexibile when you store and how you wanna use and more importance scalable and low cost.
The architecture of hadoop v1.0 itself is interesting ,it is composed with HDFS,Map reduce while the hadoop v2.0 made it more organized and instead of giving all tasks on map reduce , it added YARN to manage resources and left map reduce to just work on processing data.
Hadoop v1.0 first component is Hdfs which is the storage dealer across multiple machines,this latter architecture compose from unique single NameNode and multiple dataNodes mainly, the NN contains list of our active datanodes,conduct used datanodes and holds history of all the process occurs while datanodes will be a container for the files and responsible on replicating our files to ensure fault tolerance.Moreover,our NN have a backup version called secondary name node that replace our main NN in the moment a bug occurs.Nevertheless,we have secondary name node but its limited so ever that brings two ways to ensure our data availibity more and fault tolerance so first method called fault tolerance and that means we have a copy of our NN a Standby that has all the feature of our NN and take the lead whenever failure occur in NN and secondly method called High availiblity that our NN communicate with local file system and another distance file system to ensure the availibility like the figure shows below.
Let’s summon up the story of hdfs with an exemple,a client have to communicate with NN to ask for loading or saving files,our NN will allow it by giving the availaible DN to our client to collect or save files,last but not least,those files will be replicate on the rest of DNs and each process occur it will be saved in the metadata and logs exchanged between our NameNode and secondary NameNode.
We stored our files and we sought to manage it ,MapReduce will comes to the head by its efficency by dealing with massive volumes, so,briefly MR is opensource parallel model used for calculation passing by various steps,we start with splitting our input,mapping it to its occurence,shuffle it and then reduce it to represent the output,despite the fact that MR gives you ease way to handle your data and that what makes yahoo and google use it for google search and for data mining from facebook but the lack of real time processing that makes it vunlerable to spark.
(Cluster=machine with compute ressources)
there are deamons that responsible on task management in the NN and DN JobTracker and TaskTracker are the essential serveurs to communicate with hdfs ,jobTracker created in the NN unique in the cluster ,liable on planifying,providing jobs MR and manage ressources ,on the other hand,task tracker could be more than one in the cluster created in the DN perform those jobs and execute them
To conclude,hadoop is suitable for newbies and tiny troups who wanna explore bigdata and how it works from storing data into hdfs to giving tasks and processing especially after the newly version(v2.0)that add Yarn,Moreover,its simplicity what makes hadoop shine with low costs mechanism,flexible,rapide dealing with massive files and fault tolerance.However,for larger companies there would be a better options.
Thank you for taking your time to read my blog,appreciate often your time giving your feedbacks for me to improve.Might be not the best but the highest places begins with small steps.