- Velocity
- Variety
- Volume
No complex acronyms or ideas. Simple and crisp content.What is the wealthiest asset of 21st century? Not oil!Not Diamond!!It is data. Contact me Gmail:pradoshkumar.jena@gmail.com Skype:Pradosh2008
Saturday, January 10, 2015
Big Data Sources and Origins
Wednesday, January 7, 2015
Storage :DAS , NAS and SAN
Data is the soul of any organisation.If the soul gets bruised or tarnished then the whole body gets affected likewise if data gets compromised then the organisation is at risk.So it matters a lot how you store your data.Data are the most vulnerable asset in this 21st century.So companies reputation and revenues are now more dependent like never before , the way it stores its data
DAS: The name says that the storage is directly attached to the server.
So lets say , If I have four servers , then I have four sets of storage attached individually
to each of this server. Lets say if one server is down then data resided in that particularly
attached storage can not be transferred.
But as it has less intial cost setup and the business which is running in a localised environment,
can go for this form of storage.
NAS : Here storage is directly attached to the network which is fully capable enough to serve files
over network where as in case of DAS where Server has to play dual role of file sharing and
provide services.
SAN: It's a high performance storage network that transfers data between server and storage devices.
Here the storage devices is separate from local area network.As the degree of sophistication
and cost is more , so it been used in mission critical application.
Here the bunch of networked storage is connected to the server over fibre channel.The inherent
property of fibre channel is fast.
I will discuss in a future post about Fibre channel and SCSI.
Tuesday, January 6, 2015
HDFS enabled Storage,DataLake,DataHub
By Layering this Datahub (Cloudera Enterprise) over this Datalake(Isilon) ,
Cloudera and EMC believes they can remove the cycle of moving data to a separate Bigdata infrastructure.
Datahub: "Cloudera" defines it as an engineered system designed to analyze and process bigdata in place, so it needn’t be moved to be used
Data lake : is an already existing huge repositories where huge amount of data gets stored and managed but traditionally it has to be moved out of this lake to bigdata infrastructure to be analysed.
Data locality Understanding
In1990 , a normal disk drive used be a capacity of 1.3GB.The reading speed from drive is 4.4 MB/Sec. So it will take near about 5 mins.
Now a normal disk drive is 1TB. It has multiplied to 10^6 times.
The reading speed is close 100MB/sec . So it will take around
1TB/(100MB/Sec)= 10^4 Sec= It is approximately 2hours 30 minutes.
It makes sense to use distributed system . But in traditional distributed system where data moves from storage to computing node . So evenif the processpr speed has improved a lot , it has to still wait for the data to reach . So there comes Hadoop Distributed System where the data does not move to the node where computation happens instead code . This is actually called as data locality.