Monday, February 29, 2016

Apache Hive Data File Formats



The Apache Hive data file formats: 
The below are the most Hive data file formats one can use it for processing variety of data format within Hive and Hadoop ecosystems.

  • TEXTFILE
  • SEQUENCEFILE
  • RCFILE
  • ORC
  • AVRO
  • PARQUET

TEXTFILE

  • A simplest data format to use, with whatever delimiters you prefer.
  • Also a default format, equivalent to creating a table with the clause STORED AS TEXTFILE.
  • Can be shared data with other Hadoop related tools, such as Pig. 
  • Can also be used within Unix text tools like grep, sed, and awk, etc.
  • Also convenient for viewing or editing files manually.
  • It is not space efficient compared to binary formats, which offer added other advantages over TEXTFILE other than just the simplicity.

SEQUENCEFILE

  • A first alternative to the hive default file format,
  • Can be specified using “STORED AS SEQUENCEFILE” clause during table creation.
  • Files are in flat files structure consisting of binary key-value pairs.
  • In a runtime Hive queries processed into MapReduce jobs, during which records are assigned/generated with the appropriate key-value pairs.
  • It is a standard format supported by Hadoop itself, thus becomes native or acceptable while sharing files between Hive and other Hadoop-related tools.
  • It’s less suitable for use with tools outside the Hadoop ecosystem.
  • When needed, the sequence files can be compressed at the block and record level, which is very useful for optimizing disk space utilization and I/O, while still supporting the ability to split files on block for parallel processing.

RCFILE

  • RCFile = Record Columnar File
  • An efficient internal (binary) hive format and natively supported by Hive.  
  • Used when Column-oriented organization is a good storage option for certain types of data and applications.
  • If data is stored by column instead of by row, then only the data for the desired columns has to be read, this intern improves performance. 
  • Makes columns compression very efficient, especially for low cardinality columns. 
  • Also, some column-oriented stores do not physically need to store null columns.
  • Helps storing columns of a table in a record columnar way.
  • It first partitions rows horizontally into row splits and then it vertically partitions each row split in a columnar way.
  • It first stores the metadata of a row split, as the key part of a record, and all the data of a row split as value part. 
  • The rcfilecat tool to display the contents of RCFiles from Hive command line, since the RCFiles can not be seen with simple editors.

ORC

  • ORC = Optimized Row Columnar 
  • Designed to overcome limitations of other Hive file formats and has highly efficient way to store Hive data. 
  • Stores data as groups of row data called stripes, along with auxiliary information in a file footer. 
  • Holds compression parameters and size of the compressed footer at the end of the file a postscript section. 
  • The default stripe size is 250 MB. 
  • Large stripe sizes enable large, efficient reads from HDFS. 
  • Thus, performance improves during reading, writing, and processing data. 
  • It has many advantages, over RCFile format such as: 
    • A single file as the output of each task, which reduces the NameNode's load 
    • Hive Type support including datetime, decimal, and the complex types (struct, list, map, and union) 
    • light-weight indexes stored within the file 
    • skip row groups that don't pass predicate filtering 
    • seek to a given row 
    • block-mode compression based on data type 
    • run-length encoding for integer columns 
    • dictionary encoding for string columns 
    • concurrent reads of the same file using separate RecordReaders 
    • Ability to split files without scanning for markers 
    • Bound the amount of memory needed for reading or writing 
    • Metadata stored using Protocol Buffers, which allows addition and removal of fields

AVRO


  • Relatively newest Apache’s Hadoop related projects. 
  • A language neutral preferred data serialization system. 
  • Handles multiple data formats that can be processed by multiple languages. 
  • Relies on schema. 
  • Uses JSON for defining data structure schema, types and protocols. 
  • Stores data structure definitions along with the data, in an easy-to-process form. 
  • It includes support for integers, numeric types, arrays, maps, enums, variable and fixed-length binary data and strings. 
  • It also defines a container file format intended to provide good support for MapReduce and other analytical frameworks. 
  • Data structures can specify sort order, 
  • Faster sorting is possible without deserialization. 
  • The data created in one programming language can be sorted by another. 
  • When data is read, schema used for writing it is always present and available permitting records data Serialization faster with minimal overheads per record. 
  • It serializes data in a compact binary format. 
  • It can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services. 
  • An additional advantage of storing the full data structure definition with the data is that it permits the data to be written faster and more compactly without a need to process metadata separately. 
  • Avro as a file format to store data in a predefined format and can be used in any of the Hadoop’s tools like Pig, Hive and other programming languages like Java, Python, more. 
  • lets one define Remote Procedure Call (RPC) protocols. The data types used in RPC are usually distinct from those in datasets, using a common serialization system is still useful.
W.I.P

PARQUET 



    W.I.P



    References:

    https://cwiki.apache.org/
    https://avro.apache.org






    W.I.P