Apache Hive Data File Formats ~ Whats buzzing Apache Hive?

The Apache Hive data file formats:
The below are the most Hive data file formats one can use it for processing variety of data format within Hive and Hadoop ecosystems.

TEXTFILE
SEQUENCEFILE
RCFILE
ORC
AVRO
PARQUET

TEXTFILE

A simplest data format to use, with whatever delimiters you prefer.
Also a default format, equivalent to creating a table with the clause STORED AS TEXTFILE.
Can be shared data with other Hadoop related tools, such as Pig.
Can also be used within Unix text tools like grep, sed, and awk, etc.
Also convenient for viewing or editing files manually.
It is not space efficient compared to binary formats, which offer added other advantages over TEXTFILE other than just the simplicity.

SEQUENCEFILE

A first alternative to the hive default file format,
Can be specified using “STORED AS SEQUENCEFILE” clause during table creation.
Files are in flat files structure consisting of binary key-value pairs.
In a runtime Hive queries processed into MapReduce jobs, during which records are assigned/generated with the appropriate key-value pairs.
It is a standard format supported by Hadoop itself, thus becomes native or acceptable while sharing files between Hive and other Hadoop-related tools.
It’s less suitable for use with tools outside the Hadoop ecosystem.
When needed, the sequence files can be compressed at the block and record level, which is very useful for optimizing disk space utilization and I/O, while still supporting the ability to split files on block for parallel processing.

RCFILE

RCFile = Record Columnar File
An efficient internal (binary) hive format and natively supported by Hive.
Used when Column-oriented organization is a good storage option for certain types of data and applications.
If data is stored by column instead of by row, then only the data for the desired columns has to be read, this intern improves performance.
Makes columns compression very efficient, especially for low cardinality columns.
Also, some column-oriented stores do not physically need to store null columns.
Helps storing columns of a table in a record columnar way.
It first partitions rows horizontally into row splits and then it vertically partitions each row split in a columnar way.
It first stores the metadata of a row split, as the key part of a record, and all the data of a row split as value part.
The rcfilecat tool to display the contents of RCFiles from Hive command line, since the RCFiles can not be seen with simple editors.

ORC

ORC = Optimized Row Columnar
Designed to overcome limitations of other Hive file formats and has highly efficient way to store Hive data.
Stores data as groups of row data called stripes, along with auxiliary information in a file footer.
Holds compression parameters and size of the compressed footer at the end of the file a postscript section.
The default stripe size is 250 MB.
Large stripe sizes enable large, efficient reads from HDFS.
Thus, performance improves during reading, writing, and processing data.
It has many advantages, over RCFile format such as:

A single file as the output of each task, which reduces the NameNode's load
Hive Type support including datetime, decimal, and the complex types (struct, list, map, and union)
light-weight indexes stored within the file
skip row groups that don't pass predicate filtering
seek to a given row
block-mode compression based on data type
run-length encoding for integer columns
dictionary encoding for string columns
concurrent reads of the same file using separate RecordReaders
Ability to split files without scanning for markers
Bound the amount of memory needed for reading or writing
Metadata stored using Protocol Buffers, which allows addition and removal of fields

AVRO

Relatively newest Apache’s Hadoop related projects.
A language neutral preferred data serialization system.
Handles multiple data formats that can be processed by multiple languages.
Relies on schema.
Uses JSON for defining data structure schema, types and protocols.
Stores data structure definitions along with the data, in an easy-to-process form.
It includes support for integers, numeric types, arrays, maps, enums, variable and fixed-length binary data and strings.
It also defines a container file format intended to provide good support for MapReduce and other analytical frameworks.
Data structures can specify sort order,
Faster sorting is possible without deserialization.
The data created in one programming language can be sorted by another.
When data is read, schema used for writing it is always present and available permitting records data Serialization faster with minimal overheads per record.
It serializes data in a compact binary format.
It can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.
An additional advantage of storing the full data structure definition with the data is that it permits the data to be written faster and more compactly without a need to process metadata separately.
Avro as a file format to store data in a predefined format and can be used in any of the Hadoop’s tools like Pig, Hive and other programming languages like Java, Python, more.
lets one define Remote Procedure Call (RPC) protocols. The data types used in RPC are usually distinct from those in datasets, using a common serialization system is still useful.

W.I.P

PARQUET

W.I.P

References:
https://cwiki.apache.org/
https://avro.apache.org

Whats buzzing Apache Hive?

Monday, February 29, 2016

Apache Hive Data File Formats

TEXTFILE

SEQUENCEFILE

RCFILE

ORC

AVRO

PARQUET

W.I.P

Popular Posts

Recent Posts

Unordered List

Text Widget

Blog Archive

Pages

About Me

Total Pageviews

views

Advertisement

About