The Apache Hive data file formats:
The below are the most Hive data file formats one can use it for processing variety of data format within Hive and Hadoop ecosystems.
- TEXTFILE
- SEQUENCEFILE
- RCFILE
- ORC
- AVRO
- PARQUET
TEXTFILE
- A simplest data format to use, with whatever delimiters you prefer.
- Also a default format, equivalent to creating a table with the clause STORED AS TEXTFILE.
- Can be shared data with other Hadoop related tools, such as Pig.
- Can also be used within Unix text tools like grep, sed, and awk, etc.
- Also convenient for viewing or editing files manually.
- It is not space efficient compared to binary formats, which offer added other advantages over TEXTFILE other than just the simplicity.
SEQUENCEFILE
- A first alternative to the hive default file format,
- Can be specified using “STORED AS SEQUENCEFILE” clause during table creation.
- Files are in flat files structure consisting of binary key-value pairs.
- In a runtime Hive queries processed into MapReduce jobs, during which records are assigned/generated with the appropriate key-value pairs.
- It is a standard format supported by Hadoop itself, thus becomes native or acceptable while sharing files between Hive and other Hadoop-related tools.
- It’s less suitable for use with tools outside the Hadoop ecosystem.
- When needed, the sequence files can be compressed at the block and record level, which is very useful for optimizing disk space utilization and I/O, while still supporting the ability to split files on block for parallel processing.
RCFILE
- RCFile = Record Columnar File
- An efficient internal (binary) hive format and natively supported by Hive.
- Used when Column-oriented organization is a good storage option for certain types of data and applications.
- If data is stored by column instead of by row, then only the data for the desired columns has to be read, this intern improves performance.
- Makes columns compression very efficient, especially for low cardinality columns.
- Also, some column-oriented stores do not physically need to store null columns.
- Helps storing columns of a table in a record columnar way.
- It first partitions rows horizontally into row splits and then it vertically partitions each row split in a columnar way.
- It first stores the metadata of a row split, as the key part of a record, and all the data of a row split as value part.
- The rcfilecat tool to display the contents of RCFiles from Hive command line, since the RCFiles can not be seen with simple editors.
ORC
- ORC = Optimized Row Columnar
- Designed to overcome limitations of other Hive file formats and has highly efficient way to store Hive data.
- Stores data as groups of row data called stripes, along with auxiliary information in a file footer.
- Holds compression parameters and size of the compressed footer at the end of the file a postscript section.
- The default stripe size is 250 MB.
- Large stripe sizes enable large, efficient reads from HDFS.
- Thus, performance improves during reading, writing, and processing data.
- It has many advantages, over RCFile format such as:
- A single file as the output of each task, which reduces the NameNode's load
- Hive Type support including datetime, decimal, and the complex types (struct, list, map, and union)
- light-weight indexes stored within the file
- skip row groups that don't pass predicate filtering
- seek to a given row
- block-mode compression based on data type
- run-length encoding for integer columns
- dictionary encoding for string columns
- concurrent reads of the same file using separate RecordReaders
- Ability to split files without scanning for markers
- Bound the amount of memory needed for reading or writing
- Metadata stored using Protocol Buffers, which allows addition and removal of fields
AVRO
- Relatively newest Apache’s Hadoop related projects.
- A language neutral preferred data serialization system.
- Handles multiple data formats that can be processed by multiple languages.
- Relies on schema.
- Uses JSON for defining data structure schema, types and protocols.
- Stores data structure definitions along with the data, in an easy-to-process form.
- It includes support for integers, numeric types, arrays, maps, enums, variable and fixed-length binary data and strings.
- It also defines a container file format intended to provide good support for MapReduce and other analytical frameworks.
- Data structures can specify sort order,
- Faster sorting is possible without deserialization.
- The data created in one programming language can be sorted by another.
- When data is read, schema used for writing it is always present and available permitting records data Serialization faster with minimal overheads per record.
- It serializes data in a compact binary format.
- It can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.
- An additional advantage of storing the full data structure definition with the data is that it permits the data to be written faster and more compactly without a need to process metadata separately.
- Avro as a file format to store data in a predefined format and can be used in any of the Hadoop’s tools like Pig, Hive and other programming languages like Java, Python, more.
- lets one define Remote Procedure Call (RPC) protocols. The data types used in RPC are usually distinct from those in datasets, using a common serialization system is still useful.
PARQUET
W.I.P
References:
https://cwiki.apache.org/
https://avro.apache.org