Monday, February 29, 2016
Apache Hive Data File Formats
The Apache Hive data file formats:
The below are the most Hive data file formats one can use it for processing variety of data format within Hive and Hadoop ecosystems.
- TEXTFILE
- SEQUENCEFILE
- RCFILE
- ORC
- AVRO
- PARQUET
TEXTFILE
- A simplest data format to use, with whatever delimiters you prefer.
- Also a default format, equivalent to creating a table with the clause STORED AS TEXTFILE.
- Can be shared data with other Hadoop related tools, such as Pig.
- Can also be used within Unix text tools like grep, sed, and awk, etc.
- Also convenient for viewing or editing files manually.
- It is not space efficient compared to binary formats, which offer added other advantages over TEXTFILE other than just the simplicity.
SEQUENCEFILE
- A first alternative to the hive default file format,
- Can be specified using “STORED AS SEQUENCEFILE” clause during table creation.
- Files are in flat files structure consisting of binary key-value pairs.
- In a runtime Hive queries processed into MapReduce jobs, during which records are assigned/generated with the appropriate key-value pairs.
- It is a standard format supported by Hadoop itself, thus becomes native or acceptable while sharing files between Hive and other Hadoop-related tools.
- It’s less suitable for use with tools outside the Hadoop ecosystem.
- When needed, the sequence files can be compressed at the block and record level, which is very useful for optimizing disk space utilization and I/O, while still supporting the ability to split files on block for parallel processing.
RCFILE
- RCFile = Record Columnar File
- An efficient internal (binary) hive format and natively supported by Hive.
- Used when Column-oriented organization is a good storage option for certain types of data and applications.
- If data is stored by column instead of by row, then only the data for the desired columns has to be read, this intern improves performance.
- Makes columns compression very efficient, especially for low cardinality columns.
- Also, some column-oriented stores do not physically need to store null columns.
- Helps storing columns of a table in a record columnar way.
- It first partitions rows horizontally into row splits and then it vertically partitions each row split in a columnar way.
- It first stores the metadata of a row split, as the key part of a record, and all the data of a row split as value part.
- The rcfilecat tool to display the contents of RCFiles from Hive command line, since the RCFiles can not be seen with simple editors.
ORC
- ORC = Optimized Row Columnar
- Designed to overcome limitations of other Hive file formats and has highly efficient way to store Hive data.
- Stores data as groups of row data called stripes, along with auxiliary information in a file footer.
- Holds compression parameters and size of the compressed footer at the end of the file a postscript section.
- The default stripe size is 250 MB.
- Large stripe sizes enable large, efficient reads from HDFS.
- Thus, performance improves during reading, writing, and processing data.
- It has many advantages, over RCFile format such as:
- A single file as the output of each task, which reduces the NameNode's load
- Hive Type support including datetime, decimal, and the complex types (struct, list, map, and union)
- light-weight indexes stored within the file
- skip row groups that don't pass predicate filtering
- seek to a given row
- block-mode compression based on data type
- run-length encoding for integer columns
- dictionary encoding for string columns
- concurrent reads of the same file using separate RecordReaders
- Ability to split files without scanning for markers
- Bound the amount of memory needed for reading or writing
- Metadata stored using Protocol Buffers, which allows addition and removal of fields
AVRO
- Relatively newest Apache’s Hadoop related projects.
- A language neutral preferred data serialization system.
- Handles multiple data formats that can be processed by multiple languages.
- Relies on schema.
- Uses JSON for defining data structure schema, types and protocols.
- Stores data structure definitions along with the data, in an easy-to-process form.
- It includes support for integers, numeric types, arrays, maps, enums, variable and fixed-length binary data and strings.
- It also defines a container file format intended to provide good support for MapReduce and other analytical frameworks.
- Data structures can specify sort order,
- Faster sorting is possible without deserialization.
- The data created in one programming language can be sorted by another.
- When data is read, schema used for writing it is always present and available permitting records data Serialization faster with minimal overheads per record.
- It serializes data in a compact binary format.
- It can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.
- An additional advantage of storing the full data structure definition with the data is that it permits the data to be written faster and more compactly without a need to process metadata separately.
- Avro as a file format to store data in a predefined format and can be used in any of the Hadoop’s tools like Pig, Hive and other programming languages like Java, Python, more.
- lets one define Remote Procedure Call (RPC) protocols. The data types used in RPC are usually distinct from those in datasets, using a common serialization system is still useful.
PARQUET
W.I.P
References:
https://cwiki.apache.org/
https://avro.apache.org
W.I.P
Apache Hive's pros and cons...
Apache Hive's pros and cons...
Pros/advantages:
- It is built on top of hadoop distributed framework system.
- Helps querying larger datasets residing in distributed storage
- It is a distributed data warehouse.
- Queries data using a SQL-like language called HiveQL (HQL).
- HiveQL is a declarative language like SQL.
- Table structure/s is/are similar to tables in a relational database.
- Multiple users can simultaneously query the data using Hive-QL.
- Allows to write custom MapReduce framework processes to perform more detailed data analysis.
- Data extract/transform/load (ETL) can be done easily.
- It provides the structure on a variety of data formats.
- Allows access files stored in Hadoop Distributed File System (HDFS) or also similar others data storage systems such as Apache HBase.
- Converting variety of format from to to within Hive is simple and possible.
Cons/limitations:
- It's not designed for Online transaction processing (OLTP), it is only used for the Online Analytical Processing (OLAP).
- Hive supports overwriting or apprehending data, but not updates and deletes.
- Sub-queries are not supported, in Hive
Thanks!
Saturday, January 23, 2016
How to find All or Any Apache Hive function quickly?
To find all or any specific Apache Hive Function/s, please type below on Hive interactive shell/CLI:
SHOW FUNCTIONS;
DESCRIBE FUNCTION <function_name>;
DESCRIBE FUNCTION EXTENDED <function_name>;
Example:
hive> show functions; OK ! != % & * + - / < <= <=> <> = == > >= ^ abs acos add_months and array array_contains ascii asin assert_true atan avg base64 between bin case cbrt ceil ceiling coalesce collect_list collect_set compute_stats concat concat_ws context_ngrams conv corr cos count covar_pop covar_samp create_union cume_dist current_database current_date current_timestamp current_user date_add date_format date_sub datediff day dayofmonth decode degrees dense_rank div e elt encode ewah_bitmap ewah_bitmap_and ewah_bitmap_empty ewah_bitmap_or exp explode factorial field find_in_set first_value floor format_number from_unixtime from_utc_timestamp get_json_object greatest hash hex histogram_numeric hour if in in_file index initcap inline instr isnotnull isnull java_method json_tuple lag last_day last_value lcase lead least length levenshtein like ln locate log log10 log2 lower lpad ltrim map map_keys map_values matchpath max min minute month months_between named_struct negative next_day ngrams noop noopstreaming noopwithmap noopwithmapstreaming not ntile nvl or parse_url parse_url_tuple percent_rank percentile percentile_approx pi pmod posexplode positive pow power printf radians rand rank reflect reflect2 regexp regexp_extract regexp_replace repeat reverse rlike round row_number rpad rtrim second sentences shiftleft shiftright shiftrightunsigned sign sin size sort_array soundex space split sqrt stack std stddev stddev_pop stddev_samp str_to_map struct substr substring sum tan to_date to_unix_timestamp to_utc_timestamp translate trim trunc ucase unbase64 unhex unix_timestamp upper var_pop var_samp variance weekofyear when windowingtablefunction xpath xpath_boolean xpath_double xpath_float xpath_int xpath_long xpath_number xpath_short xpath_string year | ~ Time taken: 0.099 seconds, Fetched: 216 row(s) hive>
How to resolve "Terminal initialization failed; falling back to unsupported java.lang.IncompatibleClassChangeError" in Apache Hive?
When you run Apache Hive from command prompt:
c) On success, we will get Apache Hive interactive shell:
Thanks!
/user/home: hive <enter>
You may see below error or warning on Hive interactive Shell:
You may see below error or warning on Hive interactive Shell:
Logging initialized using configuration in jar:file:/Library/apache-hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties
[ERROR] Terminal initialization failed; falling back to unsupported java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
initialization failed error resolution:
a) Resolve error after running Apache Hive from command line, add below line to initialize the Apache Hadoop library into .bash profile file:
export HADOOP_USER_CLASSPATH_FIRST=true
b) Rerun the Apache Hive again:
/user/home: hive <enter>
/user/home: hive <enter>
c) On success, we will get Apache Hive interactive shell:
Logging initialized using configuration in jar:file:/usr/lib/apache-hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive>Friday, January 22, 2016
How to install and run Apache Hive in a local mode (stand alone)?
For installing and running Apache Hive in a local mode (stand alone):
Step 1: Download following software/s:
a) Java software: Our system must have Java ( http://www.oracle.com/technetwork/java/javaee/downloads/index.html ) before Hadoop and Hive installation.b) Apache Hadoop framework software:
For Apache Hive in stand alone mode we need to point to Hadoop installed directory. I am using the hadoop-2.5.2.tar.gz framework from https://hadoop.apache.org/releases.html
c) Apache Hive framework software:
I am using the apache-hive-1.2.1-bin.tar.gz framework release from https://hive.apache.org/downloads.html
Step 2: Uncompress downloaded tar files a gzip tar file (.tgz or .tar.gz)
a) Uncompress / install Java software:tar xvzf jdk-8u71-linux-i586.tar.gz
Please follow the appropriate instruction if required.
b) Uncompress the Apache Hadoop framework at your Unix/Linux/Osx command prompt:
tar xvzf hadoop-2.5.2.tar.gz
The above command creates and uncompress the software into hadoop-2.5.2 folder
b) Move the hadoop-2.5.2 directory from the uncompressed location to Users library directory, (if required, rename appropriately):
/usr/lib/apache-hadoop-2.5.2
c) Uncompress the Apache Hive framework at your Unix/Linux/Osx command prompt:
tar xvzf apache-hive-1.2.1-bin.tar.gz
The above command uncompresses software files into hadoop-2.5.2 folder
d) Move hadoop-2.5.2 directory from the uncompressed location to Current users library directory (if required, rename appropriately):
/usr/lib/apache-hive-1.2.1
Step 3: Configure user's .bash profile as below:
export JAVA_HOME="/usr/lib/Java/JavaVirtualMachines/jdk1.8.0_65.jdk/Contents/Home"
#Apache Hive Home Path
export PATH="/usr/lib/Java/JavaVirtualMachines/jdk1.8.0_65.jdk/Contents/Home:${PATH}"
#HADOOP_HOME for Hive Installation
export HADOOP_HOME="/usr/lib/apache-hadoop-2.5.2"
export PATH="${HADOOP_HOME}/bin:${PATH}"
#Apache Hive Home Path
export PATH="/usr/lib/apache-hive-1.2.1/bin:${PATH}"
a) The above may return below error or warning:
Step 4: Run Apache Hive from command prompt:
/user/home: hive <enter>a) The above may return below error or warning:
Logging initialized using configuration in jar:file:/Library/apache-hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties
[ERROR] Terminal initialization failed; falling back to unsupported
java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
Step 5: Resolve initialization failed error:
a) Resolve error after running Apache Hive from command line, add below line to initialize the Apache Hadoop library into .bash profile file:
export HADOOP_USER_CLASSPATH_FIRST=true
b) Rerun the Apache Hive again:
/user/home: hive <enter>
c) On success, we will get Apache Hive interactive shell:
d) You can test/view the Apache Hive default database:
export HIVE_OPTS='-hiveconf mapred.job.tracker=local -hiveconf fs.default.name=file:///tmp -hiveconf hive.metastore.warehouse.dir=file:///tmp/warehouse -hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/tmp/metastore_db;create=true'
b) Rerun the Apache Hive again:
/user/home: hive <enter>
c) On success, we will get Apache Hive interactive shell:
Logging initialized using configuration in jar:file:/usr/lib/apache-hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive>
hive> show databases; <enter>
OK
default --- Pls. note that this is Hive's default database
Time taken: 1.994 seconds, Fetched: 1 row(s)
d) Try to create new database for your testing:
hive> create database test_db; <enter>
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Unable to create database path file:/user/hive/warehouse/test_db.db, failed to create database test_db)
Step 6: Resolve Unable to create database path file error:
a) To be able to create/store Apache Hive database file/s add following line into .bash profile file and restart Apache Hive:export HIVE_OPTS='-hiveconf mapred.job.tracker=local -hiveconf fs.default.name=file:///tmp -hiveconf hive.metastore.warehouse.dir=file:///tmp/warehouse -hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/tmp/metastore_db;create=true'
b) Quit your terminal window and restart to get above setting refreshed:
c) Rerun the Apache Hive again:
/user/home: hive <enter>
Logging initialized using configuration in jar:file:/Library/apache-hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive> show databases;
OK
default
Time taken: 1.652 seconds, Fetched: 1 row(s)
hive> create database test_db;
OK
Time taken: 0.323 seconds
hive> show databases;
OK
default
test_db
Time taken: 0.035 seconds, Fetched: 2 row(s)
hive>
d) Now you will be able to access hive in local mode (stand alone) and create databases/tables without any errors or warnings.
I will be creating more blog posts to share new things in Apache Hive. Please keep on reading. Thanks!
References:
- http://www.dummies.com/how-to/content/how-to-get-started-with-apache-hive.html
- https://cwiki.apache.org/confluence/display/Hive/LanguageManual
Subscribe to:
Posts (Atom)