Whats buzzing Apache Hive?

Monday, February 29, 2016

Using Apache Hive to convert from and to data file formats (W.I.P)

4:27 PM One4All

W.I.P

Using Apache Hive to convert from and to data file formats

Share This: Facebook Twitter Google+ Stumble Digg

Apache Hive Data File Formats

2:08 PM One4All

The Apache Hive data file formats:
The below are the most Hive data file formats one can use it for processing variety of data format within Hive and Hadoop ecosystems.

TEXTFILE
SEQUENCEFILE
RCFILE
ORC
AVRO
PARQUET

TEXTFILE

A simplest data format to use, with whatever delimiters you prefer.
Also a default format, equivalent to creating a table with the clause STORED AS TEXTFILE.
Can be shared data with other Hadoop related tools, such as Pig.
Can also be used within Unix text tools like grep, sed, and awk, etc.
Also convenient for viewing or editing files manually.
It is not space efficient compared to binary formats, which offer added other advantages over TEXTFILE other than just the simplicity.

SEQUENCEFILE

A first alternative to the hive default file format,
Can be specified using “STORED AS SEQUENCEFILE” clause during table creation.
Files are in flat files structure consisting of binary key-value pairs.
In a runtime Hive queries processed into MapReduce jobs, during which records are assigned/generated with the appropriate key-value pairs.
It is a standard format supported by Hadoop itself, thus becomes native or acceptable while sharing files between Hive and other Hadoop-related tools.
It’s less suitable for use with tools outside the Hadoop ecosystem.
When needed, the sequence files can be compressed at the block and record level, which is very useful for optimizing disk space utilization and I/O, while still supporting the ability to split files on block for parallel processing.

RCFILE

RCFile = Record Columnar File
An efficient internal (binary) hive format and natively supported by Hive.
Used when Column-oriented organization is a good storage option for certain types of data and applications.
If data is stored by column instead of by row, then only the data for the desired columns has to be read, this intern improves performance.
Makes columns compression very efficient, especially for low cardinality columns.
Also, some column-oriented stores do not physically need to store null columns.
Helps storing columns of a table in a record columnar way.
It first partitions rows horizontally into row splits and then it vertically partitions each row split in a columnar way.
It first stores the metadata of a row split, as the key part of a record, and all the data of a row split as value part.
The rcfilecat tool to display the contents of RCFiles from Hive command line, since the RCFiles can not be seen with simple editors.

ORC

ORC = Optimized Row Columnar
Designed to overcome limitations of other Hive file formats and has highly efficient way to store Hive data.
Stores data as groups of row data called stripes, along with auxiliary information in a file footer.
Holds compression parameters and size of the compressed footer at the end of the file a postscript section.
The default stripe size is 250 MB.
Large stripe sizes enable large, efficient reads from HDFS.
Thus, performance improves during reading, writing, and processing data.
It has many advantages, over RCFile format such as:

A single file as the output of each task, which reduces the NameNode's load
Hive Type support including datetime, decimal, and the complex types (struct, list, map, and union)
light-weight indexes stored within the file
skip row groups that don't pass predicate filtering
seek to a given row
block-mode compression based on data type
run-length encoding for integer columns
dictionary encoding for string columns
concurrent reads of the same file using separate RecordReaders
Ability to split files without scanning for markers
Bound the amount of memory needed for reading or writing
Metadata stored using Protocol Buffers, which allows addition and removal of fields

AVRO

Relatively newest Apache’s Hadoop related projects.
A language neutral preferred data serialization system.
Handles multiple data formats that can be processed by multiple languages.
Relies on schema.
Uses JSON for defining data structure schema, types and protocols.
Stores data structure definitions along with the data, in an easy-to-process form.
It includes support for integers, numeric types, arrays, maps, enums, variable and fixed-length binary data and strings.
It also defines a container file format intended to provide good support for MapReduce and other analytical frameworks.
Data structures can specify sort order,
Faster sorting is possible without deserialization.
The data created in one programming language can be sorted by another.
When data is read, schema used for writing it is always present and available permitting records data Serialization faster with minimal overheads per record.
It serializes data in a compact binary format.
It can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.
An additional advantage of storing the full data structure definition with the data is that it permits the data to be written faster and more compactly without a need to process metadata separately.
Avro as a file format to store data in a predefined format and can be used in any of the Hadoop’s tools like Pig, Hive and other programming languages like Java, Python, more.
lets one define Remote Procedure Call (RPC) protocols. The data types used in RPC are usually distinct from those in datasets, using a common serialization system is still useful.

W.I.P

PARQUET

W.I.P

References:
https://cwiki.apache.org/
https://avro.apache.org

W.I.P

Share This: Facebook Twitter Google+ Stumble Digg

Apache Hive's pros and cons...

12:15 PM One4All

Apache Hive's pros and cons...

Pros/advantages:

It is built on top of hadoop distributed framework system.
Helps querying larger datasets residing in distributed storage
It is a distributed data warehouse.
Queries data using a SQL-like language called HiveQL (HQL).
HiveQL is a declarative language like SQL.
Table structure/s is/are similar to tables in a relational database.
Multiple users can simultaneously query the data using Hive-QL.
Allows to write custom MapReduce framework processes to perform more detailed data analysis.
Data extract/transform/load (ETL) can be done easily.
It provides the structure on a variety of data formats.
Allows access files stored in Hadoop Distributed File System (HDFS) or also similar others data storage systems such as Apache HBase.
Converting variety of format from to to within Hive is simple and possible.

Cons/limitations:

It's not designed for Online transaction processing (OLTP), it is only used for the Online Analytical Processing (OLAP).
Hive supports overwriting or apprehending data, but not updates and deletes.
Sub-queries are not supported, in Hive

Thanks!

Share This: Facebook Twitter Google+ Stumble Digg

Saturday, January 23, 2016

How to find All or Any Apache Hive function quickly?

4:55 PM One4All

To find all or any specific Apache Hive Function/s, please type below on Hive interactive shell/CLI:

SHOW FUNCTIONS;
DESCRIBE FUNCTION <function_name>;
DESCRIBE FUNCTION EXTENDED <function_name>;

Example:

hive> show functions;
OK
!
!=
%
&
*
+
-
/
<
<=
<=>
<>
=
==
>
>=
^
abs
acos
add_months
and
array
array_contains
ascii
asin
assert_true
atan
avg
base64
between
bin
case
cbrt
ceil
ceiling
coalesce
collect_list
collect_set
compute_stats
concat
concat_ws
context_ngrams
conv
corr
cos
count
covar_pop
covar_samp
create_union
cume_dist
current_database
current_date
current_timestamp
current_user
date_add
date_format
date_sub
datediff
day
dayofmonth
decode
degrees
dense_rank
div
e
elt
encode
ewah_bitmap
ewah_bitmap_and
ewah_bitmap_empty
ewah_bitmap_or
exp
explode
factorial
field
find_in_set
first_value
floor
format_number
from_unixtime
from_utc_timestamp
get_json_object
greatest
hash
hex
histogram_numeric
hour
if
in
in_file
index
initcap
inline
instr
isnotnull
isnull
java_method
json_tuple
lag
last_day
last_value
lcase
lead
least
length
levenshtein
like
ln
locate
log
log10
log2
lower
lpad
ltrim
map
map_keys
map_values
matchpath
max
min
minute
month
months_between
named_struct
negative
next_day
ngrams
noop
noopstreaming
noopwithmap
noopwithmapstreaming
not
ntile
nvl
or
parse_url
parse_url_tuple
percent_rank
percentile
percentile_approx
pi
pmod
posexplode
positive
pow
power
printf
radians
rand
rank
reflect
reflect2
regexp
regexp_extract
regexp_replace
repeat
reverse
rlike
round
row_number
rpad
rtrim
second
sentences
shiftleft
shiftright
shiftrightunsigned
sign
sin
size
sort_array
soundex
space
split
sqrt
stack
std
stddev
stddev_pop
stddev_samp
str_to_map
struct
substr
substring
sum
tan
to_date
to_unix_timestamp
to_utc_timestamp
translate
trim
trunc
ucase
unbase64
unhex
unix_timestamp
upper
var_pop
var_samp
variance
weekofyear
when
windowingtablefunction
xpath
xpath_boolean
xpath_double
xpath_float
xpath_int
xpath_long
xpath_number
xpath_short
xpath_string
year
|
~
Time taken: 0.099 seconds, Fetched: 216 row(s)
hive>

Share This: Facebook Twitter Google+ Stumble Digg

How to resolve "Terminal initialization failed; falling back to unsupported java.lang.IncompatibleClassChangeError" in Apache Hive?

3:32 PM One4All

When you run Apache Hive from command prompt:

/user/home: hive <enter>

You may see below error or warning on Hive interactive Shell:

Logging initialized using configuration in jar:file:/Library/apache-hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties

[ERROR] Terminal initialization failed; falling back to unsupported java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected

initialization failed error resolution:

a) Resolve error after running Apache Hive from command line, add below line to initialize the Apache Hadoop library into .bash profile file:

export HADOOP_USER_CLASSPATH_FIRST=true

b) Rerun the Apache Hive again:
/user/home: hive <enter>

c) On success, we will get Apache Hive interactive shell:

Logging initialized using configuration in jar:file:/usr/lib/apache-hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties

hive>

Thanks!

Share This: Facebook Twitter Google+ Stumble Digg

Friday, January 22, 2016

How to install and run Apache Hive in a local mode (stand alone)?

6:36 PM One4All

For installing and running Apache Hive in a local mode (stand alone):

Step 1: Download following software/s:

a) Java software: Our system must have Java ( http://www.oracle.com/technetwork/java/javaee/downloads/index.html ) before Hadoop and Hive installation.

b) Apache Hadoop framework software:
For Apache Hive in stand alone mode we need to point to Hadoop installed directory. I am using the hadoop-2.5.2.tar.gz framework from https://hadoop.apache.org/releases.html

c) Apache Hive framework software:
I am using the apache-hive-1.2.1-bin.tar.gz framework release from https://hive.apache.org/downloads.html

Step 2: Uncompress downloaded tar files a gzip tar file (.tgz or .tar.gz)

a) Uncompress / install Java software:

tar xvzf jdk-8u71-linux-i586.tar.gz
Please follow the appropriate instruction if required.

b) Uncompress the Apache Hadoop framework at your Unix/Linux/Osx command prompt:

tar xvzf hadoop-2.5.2.tar.gz

The above command creates and uncompress the software into hadoop-2.5.2 folder

b) Move the hadoop-2.5.2 directory from the uncompressed location to Users library directory, (if required, rename appropriately):

/usr/lib/apache-hadoop-2.5.2

c) Uncompress the Apache Hive framework at your Unix/Linux/Osx command prompt:

tar xvzf apache-hive-1.2.1-bin.tar.gz

The above command uncompresses software files into hadoop-2.5.2 folder

d) Move hadoop-2.5.2 directory from the uncompressed location to Current users library directory (if required, rename appropriately):

/usr/lib/apache-hive-1.2.1

Step 3: Configure user's .bash profile as below:

export JAVA_HOME="/usr/lib/Java/JavaVirtualMachines/jdk1.8.0_65.jdk/Contents/Home"

export PATH="/usr/lib/Java/JavaVirtualMachines/jdk1.8.0_65.jdk/Contents/Home:${PATH}"

#HADOOP_HOME for Hive Installation

export HADOOP_HOME="/usr/lib/apache-hadoop-2.5.2"

export PATH="${HADOOP_HOME}/bin:${PATH}"

#Apache Hive Home Path

export PATH="/usr/lib/apache-hive-1.2.1/bin:${PATH}"

Step 4: Run Apache Hive from command prompt:

/user/home: hive <enter>

a) The above may return below error or warning:

Logging initialized using configuration in jar:file:/Library/apache-hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties

[ERROR] Terminal initialization failed; falling back to unsupported

java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected

Step 5: Resolve initialization failed error:

a) Resolve error after running Apache Hive from command line, add below line to initialize the Apache Hadoop library into .bash profile file:

export HADOOP_USER_CLASSPATH_FIRST=true

b) Rerun the Apache Hive again:

/user/home: hive <enter>

c) On success, we will get Apache Hive interactive shell:

Logging initialized using configuration in jar:file:/usr/lib/apache-hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties

hive>

d) You can test/view the Apache Hive default database:

hive> show databases; <enter>

default --- Pls. note that this is Hive's default database

Time taken: 1.994 seconds, Fetched: 1 row(s)

d) Try to create new database for your testing:

hive> create database test_db; <enter>

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Unable to create database path file:/user/hive/warehouse/test_db.db, failed to create database test_db)

Step 6: Resolve Unable to create database path file error:

a) To be able to create/store Apache Hive database file/s add following line into .bash profile file and restart Apache Hive:

export HIVE_OPTS='-hiveconf mapred.job.tracker=local -hiveconf fs.default.name=file:///tmp -hiveconf hive.metastore.warehouse.dir=file:///tmp/warehouse -hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/tmp/metastore_db;create=true'

b) Quit your terminal window and restart to get above setting refreshed:

c) Rerun the Apache Hive again:

/user/home: hive <enter>

Logging initialized using configuration in jar:file:/Library/apache-hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties

hive> show databases;

default

Time taken: 1.652 seconds, Fetched: 1 row(s)

hive> create database test_db;

Time taken: 0.323 seconds

hive> show databases;

default

test_db

Time taken: 0.035 seconds, Fetched: 2 row(s)

hive>

d) Now you will be able to access hive in local mode (stand alone) and create databases/tables without any errors or warnings.

I will be creating more blog posts to share new things in Apache Hive. Please keep on reading. Thanks!

References:

http://www.dummies.com/how-to/content/how-to-get-started-with-apache-hive.html
https://cwiki.apache.org/confluence/display/Hive/LanguageManual

Share This: Facebook Twitter Google+ Stumble Digg

Apache Hive Overview...

6:34 PM One4All

Apache Hive Overview: <Work in progress...>

Share This: Facebook Twitter Google+ Stumble Digg

Whats buzzing Apache Hive?

Monday, February 29, 2016

Using Apache Hive to convert from and to data file formats (W.I.P)

W.I.P

Apache Hive Data File Formats

TEXTFILE

SEQUENCEFILE

RCFILE

ORC

AVRO

PARQUET

W.I.P

Apache Hive's pros and cons...

Saturday, January 23, 2016

How to find All or Any Apache Hive function quickly?

How to resolve "Terminal initialization failed; falling back to unsupported java.lang.IncompatibleClassChangeError" in Apache Hive?

initialization failed error resolution:

Friday, January 22, 2016

How to install and run Apache Hive in a local mode (stand alone)?

Step 1: Download following software/s:

Step 2: Uncompress downloaded tar files a gzip tar file (.tgz or .tar.gz)

Step 3: Configure user's .bash profile as below:

Step 4: Run Apache Hive from command prompt:

Step 5: Resolve initialization failed error:

Step 6: Resolve Unable to create database path file error:

References:

Apache Hive Overview...

Popular Posts

Recent Posts

Unordered List

Text Widget

Blog Archive

Pages

About Me

Total Pageviews

views

Advertisement

About