Monday, February 29, 2016

Apache Hive Data File Formats



The Apache Hive data file formats: 
The below are the most Hive data file formats one can use it for processing variety of data format within Hive and Hadoop ecosystems.

  • TEXTFILE
  • SEQUENCEFILE
  • RCFILE
  • ORC
  • AVRO
  • PARQUET

TEXTFILE

  • A simplest data format to use, with whatever delimiters you prefer.
  • Also a default format, equivalent to creating a table with the clause STORED AS TEXTFILE.
  • Can be shared data with other Hadoop related tools, such as Pig. 
  • Can also be used within Unix text tools like grep, sed, and awk, etc.
  • Also convenient for viewing or editing files manually.
  • It is not space efficient compared to binary formats, which offer added other advantages over TEXTFILE other than just the simplicity.

SEQUENCEFILE

  • A first alternative to the hive default file format,
  • Can be specified using “STORED AS SEQUENCEFILE” clause during table creation.
  • Files are in flat files structure consisting of binary key-value pairs.
  • In a runtime Hive queries processed into MapReduce jobs, during which records are assigned/generated with the appropriate key-value pairs.
  • It is a standard format supported by Hadoop itself, thus becomes native or acceptable while sharing files between Hive and other Hadoop-related tools.
  • It’s less suitable for use with tools outside the Hadoop ecosystem.
  • When needed, the sequence files can be compressed at the block and record level, which is very useful for optimizing disk space utilization and I/O, while still supporting the ability to split files on block for parallel processing.

RCFILE

  • RCFile = Record Columnar File
  • An efficient internal (binary) hive format and natively supported by Hive.  
  • Used when Column-oriented organization is a good storage option for certain types of data and applications.
  • If data is stored by column instead of by row, then only the data for the desired columns has to be read, this intern improves performance. 
  • Makes columns compression very efficient, especially for low cardinality columns. 
  • Also, some column-oriented stores do not physically need to store null columns.
  • Helps storing columns of a table in a record columnar way.
  • It first partitions rows horizontally into row splits and then it vertically partitions each row split in a columnar way.
  • It first stores the metadata of a row split, as the key part of a record, and all the data of a row split as value part. 
  • The rcfilecat tool to display the contents of RCFiles from Hive command line, since the RCFiles can not be seen with simple editors.

ORC

  • ORC = Optimized Row Columnar 
  • Designed to overcome limitations of other Hive file formats and has highly efficient way to store Hive data. 
  • Stores data as groups of row data called stripes, along with auxiliary information in a file footer. 
  • Holds compression parameters and size of the compressed footer at the end of the file a postscript section. 
  • The default stripe size is 250 MB. 
  • Large stripe sizes enable large, efficient reads from HDFS. 
  • Thus, performance improves during reading, writing, and processing data. 
  • It has many advantages, over RCFile format such as: 
    • A single file as the output of each task, which reduces the NameNode's load 
    • Hive Type support including datetime, decimal, and the complex types (struct, list, map, and union) 
    • light-weight indexes stored within the file 
    • skip row groups that don't pass predicate filtering 
    • seek to a given row 
    • block-mode compression based on data type 
    • run-length encoding for integer columns 
    • dictionary encoding for string columns 
    • concurrent reads of the same file using separate RecordReaders 
    • Ability to split files without scanning for markers 
    • Bound the amount of memory needed for reading or writing 
    • Metadata stored using Protocol Buffers, which allows addition and removal of fields

AVRO


  • Relatively newest Apache’s Hadoop related projects. 
  • A language neutral preferred data serialization system. 
  • Handles multiple data formats that can be processed by multiple languages. 
  • Relies on schema. 
  • Uses JSON for defining data structure schema, types and protocols. 
  • Stores data structure definitions along with the data, in an easy-to-process form. 
  • It includes support for integers, numeric types, arrays, maps, enums, variable and fixed-length binary data and strings. 
  • It also defines a container file format intended to provide good support for MapReduce and other analytical frameworks. 
  • Data structures can specify sort order, 
  • Faster sorting is possible without deserialization. 
  • The data created in one programming language can be sorted by another. 
  • When data is read, schema used for writing it is always present and available permitting records data Serialization faster with minimal overheads per record. 
  • It serializes data in a compact binary format. 
  • It can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services. 
  • An additional advantage of storing the full data structure definition with the data is that it permits the data to be written faster and more compactly without a need to process metadata separately. 
  • Avro as a file format to store data in a predefined format and can be used in any of the Hadoop’s tools like Pig, Hive and other programming languages like Java, Python, more. 
  • lets one define Remote Procedure Call (RPC) protocols. The data types used in RPC are usually distinct from those in datasets, using a common serialization system is still useful.
W.I.P

PARQUET 



    W.I.P



    References:

    https://cwiki.apache.org/
    https://avro.apache.org






    W.I.P

    Apache Hive's pros and cons...




    Apache Hive's pros and cons...

    Pros/advantages:
    • It is built on top of hadoop distributed framework system.
    • Helps querying larger datasets residing in distributed storage 
    • It is a distributed data warehouse.
    • Queries data using a SQL-like language called HiveQL (HQL). 
    • HiveQL is a declarative language like SQL.
    • Table structure/s is/are similar to tables in a relational database. 
    • Multiple users can simultaneously query the data using Hive-QL. 
    • Allows to write custom MapReduce framework processes to perform more detailed data analysis. 
    • Data extract/transform/load (ETL) can be done easily.
    • It provides the structure on a variety of data formats. 
    • Allows access files stored in Hadoop Distributed File System (HDFS) or also similar others data storage systems such as Apache HBase.
    • Converting variety of format from to to within Hive is simple and possible.
     Cons/limitations:
    • It's not designed for Online transaction processing (OLTP), it is only used for the Online Analytical Processing (OLAP). 
    • Hive supports overwriting or apprehending data, but not updates and deletes. 
    • Sub-queries are not supported, in Hive

    Thanks!

    Saturday, January 23, 2016

    How to find All or Any Apache Hive function quickly?



    To find all or any specific Apache Hive Function/s, please type below on Hive interactive shell/CLI:

    
    
    
    SHOW FUNCTIONS;
    DESCRIBE FUNCTION <function_name>;
    DESCRIBE FUNCTION EXTENDED <function_name>;

    
    
    
    
    
    
    
    
    Example:
    hive> show functions;
    OK
    !
    !=
    %
    &
    *
    +
    -
    /
    <
    <=
    <=>
    <>
    =
    ==
    >
    >=
    ^
    abs
    acos
    add_months
    and
    array
    array_contains
    ascii
    asin
    assert_true
    atan
    avg
    base64
    between
    bin
    case
    cbrt
    ceil
    ceiling
    coalesce
    collect_list
    collect_set
    compute_stats
    concat
    concat_ws
    context_ngrams
    conv
    corr
    cos
    count
    covar_pop
    covar_samp
    create_union
    cume_dist
    current_database
    current_date
    current_timestamp
    current_user
    date_add
    date_format
    date_sub
    datediff
    day
    dayofmonth
    decode
    degrees
    dense_rank
    div
    e
    elt
    encode
    ewah_bitmap
    ewah_bitmap_and
    ewah_bitmap_empty
    ewah_bitmap_or
    exp
    explode
    factorial
    field
    find_in_set
    first_value
    floor
    format_number
    from_unixtime
    from_utc_timestamp
    get_json_object
    greatest
    hash
    hex
    histogram_numeric
    hour
    if
    in
    in_file
    index
    initcap
    inline
    instr
    isnotnull
    isnull
    java_method
    json_tuple
    lag
    last_day
    last_value
    lcase
    lead
    least
    length
    levenshtein
    like
    ln
    locate
    log
    log10
    log2
    lower
    lpad
    ltrim
    map
    map_keys
    map_values
    matchpath
    max
    min
    minute
    month
    months_between
    named_struct
    negative
    next_day
    ngrams
    noop
    noopstreaming
    noopwithmap
    noopwithmapstreaming
    not
    ntile
    nvl
    or
    parse_url
    parse_url_tuple
    percent_rank
    percentile
    percentile_approx
    pi
    pmod
    posexplode
    positive
    pow
    power
    printf
    radians
    rand
    rank
    reflect
    reflect2
    regexp
    regexp_extract
    regexp_replace
    repeat
    reverse
    rlike
    round
    row_number
    rpad
    rtrim
    second
    sentences
    shiftleft
    shiftright
    shiftrightunsigned
    sign
    sin
    size
    sort_array
    soundex
    space
    split
    sqrt
    stack
    std
    stddev
    stddev_pop
    stddev_samp
    str_to_map
    struct
    substr
    substring
    sum
    tan
    to_date
    to_unix_timestamp
    to_utc_timestamp
    translate
    trim
    trunc
    ucase
    unbase64
    unhex
    unix_timestamp
    upper
    var_pop
    var_samp
    variance
    weekofyear
    when
    windowingtablefunction
    xpath
    xpath_boolean
    xpath_double
    xpath_float
    xpath_int
    xpath_long
    xpath_number
    xpath_short
    xpath_string
    year
    |
    ~
    Time taken: 0.099 seconds, Fetched: 216 row(s)
    hive> 
    

    How to resolve "Terminal initialization failed; falling back to unsupported java.lang.IncompatibleClassChangeError" in Apache Hive?

    When you run Apache Hive from command prompt:

    /user/home: hive <enter>

    You may see below error or warning on Hive interactive Shell:
    Logging initialized using configuration in jar:file:/Library/apache-hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties
    [ERROR] Terminal initialization failed; falling back to unsupported java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected

    initialization failed error resolution:

    a) Resolve error after running Apache Hive from command line, add below line to initialize the Apache Hadoop library into .bash profile file:

    export HADOOP_USER_CLASSPATH_FIRST=true

    b) Rerun the Apache Hive again:
    /user/home: hive <enter>

    c) On success, we will get Apache Hive interactive shell:
    Logging initialized using configuration in jar:file:/usr/lib/apache-hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties
    hive>
    Thanks!

    Friday, January 22, 2016

    How to install and run Apache Hive in a local mode (stand alone)?



    For installing and running Apache Hive in a local mode (stand alone):

    Step 1:  Download following software/s:

    a) Java software: Our system must have Java ( http://www.oracle.com/technetwork/java/javaee/downloads/index.html ) before Hadoop and Hive installation.

    b) Apache Hadoop framework software:
    For Apache Hive in stand alone mode we need to point to Hadoop installed directory. I am using the hadoop-2.5.2.tar.gz framework from https://hadoop.apache.org/releases.html

    c) Apache Hive framework software:
    I am using the apache-hive-1.2.1-bin.tar.gz framework release from https://hive.apache.org/downloads.html

    Step 2: Uncompress downloaded tar files a gzip tar file (.tgz or .tar.gz)

    a) Uncompress / install Java software:  

    tar xvzf jdk-8u71-linux-i586.tar.gz
    Please follow the appropriate instruction if required.

    b) Uncompress the Apache Hadoop framework at your Unix/Linux/Osx command prompt:

    tar xvzf hadoop-2.5.2.tar.gz 

    The above command creates and uncompress the software into hadoop-2.5.2 folder

    b) Move the hadoop-2.5.2 directory from the uncompressed location to Users library directory, (if required, rename appropriately):

    /usr/lib/apache-hadoop-2.5.2 

    c) Uncompress the Apache Hive framework at your Unix/Linux/Osx command prompt: 

    tar xvzf apache-hive-1.2.1-bin.tar.gz 

    The above command uncompresses software files into hadoop-2.5.2 folder

    d) Move hadoop-2.5.2 directory from the uncompressed location to Current users library directory (if required, rename appropriately):

    /usr/lib/apache-hive-1.2.1 

    Step 3: Configure user's .bash profile as below:

    export JAVA_HOME="/usr/lib/Java/JavaVirtualMachines/jdk1.8.0_65.jdk/Contents/Home"

    export PATH="/usr/lib/Java/JavaVirtualMachines/jdk1.8.0_65.jdk/Contents/Home:${PATH}"

    #HADOOP_HOME for Hive Installation
    export HADOOP_HOME="/usr/lib/apache-hadoop-2.5.2"

    export PATH="${HADOOP_HOME}/bin:${PATH}"


    #Apache Hive Home Path
    export PATH="/usr/lib/apache-hive-1.2.1/bin:${PATH}"

    Step 4: Run Apache Hive from command prompt:

    /user/home: hive <enter>

    a) The above may return below error or warning:
    Logging initialized using configuration in jar:file:/Library/apache-hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties
    [ERROR] Terminal initialization failed; falling back to unsupported

    java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected

    Step 5: Resolve initialization failed error:

    a) Resolve error after running Apache Hive from command line, add below line to initialize the Apache Hadoop library into .bash profile file:

    export HADOOP_USER_CLASSPATH_FIRST=true

    b) Rerun the Apache Hive again:

    /user/home: hive <enter>

    c) On success, we will get Apache Hive interactive shell:

    Logging initialized using configuration in jar:file:/usr/lib/apache-hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties
    hive>

    d) You can test/view the Apache Hive default database:

    hive> show databases; <enter>
    OK
    default      --- Pls. note that this is Hive's default database
    Time taken: 1.994 seconds, Fetched: 1 row(s)

    d) Try to create new database for your testing:


    hive> create database test_db; <enter>

    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Unable to create database path file:/user/hive/warehouse/test_db.db, failed to create database test_db)

    Step 6: Resolve Unable to create database path file error:

    a) To be able to create/store Apache Hive database file/s add following line into .bash profile file and restart Apache Hive:

    export HIVE_OPTS='-hiveconf mapred.job.tracker=local -hiveconf fs.default.name=file:///tmp -hiveconf hive.metastore.warehouse.dir=file:///tmp/warehouse -hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/tmp/metastore_db;create=true'

    b) Quit your terminal window and restart to get above setting refreshed:

    c) Rerun the Apache Hive again:

    /user/home: hive <enter>

    Logging initialized using configuration in jar:file:/Library/apache-hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties
    hive> show databases;
    OK
    default
    Time taken: 1.652 seconds, Fetched: 1 row(s)
    hive> create database test_db;
    OK
    Time taken: 0.323 seconds
    hive> show databases;
    OK
    default
    test_db
    Time taken: 0.035 seconds, Fetched: 2 row(s)

    hive> 

    d) Now you will be able to access hive in local mode (stand alone) and create databases/tables without any errors or warnings.

    I will be creating more blog posts to share new things in Apache Hive. Please keep on reading. Thanks!

    References:

    • http://www.dummies.com/how-to/content/how-to-get-started-with-apache-hive.html
    • https://cwiki.apache.org/confluence/display/Hive/LanguageManual