My BI World: Hive / Impala - create external tables with data from subfolders

At my workplace, we already store a lot of files in our HDFS..and I wanted to create impala tables against them. But our files are stored as lzo compressed files, and as of Impala 1.1, you cannot create the tables that are using lzo files through Impala, but you can create them in Hive, then query them in Impala.

Our typical folder structure look like this:

/data/mylogfiles/20130101
/data/mylogfiles/20130102
/data/mylogfiles/20130103

So basically we create a folder for each day of data.

At first I tried this and this doesn't work:
create external table mylogfile
    (
    keyword string,
    campaign_id string,
    mydate string,
    mytime string,
    display_site string,
    was_clicked int,
    cpc int,
    country string,
    placement string)
    PARTITIONED BY (log_date STRING)
    ROW FORMAT DELIMITED
    fields terminated by '\t'
    stored as textfile
    location '/data/mylogfiles/*/'; doesn't work!!!

Then I read about partitions and this works..although I do think it's a hassle to have to execute a statement to add a new partition for each folder of data in our hdfs. If you know of any other better way, please feel free to leave it in the comments section.

create external table mylogtable
    (
    keyword string,
    campaign_id string,
    mydate string,
    mytime string,
    display_site string,
    was_clicked int,
    cpc int,
    country string,
    placement string)
    PARTITIONED BY (folderdate BIGINT)
    ROW FORMAT DELIMITED
    fields terminated by '\t'; leave the location empty

Note: when you first go to impala after creating the table in Hive, you will need to issue these 2 commands or the table will not be visible in Impala:

invalidate metadata;
refresh mylogtable;

You only need to create the table in Hive, then you add the partitions in impala shell.

ALTER TABLE mylogtable ADD PARTITION (folderdate=20130101) LOCATION '/data/mylogfiles/20130101';

ALTER TABLE mylogtable ADD PARTITION (folderdate=20130102) LOCATION '/data/mylogfiles/20130102';

You can do a quick sanity check after addition of each partition...the rowcount should go up.

select count(*)
from mylogfile;

Notice that the partition key is not an existing column of the table. You just give the partition your own name, and assign the appropriate file to that partition. In my example above, I am effectively naming the partition "20130101" for the file located in hdfs folder: "/data/mylogfiles/20130101". I could have named the partition "mypartition1" if I wanted to.

The real use of partitions though, it to speed up query by reading only the data necessary if you know which partitions the data resides in.

For example, if you are looking for data with mydate = 2013-11-02, and you know that the data resides in the 20131102 and 20131103 partitions, you can issue a statement like this which will speed up the query:

select count(*)
from mylogtable
where to_date(mydate) >= '2013-11-02' and to_date(mydate) < '2013-11-03'
and folderdate in (20131102,20131103); --> folderdate is the partition key, not an actual columnname

4 comments:

UnknownOctober 11, 2014 at 9:54 AM
Thank you for another great article. Where else could anyone get that kind of information in such a perfect way of writing? I have a presentation next week, and I am on the look for such information.

Local Seo Singapore
UnknownDecember 15, 2015 at 10:28 AM
This comment has been removed by the author.
UnknownDecember 15, 2015 at 10:42 AM
In beeline execute below command
MSCK REPAIR TABLE ;

In impala execute below command
invalidate metadata ;

Now do a select * from ;

We can see all partitioned data without using ALTER TABLE command.

Unfortunately MSCK REPAIR is not available in impala..

Cheers !!!
CryptoWizard101October 8, 2023 at 3:31 PM
goruntulu show
ücretli
AQWBİ

My BI World

Thursday, November 14, 2013

Hive / Impala - create external tables with data from subfolders

4 comments:

Popular Posts