At my workplace, we already store a lot of files in our HDFS..and I wanted to create impala tables against them. But our files are stored as lzo compressed files, and as of Impala 1.1, you cannot create the tables that are using lzo files through Impala, but you can create them in Hive, then query them in Impala.
Our typical folder structure look like this:
So basically we create a folder for each day of data.
At first I tried this and this doesn't work:
create external table mylogfile
PARTITIONED BY (log_date STRING)
ROW FORMAT DELIMITED
fields terminated by '\t'
stored as textfile
location '/data/mylogfiles/*/'; doesn't work!!!
Then I read about partitions and this works..although I do think it's a hassle to have to execute a statement to add a new partition for each folder of data in our hdfs. If you know of any other better way, please feel free to leave it in the comments section.
create external table mylogtable
PARTITIONED BY (folderdate BIGINT)
ROW FORMAT DELIMITED
fields terminated by '\t'; leave the location empty
Note: when you first go to impala after creating the table in Hive, you will need to issue these 2 commands or the table will not be visible in Impala:
You only need to create the table in Hive, then you add the partitions in impala shell.
ALTER TABLE mylogtable ADD PARTITION (folderdate=20130101) LOCATION '/data/mylogfiles/20130101';
ALTER TABLE mylogtable ADD PARTITION (folderdate=20130102) LOCATION '/data/mylogfiles/20130102';
You can do a quick sanity check after addition of each partition...the rowcount should go up.
Notice that the partition key is not an existing column of the table. You just give the partition your own name, and assign the appropriate file to that partition. In my example above, I am effectively naming the partition "20130101" for the file located in hdfs folder: "/data/mylogfiles/20130101". I could have named the partition "mypartition1" if I wanted to.
The real use of partitions though, it to speed up query by reading only the data necessary if you know which partitions the data resides in.
For example, if you are looking for data with mydate = 2013-11-02, and you know that the data resides in the 20131102 and 20131103 partitions, you can issue a statement like this which will speed up the query:
where to_date(mydate) >= '2013-11-02' and to_date(mydate) < '2013-11-03'
and folderdate in (20131102,20131103); --> folderdate is the partition key, not an actual columnname