Here are some snippets of code to do some data manipulation in your *nix environment.
Say you have in a text file test.txt
a
b
c
and you want to convert it to a,b,c, in other words, convert rows to columns
you can run this:
tr -s '\n' ',' < test.txt
Say you have in your text file a,b,c and you want to change it to single quoted like 'a','b','c'
you can run this:
cat /tmp/your_csv_row.txt | awk '{n=split($0,a,","); for (i = 0; ++i <=n;) print "'\''" a[i] "'\'',"}' | tr -s '\n' ',' | sed 's/.$//'
My BI World
IBM Infosphere Datastage, Netezza, Microstrategy, SSIS, SSRS,Hadoop, BI, ETL, Data warehouse, SQL Server, Tableau, mongoDB
Wednesday, October 29, 2014
Wednesday, October 8, 2014
Impala create table, add partitions etc cheatsheet
I used pig to convert the incoming csv file to parquet format, then in hive, created the external table:
create external table salestransactions
(
PartnerCode string,
TransactionNumber string,
TransactionEndDateTime string,
TransactionType string,
TransactionSubType string,
StoreCode string
)
PARTITIONED BY (batchdate BIGINT)
ROW FORMAT DELIMITED
fields terminated by '\t'
STORED AS
INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat';
Store the parquet format data file in this hdfs folder /user/data/salestransactions_parquet/20141002
then again in hive, issue this command:
ALTER TABLE salestransactionsADD PARTITION (batchdate=20141002) LOCATION '/user/data/salestransactions_parquet/20141002';
Then in your impala-shell, if this is your first time accessing the table that you just created in hive, type "invalidate metadata", then type "refresh salestransactions". If you then type "show partitions salestransactions", you should be able to see the above partition you just added.
to drop a partition,
alter table riq_salestransactions drop partition (batchdate=20141002);
create external table salestransactions
(
PartnerCode string,
TransactionNumber string,
TransactionEndDateTime string,
TransactionType string,
TransactionSubType string,
StoreCode string
)
PARTITIONED BY (batchdate BIGINT)
ROW FORMAT DELIMITED
fields terminated by '\t'
STORED AS
INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat';
Store the parquet format data file in this hdfs folder /user/data/salestransactions_parquet/20141002
then again in hive, issue this command:
ALTER TABLE salestransactionsADD PARTITION (batchdate=20141002) LOCATION '/user/data/salestransactions_parquet/20141002';
Then in your impala-shell, if this is your first time accessing the table that you just created in hive, type "invalidate metadata", then type "refresh salestransactions". If you then type "show partitions salestransactions", you should be able to see the above partition you just added.
to drop a partition,
alter table riq_salestransactions drop partition (batchdate=20141002);
Thursday, February 13, 2014
Configuring Microstrategy 9.3.1 to work with Impala 1.2.3
We have Microstrategy 9.3.1 installed on CentOS 6.
Installing the driver and setting it up was relatively easy if you follow the instructions exactly.
I downloaded the driver from Cloudera's page here:
http://www.cloudera.com/content/support/en/downloads/download-components/download-products/downloads-listing/connectors/microstrategy.html
I followed Cloudera's instructions here to install the ODBC connector: http://www.cloudera.com/content/cloudera-content/cloudera-docs/Connectors/MicroStrategy/Cloudera-Connector-for-MicroStrategy/ccms_linux.html
Then I followed the instructions here to set things up on our Microstrategy linux box:
https://resource.microstrategy.com/support/Authoring/DisplayTN.aspx?formatted=1&tnkey=43595
I did have to open a ticket to get a PDS file from Microstrategy and install it so that I can see 'Impala 1.x' in the drop down list when setting up a new database instance in Microstrategy admin.
odbc.ini file:
[ODBC Data Sources]
[IMPALA-SERVER]
Driver=/usr/lib/libhiveodbc.so.1
Description=Hive ODBC Driver
DATABASE=default
HOST=impalaservername
PORT=21050
FRAMED=0
Trace=Yes
TraceFile=/tmp/odbc.log
Type=HS2NoSasl
ODBC.sh in env folder:
#
# ODBC Driver for Hive
#
HIVE_CONFIG='/usr'
if [ "${HIVE_CONFIG}" != '<HIVE_CONFIG>' ]; then
export HIVE_CONFIG
mstr_append_path LD_LIBRARY_PATH "${HIVE_CONFIG:?}"/lib
export LD_LIBRARY_PATH
fi
It works and it's very exciting! I then created a quick free form sql report against our Impala table in hadoop. We can finally have our users run reports directly against our hadoop cluster. We'll have to do some more testing but this is a very good start.
Installing the driver and setting it up was relatively easy if you follow the instructions exactly.
I downloaded the driver from Cloudera's page here:
http://www.cloudera.com/content/support/en/downloads/download-components/download-products/downloads-listing/connectors/microstrategy.html
I followed Cloudera's instructions here to install the ODBC connector: http://www.cloudera.com/content/cloudera-content/cloudera-docs/Connectors/MicroStrategy/Cloudera-Connector-for-MicroStrategy/ccms_linux.html
Then I followed the instructions here to set things up on our Microstrategy linux box:
https://resource.microstrategy.com/support/Authoring/DisplayTN.aspx?formatted=1&tnkey=43595
I did have to open a ticket to get a PDS file from Microstrategy and install it so that I can see 'Impala 1.x' in the drop down list when setting up a new database instance in Microstrategy admin.
odbc.ini file:
[ODBC Data Sources]
IMPALA-SERVER=Hive ODBC Driver
[IMPALA-SERVER]
Driver=/usr/lib/libhiveodbc.so.1
Description=Hive ODBC Driver
DATABASE=default
HOST=impalaservername
PORT=21050
FRAMED=0
Trace=Yes
TraceFile=/tmp/odbc.log
Type=HS2NoSasl
ODBC.sh in env folder:
#
# ODBC Driver for Hive
#
HIVE_CONFIG='/usr'
if [ "${HIVE_CONFIG}" != '<HIVE_CONFIG>' ]; then
export HIVE_CONFIG
mstr_append_path LD_LIBRARY_PATH "${HIVE_CONFIG:?}"/lib
export LD_LIBRARY_PATH
fi
It works and it's very exciting! I then created a quick free form sql report against our Impala table in hadoop. We can finally have our users run reports directly against our hadoop cluster. We'll have to do some more testing but this is a very good start.
Tuesday, January 28, 2014
Deduping rows in Netezza / Deleting rows in Netezza.
netezza has a hidden rowid for every row in all tables. to dedup, you can do this:
delete from omszip_stage
where rowid in (
select max(rowid) as rowid
from OMSZIP_STAGE
group by couponid, zip, pid
having count(*) > 1)
delete from omszip_stage
where rowid in (
select max(rowid) as rowid
from OMSZIP_STAGE
group by couponid, zip, pid
having count(*) > 1)
Tuesday, December 3, 2013
bash script to extract dates from sql statements from history table
We had to figure out the usage against one of our biggest table to decide how to archive it. We keep a history of all queries issued against the table and I wanted to extract the date filter used in the sql.
There are many ways to do it but I wanted to do it in bash script just for fun...Probably not the most efficient way of doing it but I'm posting this anyway mostly for my own benefit in case I need to figure out how to do something in bash..like how to find the position of a word in a string, how to extract words from a line etc etc
#! /bin/bash
#goal: to extract the dates used in where condition in sql ie adddate >= '2013-11-01' or daystring=20131105
nzsql -t -host serverhere -d databasehere -u usernamehere -W passwordhere -c "select QH_SESSIONID, QH_TSUBMIT, QH_USER, QH_SQL from QUERYHISTORY" > queryhistory.txt 2>&1
rm -f lhqueryhistory1.csv
while read line
do
if [ -z "$line" ];
then
echo "line is empty"
break
fi
sessionid=`echo $line | cut -d"|" -f1`
starttime=`echo $line | cut -d"|" -f2`
username=`echo $line | cut -d"|" -f3`
sql=`echo $line | cut -d"|" -f4`
#cut out sql starting from the where word
wherepos=$(echo $sql | awk '{print match(tolower($0), "where")}')
#if no where condition skip that sql
if [ ${wherepos:-0} -eq 0 ];
then
continue
fi
wheresql=$(echo ${sql:${wherepos}})
#cut out sql starting from the adddate or daystring word
wherepos2=$(echo $wheresql | awk '{print match(tolower($0), "adddate|daystring")}')
if [ ${wherepos2:-0} -eq 0 ];
then
wheresql2="none and"
else
wherepos2=$(($wherepos2 - 1))
wheresql2=$(echo ${wheresql:${wherepos2}})
fi
#cut out sql up until the and, group or limit word
wherepos3=$(echo $wheresql2 | awk '{print match(tolower($0), "and|group|limit")}')
if [ ${wherepos3:-0} -eq 0 ];
then
wheresql3=$wheresql2
else
wherepos3=$(($wherepos3 - 1))
wheresql3=$(echo ${wheresql2:0:${wherepos3:-0}})
fi
#extract the date (before single quote)
wherepos4=$(expr index "$wheresql3" "\'") #look for single quote
wheresql4=$(echo ${wheresql3:${wherepos4}})
if [ ${wherepos4:-0} -eq 0 ];
then
wheresql4=$wheresql3
else
wherepos4=$(expr index "$wheresql3" "\'")#look for next single quote
wherepos4=$(($wherepos4 - 1))
wheresql4=$(echo ${wheresql3:${wherepos4}})
fi
finalrow="$sessionid|$starttime|$username|$wheresql4"
echo $finalrow >> lhqueryhistory1.txt
done < queryhistory.txt
rm -f queryhistory.txt
There are many ways to do it but I wanted to do it in bash script just for fun...Probably not the most efficient way of doing it but I'm posting this anyway mostly for my own benefit in case I need to figure out how to do something in bash..like how to find the position of a word in a string, how to extract words from a line etc etc
#! /bin/bash
#goal: to extract the dates used in where condition in sql ie adddate >= '2013-11-01' or daystring=20131105
nzsql -t -host serverhere -d databasehere -u usernamehere -W passwordhere -c "select QH_SESSIONID, QH_TSUBMIT, QH_USER, QH_SQL from QUERYHISTORY" > queryhistory.txt 2>&1
rm -f lhqueryhistory1.csv
while read line
do
if [ -z "$line" ];
then
echo "line is empty"
break
fi
sessionid=`echo $line | cut -d"|" -f1`
starttime=`echo $line | cut -d"|" -f2`
username=`echo $line | cut -d"|" -f3`
sql=`echo $line | cut -d"|" -f4`
#cut out sql starting from the where word
wherepos=$(echo $sql | awk '{print match(tolower($0), "where")}')
#if no where condition skip that sql
if [ ${wherepos:-0} -eq 0 ];
then
continue
fi
wheresql=$(echo ${sql:${wherepos}})
#cut out sql starting from the adddate or daystring word
wherepos2=$(echo $wheresql | awk '{print match(tolower($0), "adddate|daystring")}')
if [ ${wherepos2:-0} -eq 0 ];
then
wheresql2="none and"
else
wherepos2=$(($wherepos2 - 1))
wheresql2=$(echo ${wheresql:${wherepos2}})
fi
#cut out sql up until the and, group or limit word
wherepos3=$(echo $wheresql2 | awk '{print match(tolower($0), "and|group|limit")}')
if [ ${wherepos3:-0} -eq 0 ];
then
wheresql3=$wheresql2
else
wherepos3=$(($wherepos3 - 1))
wheresql3=$(echo ${wheresql2:0:${wherepos3:-0}})
fi
#extract the date (before single quote)
wherepos4=$(expr index "$wheresql3" "\'") #look for single quote
wheresql4=$(echo ${wheresql3:${wherepos4}})
if [ ${wherepos4:-0} -eq 0 ];
then
wheresql4=$wheresql3
else
wherepos4=$(expr index "$wheresql3" "\'")#look for next single quote
wherepos4=$(($wherepos4 - 1))
wheresql4=$(echo ${wheresql3:${wherepos4}})
fi
finalrow="$sessionid|$starttime|$username|$wheresql4"
echo $finalrow >> lhqueryhistory1.txt
done < queryhistory.txt
rm -f queryhistory.txt
Friday, November 22, 2013
Netezza ERROR: 65798 : no more space for TXMgr transactions array
Every now and then, in Netezza, we'll get an error message : ERROR: 65798 : no more space for TXMgr transactions array.
When this happens, nobody can issue an query against the database, and the database is unusable.
There is a limit of approximately 65,000 transaction objects in Netezza. Once that limit is reached, the system will not accept any new queries. It usually isn't a problem because as transactions completes, the spot that was taken up by that query is freed up and given to new queries. However, sometimes there will be queries that will get stuck and will hang onto the transaction objects until we go into the system and manually abort those sessions.
This script will look for queries that has been running for more than 24 hours (it's an argument thus configurable), and sends an email that displays how long it has been running, the username, the sql command, and automatically kills the sessions.
#! /bin/bash
MaxMinutes="${1}"
if [ -z "$1" ]
then MaxMinutes=1440
fi
rowcount=0
nzsql -t -c "select id, conntime, username, dbname, command from _v_session where status='active' and command is not null
;" > sessionstokill.txt 2>&1
CurrentTimeEpoch=$(date +%s)
message="Long Running Queries:\n----------------\n"
while read line
do
id=`echo $line | cut -d"|" -f1`
starttime=`echo $line | cut -d"|" -f2`
username=`echo $line | cut -d"|" -f3`
dbname=`echo $line | cut -d"|" -f4`
command=`echo $line | cut -d"|" -f5`
StartTimeEpoch=$(date -d "$starttime" +%s)
DifferenceInMinutes=$((($CurrentTimeEpoch - $StartTimeEpoch)/(60)))
#if it has been running for more than X minutes (argument 1)
if [ $DifferenceInMinutes -ge $MaxMinutes ] && [ $id > 0 ];
then
rowcount=$[$rowcount+1];
message+="SESSIONID: ${id} has been running for ${DifferenceInMinutes} minutes.\nUSERNAME:${username}\nDATABASE:${dbname}\nSQL:${command}\n"
#we only want to kill it if the queries came from these users
if [[ "$username" == *INTERNAL*_W* ]] || [[ "$username" == *ADMIN* ]];
then
message+="\n!!This session will be killed!!\n";
`nzsession abortTxn -force -id $id`
fi
message+="-----------------\n\n"
fi
done < sessionstokill.txt
if [ $rowcount -gt 0 ]
then
sendMail -dst "youremail@yourcompany.com" -msg "Long Running Queries / Hung Transactions NZ" -bodyText "$message"
fi
When this happens, nobody can issue an query against the database, and the database is unusable.
There is a limit of approximately 65,000 transaction objects in Netezza. Once that limit is reached, the system will not accept any new queries. It usually isn't a problem because as transactions completes, the spot that was taken up by that query is freed up and given to new queries. However, sometimes there will be queries that will get stuck and will hang onto the transaction objects until we go into the system and manually abort those sessions.
This script will look for queries that has been running for more than 24 hours (it's an argument thus configurable), and sends an email that displays how long it has been running, the username, the sql command, and automatically kills the sessions.
#! /bin/bash
MaxMinutes="${1}"
if [ -z "$1" ]
then MaxMinutes=1440
fi
rowcount=0
nzsql -t -c "select id, conntime, username, dbname, command from _v_session where status='active' and command is not null
;" > sessionstokill.txt 2>&1
CurrentTimeEpoch=$(date +%s)
message="Long Running Queries:\n----------------\n"
while read line
do
id=`echo $line | cut -d"|" -f1`
starttime=`echo $line | cut -d"|" -f2`
username=`echo $line | cut -d"|" -f3`
dbname=`echo $line | cut -d"|" -f4`
command=`echo $line | cut -d"|" -f5`
StartTimeEpoch=$(date -d "$starttime" +%s)
DifferenceInMinutes=$((($CurrentTimeEpoch - $StartTimeEpoch)/(60)))
#if it has been running for more than X minutes (argument 1)
if [ $DifferenceInMinutes -ge $MaxMinutes ] && [ $id > 0 ];
then
rowcount=$[$rowcount+1];
message+="SESSIONID: ${id} has been running for ${DifferenceInMinutes} minutes.\nUSERNAME:${username}\nDATABASE:${dbname}\nSQL:${command}\n"
#we only want to kill it if the queries came from these users
if [[ "$username" == *INTERNAL*_W* ]] || [[ "$username" == *ADMIN* ]];
then
message+="\n!!This session will be killed!!\n";
`nzsession abortTxn -force -id $id`
fi
message+="-----------------\n\n"
fi
done < sessionstokill.txt
if [ $rowcount -gt 0 ]
then
sendMail -dst "youremail@yourcompany.com" -msg "Long Running Queries / Hung Transactions NZ" -bodyText "$message"
fi
Wednesday, November 20, 2013
Exporting resultset using Impala and chaining multiple commands
Using Hue's Impala UI, you can download the resultset as CSV or Excel. But Impala shell offers a much more flexible way of exporting resultset as you can specify any delimiter you want. Another benefit over using Hue UI Impala 1.1 is that in the Impala UI, you cannot execute multiple commands (even though it doesn't report a failure). Only the first statement will be executed.
If you want to chain a few commands, create a script file (can be any extension although the convention is to give it a .hql extension).
Put all your sql commands in a file. The last command should select the output you want to save. To execute the file, type:
The above command will invoke the impala shell, and run the commands stored in myscrpt.hql, will output the results to a file called "results.txt", and the field delimited is tab.
Contents of myscript.hql:
You can then view the output in the file you specified, in my case:
cat results.txt
Put all your sql commands in a file. The last command should select the output you want to save. To execute the file, type:
impala-shell -f myscript.hql -B -o results.txt --output_delimiter='\t';
The above command will invoke the impala shell, and run the commands stored in myscrpt.hql, will output the results to a file called "results.txt", and the field delimited is tab.
Contents of myscript.hql:
drop table if exists mytable;
create table mytable (adddate string, rowcount bigint);
insert into mytable
select to_date(adddate), count(*)
from transactionmaster
group by to_date(adddate);
select * from mytable;
-------------------------------------------------------------------------------------------
cat results.txt
Subscribe to:
Posts (Atom)