The new Aggregation Framework from mongodb (available in version 2.1 and above) is pretty useful...rather than write mapreduce functions, you can use their aggregation framework to do simple group by, count(*), max and sum type of queries.
--This is how you insert into a collection based on query on another collection
db.basecollection.aggregate( [
{ $group: { _id: "$c1",
rowcount: { $sum: 1 },
lastupdated: { $max: "$c2" } } }
], {$sort:{'lastupdated':1}} ).result.forEach(function(doc){
db.aggcollection.insert(doc)});
The equivalent sql would be something like:
insert into your_aggregate_table (category, rowcount, last_updated)
select category, count(*), max(update_date)
from your_base_table
group by category
--update (insert if don't exist) a collection based on query on another collection
db.basecollection.aggregate( [
{ $group: { _id: "$c1",
rowcount: { $sum: 1 },
lastupdated: { $max: "$c2" } } }
]).result.forEach(function(doc){
db.aggcollection.update({_id: doc._id}, { $set: {rowcount: doc.rowcount, lastupdated: doc.lastupdated}}, {upsert:true})
});
IBM Infosphere Datastage, Netezza, Microstrategy, SSIS, SSRS,Hadoop, BI, ETL, Data warehouse, SQL Server, Tableau, mongoDB
Friday, March 22, 2013
Thursday, March 21, 2013
useful unix commands for data processing
I use IBM Datastage for ETL at work. Datastage has an "Execute Command Activity" which allows us to issue an command on the operating system, in our case it's a linux.
Below are the commands that have come in very handy and efficient in helping me process the data or augment my data workflow.
To convert a single column into a comma separated row of values:
paste -s -d, your_input_file
To get last row:
tail -1 your_input_file
Below are the commands that have come in very handy and efficient in helping me process the data or augment my data workflow.
To convert a single column into a comma separated row of values:
paste -s -d, your_input_file
To get last row:
tail -1 your_input_file
Subscribe to:
Posts (Atom)