Apache Pig is designed to handle analysis of large data sets using a high-level language (Pig Latin) that allows for parallelisation. Pig Latin compiles to sequences of Map-Reduce programs that can be executed on Hadoop.
This post pulls together an archive of some Apache Pig tips tweeted as “Apache #Pig tip of the day”.
- 18th June 2013:
org.apache.pig.builtin.MonitoredUDFannotation to terminate your reg/Algebraic UDFs if they run for too long
- 17th June 2013:
HadoopJobHistoryLoaderin the piggybank can be used to check for failed jobs amongst other things.
- 14th June 2013:
visualise execution plan as a directed acyclic graph using
EXPLAIN, pass output thru graphviz dot
- 13th June 2013:
just like your favourite RDBMS Pig has an
EXPLAINcommand – get logical/physical and MapReduce execution plans.
- 12th June 2013:
Know your path – the pig shell script tries to locate hadoop using ‘
- 11th June 2013:
Penny (a debug/tracing tool) users should be aware that it has been removed from trunk in 0.11 #tidyout
- 10th June 2013:
Pig can work with data serialized using Avro – the necessary
AvroStorage& related classes are in the PiggyBank
- 7th June 2013:
PigUnitfor unit testing Pig scripts. It defaults to local mode, use
pigunit.exectype.clusterprop for MR.
- 6th June 2013:
Amazon Elastic MapReduce supports Pig (0.9.x). You can run newer unsupported versions: https://forums.aws.amazon.com/thread.jspa?messageID=455015
- 5th June 2013:
the PiggyBank contains contributed Java UDFs. Very useful stuff in contrib/piggybank (caveat: they are ‘as-is’)
- 4th June 2013:
Pig supports user defined functions (UDFs). Write them in Java or (with less support) Python, JS, Ruby & #Groovy
- 3rd June 2013:
ILLUSTRATEcommand to exemplify a Pig Latin script with concise, complete and realistic data #iterate
- 2nd June 2013:
Pig provides a high level language (Pig Latin) for data analysis that compiles to Hadoop Map-Reduce data flows.
- 31st May 2013:
pig -x local‘ runs on a single machine, the default or ‘
-x mapreduce‘ mode runs on a Hadoop cluster. #bigdata
- 30th May 2013:
Pig supports multiple Hadoop versions (0.20 by default) – make sure you set the
- 29th May 2013:
-secretDebugCmdparameter shows the environment pig/hadoop will use (useful for Error 2998) #bigdata