Apache Pig tips (swill?)

Apache Pig is designed to handle analysis of large data sets using a high-level language (Pig Latin) that allows for parallelisation. Pig Latin compiles to sequences of Map-Reduce programs that can be executed on Hadoop.

This post pulls together an archive of some Apache Pig tips tweeted as “Apache #Pig tip of the day”.

  • 18th June 2013:
    Use org.apache.pig.builtin.MonitoredUDF annotation to terminate your reg/Algebraic UDFs if they run for too long
  • 17th June 2013:
    The HadoopJobHistoryLoader in the piggybank can be used to check for failed jobs amongst other things.
  • 14th June 2013:
    visualise execution plan as a directed acyclic graph using -dot arg for EXPLAIN, pass output thru graphviz dot
  • 13th June 2013:
    just like your favourite RDBMS Pig has an EXPLAIN command – get logical/physical and MapReduce execution plans.
  • 12th June 2013:
    Know your path – the pig shell script tries to locate hadoop using ‘which‘.
  • 11th June 2013:
    Penny (a debug/tracing tool) users should be aware that it has been removed from trunk in 0.11 #tidyout
  • 10th June 2013:
    Pig can work with data serialized using Avro – the necessary AvroStorage & related classes are in the PiggyBank
  • 7th June 2013:
    use PigUnit for unit testing Pig scripts. It defaults to local mode, use pigunit.exectype.cluster prop for MR.
  • 6th June 2013:
    Amazon Elastic MapReduce supports Pig (0.9.x). You can run newer unsupported versions: https://forums.aws.amazon.com/thread.jspa?messageID=455015
  • 5th June 2013:
    the PiggyBank contains contributed Java UDFs. Very useful stuff in contrib/piggybank (caveat: they are ‘as-is’)
  • 4th June 2013:
    Pig supports user defined functions (UDFs). Write them in Java or (with less support) Python, JS, Ruby & #Groovy
  • 3rd June 2013:
    use the ILLUSTRATE command to exemplify a Pig Latin script with concise, complete and realistic data #iterate
  • 2nd June 2013:
    Pig provides a high level language (Pig Latin) for data analysis that compiles to Hadoop Map-Reduce data flows.
  • 31st May 2013:
    pig -x local‘ runs on a single machine, the default or ‘-x mapreduce‘ mode runs on a Hadoop cluster. #bigdata
  • 30th May 2013:
    Pig supports multiple Hadoop versions (0.20 by default) – make sure you set the hadoopversion build parameter
  • 29th May 2013:
    the -secretDebugCmd parameter shows the environment pig/hadoop will use (useful for Error 2998) #bigdata
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s