Hadoop MapReduce multi output details

  • 2020-06-01 11:18:01
  • OfStack

Hadoop MapReduce multioutput

The files generated by FileOutputFormat and its subclasses are in the output directory. Each reducer1 file is named by partition number: part-r-00000, part-r-00001, and so on. Sometimes it is possible to control the filename of the output or to have multiple files output per reducer. MapReduce provides the MultipleOutputFormat class for this.

The MultipleOutputFormat class can write data to multiple files whose names are derived from the output keys and values or arbitrary strings. This allows multiple files to be created per reducer (or mapper for map jobs only). name-r-nnnnn file name is used for map output, name-r-nnnnn file name is used for reduce output, name is an arbitrary name set by the program, nnnnn is an integer named block number (starting from 0). Block Numbers ensure that output written from different blocks (mapper or reducer) will not conflict with the same name.

1. Redefine the output file name

We can control the filename of the output. Consider the need to separate holiday order data by gender. This requires running a job, and the output of the job is a file for each male and female, which contains all the data records for male and female.

This requirement can be implemented using MultipleOutputs:


package com.sjf.open.test;
import java.io.IOException;
import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapred.JobPriority;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.sjf.open.utils.ConfigUtil;
/**
 * Created by xiaosi on 16-11-7.
 */
public class VacationOrderBySex extends Configured implements Tool {
  public static void main(String[] args) throws Exception {
    int status = ToolRunner.run(new VacationOrderBySex(), args);
    System.exit(status);
  }
  public static class VacationOrderBySexMapper extends Mapper<LongWritable, Text, Text, Text> {
    public String fInputPath = "";
    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
      super.setup(context);
      fInputPath = ((FileSplit) context.getInputSplit()).getPath().toString();
    }
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
      String line = value.toString();
      if(fInputPath.contains("vacation_hot_country_order")){
        String[] params = line.split("\t");
        String sex = params[2];
        if(StringUtils.isBlank(sex)){
          return;
        }
        context.write(new Text(sex.toLowerCase()), value);
      }
    }
  }
  public static class VacationOrderBySexReducer extends Reducer<Text, Text, NullWritable, Text> {
    private MultipleOutputs<NullWritable, Text> multipleOutputs;
    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
      multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
    }
    @Override
    protected void reduce(Text key, Iterable<Text> values, Context context)
        throws IOException, InterruptedException {
      for (Text value : values) {
        multipleOutputs.write(NullWritable.get(), value, key.toString());
      }
    }
    @Override
    protected void cleanup(Context context) throws IOException, InterruptedException {
      multipleOutputs.close();
    }
  }
  @Override
  public int run(String[] args) throws Exception {
    if (args.length != 2) {
      System.err.println("./run <input> <output>");
      System.exit(1);
    }
    String inputPath = args[0];
    String outputPath = args[1];
    int numReduceTasks = 16;
    Configuration conf = this.getConf();
    conf.setBoolean("mapred.output.compress", true);
    conf.setClass("mapred.output.compression.codec", GzipCodec.class, CompressionCodec.class);
    Job job = Job.getInstance(conf);
    job.setJobName("vacation_order_by_jifeng.si");
    job.setJarByClass(VacationOrderBySex.class);
    job.setMapperClass(VacationOrderBySexMapper.class);
    job.setReducerClass(VacationOrderBySexReducer.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(Text.class);
    job.setOutputKeyClass(NullWritable.class);
    job.setOutputValueClass(Text.class);
    FileInputFormat.setInputPaths(job, inputPath);
    FileOutputFormat.setOutputPath(job, new Path(outputPath));
    job.setNumReduceTasks(numReduceTasks);
    boolean success = job.waitForCompletion(true);
    return success ? 0 : 1;
  }
}

In the reduce that generates the output, construct an instance of MultipleOutputs in the setup() method and assign it to an instance variable. Write output using an MultipleOutputs instance in the reduce() method instead of context. The write() method applies to keys, values, and names. The gender is used as the name, so the resulting output name is in the form sex-r-nnnnn:


-rw-r--r--  3 wirelessdev wirelessdev     0 2016-12-06 10:41 tmp/data_group/order/vacation_hot_country_order_by_sex/_SUCCESS
-rw-r--r--  3 wirelessdev wirelessdev   88574 2016-12-06 10:41 tmp/data_group/order/vacation_hot_country_order_by_sex/f-r-00005.gz
-rw-r--r--  3 wirelessdev wirelessdev   60965 2016-12-06 10:41 tmp/data_group/order/vacation_hot_country_order_by_sex/m-r-00012.gz
-rw-r--r--  3 wirelessdev wirelessdev     20 2016-12-06 10:41 tmp/data_group/order/vacation_hot_country_order_by_sex/part-r-00000.gz
-rw-r--r--  3 wirelessdev wirelessdev     20 2016-12-06 10:41 tmp/data_group/order/vacation_hot_country_order_by_sex/part-r-00001.gz
-rw-r--r--  3 wirelessdev wirelessdev     20 2016-12-06 10:41 tmp/data_group/order/vacation_hot_country_order_by_sex/part-r-00002.gz
-rw-r--r--  3 wirelessdev wirelessdev     20 2016-12-06 10:41 tmp/data_group/order/vacation_hot_country_order_by_sex/part-r-00003.gz
-rw-r--r--  3 wirelessdev wirelessdev     20 2016-12-06 10:41 tmp/data_group/order/vacation_hot_country_order_by_sex/part-r-00004.gz
-rw-r--r--  3 wirelessdev wirelessdev     20 2016-12-06 10:41 tmp/data_group/order/vacation_hot_country_order_by_sex/part-r-00005.gz
-rw-r--r--  3 wirelessdev wirelessdev     20 2016-12-06 10:41 tmp/data_group/order/vacation_hot_country_order_by_sex/part-r-00006.gz
-rw-r--r--  3 wirelessdev wirelessdev     20 2016-12-06 10:41 tmp/data_group/order/vacation_hot_country_order_by_sex/part-r-00007.gz
-rw-r--r--  3 wirelessdev wirelessdev     20 2016-12-06 10:41 tmp/data_group/order/vacation_hot_country_order_by_sex/part-r-00008.gz

We can see that in the output file there is not only the output file type we want, but also the part-r-nnnnn file, but there is no information in the file, which is the default output file of the program. So when we specify the name of the output file (name-r-nnnnn), we do not specify name as part because it is already used as the default.

2. Multi-directory output

The base path specified in MultipleOutputs's write() method is interpreted relative to the output path, because it can contain the file path separator (/) to create subdirectories of any depth. For example, we changed the requirement above to separate holiday order data by gender, with the different gender data in different subdirectories (for example: sex=f/ part-r-00000).


 public static class VacationOrderBySexReducer extends Reducer<Text, Text, NullWritable, Text> {
    private MultipleOutputs<NullWritable, Text> multipleOutputs;
    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
      multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
    }
    @Override
    protected void reduce(Text key, Iterable<Text> values, Context context)
        throws IOException, InterruptedException {
      for (Text value : values) {
        String basePath = String.format("sex=%s/part", key.toString());
        multipleOutputs.write(NullWritable.get(), value, basePath);
      }
    }
    @Override
    protected void cleanup(Context context) throws IOException, InterruptedException {
      multipleOutputs.close();
    }
  }

The resulting output name is in the form sex=f/ part-r-nnnnn or sex=m/ part-r-nnnnn:


-rw-r--r--  3 wirelessdev wirelessdev     0 2016-12-06 12:26 tmp/data_group/order/vacation_hot_country_order_by_sex/_SUCCESS
-rw-r--r--  3 wirelessdev wirelessdev     20 2016-12-06 12:26 tmp/data_group/order/vacation_hot_country_order_by_sex/part-r-00000.gz
-rw-r--r--  3 wirelessdev wirelessdev     20 2016-12-06 12:26 tmp/data_group/order/vacation_hot_country_order_by_sex/part-r-00001.gz
-rw-r--r--  3 wirelessdev wirelessdev     20 2016-12-06 12:26 tmp/data_group/order/vacation_hot_country_order_by_sex/part-r-00002.gz
-rw-r--r--  3 wirelessdev wirelessdev     20 2016-12-06 12:26 tmp/data_group/order/vacation_hot_country_order_by_sex/part-r-00003.gz
-rw-r--r--  3 wirelessdev wirelessdev     20 2016-12-06 12:26 tmp/data_group/order/vacation_hot_country_order_by_sex/part-r-00004.gz
-rw-r--r--  3 wirelessdev wirelessdev     20 2016-12-06 12:26 tmp/data_group/order/vacation_hot_country_order_by_sex/part-r-00005.gz
-rw-r--r--  3 wirelessdev wirelessdev     20 2016-12-06 12:26 tmp/data_group/order/vacation_hot_country_order_by_sex/part-r-00006.gz
-rw-r--r--  3 wirelessdev wirelessdev     20 2016-12-06 12:26 tmp/data_group/order/vacation_hot_country_order_by_sex/part-r-00007.gz
drwxr-xr-x  - wirelessdev wirelessdev     0 2016-12-06 12:26 tmp/data_group/order/vacation_hot_country_order_by_sex/sex=f
drwxr-xr-x  - wirelessdev wirelessdev     0 2016-12-06 12:26 tmp/data_group/order/vacation_hot_country_order_by_sex/sex=m

The & # 65279; 3. Delay output

A subclass of FileOutputFormat produces an output file (part-r-nnnnn), even if the file is empty. Sometimes we don't want these empty files, we can use LazyOutputFormat to process them. It is a wraparound output format, and you can specify that the partition records the output of section 1 only when the file is actually created. To use it, call the setOutputFormatClass() method with JobConf and the associated output format as parameters:


Configuration conf = this.getConf();
Job job = Job.getInstance(conf);
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);

Check 1 again for our output file (first example) :


sudo -uwirelessdev hadoop fs -ls tmp/data_group/order/vacation_hot_country_order_by_sex/
Found 3 items
-rw-r--r--  3 wirelessdev wirelessdev     0 2016-12-06 13:36 tmp/data_group/order/vacation_hot_country_order_by_sex/_SUCCESS
-rw-r--r--  3 wirelessdev wirelessdev   88574 2016-12-06 13:36 tmp/data_group/order/vacation_hot_country_order_by_sex/f-r-00005.gz
-rw-r--r--  3 wirelessdev wirelessdev   60965 2016-12-06 13:36 tmp/data_group/order/vacation_hot_country_order_by_sex/m-r-00012.gz

The & # 65279; Thank you for reading, I hope to help you, thank you for your support of this site!


Related articles: