Email LinkedIn

MN3441 Technology for Managerial Data Analysis

Merging Data (video time: 26 minutes)

video duration 0:51

Merging data is the process of taking data sets and combining them into a joint data set.

Motivation

Often, when working with data, not all the information you need is nicely organized in one place and ready to process. More often, you need to expend some effort gathering it. If you can’t get it all from one source, you may need to connect pieces from different sources together. In this lesson, we’ll work through a non-trivial example, where two datasets have complementary data, but the data they have in common is not all represented in the same way. So merging them is not straightforward. Spoiler alert: we’ll use pattern matching at some point in the process.

Software Tools

Computer programming languages provide the capability to merge datasets and packages are available to make that process easier. There are also software tools that provide some of the same capabilities, but may have limitations in the extent to which their functionality is customizable.

Data wrangling or data preparation software provide graphical user interface solutions for transforming, mapping and joining data. A few notable ones include:

We’ll use Trifacta, which is a web-based tool that utilizes cloud storage and computing. It is designed with a workflow architecture, in which a series of activities or “recipes” are run to read, process and output data.

...