The Utility of Data-integration Techniques MrKE HESS Nielsen Michael.Hess@nielsen. com PETE DOE Nielsen Pete.Doe@nielsen.com Data-integration techniques can be useful tools as marketers continue to innprove overall efficiency and return on investment. This is true because of the value of the techniques themselves and also because the current advertising market, based on demographic buying, has major opportunities for arbitrage in the range of 10 percent to 25 percent (where in that range depends on the nature of the vertical). The current study reviews different methods of data integration in pursuing such negotiations. INTRODUCTION Advertisers, agencies, and content providers all are looking for improvement in the placement of advertisements in content. If an advertiser can reach more of its customers and potential custom- ers by spending less money, or an agency can help an advertiser to do the same, this yields a positive effect on the advertiser's bottom line. Conversely, a content supplier can enhance its value if it can demonstrate that its content is attractive to par- ticular types of people (e.g., those disposed to a particular brand or category, or even a particular. psychographic target). In this quest for improved advertising effi- ciency and return on investment (ROI), a number of different methods have evolved. Most market- ers and their agencies use targeting rather than mass-marketing strategies (Sharp, 2010). Beyond this, many agencies have their own "secret-sauce" formulas whereby they adjust the value of an advertising buy as a function of how much "engagement" can be attributed to that vehicle, whether it be a specific television program or a magazine title. A more recent in-market approachexemplified by T RA (Harvey, 2012) and Nielsen Catalina Serviceshas also shown that buying can be improved through the identification of programs that have more brand and category heavy users. T he authors' own work since 2007 with data- integration techniques has shown that fused data DO I : 10.2501/JAR-53-2-231-236 sets also can improve targeting efficiency by a range from about 10 percent to 25 percent depend- ing on the category vertical. A number of firms employ data fusion and integration techniques on the provider side (e.g., Nielsen, T elmar, Kantar, and Simmons) and the agency business (Hess and Fadeyeva, 2008). In this study, the authors share some of the defi- nitions and empirical generalizations that have accumulated in the past five years of working with these techniques. T he practical application of data integration already has begun to appear in the marketplace. A large snack-manufacturing company presented some of its findings ata recent Advertising Research Foundation (ARF) conference (Lion, 2009); a global software supplier took the stage at a Consumer-360 event (Nielsen C-360, 2011); and a media-planning and buying agericy has indicated that it is using its custom fusion data set to verify and fine-tune com- mitments made in the 2012 Upfront and in all of its competitive pitches for new business (personal communication to M. Hess, 2012). In the next section, the various data-integration techniques are defined, and some of the advan- tages and disadvantages of each are discussed. TYPES OF DATA INTEGRATION T here are three broad types of data integration used in media and consumer research for advertis- ing planning. June 2 0 1 3 J DURIIRL OF HDUEd TISIIIG RESEHRCH 2 3 1 WHAT WE KNOW ABOUT ADVERTISING II EIVIPIRICAL GENERALIZATION Analysis with integrated data sets and the national people meter panel has shown us that if an advertising buy is made based on a marketing target and the programs that its members viewrather than against a demographic targetthere is empirically a range of between 10 percent and 25 percent improvement in the efficiency of that buy. This marketing target can be based either on consumption pattern segmentation (e.g., heavy/light category users) or on psychographic/lifestyle segmentation (e.g., prudent savers versus financial risk takers). Directly Matched Data Data sets are matched using a common key (e.g., name and address, or cookies). Very often, this requires the use of personally identifiable information, and appropriate privacy measures must be in place. Some of the key technical aspects that must be evaluated are completeness and accuracy of matching. For marketing purposes, databases that are integrated via direct-matching of address are often referred to as single- source data, but there is a distinction between true single-source and this form of integrated data as the completeness and accuracy of the match are usually not per- fect. However, it can be considered to be the next best thing to single source assum- ing the datasets being integrated are of good quality and relevance. An example of this sort of database is the Nielsen Catalina Services integration of Catalina frequent shopper data with television data obtained from Nielsen National People Meter data and Return Path Set Top Box data. Unit-Level (e.g., respondent-level) Ascription In many cases, direct matching of data is unfeasible, perhaps because of pri- vacy concerns or because the intersection between the data sets is minimal (this is usually the case with samples, where pop- ulation sampling fractions are very small); assuming no exclusion criteria for research eligibility, the chance of a respondent being in two samples with sampling frac- tions of 1/10,000 is 1 in 100 million. In these cases, statistical ascription tech- niques can be used to impute data. For example, product-purchase data can be ascribed onto the members of a research panel that measures television audiences, using common variables on the television panel and a product-purchase database to guide the ascription. This enables viewing habits of product users to be estimated. Data fusion is one example of a unit- level ascription technique that is increas- ingly being used to create integrated databases. (The topic is discussed in more detail later in this article.) Some of the advantages of this approach: There is no additional burden on the respondent. Because the ascription is sta- tistical, it can be applied to anonymized data. Additional data are obtained with- out affecting existing response rates or worsening respondent fatigue. There are no privacy concerns. Along with the previous point, it makes this a particularly valuable approach to add- ing additional data fields to media cur- rency measurements, which typically have tight constraints on respondent access and measurement specifications. As the ascription is applied at the urt/ respondent level, the database created delivers complete analytic flexibility. A particularly relevant and valuable consequence of this for media databases is that advertising reach and frequency analyses can be created. The cost of ascription is low in com- parison to the cost of additional primary research. Caveats associated with this approach: Ascription techniques contain the pos- sibility of model bias. This needs to be carefully assessed. Model validation is essential. In the majority of cases, ascription models have aggregate- rather than respondent-level validity. For example, a model that overlays brand purchasing onto a television measurement panel may not be able to predict the actual brand purchases of an individual house- hold on the panel, but it will be able ta reliably predict the viewing of brand purchasers as a group. This means that the approach is relevant to advei- tising planning but less applicable to test-control ROI analyses where direct assessment of purchase versus exposure is required. Aggregate-Level Integration Aggregate-level integration uses segmen- tation to group and then link types cf respondent on data sets. The segmentation typically uses combinations of demograph- ics and geography, though any information common to the data sets can be employed. An example of a commonly used seg- mentation is Prizm, which segments the population into 60 geo-demographic groups. An assessment of viewing habits of brand users can be obtained by iden- tifying Prizm codes strongly associated with particular brands (using a consumer panel) and looking at viewing traits associ- ated with these groups (using a television 2 3 2 J OUIRL or f lDUERTIS lOG RES EflflCH J u ne 2 0 1 3 THE MARKETER'S DILEMMA: FOCUSING ON A TARGET OR A DEMOGRAPHIC? panel with Prizm classification). Alterna- tively, purchase, propensity scores across all segments can be calculated on the con- sumer panels and used as media weights on television audiences. Advantages of this approach: Segmentations can cover a wide scopelinking data sets through geo-demographic segmentation, for example, allows consumer and media research databases to be connected and subsequently linked with geographical data such as retail areas. Understanding a brand through the lens of a suitably constructed segmentation delivers insights beyond basic purchase facts, perhaps guiding advertising crea- tivity as well as media touch-points. Limitations of this approach: Segmentations, by nature, assume homogeneity within segments, and this delivers less precision and less sensitiv- ity than other approaches. Because the integration of data sources is not unit/respondent level, there are restrictions on analysis: in particular, campaign reach and frequency. The Pros and Cons of Each Approach Direct match, unit-level ascription, and aggregate-level ascription can' be consid- ered as a tool for users of research, to be used in the appropriate way (See Table 1). For example, respondent-level ascription of brand user attributes on a television panel may be used to plan advertising for a specific brand target; a direct-match database may then be used to estimate advertising effectiveness of the cam- paign; product distribution tactics may be informed by the use of geo-demographic segmentation. TABLE i " Overview of Integration Approaches Direct Match (e.g.. Address Matching) Applications Advertising ROI Media Reach and Frequency Media Planning Ad Sales Accuracy/ High - near single precision source Unit-Level Ascription (e.g., Data Fusion) Media Reach and Frequency Media Planning Ad Sales Dependent on model: can be near single source Aggregate Level (e.g.. Segment Matching) Media Planning Ad Sales Relating media and sales activity to geographical locations e.g., stores. catchment areas Dependent on segmentation but typically lower than unit-level ascription Caveats Privacy Aggregate-level validity: not suited to direct ROI estimation Completeness and Model Bias Accuracy of Matching Aggregate-level validity: not suited to direct ROI estimation Reach and Frequency not available Assumption of homogeneity within segments reduces sensitivity DATA FUSiON The term data fusion is used to describe many different data-integration methods. The most conunon definition, and the one we shall use in this study, is as follows: "Data fusion is a respondent-level integra- tion of two or more survey databases to create a simulated single source data set." Essentially two surveys (or panels) are merged at the respondent level to create a single database (e.g., the U.S. Nielsen tele- vision/Internet Data Fusion overlays data from the Nielsen Online Audience Meas- urement Panel onto the National People Meter television Audience Measurement Panel, creating a database of respondents with television viewing measures and online usage measures). The Data Fusion Process (TV/Internet Fusion) TV Panel Common Characteristics TV Viewing Internet Panel Common Characteristics Online Use Data Fusion (Matching via Common Characteristics) Integrated Data Common Characteristics TV Viewing and Online Use J une 2 0 1 3 J DUR nH L OF H DUER TISIDG R ESER R CH 2 3 3 WHAT WE KNOW ABOUT ADVERTISING II The term data fusion is used to describe many different data-integration methods. Linking Variabies The creation of this single database matches respondents on common vari- ables to lir\k the data sets. Common vari- ables (also known as "linking variables" or "fusion hooks") typically are demo- graphic, geographic, and media-related. For example, men aged 18 to 24 years, in full-time employment within a certain geographical region who have a particular defined set of media habits (defined across the two' panels), may be matched across the two databases. The importance of linking variables in the data fusion cannot be overstressed. In the case of media-based data fusion, Nielsen data fusions adhere to the gener- ally accepted idea that linking variables must encompass more than standard demographic measures to ensure reliabil- ity of results. The importance of employing measures directly related to the phenomena begin fused (in this case, television viewing) was emphasized by Suzanne Rassler (2002) in Statistical Matching: Within media and consuming data the typical demographic and socioeconomic variables will surely not completely explain media exposure and consuming behavior. Variables already concertiing media expo- sure and consuming behavior have to be asked as well. Thus, the common variables also have to contain variables concerning television and consuming behaviors.... Linking variables are the key to the sta- tistical validity of the fusion, which oper- ates on the assumption of conditional independence; in the case of the televi- sion/Internet fusion, this would mean that variations in the way that television view- ing and online use interact are random within each group of respondents defined by the interlaced common variables. Where this condition does not hold, model regression to mean occurs, and there wili be some bias in the fused results. This bias can be estimated using fold-over tests or comparison to single-source data (if available) and is an important part of assessing a data fusion's validity and utility. In addition, a smart fusion practitioner also will test the congruence of the link- ing variables across the two databases checking that the two sample structures are matched well enough to enable the fusion to work well and assessing the closeness of matching of the two samples post fusion. Matching the Samples In practice, it is rarely possible to find a match for every respondent across every characteristic in the linking variable set. In the absence of a perfect match, the objective, therefore, becomes finding the best match. And although fusion algo- rithms vary, this requirement typically is achieved using statistical distance meas- urements (including assessment of the rel- ative importance of the linking variables in predicting behavior) and identifying the respondents with the smallest distance. At the same time, checks should occur in the fusion algorithm to ensure that the fusion uses all the respondents in both samples as equitably as possible. In some cases, the two samples to be fused may have very different sample sizes, and con- sideration needs to be given to how to best use the sampleswhether ail respondents will contribute to the fused database or just the closest matches to create a data- base with a respondent base equal in size to the smaller of the two samples. This decision often is driven by logistical fac- tors such as the analysis system capabili- ties rather than being a purely statistical consideration. Vaiidation Data fusion has been used in media research for planning purposes for more than 20 years, and a body of knowledge has been built up over that time. Valu- able guidance as to the validity levels that may hold given various data-integration approaches also can be found in industry guidelines developed by the Advertising Research Foundation (2003). Validation studies have demonstrated that data fusion provides vahd results with acceptably low levels of model bias assuming the following hold: The samples are well defined and struc- turally similar; there is a sufficient set of relevant link- ing variables; and the fusion matches the samples closely across the linking variables. The authors of the current article believe that it is important to validate every data fusion across these three criteria and to create formal fold-over validation tests and/or single-source comparisons where possible. In addition, offering methodo- logical transparency and welcoming exter- nal validation of data fusion processes have contributed to greater acceptance of data fusion by the industry. As such, the method is viewed by many as a useful tool in the researchers' tool box. 2 3 4 m m i or nouERTisiiiG DESEHRCH J une 2 0 1 3 THE MARKETER'S DILEMMA: FOCUSING ON A TARGET OR A DEMOGRAPHIC? ANALYSIS OF LEARNINGS AND EMPIRICAL GENERALIZATIONS Although the authors have been work- ing in this space since 2007, it is not easy to obtain specific learning from every data integration due to the proprietary nature of the service. The generalizafions below are offered in the spirit of industry advance- ment while, at the same dme, protective of the proprietary aspects of the outcomes. Analysis with integrated data sets and the national people meter panel has shown us that if an advertising buy is made based on a marketing target and the programs that its members view, rather than on a demographic target, there is empirically a range of 1 0 percent to 25 percent improve- ment in the efficiency of that buy. This marketing target can be based either on consumption pattern segmen- tation (e.g., heavy/light category users) or on psychographic/lifestyle segmenta- tion (e.g., prudent savers versus financial risk takers). An increase in efficiency is explained as follows: A campaign planned to deliver X demo- graphic GRPs will deliver Y brand target GRPs. An alternate plan can be developed that delivers X demographic GRPs and Z brand target GRPs wheh Z > X. Equiva- lently an alternate plan can be developed to deliver X2 demographic GRPs and Y brand target GRPs where X2<X (Collins and Doe, 201 1 ). The general patterns observed are technology companies are closer to the high end of the 1 0-percent to 25 -percent range of improvement; services, such as financial, are in the middle; and Consumer Packaged Goods (CPG) are at the lower end. The authors attribute this outcome to the fact that demographic buying is itself more aligned with CPG items that have broader penetration, whereas the technology side is less aligned. Larger improvements can, therefore, come from this area. Expectations The only empirical excepfions occur when the demographics and marketing target indexes for two programs happen to over- lap, or at least not differ significantly. These occasional excepfions, however, are offset by the findings that come from a list of demographicaUy similar programs. In fact, one almost always can find a subset that will have higher category consumpfion or penetrafion of a key psychographic tar- get segment. This 1 0-percent to 25 -percent range, in turn, translates into a form of media arbitrage because sellers do not take into account the amount of the category consumpfion/segment penetration when they price their program Cost per Thou- sand (CPMs) based on demographics. As noted earlier, established CPG categories tend to fall in the lower part of this range whereas newer spaces such as software and technology lie in the higher end. Brands in all the categories we have examined to date have fallen into that range, signaling that there is virtually always an efficiency to be gained by being able to direct media toward the marketing target from an initial condition of having begun as a demographic target. Import- antly, that marketing target can be based either on psychographic/lifestyle attrib- utes or on brand/category consumption. These targets are sourced directly from the fused databases. Although it is true that if the target is very large (such as all American television viewers), no efficien- cies will be gained; the majority of the targets worked with represent less than 20 percent of the viewing populafion. At that level of targeting, the 1 0-percent to 25 -percent range of improvement holds. As noted previously, the brand target need not be either demographic or purchase based: it could be based on a psycho- graphic segmentation or a set of attitudes. The implication is that planning on a standard demographic target (e.g., women ages 25 to 5 4 years) is less efficient than planning on a more precisely defined target. STRATEGIC IMPLICATIONS Using more precise brand targets than tradifional demographics creates oppor- tunities for both buyers and sellers and improves overall media efficiency by delivering less waste: better advertising placement leads to more advertisements being seen by the right people at the right fime and less irrelevant adverfisements being served up to bemused consumers. Improving the media envirormient in this way is clearly good for every- one. Whether the use of brand targets will become an explicit component of an adverfising buy or will remain hidden in the planning and negotiation process is unclear. At present, the latter is the case in television, in part because the execu- fional tools for buying are conshained to demographics. Online advertising-serving models, however, are capable of defining more precise targets through cookie-based ascription models. This empirical generalizafion also sug- gests a strategy: to take advantage of the available demographic-versus-marketing target arbitrage, it is important to have the right data that link the consumption seg- ment, or psychographic segment, to pro- gram viewing. These data sets can be based on single- source, direct-matched, or fused data. In each case, the television currency meas- urement (e.g., the National People Meter service for national television advertising in the United States) is used as the basis for the program-viewing behavior. Get- ting these efficiencies in the television buy J u ne 2 0 1 3 J OU R IRL OF H DU ERTISil l G RESERRCH 2 3 5 WHAT WE KNOW ABOUT ADVERTISING II also is important for cross-platform cam- paigns. If the reach, for example, against the marketing target is already enhanced via this approach as part of the television buy, the Key Performance Indicator (KPI) of the cross-platform might be based more on frequency and recency than on an effort to attain additional unduplicated reach. CONCLUSION In sum, the authors believe that data- integration techniques are acting as the latest wave of services that are bringing greater overall efficiency and, in tum, ROI to the industry. They follow in the foot- steps of predictive new product models in the 1970s and 1980s, and marketing-mix modeling in the 1990s and 2000s. MIKE HESS is evp in Nielsen's Media Analytics group. He aiso serves as the Nielsen spokesperson for Social Television and is currentiy directing a comprehensive anaiysis of the relationship between social buzz and television ratings. Before joining Nielsen in 2011, Hess was research director for the media agencies of Carat and OMD. Hess's publications inciude an American Association of Advertising Agencies-sponsored monograph on "Short and Long Term Effects of Advertising and Promotion" (2002), and a review of quantitative methods in advertising research for the Rftieth Anniversary issue of the Journal of Advertising Research (2011). He currently acts as project co-lead for the quantification of brand equity for the MASB and this year became a trustee of the Marketing Sciences Institute. PETE DOE is svp/data integration at Nielsen. In that role, he has global responsibility for Nielsen's data-fusion methodologies and is involved with such data-integration methods as STB modeled ratings and oniine hybrid audiences. Prior to moving to the United States in 2003. Doe was a board director at RSMB television research in the United Kingdom, where he worked on the BARB television audience measurement currency and numerous data-fusion projects. REFERENCES ADVERTISING RESEARCH FOUNDATION. ARF Guideiines for Data Integration. Advertising Research Foundation, 2003. COLLINS, ]., and P. DOE. Making Best Use of Brand Target Audiences Print and Digital Research Forum. San Francisco, CA, 2011. HARVEY, B., panelist at the Wharton Empirical Generalizations Conference-II, Philadelphia, May 31,2012. HESS, M. , and I. FADEYEVA. ARF Forum on Data Fusion and Integration. New York: Advertising Research Foundation, 2008. LION, S. Marketing Laws in Action. AM 4.0. New York, NY: Advertising Research Foundation, 2009. * NIELSEN ANNUAL CUSTOMER C - 36 0 CONFER- ENCE. Orlando, June 2011. RASSLER, S. Statistical Matching: A Frequentisf Theory, Practical Applications, and Alternatizx Bayesian Approaches. New York: Springer-Verlag, 2002. SHARP, B. HOW Brands Grow. Australia and New Zealand: Oxford University Press, 2010. 2 3 6 J o u R o m o r H D UE IIT ISIIIG R E SE H UCH J u n e 2 0 1 3