Combining data sources
In this chapter we take a broad perspective on the federal statistical system and the needs of the research community involved in program evaluation studies. Together, they inform the citizenry about the current status of the economy and the well-being of the population and evaluate whether various government actions improve that status. Building on the findings and themes from the previous chapters, we discuss what is needed to facilitate the use of administrative data and other data sources for federal statistics and for research evaluating the efficacy of federal programs.
We believe it is urgent that changes be initiated now because addressing the changes that are needed will take considerable time and effort and will need to include extensive research, upgrades in information technology.
IT infrastructure, and new skill sets for current and new federal statistical agency staff. Producing legislatively mandated and policy-relevant statistics is costly and requires a considerable time investment, and changes to methods of how those statistics are produced will require new investments.
Furthermore, building a new paradigm while continuing to produce critical information for the nation will be difficult, but we believe the alternative of not making fundamental changes now would result in the inability of many statistical programs to meet their core missions and legislative mandates.
As we note in Chapter 2 , sample surveys have played a vital role in providing reliable and trustworthy information to inform the public and policy makers. Sample surveys have many virtues, including the ability to measure the precision of the results, design questions tailored to specific data needs, use a variety of data collection modes to best meet the needs and preferences of respondents, and target specific groups of interest.
We expect that sample surveys will continue to play an important but not exclusive role in federal statistics and, more broadly, in social science research. Federal statistical agencies will need to examine what information is needed to address key public policy issues and then to consider the best way to produce that information.
That examination needs to look at what source s of data—surveys, administrative data, other sources, or a combination of them—can best meet the information needs. Federal statistical agencies are in the best position to undertake such evaluations and to combine the most useful sources to produce the best statistical estimates possible in a transparent and objective manner. In the rest of this chapter we first review the current efforts to examine and use administrative records and other new sources of data for federal statistics.
We focus particularly on issues of data access and data sharing, including the environment and infrastructure, both legal and physical, that will be needed. Closely tied to these efforts is the needed IT infrastructure and staff technical skills that will be needed to work with some of these new data sources, including processing, cleaning, and editing large volumes of data. We conclude with a discussion of the quality and usability of different data sources for federal statistics and the necessary research and evaluation that is needed both of the data and of the techniques to protect the privacy of the data.
Chapters 3 and 4 discuss using government administrative and private-sector data sources to enhance federal statistics. Although it is clear that other data sources are becoming increasingly available, government administrative data have most clearly demonstrated the direct and immediate. Both inside and outside the United States, administrative data on their own or in combination with sample survey data are being used for the production of high-quality statistics by a wide range of statistical agencies.
The potential for using private-sector data sources to enhance federal statistics is only beginning to be explored, and evaluations of these new sources are not evenly spread across agencies. Much more work is needed and could be done. A recent report of the National Research Council b , p.
Under the leadership of the U. Office of Management and Budget, the federal statistical system should accelerate 1 research designed to understand the quality of statistics derived from alternative data—including those from social media, other Web-based and digital sources, and administrative records; 2 monitoring of data from a range of private and public sources that have potential to complement or supplement existing measures and surveys; and 3 investigation of methods to integrate public and private data into official statistical products.
Evaluating alternative data sources for federal statistics can best be achieved by the statistical programs with access to other relevant sources of information. However, there is also a need across the decentralized federal statistical system for greater leveraging of limited resources for research and development of new methods, as reflected in the recommendation. Individual agency programs have explored various data sources, but there has been little systematic accumulation of knowledge across agencies.
As a result, there is no systemwide plan or strategy for a broad examination of private-sector and other alternative data sources to supplement or replace sample surveys.
Furthermore, widespread adoption of new IT requirements, quality assessments, and other areas of needed developments has not occurred. The National Research Council report anticipated the difficulties in accomplishing this research due to the nature of the highly decentralized federal statistical system National Research Council, b , p.
One of the drawbacks of such a system is the lack of a critical mass for the purpose of major research undertakings. The Census Bureau and perhaps the Bureau of Labor Statistics are the only agencies with significant numbers of in-house research staff, although there is exceptional research capability throughout the statistical system.
However, many research topics. And as described in Chapter 4 , we also found some promising pilots in exploring and using various private-sector data sources.
However, so far these efforts have been fragmented, and fragmented efforts will not be sufficient for the needs of the overall statistical system. There has been a need for systemwide research and development capabilities even as the survey paradigm was evolving; now, with the exploration of new technologies and data sources, that need is even greater Habermann, In addition to endorsing Recommendation 5 above from the previous report, we note and repeat the recommendations in Chapters 3 and 4 on the need for a systematic approach to the use of new data sources.
To this end, federal statistical agencies should create collaborative research programs to address the many challenges in using administrative data for federal statistics. Federal statistical agencies should provide annual public reports of these activities.
While the panel believes that the above recommendations are needed and will benefit the federal statistical system, it also acknowledges the organizational, policy, and legal barriers that prevent collaborative relationships among statistical agencies.
It is not clear that sufficient resources currently exist to pursue the kinds of research needed while continuing to produce the statistics that policy makers and the public expect. However, it is equally clear that the status quo is not meeting the research and development needs of the federal statistical system in evaluating new data sources for federal statistics. As detailed in Chapter 3 , federal statistical agencies face obstacles obtaining access to federal administrative data.
When the data are held outside the federal government by states, local governments, or private entities, the obstacles are even more daunting.
Office of Management and Budget, a , the results have been discrete efforts that have not been cumulative and have not resulted in a standardized process for accessing data across projects or agencies.
For the most part, each project involving two or more agencies requires specific memoranda of understanding that are tailored to the project and dataset being used, often specifying exactly which variables from the dataset may be accessed and by whom. Even when there are no regulatory impediments and both agencies are eager to share data for statistical purposes, those memoranda of understanding often take months of negotiations.
In fact, Prell and colleagues noted that in the life cycle of an administrative data project, the signing of a memorandum of understanding should be considered a midpoint milestone for a project rather than the beginning of the project, because of the extensive time, planning, resources, and effort needed to reach that agreement.
The authors also noted that many projects are abandoned before ever attaining this milestone. As we note in Chapter 3 , one possible cause of these difficulties is that there is no agency that is directly charged to ensure timely and effective access of program data for statistical purposes. In an effort to achieve greater objectivity, the evaluation of federal government programs is often conducted by researchers outside the program.
However, external, nonfederal researchers face particular hurdles in gaining access to the data that are crucial to an objective evaluation of program efficacy. There is currently no standard procedure for external researchers to access datasets from different agencies for statistical or evaluation research studies.
Although statistical agencies provide a variety of secure means to allow researchers access to their data for statistical purposes see Chapter 5 , access to survey microdata or survey data linked to administrative records typically requires submitting a proposal to each agency whose data will be involved in the project.
Each agency has its own application and review process for accessing its data. Acquisition of datasets from states can require considerably more time, sometimes taking more than 2 years to obtain vital records or other state administrative datasets see, e. The result is that some social science researchers have shifted away from evaluative and empirical research in the United States to studies in other countries that are able to. Although the Confidential Information Protection and Statistical Efficiency Act CIPSEA provides a common level of legal protection across statistical agencies and sustains the culture of confidentiality protection within the statistical agencies see Chapter 5 , it would need substantial expansion to serve as a sufficient foundation for effective data sharing and access.
However, even if this specific lack was remedied, the situation would still fail to provide what is needed more broadly for the statistical system to function effectively as a system. Although greater access to tax data would be a key resource that would greatly benefit the quality of data products for other statistical agencies and programs, other sources would also be of benefit see Chapters 3 and 4.
A new paradigm for the system needs to include changes to several laws that prohibit access for statistical purposes or require legal or regulatory changes to permit access for research and statistical purposes.
It is clear that fundamental changes in data access and sharing need to be made for the future of federal statistics and evidence-based policy research. The panel believes that the country can no longer afford the redundancy of individual federal statistical agencies each negotiating on their own with 50 states and the District of Columbia and, in some cases, other jurisdictions to access the same dataset for statistical purposes.
It is a burden on the states and the agencies that provides no benefits, and it limits the production of useful statistics and research. The panel believes that the nation needs a secure environment where administrative data can be statistically analyzed, evaluated for quality, and linked to surveys, other administrative datasets, and other data sources.
Such an environment would need to have the authority to control access for statistical and research purposes. It would also have to use and continually evaluate and enhance privacy measures. Integration of these efforts into a single entity could achieve many benefits if all statistical agencies could use a secure data-sharing environment.
Without a new entity, no scaling of expertise can occur in privacy protection measures, statistical modeling on multiple data sets, and IT architectures for data sharing. The panel does not recommend a new entity lightly.
As we describe throughout this report, however, there are numerous drawbacks to the status quo, so much so that we believe the statistical system is currently hampered in carrying out its mission. There is also tremendous inertia in many parts of the system that will make any changes difficult. We recognize that creation of a new entity will not by itself solve all the problems detailed in this report. In fact, we expect that, like the statistical agencies themselves, the authority and mission of the new entity will need to be clearly delineated, as organizational issues will arise between it and the existing agencies.
How this entity is created and its functions will determine its ability to be an effective resource of and for the federal statistical system. Thus, in the remainder of this chapter, we delineate some foundational principles and raise fundamental issues that will need to be addressed in order to create an effective new entity. In our second report, we will explore these issues more deeply. As many people in federal statistical and evaluation research communities know, these opportunities and challenges are not new.
As Kraus , p. Computer technology had improved the efficiency and affordability of research with large data sets, and the expansion of government social programs called for more data and research to inform public policy.
As a result, in social scientists recommended that the federal government develop a national data center that would store and make available to researchers the data collected by various statistical agencies. Because of its massive data holdings and its pioneering work in the use of computers for the storage and analysis of data, the Census Bureau became involved in the national debate, though reluctantly. However, the proposal for a national data center led to widespread concerns about government profiling and monitoring.
Department of Health, Education, and Welfare, , and comprehensive legislation in that essentially prevented the establishment of a centralized database in the United States. New limitations were adopted for the use of Social Security numbers, understood at the time as the key technique to link discrete record sets containing personally identifiable information.
Kraus concluded , p. The panel does not envision this new entity as a major new data warehouse or national data center. We will discuss potential IT approaches and requirements in our second report, but emphasize here that there are mechanisms and protocols, such as secure multiparty computing, for combining and analyzing data virtually that do not require all the data being combined to be in the same place. Since , data lake approaches have risen to the level of Data Hubs.
See all three search terms popularity on Google Trends. Consider a web application where a user can query a variety of information about cities such as crime statistics, weather, hotels, demographics, etc. Traditionally, the information must be stored in a single database with a single schema.
But any single enterprise would find information of this breadth somewhat difficult and expensive to collect. Even if the resources exist to gather the data, it would likely duplicate data in existing crime databases, weather websites, and census data. A data-integration solution may address this problem by considering these external resources as materialized views over a virtual mediated schema , resulting in "virtual data integration". This means application-developers construct a virtual schema—the mediated schema —to best model the kinds of answers their users want.
Next, they design "wrappers" or adapters for each data source, such as the crime database and weather website. These adapters simply transform the local query results those returned by the respective websites or databases into an easily processed form for the data integration solution see figure 2.
When an application-user queries the mediated schema, the data-integration solution transforms this query into appropriate queries over the respective data sources. Finally, the virtual database combines the results of these queries into the answer to the user's query. This solution offers the convenience of adding new sources by simply constructing an adapter or an application software blade for them.
It contrasts with ETL systems or with a single database solution, which require manual integration of entire new dataset into the system. The virtual ETL solutions leverage virtual mediated schema to implement data harmonization; whereby the data are copied from the designated "master" source to the defined targets, field by field.
Advanced data virtualization is also built on the concept of object-oriented modeling in order to construct virtual mediated schema or virtual metadata repository, using hub and spoke architecture. Each data source is disparate and as such is not designed to support reliable joins between data sources.
Therefore, data virtualization as well as data federation depends upon accidental data commonality to support combining data and information from disparate data sets. Because of this lack of data value commonality across data sources, the return set may be inaccurate, incomplete, and impossible to validate.
One solution is to recast disparate databases to integrate these databases without the need for ETL. The recast databases support commonality constraints where referential integrity may be enforced between databases. The recast databases provide designed data access paths with data value commonality across databases. The theory of data integration  forms a subset of database theory and formalizes the underlying concepts of the problem in first-order logic.
Applying the theories gives indications as to the feasibility and difficulty of data integration. A database over a schema is defined as a set of sets, one for each relation in a relational database.
Note that this single source database may actually represent a collection of disconnected databases. Two popular ways to model this correspondence exist: The burden of complexity falls on implementing mediator code instructing the data integration system exactly how to retrieve elements from the source databases. If any new sources join the system, considerable effort may be necessary to update the mediator, thus the GAV approach appears preferable when the sources seem unlikely to change.
In a GAV approach to the example data integration system above, the system designer would first develop mediators for each of the city information sources and then design the global schema around these mediators. For example, consider if one of the sources served a weather website.
The designer would likely then add a corresponding element for weather to the global schema. Then the bulk of effort concentrates on writing the proper mediator code that will transform predicates on weather into a query over the weather website. This effort can become complex if some other source also relates to weather, because the designer may need to write code to properly combine the results from the two sources. As is illustrated in the next section, the burden of determining how to retrieve elements from the sources is placed on the query processor.
The benefit of an LAV modeling is that new sources can be added with far less work than in a GAV system, thus the LAV approach should be favored in cases where the mediated schema is less stable or likely to change. In an LAV approach to the example data integration system above, the system designer designs the global schema first and then simply inputs the schemas of the respective city information sources.
Consider again if one of the sources serves a weather website. The designer would add corresponding elements for weather to the global schema only if none existed already. Then programmers write an adapter or wrapper for the website and add a schema description of the website's results to the source schemas. The complexity of adding the new source moves from the designer to the query processor.
The theory of query processing in data integration systems is commonly expressed using conjunctive queries and Datalog , a purely declarative logic programming language. If a tuple or set of tuples is substituted into the rule and satisfies it makes it true , then we consider that tuple as part of the set of answers in the query.
While formal languages like Datalog express these queries concisely and without ambiguity, common SQL queries count as conjunctive queries as well. In terms of data integration, "query containment" represents an important property of conjunctive queries. The two queries are said to be equivalent if the resulting sets are equal for any database. This is important because in both GAV and LAV systems, a user poses conjunctive queries over a virtual schema represented by a set of views , or "materialized" conjunctive queries.
Integration seeks to rewrite the queries represented by the views to make their results equivalent or maximally contained by our user's query. This corresponds to the problem of answering queries using views AQUV. In GAV systems, a system designer writes mediator code to define the query-rewriting.
Each element in the user's query corresponds to a substitution rule just as each element in the global schema corresponds to a query over the source. Query processing simply expands the subgoals of the user's query according to the rule specified in the mediator and thus the resulting query is likely to be equivalent. While the designer does the majority of the work beforehand, some GAV systems such as Tsimmis involve simplifying the mediator description process.