I view four common workflows/processes for big data systems, and curation, with options for qualified and vetted humans to fix data, could occur within all four. There are numerous identified concerns: garbage in, missing data, format mismatches, poisoned data, data spills, sensor calibration, timeliness, buffer overflow, down network links, sites down, system crashes, data corruption, RAID failures, government shutdowns. Curation operates like firefighting, long periods of boredom leading to complacency before the next crisis.
The four are data ingestion, data modeling, data retrieval, and data presentation. Data ingestion typically involves data discovery, data collection, data cleansing, error logging & notification & thresholds, secondary remediation processing. Data modeling typically involves ETL (extract, transform, load) prior to model runs, filtering (e.g., time, partition into region or service), statistical analysis for bad data points & outliers & significant anomalies. Et cetera for last two. The first rule of big data is carefully vet all externally sourced data. The second rule of big data is carefully vet all internally sourced data. The third rule is keep vetting data at each stage.
I suggest we in the NIST BDWGs explore curation within all functional processing stages (e.g., mine and everyones). Curation typically completes just before each batch of staged data is released to the next stage. Processing logs typically include counts, times, errors, corrections, actions, outcomes.
Of course, someone could merely wordsmith over your concerns. Some think curation smacks of data taxidermy. Others think of library science. Others content management systems. HPC usually address simulation verification.
1) I added the note about “sheer curation” within the Wikipedia article in anticipation of your comment. Poor curation looks like simple taxidermy/preservation to end users.
2) IEEE has massive good curation addressing journals, transactions, conferences, standards, education curricula, Web sites, e-learning, digital libraries, IEEE search engines, IEEE vertical expert courseware, contracts with external businesses/organizations. IEEE members pay high dues in part to pay for the massive IEEE curation. Please talk to your IEEE coworkers.
3) It is possible that many people have never seen big data systems from the inside, so they assume data self organizes and serves their end uses. In reality, big data is not generally available to or usable by naive end users in its raw, ingested state in the wild.
4) Most big data systems live or die based on the quality of their curation (e.g., Wikipedia, Google, Amazon, Facebook, Netflix). Provenance, governance, pedigree, authority, digital format, indexing, translation, sensor calibration, annotations, metadata, links, identity databases, simulation validation, etc. occur before end users access big data and later while end users access big data.
5) How could any big data system be implemented without curation? Curation enables end users to access information efficiently: a single curator (or curation SW) assists an unlimited number of end users. Curation occurs as virtuous cycles with feedback.
Explain, what is the foundation for your concern or complaint? What do you know about curation wrt big data?
. . .The very first sentence of the linked Wikipedia article actually is as follows:
Digital curation is the selection, preservation, maintenance, collection and archiving of digital assets. Digital curation establishes, maintains and adds value to repositories of digital data for present and future use. This is often accomplished by archivists, librarians, scientists, historians, and scholars. Enterprises are starting to utilize digital curation to improve the quality of information and data within their operational and strategic processes. Successful digital curation will mitigate digital obsolescence, keeping the information accessible to users indefinitely.
We have a problem here. For the NBDWG, repeating the link: http://en.wikipedia.org/wiki/Digital_curation Note all the references to authoritative sources.
Compromise: The NBDWG should define “digital data curation” and use curation in its taxonomy, RA, SnP, and Roadmap. “Data preparation” should also be defined and used, but data preparation should only implement ingestion curation. A curation role should be included in the System Orchestrator.
Here is an example of curation efforts http://www.whitehouse.gov/
Having discovered the futility of just tossing government data into Data.gov, Next.Data.gov was started to incorporate virtuous cycles with feedback to give end users both what they tend to want and actually use. This is an example of the necessity for data curation. CIOs and CTOs get curation and have been burned by poor curation. Big data without feedback fails sooner or later.
Here is an example of a global organization standardizing, educating, and curating metadata for curation http://dublincore.org/ Their mission statement from their Web site follows:
The Dublin Core Metadata Initiative (DCMI) supports shared innovation in metadata design and best practices across a broad range of purposes and business models.
DCMI does this by:
“Managing long term curation and development of DCMI specifications and metadata terms namespaces;
- Managing ongoing discussion of current DCMI-wide work themes;
- Setting up and managing international and regional events;
- Curation and open availability of meeting assets including proceedings, project reports and meeting minutes;
- Creation and delivery of training resources in metadata best practices including tutorials, webinars and workshops;
- Coordinating the global community of DCMI volunteers.”
Note the recursion as DCMI curates its digital data intended for curating your digital data. DCMI practices what it teaches. Here is the Wikipedia article on Dublin Corehttp://en.wikipedia.org/wiki/
The IEEE member Web portal has 1,610,000 article hits for “curation” versus 9,050,000 hits for “big data”. The IEEE Technical Committee on Digital Libraries uses curation often. There is a Digital Curation Centre at the University of Edinburgh, UK. There are digital curation lifecycle models that implement information lifecycle management. There are suggestions for organizing datasets for concept and faceted discovery. Alternatively, there are suggestions for applying curation for ad hoc data mining. All this is part of the IEEE literature backed with citations.
The ACM Web portal has 575 article hits for “curation” versus 1,940 for “big data”, and the ACM member digital library has 1,640 hits for “curation” versus 2,349 for “big data”. My favorite quote emphasizes virtuous cycles with feedback:
“Digital media is evolving. After years of rapid growth, we have entered an age of digital curation. Curation means selectivity in the way we use technology—it means drilling down, finding what tools work for which jobs and honing those tools. This new era presents new opportunities for social change and economic growth.”
Relevant to NIST, the Proceedings of the 6th International Conference on Theory and Practice of Electronic Governance has a tutorial on “Digital curation for public sector professionals” that emphasizes curation as key to governance activities. Other ACM articles tie curation to cyber infrastructure.
The U.S. Congress tasked the Library of Congress with the National Digital Information Infrastructure and Preservation Program and the U.S. National Archives and Records Administration with the Electronics Records Archives. Research USGS, NOAA, NASA, DOE and you will find advanced data curation.
Surprisingly, “data preparation” is found together with “data mining” in search hits over 1/2 of the time for ACM and ACM library searches. The IEEE member Web portal has 39,400 hits for both terms versus 264,000 hits for only “data preparation” versus 1,610,000 hits for “curation”.
Wikipedia has a useful definition of data curation that includes sheer curation during ingestion and remediation. Good curation adds marketable value to a dataset. Stakeholders rely on reputable curators to assess collections and suggest corrective measures and best practices. For example, document formats like Word ’97, OOXML, ODF, PDF/A, Rich Text, ASCII. And also, linking via URLs, URIs, URNs, PURLs. And cataloging and indexing.
Poor curation appears to be data taxidermy. Research libraries are heavily involved in curation as part of library science (e.g., cataloging, indexing, translations, establishing pedigree/provenance, establishing authoritative collections, discovering sources & materials, sharing).
Digital data curation includes more than traditional library science, such as annotating and tagging and indexing and referencing and citing and linking and graphing. Just google it. Metadata does not just spontaneously appear. A pile of data does not magically become a useful dataset.