The Definition of “Big Data”
First, “digital data” is digitally encoded data that is automatically interpretable and recursively reducible to a unique structure of primitive data types. “Digital data” exists “at rest” on storage media and “in flight” through communications channels. Some “digital data” is easy for humans to read. “Binary digital data” is intended to be only machine readable.
Second, “big digital data” is usually shortened to just “big data” in common discourse. “Big data” serves as a catch all term in discussions about data spread over networks, including clusters. Do not overuse the term.
Third, “big digital data system” is usually shortened to either “big data system” or just “big data”. When mentioning “big data”, the context distinguishes among “big digital data”, “big digital data system”, or both. Because “big data” means different things to different groups in different discussions, use more specific terminology.
Fourth, both “big digital data” and “big digital data systems” should satisfy extensive engineering, business, and legal criteria, formally represented elsewhere in taxonomies and architectures and requirements, as characterized by dynamically controlled resources, information life cycles, intellectual property, standard structure & function, and customizable process & content. Any introductory definition should overlook the “devil in the details” complexities as presented elsewhere.
Historically, “big data systems” started decades ago as potent mashups of centralized data repositories with available networks, forming overlay networks (e.g., FTP, Gopher, HTTP, Web crawlers, P2P, RSS). Today, “big data” is transforming government, business, education, and society for the 21st Century. “Big data” emerges from labs and dorm rooms as market driven phenomena on internetworks. Current examples include Google indexes and graphs, Facebook social, Amazon retail, Netflix video, Wikipedia knowledge. Big data = data + network + services + markets. Successful big data systems balance all four facets of big data.
At network scale, “big digital data” lives on the network, stays on the network, scales to the size of the whole network, processes in place, and cannot thrive off the network. “Big data systems” spawn “network effects” that enable futures beyond those of isolated system silos. For example, accessible “big data” enables “information markets” where participants share. Successful “big data systems” typically utilize “instant on” expansion of shared hardware and software and bandwidth. “Big data” can virally grow more big data in virtuous cycles that could grow out of control, so market forces should modulate growth and evolution. “Big data” travels across “overlay networks” organized around shared services for their ecosystems. “Big data” initiatives drive much of the evolution of the Internet today.
“Big digital data systems” consist of a constellation of standard and custom services. Because “big data” is valuable enough to support longterm operations, reliable information life cycles are often provided. “Big data” has business and legal consequences, so security, auditing, provenance, governance, and more are also provided. Successful “big data” dynamically scales to the network workload, so dynamic controls add and remove resources on demand. “Big data” flows from ingestion sources to data repositories to network caches to end users and back when end users are ingestion sources. “Big data” filtering occurs at each processing step, such as data cleansing during ingestion and relational model filtering for extracts. Some big data systems “federate” external data repositories to provide a unified view as a user service. Current big data systems are a study in contrasts.
This definition covers much. First, this definition is technology agnostic. Second, this definition is as free of technical theory as practical, based on “digital data” and “system” and “network”. Third, this definition embeds any big data system within its networked environment, where the environment’s resources are dynamically shared. Fourth, any big data system degrades when isolated from its networked environment. Fifth, network scale data and resource “aggregation” produces “network effects” creating new and enhanced opportunities. Sixth, “markets” are emergent behaviors found throughout history wherever products or services are available in volume over time. Seventh, “intellectual property rights” restrict “information markets”. Eighth, overlay networks “cache” even the smallest node’s data to provide network scale data access, including intermittently connected sensor nodes.
This definition prepares the reader to detect and characterize any “X” system as involving big data. Further, the definition is elastic to encompass technology, internetwork, and market evolution across the 21st Century. The definition avoids relying on data size whereby a system meets a size criterion today but fails to meet some future size criterion established by exponentially improved hardware (e.g., drive capacity, network bandwidth, CPU count). Big data = data + network + services + markets. That is easy to remember. Explaining “big data” is akin to unfolding origami.
Fifth, a self referential definition, if this definition is on Wikipedia, then it is “big data” because Wikipedia is a big data system. If this definition is on some permission controlled file system on a computer, then it is not.
Sixth, the instant any cooperating wide area network (WAN), metropolitan area network (MAN), local area network (LAN), personal area network (PAN), or mobile ad hoc network (MANET) comes online, “big data systems” respond to “big data” flows. This scenario repeats continually day and night indefinitely. This scenario was promised back in the 1960s and has been evolving among practitioners. In the future, big data will be normal data.
How To Transform Digital Data Into “Big Data”
Big data systems are emergent phenomena growing whenever conditions are conducive. A key behavior is big data systems scale dynamically to realtime traffic. Rapid optimization by adaptive control promotes rapid growth. The overlay networks are as varied as the big data systems themselves; however, big data systems usually run over standard internetworks. The lower level Internet protocols typically serve as the starting foundation for big data systems (e.g., physical link control, data link control, network control, and transport control); meanwhile, the higher level Internet protocols have been largely ignored in favor of custom solutions (e.g., FTP, P2P, software agents, botnets). With big data, things are coming together faster than bureaucracies can evolve. Your spicy sense should be tingling now. All successful systems are fortunate accidents, so avoid any tendency to over plan and micromanage big data networks.
“Abstraction” involves generalizing over a well defined set of specific cases, supporting translation and proof and protocols. For example, general purpose programming languages abstract the behaviors and capabilities of a population of computers, abstracting “memory objects”, “instruction set”, “synchronous software traps”, and “asynchronous hardware interrupts”. As long as the abstractions are valid, compilers reliably translate source abstractions to machine code. Abstraction supports formal proofs of system behavior. Big data depends heavily on standard and custom abstractions to achieve interoperability and portability and guarantees of behavior at network scale. Big data systems recursively abstract layers of metadata. Modular abstractions promote reuse.
“Virtualization” involves abstractions of physical hardware, software, and data. For example, time sharing systems virtualize one computer’s hardware resources so many interactive jobs run independently on shared resources. Another example, RAID storage systems virtualize a bunch of disks as disk partitions to improve speed, dynamically scale, and guarantee reliability. Another, a Java virtual machine emulates an image computer on a host computer. Another, software defined networking virtualizes a data center’s switch ports as a set of IP networks. Big data systems virtualize system resources and data locations so optimization and control coherently distribute down to each component, using standard protocols and application programming interfaces (APIs) and application binary interfaces (ABIs).
“Paravirtualization” is special, hard time constrained virtualization that realtime emulates foreign IO hardware events. For example, hypervisors paravirtualize IO on x86 architecture CPUs. Time sharing virtualization cannot handle synchronous events or hard realtime constraints in practical applications. Paravirtualization supports network communications, including data replication, server failover and failback, and peer to peer overlay networks.
“Community Software Projects” support current big data services. All big data systems use multiple externally created software libraries, and they often feed updates back into the parent projects. Big data has popularized some high performance computing (HPC), map-reduce, machine learning, statistical network, and domain specific software libraries. Plus, more projects are anticipated. One indicator of big data’s adoption rate is the creation and sharing of software in support of vertical communities (e.g., geological, climate, weather, health, science, engineering, marketing). Responsible big data systems utilize and support community projects.
“24×7 operation” implies live updating for fixes (e.g., eliminate bugs and vulnerabilities) and upgrades (e.g., modified, added, or removed hardware and software). Live updates require administrators to complete ongoing activities on the current version of the software in parallel with the launch of new transactions on the new version of the software.
“Lifecycle Management” optimizes big data at rest and whenever modified. A common architecture migrates chunks of data over time from high performance online to low performance near online to archival offline storage. Lifecycle management includes practical “findability” in the future and also efficient “data access” at high volume today. Optimization of data lifecycles utilizes records of sources and changes and timestamps, for versioning and provenance. For examples, the Library of Congress and National Archives are involved with big data life cycles.
“Data Cleansing” is any filtering process for checking and correcting data prior to updating a database or repository. Retaining the original, messy data is a best practice. Each ingested data set could generate multiple cleansed data sets ready for diverse uses. Common data cleansing filters deduplicate, verify, validate, recalibrate, annotate, relink, and index input data.
“Curation” is a virtuous cycle involving the creation, management, preservation, restoration, and arbitrage of valuable items. Big data discovered in the wild usually requires curation. Research librarians, archivists, database architects, legal advisors, and stakeholders are potential curators. Governance, provenance, cleansing, findability, data access control, spill mitigation, service arbitrage, personally identifying information (PII) handling, data integrity, vetting against spoofing and masquerading, logging, auditing, and more are in scope.
“Provenance and Governance” rely on curation while safeguarding the organization. Governance involves virtuous cycles providing assurances of conformance to ethical, legal, and organizational procedures. Provenance involves virtuous cycles maintaining data lifecycles as chains of evidence, providing assurances of data integrity by documenting its accuracy, precision, completeness, veracity, and more. Effective governance requires the assurances of effective provenance. Metadata as linked annotations serves to preserve a temporal record of artifacts, processes, and agents. Publish/subscribe messaging provides realtime alerting and notifications. NIST standards can align the transformation from law to regulations to codes.
A “Write Only Database” is a bridge to nowhere due to very low database utilization. Some large databases have 80 percent or more of their content never being read by end users. The existence of write-only databases illustrates why big questions should be asked before starting big data projects: because big answers supply domain specific organizing principles, use cases, and monetization strategies.
A “Read Only Database” limits users to only reading the contents. Read-only databases force all users into the predicament of having to repeat the same data processing steps every single time starting with the original data. Every job of each user starts with the original data, including correctible problem data. Every user discovers the same data problems independently of previous users. No user benefits from the knowledge and experience of subject matter experts who already know about the problem data and the optimal techniques. It is no wonder that read-only databases are the overwhelming majority of write-only databases.
Business and legal professionals should be involved from the start. Big data crosses borders and impacts peoples’ lives. The complete spectrum of business and legal consequences should be evaluated from the start. Business and legal professionals should support big data systems as longterm assignments. Expect to be legally threatened as soon as you are successful and noticed. Expect cease and desist orders. Expect class action law suits. Expect to be blindsided. The US patent system effectively makes everyone a target. Foreign courts blame US executives for a range of their local problems. Comply with all ethical and legal requirements, log everything, audit and review, draw up contingency plans, drill for emergencies, periodically have outsiders review, rapidly assess and react to complaints with followups, monitor user sentiment, anticipate insider threats, restrict executive access to big data.
Separable “Big Data” Concerns
Factoring big digital data and big digital data systems leads to a set of separate concerns: discovery, findability, presentation, data links, data matching, data first versus system first, data poisoning, data leaks.
Data ingestion is any process of inputting externally sourced data to a big digital data system. Successful data ingestion requires the cooperation of the external data sources. Successful data ingestion also requires throttling network bandwidth within acceptable high and low levels. Current big data systems use a wide variety of one off, data ingestion optimizations including cleansing (like deduplication and versioning). The single common concern of data ingestion is “timely discovery and vetting of all relevant externally sourced data”. Discovery concerns drive dynamic data ingestion optimization: the dynamics of external sources, network resources, and ingestion rates steer big data to increasingly smarter ingestion processes. Discovery optimization is limited by legal, business, and engineering constraints. Discovery success enhances the market value of the big data system.
Data extraction is any process of outputting internally sourced data from a big digital data system. Successful data extraction requires user authentication and authorization and auditing. Successful data extraction also requires throttling network bandwidth. Current big data systems use a wide variety of standard query languages and custom search filters, with the two big camps being “SQL” and “NoSQL” databases, and with a growing semantic Web camp based on XML, RDF, and OWL. The single common concern of data extraction is “100 percent data findability with 0 percent false detection or extraction”. Findability concerns drive data repository optimization, exploiting additional types of processing: syntactic and semantic information, contextual metadata, graphs of links, machine learning, Boolean networks, etc. Findability is a core optimization concern for engineers as well as a performance metric visible to end users of big data.
Data presentation is any process of representing extracted data to end users. Unlike ingestion and extraction processes, presentation utilizes advanced human to computer interfaces (HCI). Presentation activities tend to divide into standard versus ad hoc presentation formats and tools. Standard presentation tends to be periodic reports. Ad hoc presentation tends to be highly interactive. Visualization tools are emphasized in some data presentation systems. The single common concern of data presentation is “effective, accurate, precise communication between human and computer”. Data presentations could be human orchestrated or event triggered. Data presentation processes are the public face of big data to end users. Optimizing presentation encourages virtuous cycles with users and stakeholders.
Data links connect data nodes to create data graphs and annotate data nodes. Historically, data links have been implemented as memory addresses in a computer’s main memory, file names in a computer’s file system, index values in indexed lists, Web addresses, abstract URLs (like PURLs), URNs, etc. The two common concerns of data links are “referential integrity” and “realtime performance” in the face of data and system changes over time, as running programs shutdown, files are deleted or moved, lists are deleted or rebuilt, IP addresses change, link services fail or change, standards change. When a link breaks, the error is easily detected. However, changes or instabilities in the link’s ecosystem can be difficult to detect and recover from. Big data systems should utilize redundant linkage mechanisms and recover from broken links. Bit rot eats away at historical data links. Wayback machines have been created to provide periodic snapshots.
Data matching is any process of ad hoc dynamic linking of disparate data nodes in a data repository. Successful data matching is advanced science and art based on exploiting available information, in spite of flaws in the information. Well designed database keys boost data matching success rates for planned matches (e.g., joins). The single common concern of data matching is “properly connecting all the nodes into the statistically most accurate and complete graph based on limited and flawed and changing information”. Historically motivated artificial intelligence camps are working on data matching, ranging from classical logic rules as action triggers to new machine learning supervised or unsupervised decision models. Progress is uneven and holding back some big data applications.
Data and systems interact in a dynamic dance. Some decide that the available data drives the system, while others decide the available system drives the data. The single common concern of the debate is “future growth and opportunities”. Data people want what is best for their valuable data ecosystems. System people want the best system that technology and money can support. The debates that inevitably happen usually end up at the dichotomy between the “cathedral and the bazar”. Both staging these debates and eventually turning off the debates are healthy. In this oft played debate, the cathedral group wants to first build a large cathedral like system to house big data for the future; meanwhile, the bazar group wants to market big data as soon as possible as widely as possible for a “first to market advantage”. More than once, this schism has been a game ender.
Data poisoning is any process of sourcing bad data. Failed sensors, bad data calibration, format errors, processing mistakes, data loss, and more cause data poisoning. Provenance, governance, cleansing, curation, recurrent testing, and more provide assurances against widespread data poisoning. Malicious data poisoning is more difficult to manage, but logging and recurrent auditing can detect malicious activities. The two common concerns of data poisoning are “whether anyone will respond to data poisoning” and “what remedies are available”. Diligence and compliance responsibilities dictate that management should follow written plans including methods and responses and contingencies and cleanup activities. Data redundancy can provide critical data protection. Continual data monitoring can minimize the duration of data poisoning.
Data leaks are any process of unauthorized disclosure. Historically, insider threats come in two flavors of incompetence and malicious intent. However, the damage and recovery are usually the same after mistakes or attacks. Fortunately, training and teaming can significantly reduce both the frequency and impacts of mistakes. The single common concern is “the consequences of data leaks”. Diligence and compliance responsibilities point to management as ultimately responsible for data leaks. Continual monitoring using probes, firewalls, traffic analytics, workload metrics, individual activity logs, audits, and more are part of diligence. Security should always be redundant and changing. Plan data leak responses and be prepared for “outside the box” surprises.