Documents for the Security and Privacy subgroup meeting.
All documents are Creative Commons ShareAlike 4.0.
Documents for the Security and Privacy subgroup meeting.
All documents are Creative Commons ShareAlike 4.0.
The contents demonstrate the state of the art in big data enabled data centers. The book is very well written and edited for generic clouds, so specialized technologies like OpenGL/OpenCL GPUs are not covered. Common performance and scaling concerns are addressed. Vertical business concerns are not mentioned.
My assessment is that this is a marketing tool for CIO/CTO/Program Manager personnel, a self study tool for architects and engineers, and a gentle introduction for project staff who are new to big data.
Briefing on candidate reference architectures for CPS scenarios by NIST Big Data PWG Co-Chair Bob Marcus.bob-marcus-ras-for-layered-cps-sos-160403042711
NIST has launched a Smart City Program. Smart Cities are an example of CPS System of Systems (SoS) applications. Modeling CPS SoS requires an extension of current Big Data, CPS, and Cloud Reference Architectures. Some possibilities include
I have described some of these issues in a presentation on Reference Architectures for Layered CPS SoS . The growing interest in Smart X applications will make the extension of NIST’s current Reference Architectures useful to many stakeholders.
From a January 2016 FTC announcement:
A new report from the Federal Trade Commission outlines a number of questions for businesses to consider to help ensure that their use of big data analytics, while producing many benefits for consumers, avoids outcomes that may be exclusionary or discriminatory.
“Big data’s role is growing in nearly every area of business, affecting millions of consumers in concrete ways,” said FTC Chairwoman Edith Ramirez. “The potential benefits to consumers are significant, but businesses must ensure that their big data use does not lead to harmful exclusion or discrimination.”
The report, Big Data: A Tool for Inclusion or Exclusion? Understanding the Issues, looks specifically at big data at the end of its lifecycle – how it is used after being collected and analyzed, and draws on information from the FTC’s 2014 workshop, “Big Data: A Tool for Inclusion or Exclusion?,” as well as the Commission’s seminar on Alternative Scoring Products. The Commission also considered extensive public comments and additional public research in compiling the report.
The FTC report is available for download as of this writing.
Java? Visual Age? Workspaces? Eclipse? JIT? Scala? Solr/Lucene? Nudge? Hadoop? UIMA? Spark? UML2? M2T & T2M? Open Source? DIY’ers? Git? Github? Cloud? Yum? Containers? SNMP? PDF? XML? Everything is a Service? Distributed protocols? Distributed algorithms? IoT? IoE? Internet2?
Here is the link to this week’s keynote at NA EclipseCon in Reston, VA:
Those stuck in the past obsess over their favorite V words and ontologies. They clearly do understand the linked keynote speakers. They operate like cargo cults focused on mystical pretend words fought over by their high priests spilling information.
Those working in the future will apply federated workspaces containing code and datasets, often residing in clouds, as demonstrated in the linked keynote. Those demos utilize containers first constructed and later run in a federated environment both “within” shared workspaces like devops or users and “across” shared workspaces including devops and users. Users can select just the relevant workspaces for their own tasks, supporting fine grained to commons wide usage. Workspaces could hold testing sets, cross validation sets, and training sets with supporting code. Workspaces can be traded like baseball cards, or workspaces can be shared like commons. JIT compilation provides portability and interoperability across all relevant hardware architectures. A9.com estimates JIT code runs 1.6 times slower than highly optimized C/C++ code (it is good enough). Eclipse Che is beta now. Notice how servers disappear into the background.
There are future software barriers yet to solve, including many cores / many lanes, GPGPUs, FPGAs, cache management, memory management, network management, power management (yes, this is critical at the fine scale of HW functional units; how long can you boost speed of a functional unit before cooling it back down). There will be more PCIe lanes on a chip than between chips, and there will be more PCIe lanes on a circuit board than between boards; hence, many lanes has replaced many cores in architecture discussions.
Here is the link to Werner Vogels’ blog post on AWS lessons learned:
Those left behind will obsess over their favorite V words and ontologies. That is their only agenda. They enjoy pretending to make up never before seen definitions of old, commonly used words to confuse everyone and corrupt discourse. Blocking progress.
Werner and the AWS teams have done the heavy lifting and set today’s high bar for big data success for enterprises. They have defined the metrics and measurement processes for global cloud & interclouds and HTC. They have measured and compared programming languages (e.g., Java good; Python 6 to 8 times slower than C/C++). They have implemented the research of Leslie Valiant and Leslie Lamport (e.g., TLA+) to create AWS. They deeply understand distributed algorithms. They are do-it-yourselfers for hardware and networking and software. They have established very advanced software methodologies and are continually fine tuning their devops process. They listen to the world’s software experts like Stroustrup and Stepanov and Odersky.
Heed Occam’s razor or fail repeatedly,
P.S.: These are two examples of the API ecosystem that Wo has been requesting since 2014. Eclipse has legs and the support of all the big vendors including Microsoft now. AWS is two to five years ahead on the other cloud vendors.
Daniel and Rick:
Please see attached graphic of the Data Curation Process. Is this getting closer to what you are saying? I can revise this as needed.
I have been creating an owl file for all of the Big Data related processes, artifacts, characteristics, and standardized relations. It makes creating the visualizations rather quick, but I need SME input.
Let me know if you have other graphics/visualizations that would be helpful. Just identify the particular process or artifact (e.g. Data Store) and its relation to other things.
The owl file I have so far has a lot of good content in it, but needs SME oversight. I know how to categorize and relate things, but I need experts in Big Data to guide the process.
The upper level taxonomy I extend from is the same that any scientific effort would use (e.g. genetics, medicine, bioinformatics, etc.) with categories for Physical Objects, Properties, Characteristics, Roles, and Processes or Events.
Some of the relations I have so far are attached.
As needed, I can quickly create any inference from the content of the Big Data Taxonomy Owl file.
Furthermore, we can even add actual instances to the owl file. For example…for the class: Research Lab we can add the names of actual research labs, or for Genomic Data Set or Measurement Result, etc. Then we can start to relate actual organizations with actual processes (e.g. production or consumption) and outputs (like the creation of some Data Set).
If anyone wants to describe some use case, I will be happy to create the graphic depiction of it in owl format. Just include the Organization Type, the Organizational Role(s) (e.g. Data Consumer), the Process involved, and the Product of that process. Maybe that product is an Input to another Process?
-Bill Mandrick – Data Tactics
We clearly do not have consensus on the use of the term curation. There have been good references for its use in and end-to-end data management context, and just for its common use in the archival context.
While I personally like the term curation, I think the consensus would be better served to heed Rick’s initial suggestion (and we appreciate his persistence to make his point), and change to two terms:
Let’s use Preparation for the process that turns raw data into information (cleansing, outlier removal, imputation, regularization, transformation, etc) for the follow-on analytics or visualization processes. (So we have collect-prepare-analyze-act).
Let’s use Curation to refer to the activities not on the critical path to achieve the analytics objective; but on the part to ensure all policy requirements are met and the data is always available both now with fault-tolerance and in the future (backup, archive, distribution across cloud regions,…)
I view four common workflows/processes for big data systems, and curation, with options for qualified and vetted humans to fix data, could occur within all four. There are numerous identified concerns: garbage in, missing data, format mismatches, poisoned data, data spills, sensor calibration, timeliness, buffer overflow, down network links, sites down, system crashes, data corruption, RAID failures, government shutdowns. Curation operates like firefighting, long periods of boredom leading to complacency before the next crisis.
The four are data ingestion, data modeling, data retrieval, and data presentation. Data ingestion typically involves data discovery, data collection, data cleansing, error logging & notification & thresholds, secondary remediation processing. Data modeling typically involves ETL (extract, transform, load) prior to model runs, filtering (e.g., time, partition into region or service), statistical analysis for bad data points & outliers & significant anomalies. Et cetera for last two. The first rule of big data is carefully vet all externally sourced data. The second rule of big data is carefully vet all internally sourced data. The third rule is keep vetting data at each stage.
I suggest we in the NIST BDWGs explore curation within all functional processing stages (e.g., mine and everyones). Curation typically completes just before each batch of staged data is released to the next stage. Processing logs typically include counts, times, errors, corrections, actions, outcomes.
Of course, someone could merely wordsmith over your concerns. Some think curation smacks of data taxidermy. Others think of library science. Others content management systems. HPC usually address simulation verification.
1) I added the note about “sheer curation” within the Wikipedia article in anticipation of your comment. Poor curation looks like simple taxidermy/preservation to end users.
2) IEEE has massive good curation addressing journals, transactions, conferences, standards, education curricula, Web sites, e-learning, digital libraries, IEEE search engines, IEEE vertical expert courseware, contracts with external businesses/organizations. IEEE members pay high dues in part to pay for the massive IEEE curation. Please talk to your IEEE coworkers.
3) It is possible that many people have never seen big data systems from the inside, so they assume data self organizes and serves their end uses. In reality, big data is not generally available to or usable by naive end users in its raw, ingested state in the wild.
4) Most big data systems live or die based on the quality of their curation (e.g., Wikipedia, Google, Amazon, Facebook, Netflix). Provenance, governance, pedigree, authority, digital format, indexing, translation, sensor calibration, annotations, metadata, links, identity databases, simulation validation, etc. occur before end users access big data and later while end users access big data.
5) How could any big data system be implemented without curation? Curation enables end users to access information efficiently: a single curator (or curation SW) assists an unlimited number of end users. Curation occurs as virtuous cycles with feedback.
Explain, what is the foundation for your concern or complaint? What do you know about curation wrt big data?
. . .The very first sentence of the linked Wikipedia article actually is as follows:
Digital curation is the selection, preservation, maintenance, collection and archiving of digital assets. Digital curation establishes, maintains and adds value to repositories of digital data for present and future use. This is often accomplished by archivists, librarians, scientists, historians, and scholars. Enterprises are starting to utilize digital curation to improve the quality of information and data within their operational and strategic processes. Successful digital curation will mitigate digital obsolescence, keeping the information accessible to users indefinitely.
We have a problem here. For the NBDWG, repeating the link: http://en.wikipedia.org/wiki/Digital_curation Note all the references to authoritative sources.
Compromise: The NBDWG should define “digital data curation” and use curation in its taxonomy, RA, SnP, and Roadmap. “Data preparation” should also be defined and used, but data preparation should only implement ingestion curation. A curation role should be included in the System Orchestrator.
Here is an example of curation efforts http://www.whitehouse.gov/
Having discovered the futility of just tossing government data into Data.gov, Next.Data.gov was started to incorporate virtuous cycles with feedback to give end users both what they tend to want and actually use. This is an example of the necessity for data curation. CIOs and CTOs get curation and have been burned by poor curation. Big data without feedback fails sooner or later.
Here is an example of a global organization standardizing, educating, and curating metadata for curation http://dublincore.org/ Their mission statement from their Web site follows:
The Dublin Core Metadata Initiative (DCMI) supports shared innovation in metadata design and best practices across a broad range of purposes and business models.
DCMI does this by:
“Managing long term curation and development of DCMI specifications and metadata terms namespaces;
Note the recursion as DCMI curates its digital data intended for curating your digital data. DCMI practices what it teaches. Here is the Wikipedia article on Dublin Corehttp://en.wikipedia.org/wiki/
The IEEE member Web portal has 1,610,000 article hits for “curation” versus 9,050,000 hits for “big data”. The IEEE Technical Committee on Digital Libraries uses curation often. There is a Digital Curation Centre at the University of Edinburgh, UK. There are digital curation lifecycle models that implement information lifecycle management. There are suggestions for organizing datasets for concept and faceted discovery. Alternatively, there are suggestions for applying curation for ad hoc data mining. All this is part of the IEEE literature backed with citations.
The ACM Web portal has 575 article hits for “curation” versus 1,940 for “big data”, and the ACM member digital library has 1,640 hits for “curation” versus 2,349 for “big data”. My favorite quote emphasizes virtuous cycles with feedback:
“Digital media is evolving. After years of rapid growth, we have entered an age of digital curation. Curation means selectivity in the way we use technology—it means drilling down, finding what tools work for which jobs and honing those tools. This new era presents new opportunities for social change and economic growth.”
Relevant to NIST, the Proceedings of the 6th International Conference on Theory and Practice of Electronic Governance has a tutorial on “Digital curation for public sector professionals” that emphasizes curation as key to governance activities. Other ACM articles tie curation to cyber infrastructure.
The U.S. Congress tasked the Library of Congress with the National Digital Information Infrastructure and Preservation Program and the U.S. National Archives and Records Administration with the Electronics Records Archives. Research USGS, NOAA, NASA, DOE and you will find advanced data curation.
Surprisingly, “data preparation” is found together with “data mining” in search hits over 1/2 of the time for ACM and ACM library searches. The IEEE member Web portal has 39,400 hits for both terms versus 264,000 hits for only “data preparation” versus 1,610,000 hits for “curation”.
Wikipedia has a useful definition of data curation that includes sheer curation during ingestion and remediation. Good curation adds marketable value to a dataset. Stakeholders rely on reputable curators to assess collections and suggest corrective measures and best practices. For example, document formats like Word ’97, OOXML, ODF, PDF/A, Rich Text, ASCII. And also, linking via URLs, URIs, URNs, PURLs. And cataloging and indexing.
Poor curation appears to be data taxidermy. Research libraries are heavily involved in curation as part of library science (e.g., cataloging, indexing, translations, establishing pedigree/provenance, establishing authoritative collections, discovering sources & materials, sharing).
Digital data curation includes more than traditional library science, such as annotating and tagging and indexing and referencing and citing and linking and graphing. Just google it. Metadata does not just spontaneously appear. A pile of data does not magically become a useful dataset.
There are big data systems for all three cases. Structured data is represented as records in a database (e.g., inventory). Semi structured only partially fits in a database and uses links or blobs for the substantial remainder (e.g., annotated pictures or video). Unstructured does not fit in a database, but is indexed in a database for retrieval and annotations (e.g., document repository with search index). So, there are three cases: all in, partially in, all outside. The level of database support always distinguishes them. Secondarily, there are efficiency and normalization consequences.
Their HW and SW architectures at the scale of network workloads are significantly different. Architects usually nail down the requirements before sketching out the architectural components including appliances and special purpose HW. This is a top level difference in the system design. Using links has problems. The lack of unique names and a “normal form” causes collisions (e.g., data integrity). Accessing data is inefficient for some architectures.
Big data systems usually rely on internal filters to translate and extract source information. There often are one or more distinct filters for each data structure (e.g., OOXML, ODF, PDF, CSV, zip’d files) and these filters might be run as pipes. Please DO NOT short shrift this focal area in the def/tax, arch, or roadmap.
Gary Mazzaferro via nist.gov
To add a bit more specificity to semantic classification,
Exocentric terms of data (refers to the entire data object as a whole):
1) Monomorphic Schema (only one form/version)
2) Heteromorphic Schema (different versions of same scheme)
3) Polymorphic Schema (changes scheme)
Schema Lifecycle Generalizations
1) Formal Structured Schema
2) Partially Structured Schema (mix of formal and quasi structured)
3) Quasi-Structured Schema (eg natural language)
Schema Product Classifications
1) Expressible (eg XML,SQL)
2) Embedded (eg JSON,Nosql)
On 9/26/2013 8:17 AM, Chaitanya Baru wrote:
> OK, fair enough. –cb
> On 9/26/13 6:21 AM, “John Klein” <email@example.com> wrote:
>> Chaitanya –
>> I agree with your observation that “unstructured” combines structure and
>> semantics. A couple of comments:
>> 1. In practice, every datum is structured, otherwise we could not
>> identify it as a datum and represent it. Some structural descriptions are
>> more useful than others, e.g. identifying a string of bits as an IPv4
>> header may be more useful than saying it is 20 octets.
>> 2. A datum can have well-defined semantics, or undefined semantics.
>> Semantics may come from context (e.g. meaning of words in natural
>> language text).
>> 3. A record is a collection of data. A record can be structured
>> (conforming to some schema or rules of organization) or unstructured. You
>> mention that structure implies relational schema. I disagree – for
>> example, most people would consider XML documents to be structured.
>> 4. Last, and most important right now, I think it is too late to try to
>> redefine “unstructured data”. The term is widely used in practice, and we
>> have to accept that its meaning includes both schema and/or semantic
>> considerations. We can note this in our definition, but we can’t change
>> this fact.
>> On 26 Sep 13, at 3:54 AM, Chaitanya Baru wrote:
>>> Am not trying to “throw a grenade” here, but wanted to mention some
>>> thinking/developments which would be relevant to the deftax group.
>>> Hopefully, we will get to discuss this in the f2f meeting as well.
>>> There was a lively discussion at the recent XLDB conference on
>>> vs semistructured vs unstructured data. At least some of us concluded
>>> “unstructured” was actually a meaningless term…since all data does
>>> some structure.
>>> However, I don’t have good suggestions yet re what the new terminology
>>> should be. “Structured” was mostly used to imply “relational schema”, I
>>> think, but there are many other structured formats as well. We might say
>>> “explicit” vs “implicit” structure. But text, which is used as a classic
>>> example of unstructured data, has explicit structure as defined by the
>>> rules of grammar. Another idea is “design-time structure” (as in schema
>>> design) vs “application-time structure” (as in late binding of data to
>>> app). Not sure if that works well.
>>> I wonder if we were partly confusing and mixing up “structure” with
>>> “semantics”. Relational schemas capture some semantics (not all),
>>> the semantics of a piece of text needs to be extracted.
>>> I am at the ISC Big Data conference here in Germany. Several vendors are
>>> using terms like “loosely structured” data and “multi-structured” data,
>>> rather than unstructured data.
>>> Just thinking aloud here. Is this worth a discussion at the meeting?
>>> On 9/25/13 2:51 PM, “Nancy Grady” <> wrote:
>>>> This does have a greater level of detail, but I think the top data
>>>> doesn¹t seem to relate to the bottom one (unless it¹s a separate
>>>> diagram), and I¹m not clear why the big data source/origin overlaps
>>>> big data target/customer.
>>>> The Big Data Analytic/Tools is a hybrid between our application and the
>>>> resources it uses. It names some analytics techniques, but not others
>>>> (predictive modeling, outlier detection, image analysis, text analysis,
>>>> etc). Likewise some curation techniques (refinery, linking, fusion) are
>>>> mentioned but there are of course others. The temporal characteristic
>>>> process (realtime, interactive, batch, streaming) is a dimension in our
>>>> definitions that we have not yet folded into our RA, and is an
>>>> The real advantage is showing more of the complexity of the
>>>> between the application and the resources (our lingo)., but it would
>>>> more thought to see if storage general purpose, compute general
>>>> HPCC, storage specialized DB archives are the right set. It probably
>>>> go far enough to give you placeholders for the interactions across the
>>>> clusters (e.g. messaging for eventual consistency) or e.g. mapreduce
>>>> (splitting analytics across a cluster)
>>>> Security of course needs to be present in more than just
>>>> I agree we need to be a layer (or two) down from where we are, and I
>>>> forward to the call tomorrow to see what everyone thinks.
>>>> On 9/25/13 5:07 PM, “Orit Levin” > wrote:
>>>>> Dear all,
>>>>> I would like to see whether we can make some progress on the
>>>>> Provider” (prev. ³Transformation Provider²) block decomposition before
>>>>> call tomorrow.
>>>>> The main concern expressed earlier relates to the fact that the BD
>>>>> can be very different from the traditional data lifecycle: it can be
>>>>> in a different order, include new ³enrichment² techniques, etc.
>>>>> I find the depiction of the BD lifecycle from Yuri¹s diagram (see
>>>>> addressing this concern to a better degree than our current picture.
>>>>> Would people agree to use the ³University of Amsterdam² data pipeline
>>>>> If yes: (1) We need the text describing the pipeline and (2) Alignment
>>>>> our definitions and taxonomies.
>>>>> If not: What the reasons/reservations would be?
>>>>> Or: Other alternatives are welcome.
>>>>> Please share your thoughts and ideas on the list.