BRAP Forensics: Boutique Computer Activity Mining vs. Personal Privacy Management

link to published version: Communications of the ACM, January, 1999

accesses since October 21, 1998

"Value-Added Publishing"

Hal Berghel

ABSTRACT

Without question, electronic publishing is one of the hottest topics in computing. Groups worldwide want to know how to do it well, how to advertise it effectively, how to enhance the capabilities of electronic publishing to include emerging multimedia technologies, and, most of all, how to make money at it. There is also a slightly different spin. In our view, in the future it will be increasingly important for successful publishers to add value to publications over and above the original content. In this column we'll outline what seem to us to be some of the fundamental issues connected with the addition of value to electronic publications. Some of these issues have already been translated into products and services, while others have not. Our purpose here is to attempt to provide a conceptual overview of the value-added publishing landscape, around which further discussion might be organized.

INFORMATION DELIVERY IN THE GILDED AGE OF COMPUTING

While electronic publishing takes on a variety of different meanings in different settings, one core principle holds true across all domains: electronic publication involves the distribution of digital documents. In it's simplest form, electronic publishing may amount to little more than a "porting" of printed information over to the digital networks via scanning, OCR technology, etc. Augmented with some very basic accounting software, many publishing sites are in the business of serving up static HTML versions of their publications via the Web. In its more complex forms, however, electronic publishing will re-define itself in the light of available computer and network technologies. Our present goal is to outline the ways in which this re-definition may be achieved.

Although most of the critical technologies needed for electronic publishing have existed for decades, it has only been in the last few years that traditional, in-print publishers have taken it seriously. There were two basic reasons for this delay - one technological and one pragmatic. On the technology side, the primary intended venue for electronic publishing, the World Wide Web, lacked two essential capabilities. First, it lacked secure HTTP transactions until 1995 or so. Without secure transactions, selling via credit cards would entail excessive risk due to digital eavesdroppers, packet sniffers, and other network nematodes of that ilk. At the same time, there were no widespread standards for, and implementations of, electronic billing systems. Means had to be developed to charge in small amounts (e.g., a millicent) and accumulate charges until they reached cost-effective invoicing limits. These two pieces of technology were in place (in a variety of different forms, in fact) around mid-decade, thereby making the simpler forms of electronic publishing possible.

On the pragmatic side, no one knew (in fact, it could be argued that it's still unknown) how to develop a sound business plan for electronic publishing. While it was widely assumed that adding electronic publishing products would irrevocably change the economics of publishing, few felt comfortable in speculating whether this would ultimately be good or bad for the industry. Many publishers jumped on the electronic publishing bandwagon for the worst of reasons: they were afraid of being left out of the future markets. In so doing, they packaged intellectual property in basically the same way as Gutenberg except for the addition of digital delivery mechanisms.

That is the primary cause of the popularity of digitizing anything and everything in print: herd mentality dictates that if you don't have a good plan, do what everyone else does or thinks they should do. Incidentally, this practice contradicts my fourteenth law of cyberspace which holds that "the average half-life of Web resources is 18 months." The fourteenth law entails that we should take a minimalist position regarding digitizing analog and hardcopy resources, by the way.

In any case, all of the essential hooks for electronic publications are now in place. Advanced publishers can solicit, edit, produce and distribute electronic publications with not as much as a single piece of paper changing hands (not including signed copyright transfer forms and contracts, of course). The World Wide Web and the Internet has forever changed the face of publishing. But is this for the good, or bad? I'll suggest below that the outcome is unclear because we're under-utilizing network technology.

WHERE IS THE VALUE IN ELECTRONIC PUBLISHING?

The biggest misconception about electronic publishing is that its value lies in the ability to disseminate digital information over computer networks in a manner analogous to physical distribution of hardcopy. There seems to be a tacit faith in a twisted variation of Metcalf's law (i.e., the value of the Internet increases with the square of the number of nodes) to the effect that the value of electronic publishing increases with the square of the number of documents on the Internet. While this sounds good, it's likely to be false. At this writing it is more likely that the value of electronic publishing varies inversely with the square of the number of documents.

This misconception has driven virtually every publisher into some form of electronic commerce. Nowhere is this more obvious than with academic and scholarly publications. Seen as a way of mitigating against the problem of slumping sales and an annual 5-10% downturn in subscriptions, electronic offerings are thought to hold out the greatest promise of revenue growth - even a modest 5-10% annual growth in electronic publications offsets hardcopy losses and produces a steady state. But this reasoning ignores the fact that the decline the academic publishing industry is inextricably linked to the overall economy, the widespread perception that there is already too much information available for most personal bandwidths, and the perception that only a small percentage of the information in the typical publication is relevant. Readers are, therefore, "voting" with their pocket books by canceling subscriptions. Publishers worldwide are assuming that electronic publishing is the silver bullet which will save the day. Some point to the capabilities of the networks to lower overhead and production costs, support a wider variety of advertising and marketing venues (e.g., broadcasting, narrow-casting, and "personal casting"), and the ability to increase margins by dealing directly with the reader rather than distributors and middle-men, as signs that electronic publication will provide new opportunities for publishers seeking to turn their fortunes around. In other words, some publishers are working under the assumption that the decline in interest in scholarly and technical publications can be reversed if just those publications could be produced and marketed cheaper electronically. It just won't work that way - the publications that are being avoided in hardcopy will be avoided in electronic form as well. To paraphrase Sam Goldwyn, "people will stay away in droves."

Well, if the digitization of things publishable won't get us far, what will? In our view the payoff in electronic publishing in the future will be the deployment of new technologies for the integration, of digital documents into the network fabric of associated ideas, texts, times, and people. Publishers will need to be more than just the providers of digital documents from their digital warehouses. They will also need to connect a document with its contexts. Thus, a digital document could be tightly integrated into the cybersphere of all related documents in a way that traditional publishing cannot permit. Such publishing could provide not just the documents, but their connections to other data sources, as well as other valuable information. This is the essence of "value-added" publishing.

ADDING VALUE TO PUBLICATIONS

Value-Added Publishing (hereafter, VAP) is a natural extension of traditional publishing with the additional feature that the publication vehicles and venues accept from and react to additional, previously integrated and assimilated networked media. The challenges of VAP are likely to lie in such areas as:

content enhancement
the encouragement of synergy between and among information providers, information consumers and the resources they share
the addition of interactivity and feedback loops to traditional document delivery systems
a re-orientation of both the information provider and information consumer toward the "process" of publishing, rather than a focus on the individual products and services
meta-level analyses and intelligent re-structuring of document collections
ad hoc document quality ranking and recommending systems

to name but a few. s one can see from this partial list of services, VAP must use a more advanced set of computational and network tools from that of its early electronic publishing ancestors. We'll illustrate these points with a selected enlargement of some of the categories above.

ADDING VALUE VIA CONTENT ENHANCEMENT

One convenient way of viewing electronic publishing is the exchange of information between an information provider and a information consumer via an intervening computing network infrastructure. While the content of a document is central to this exchange, it is not necessarily paramount since its value is utilitarian rather than intrinsic. That is, the value of the content is not independent of the ability of people to read it, view it, use it, reference it, and so forth. From the point of information retrieval, information which can not be found or used is worthless

Content enhancement involves the study of enrichment of the semantic and syntactic content of a document. The enhancement of semantic (alt., conceptual, deep) content can be thought of as an attempt to extract more meaning from the documents. A report, summary, extract, abstract, translation, or "gist" by an intelligent agent would be considered a semantic enhancement in this sense, as would results reported by natural language understanding and translation systems, and the automated inclusion of new hyperlinks.

The enhancement of syntactic (alt. grammatical, tag-based) content, on the other hand, would affect the way documents are structured, indexed, taxonomized and linked within the intervening network and computer resources. Examples of enhancing syntactic content would be to add structure to documents for the benefit of helper agents, search engines, indexing tools and data mining and warehousing applications, etc.

ADDING VALUE WITH META-DATA

While content enrichment of electronic publications is the holy grail of VAP, it is at the same time the most difficult to implement. Some problems, complete natural language understanding for one, are intractable given the current state of the computationalists' art. Adding value through meta-data, while less ambitious, holds out much greater promise in the short term.

Meta-data is information "about" an electronic document, resource, or the operation of a computer system. For example, "confidence indicators" might provide useful information about a document or resource. We would expect that knowing that an electronic publication produced a Pulitzer Prize would increase the credibility of the author and the value of the document (at least as an object of study), as would favorable reviews by the leading authorities on the subject, etc.. The imprimatur of a publication might also be relevant, as some electronic publishers might be known to have higher standards than others. (Incidentally, the ACM certifies that its electronic publications pass through the same peer review process as their in-print siblings.)

Similarly, recommender systems assign assessments or recommendations to documents and resources that are as reliable as the confidence one has in the recommender system. Helper agents, brokerage systems, flash lists, etc. also provide meta-level value in their evaluation and recommendation of documents.

Revision control systems which collect meta-level information about various versions of a document add value by helping create stability and continuity in network documents. On this account, versions of documents are indexed in such a way that any particular version may be retrieved, with or without predecessors or ancestors.

The sidebar illustrates the types of enhancements which might result from the judicious collection and use of meta-level information about electronic offerings.

SIDEBAR: Potential Meta-Level VAP Enhancements:

A. "confidence indicators", e.g.,

1. listing as citation classic by authoritative source

2. document status indicator (i.e, preprint, archived, old, not recently viewed)

3. awards received (weighted by importance, source)

4. reviews of document in the literature

5. referees reports from peer reviewers

6. the perceived quality of the imprimatur

7. vetting by some community or constituency (praised by newsgroup, professional association, anthologized by reputable editors, etc.)

B. recommending systems

1. community review systems (e.g., Firefly [www.firefly.com])

2. helper agents

3. information "brokerage" to facilitate connection (by vendor/brokers, fulfillment agents, aggregators, and the like)

4. hyperlinked review chains (i.e., which inter-connect all reviews of a document irrespective of source)

5. Amalgamated or virtual reviews (e.g., which merge elements of individual reviews over related documents)

6. Virtual editors (i.e., "personalized" variant of an electronic publication created by someone other than the author)

C. searching, indexing and database technologies

1. more interactive and participatory than current systems

2. provide dynamic, real-time document clustering with innovative clustering topologies for display of results

3. preprint servers for preserving the ancestry of documents

4. postprint (archive) servers for maintaining definitive versions

5. data mining, including techniques based on association, sequence-based analysis, clustering, classification, estimation, fuzzy logic, genetic algorithms, and neural networks

6. Data warehousing and data repositories (e.g., the ACM Computing Research Repository [www.acm.org/corr/]and ACM Digital Library [www.acm.org/dl]

D. document persistence technology (cf., [www.sciam.com/0397issue/0397kahle.html])

1. formal methods for post-hoc data utilization (e.g., which structure data differently or anticipate new data demands)

2. cyberspace snapshots which provide backups of documents whose links are fractured

3. version archiving strategies for citation permanence

E. Variable-link-strength technology based upon frequency of use

statistics or user-centered evaluations

1. frequency of access and average visitor ratings of a site

2. detection of the number of inbound links to a particular site

F. document persistence systems which help ensure the longevity of linked resources especially with respect to mission critical environments (e.g., medical information systems, patents, copyrights, commerce)

1. revision control systems/version retention systems

2. Web "snapshot archives"

3. version validation systems

G. Virtual authoring -

1. virtual documents (i.e., process-oriented document creation systems where documents have no reality apart from current presentation)

2. dynamic contextual annotation added by authors and readers (e.g., like "pop-up" videos on MTV)

3. trans-publishing in Ted Nelson's sense [http://www.sfc.keio.ac.jp/~ted/]- where documents take on "hyperstructure" as they evolve in a structured way by inclusion of different authors and participants (cf. projects Xanadu [www.xanadu.net] and ZigZag [http://www.xanadu.net/zigzag/]).

4. Group authoring technologies (perhaps an outgrowth of computer-assisted, cooperative work, groupware)

H. Dynamic document creation - e.g., where the documents are revised continuously

1. author-revision systems (e.g. Stanford's Encyclopedia of Philosophy [plato.stanford.edu])

2. author and reader revision systems (e.g., Email: the good, the bad and the ugly site [berghel.net/email_gbu/])

3.thought swarms and "idea structuring"

4. Online ACM Computing Reviews (in development)

II. Information customization [e.g., berghel.net/publications/cb5/cb5.html]

1. client-side document extraction

2. non-prescriptive, non-linear document traversal (i.e., not prescribed by document provider)

3. multi-document "collage" interface for multiway lookahead

J. Related emerging technologies which will support value-added publishing

1. safe, open distributed archiving (e.g., Alexa [www.alexa.com])

2. Ted Nelson's transcopyright system [http://www.sfc.keio.ac.jp/~ted/]

3. Security enhancements

4. Watermarking and digital steganography [cf. berghel.net/publications/dw_n/dw_n.html]

6. Push technology [http://berghel.net/publications/push/push.html]

7 citation tree construction

8. agent-based citation locators (cf. [www.uark.edu/~iarg/])

FEEDBACK, INTERACTIVITY AND SUPPORT

Content-based and meta-data based value adding are two of the four strategies for building value in electronic publications. We add to the list two more components, (1) feedback-based and (2) interactive value adding. Services of this type collect data from users which reflect their perceptions of their experience. Out of that collective experience might come useful comments, identifications of "hot" documents by some measure of use, average rankings of sites, group interactions, and so forth which will speak to the issue of the perceived value of content.

To this we must also add, support-based value-adding - technologies that may not directly add value to a document, but which support the addition of value by other means. In other words, they are necessary conditions for the deployment of a VAP system. This might include, database technologies, statistical and clustering tools, revision control system software, editing tools, information customization clients, and so forth.

CONCLUSION

Our view is that electronic publishing in the next century will be fundamentally different than it is at the moment. We would predict that the most successful, early applications of VAP will be such things as:

publications with limited commercial appeal
publications with narrow audience appeal
digital digests (i.e., personalized magazines assembled from many sources)
focused retrieval publications - (personalized encyclopedias)
home-grown, personal publications
interactive publications (aka, interactivities in the edutainment business)
"public interest"/ "public awareness"publications
reference materials

and that we will build upon our successes as developers and researchers are inspired to take more extensive advantage of computing and network technology, and slowly but inexorably move away from the notion that the paramount value of a document is it's content. Additional enhancements such as those outlined above will establish the importance of the role of the digital or cyberspace context of information. We have included a few URL's when the technology described is already implemented.

Many of these thoughts have evolved as a result of my nearly six years on the ACM Publications Board. By continually revisiting the questions of what we were doing, and why we were doing it, this conceptual overview of the future of electronic publications began to take shape. The launch point was my belief (controversial, as it turned out) that ACM should move away from the policy of holding copyrights for it's publications (cp.,[www.acm.org/pubs/copyright_policy/]).

I remain convinced that trying to fix one version of an electronic publication as definitive and copyrightable will prove as difficult as trying to paint falling leaves. In my view, electronic publications of the future will resemble filmstrips - each "frame" will incorporate some improvement, alteration, reference, etc. which (in the ideal case) will have more value than its predecessor. In this sense, Ted Nelson's notion of transpublishing is much like many layers of intersecting film strips, each one of which has one cell which aligns with the cells of others.

Acknowledgment: In writing this column I have benefitted from lengthy discussions with two of my colleagues, Dan Berleant and Doug Blank as well as Peter Denning, Bill Arms, and the other distinguished members of the ACM Publications Board..