html-based tag cloud API: json output

Here's the output from the tag cloud project I demoed at DevHouseDC tonight. It's a purely html-based tag cloud generator which returns JSON with a <style> block, a <div> block for the body of the page, and has many configurable options (see below). Because it's simply html, the results are arbitrarily customizable by the caller (you). You can discard the style block entirely and write you own, for example.  

Screen_shot_2010-12-05_at_dec_

the basic API call with minimum required parameter "body" (OR url OR file upload) looks like so:

http://example.com/api/1.0/tagcloud/body.json?body=Down, down, down. Would the fall never come to an end! 'I wonder how many miles I've fallen by this time?' she said aloud. 'I must be getting somewhere near the centre of the earth. Let me see: that would be four thousand miles down, I think--' (for, you see, Alice had learnt several things of this sort in her lessons in the schoolroom, and though this was not a very good opportunity for showing off her knowledge, as there was no one to listen to her, still it was good practice to say it over) '--yes, that's about the right distance--but then I wonder what Latitude or Longitude I've got to?' (Alice had no idea what Latitude was, or Longitude either, but thought they were nice grand words to say.)

(the text here is from  alice in wonderland  at project gutenberg-- which incidentally is an awesome source when you just need random text to test on. check out their "most popular" section; lots of great stuff!)

there are a number of other parameters-- you can customize the size of the resulting tag cloud, or the font sizes to extrapolate between; the colour scheme; stop words; use custom tokenizers; sort order; max words, etc. 

This is all written with django, django piston, and a little bit of nltk for the tokenizing and frequency distributions. The GUI is in the works (i'd include a screnshot but it's busted right now :p), as is a version that will take an RSS feed as input so a current, up to date tag cloud can be generated on-the-fly/programmatically.  

From MLOSS.org: Free your code

"Not sharing your code basically adds an additional burden to others who may try to review and validate your work", as John Locke was quoted in a recent article in the Communications of the ACM. Of course there is the flip side to this in our competitive academic environment. As Scott A. Hissam puts it "... The academic community earns needed credentialing by producing original publications. Do you give up the software code immediately? Or do you wait until you've had a sufficient number of publications? If so, who determines what a sufficient number is?"

In a data driven computational field like machine learning, many of our results are dependent on some sort of calculation. Yes, in principle, many methods could be implemented from scratch based on a set of equations, but in practice, most people do not have the time (or the capability) to code up all prior art from scratch. In some sense good code (like a good waiter/waitress) remains in the background. My favourite example is all the linear algebra software that is common in many programming environments. Most people don't even think about the numerical complexities of finding eigenvalues since there is a "built in" function for it. This would not have been possible without the BLAS and LAPACK open source projects. So, write code, and make it open source.

"But I don't write good code..."

Nick Barnes from the Climate Code Foundation argues that you should release it anyway. In a recent opinion piece by Nick and also other famous people in a Nature News article, gives many reasons why code should be open. In his blog piece, he gives more points. Among them:

  • publication on its own is not enough
  • software skills are important and must be funded
  • open development is important
  • the longest program starts with a single line of code

Additional points:
* Not every theory is perfect the first release either but it gets
built on iteratively. it wouldn't get improved if it wasn't put out
there. Same for code.
* It's not science if it's not repeatable
* code should be cited too. That's good for progress and good for your citations.

An Assert-Challenge-Confirm model for Assessment in Learning

some random thoughts about assessment in education while reading philipp's post about the future of assessment:

observations
* there is an approximately finite set of learning inputs for a given "course"
* there is an infinite, unbounded set of possible learning outputs
* amount and nature of learning is a function of the individual and external context
* how much and for how long people remember is also extremely variable. 
* trying to control the output is generally a great way to kill the learning spirit. 
* quantity and quality are different types of learning outputs. 
* do we want to use assessment to reward a level of knowledge, or the completion of a process?
* if a tree falls in the forest... if someone learns something and has no way to communicate or demonstrate it, is it learning?
* in assessment, there is an expected homogeneity of output (what was learned) from input (course content). 

thoughts:
* perhaps we should think about using a statistical system instead of a rule based one. not "i AM or am NOT certified in X" or "I did or did not pass this course", but something more continuous. but what would that mean? 
* do we want to measure process or output? eg. time spent? or skills learned? the latter is necessarily subjective, dependent on the context of the learner (among other things), and certainly not homogenous. 
* "learning to the test" is bad, but so is teaching to the test-- it's worse in fact, because students in general look to the "teacher" or even facilitator to define the expectations of the learning experience. so no wonder they learn to the test. 

what do we use assessment for? really, the role of assessment isn't, for example, "who should i hire," although it's often marketed that way. practically, it plays more of a culling or curation role, along with other factors, in choosing who to consider for some opportunity. it's an interior node on a decision tree. but ultimately, people make decisions based on many other factor we haven't learned how to measure yet, and maybe never will or maybe never want to. but the curation axis is relative and subjective-- that's why there's so much room for curation.   

is there any objective, open ended definition of learning? what if we could measure the number of neural pathways formed? even then you'd have to identify which of those pathways were a function of the learning experience and which were a function of the rest of life :). 

properties of some improved system:
* continuous instead of discrete
* heterogenous instead of homogenous
 
a proposal
actually one interesting idea would be to have people assert for themselves what they learned in a course. of course anyone can assert something, but imagine using something akin to TCP's 3-way handshake. Let's call it the assert-challenge-confirm workflow. 

student --> facilitator/course participants/community:    "i learned x from this course"
facilitator/course participants/community --> student:    issues challenge: "well then probably you should be able to do Y"
student --> facilitator/course participants/community:    why yes, indeed i can, and here's an example. 

this keeps the possible outputs unbounded, scales to the individual, and is declarative rather than passive. it leaves room for students who do not wish to assert any specific learning outcomes, without preventing or punishing their participation, or even needing an alternative model for it. in fact i think it's really natural that the "assessment" portion is an added burden on both the student and the assessors (whoever they are). it seems like it would focus assessment on those times or areas where it's particularly valuable. one could imagine gaming this system by initiating a challenge for a bazillion things until they find some they're capable of completing. aside from the human reaction which would likely limit this, you could also use a measure of precision which considers correct claims and incorrect claims, instead of correct claims alone. 

the alternative when assessment is overkill would be to issue simple participation badges that make little claim about assessment, but could still be used as an indicator the individual's interests and activities-- as well as a starting point for more custom/personal assessments and investigation of outcomes gained. 

Programmatic Retrieval of Web Page Source with Unicode Characters

So you want to retrieve the source from a site:

>>> fp = urllib2.urlopen(site) 

>>> raw = fp.read()
>>> fp.headers['content-type']
'text/html; charset=utf-8'

now, what encoding is raw in? you might think it's in the charset of the content-type-- utf-8. but the content-type header tells you what character encoding the server of the site expects you to interpret the content in. the actual encoding is a raw string:

>>> type(raw)
<type 'str'>

'raw' itself is actually a byte-stream, since urllib/2 doesn't automatically detect and convert the returned content, we need to convert it to the proper (read: content-type specified) encoding ourselves. now:

>>> unicode(raw)

will yield:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 55751: ordinal not in range(128)

because the unicode function is assuming the input string is in ascii-- you have to tell it that it's in utf-8:

>>> ustring = unicode(raw, 'utf-8')
or, more generally for arbitrary content types (nicely shown on this post):

>>> encoding = fp.headers['content-type'].split('charset=')[-1]
>>> ustring = unicode(raw, encoding)

you can also accomplish the same thing with:
>>> raw.decode(encoding)

to clarify, this doesn't mean raw is *already* in utf-8, it means that the byte-encoded string contains escaped utf-8 characters. essentially, raw has strings like '\xc2\xa0'.  when you convert it to unicode, ustring will now have the actual unicode characters that those byte-encoded strings represent. 

Pingbacks and "Conversation Overlays" on the Web

Why have pingbacks not revolutionized the web already? Because they are not easily ubiquitous (only certain sites support them), and because there's still no way to pivot your view of the web into conversation view. So as a blog author you might see someone (or many people) have referenced your post, but it doesn't help you understand the 'graph' of the discussion related to that post, nor sythesize. 

Instead of a flat peer to peer pingback, maybe have a distributed system of intermediary servers, software that anyone can install, via which each user routes their pingbacks, and possibly registers additional callbacks to. These servers can act like a routing or discovery mechanism, but for conversations. They would constantly map the conversation graph, building a conversation web-- an overlay to existing content.

With a link to any content that is part of a conversation, you should be able to submit it to a "conversation server," and have it render back to you a graph of the whole conversation. You should also be able to elect to receive notifications, when any node or subgraph is updated-- could define notification or pingback thresholds based on in/out-degree or graph distance from the original post.    

Open Science Microformats - Initial Thoughts

Why Microformats for Science

Microformats allow for automated detection on a page or exposure in a feed or other activity stream. This capability and the simple local markup of the components means these atomic bits of science can be extracted, linked to as references, and commented on, composed or extended. Without microformats, openly shared science and research lack the formality needed to enable a true ecosystem of ideas and repeatable, verifiable science. Specifically, microformats enable: 

  • Formality
  • Local link-ability, reference-ability
  • Attribution
  • Composition, aggregation, extension, re-use
  • (Others?)

Possible Microformats for use in Open Lab Book Science

These microformats are components that we want to be able to stand alone, to be re-used, composed, or aggregated, while maintaining some structure and reference back to their origins. Attribution is important in science-- all the more so when practicing openly-- so of course that plays a role in each format discussed here. Developing theories or hypotheses, reading others' work, and identifying open questions are commons aspects of all fields of research. In technical fields, formal specifications of experimental process is also critical. 

(Note: several of the mentioned attributes already exist)

Research Topic
which contains:
Researchers (SHOULD)
Description (MUST)
Tags/Fields (SHOULD)
References (MIGHT)

Research Question
(open question/research question)
which contains:
Research Topic (SHOULD)
Question Text (MUST)
Author (SHOULD)
Datetime (SHOULD)
Tags (MIGHT)

Theory
which contains:
Hypothesis (MUST)
Evidence (MIGHT)
Comment (MIGHT)
Conclusion (MIGHT)
Author (SHOULD)
Citations/References (MIGHT)
Datetime (SHOULD)
Tags (MIGHT)

Reference Comment 
(eg. comments on a paper being read, but could also reference other hypotheses, experiments, etc.)
which contains:
Reference (eg. "currently reading...") (MUST)
Comment (MUST)
Author (SHOULD)
Datetime (SHOULD)

Formal Process
which contains:
one of: Algorithm/Equation/Chemical Solution (MUST)
Citation/reference (MIGHT)
Datetime (SHOULD)
Author (MIGHT)
Units/hMeasure (MIGHT)

Experiment 
(a research question becomes an experiment when it is formalized; certain fields would only be added after the experiment is run.)
which contains:
Objective (MIGHT)
Description (SHOULD)
Steps (MUST)
Formal Process(es) (SHOULD)
Materials (MIGHT)
Datetime (MIGHT)
Designer (Author) (SHOULD)
Experimenters (MIGHT)
Conclusion (MIGHT)
Log (MIGHT)
Results (MIGHT)

Examples in the wild
There are many people who blog about the idea of open notebook science but not nearly as many practicing it-- partially because the tools don't exist (IMHO). 

Lots of good examples on Open Wet Ware
Jean Claude Bradley's Open Notebook Science Challenge

More to come on drafting markup format and re-use of existing microformats and microformat design patterns.  Also of course there is the discussion of why microformats over, say, RDFa. Another time. 

Why is HP studying Social Media?

Why is HP conducting research on social media?

HP believes that information is becoming the greatest resource we have for addressing problems in business and society.  Social media is increasingly becoming many people's interface to IT, and these media interactions produce an enormous amount of data.  However, data isn't necessarily information.  Creating software, hardware, and services that can automatically analyze enormous data sets and help people make informed decisions is an extremely challenging technical task and an area of focus at HP Labs

HP really has a pretty good statement about why it is studying social media. (the research itself is neat, too :)).

Activity streams for open science

Inspired by the open science summit and the cool new CoLab science collaboration site, thinking about creating activity streams for open science lab books. the spec still seems pretty drafty/in flux, but going off the examples i think we could create something like this:

other verbs (some not even science specific) (extensions of create/post/update): 
  • create/post/update a hypothesis
  • create/post/update an observation
  • create/post/update evidence
  • create/post/update a paper
  • create/post/update a conclusion
  • create/post/update a derivation or a proof
  • extend a derivation
  • comment on a paper
  • comment on *
  • fork a hypothesis
  • create/post/update a theorem
  • create/post/update a lab book
  • create/post/update a research topic
  • create/post/update code or data associated with many of the items above
others?

You Weren't Meant to Have a Boss

the only way I can imagine for larger groups to avoid tree structure would be to have no structure: to have each group actually be independent, and to work together the way components of a market economy do.

and also...

Founders arriving at Y Combinator often have the downtrodden air of refugees. Three months later they're transformed: they have so much more confidence that they seem as if they've grown several inches taller. [4] Strange as this sounds, they seem both more worried and happier at the same time. Which is exactly how I'd describe the way lions seem in the wild.

Fantastic paul graham article that ties into mine and jake's sxsw proposal (no link yet) about the fall of organizations and the rise of the free agent citizen.