The following is an overview of the big ideas I’m aiming to cover both in this blog and with Design By Robots in general. I’m currently drafting a bunch of new blog posts together, and will try to push them out one by one as they get cleaned up.

Organizational Issues in Data Analysis

  • The software industry is offering Summarize-Visualize-Share as the next big idea in data analysis (as I write this in early 2011, at least). This style of software is already suffering from diminishing returns. Adding one more data set will always require a human to view and understand that summary, and at some point you run out of hours in the day to look at more summaries, regardless of how nice the charts and graphs are. Speeding up the process with “Facebook for Data” or “Facebook for Infographics” won’t help all that much.
  • Pattern detection technology has trouble gaining a foothold in project-centric businesses. A project-centric business organizes work around fixed-budget projects, and requires success to get follow-on funding. This is the way pretty much the whole world works, but it causes problems with projects involving statistics in particular because:
    • Analyzing a data set for the first time is high risk, and often requires a large but unknown number of iterations to get anything useful.
    • Current tools require significant expertise, so these iterations are very time-consuming and expensive.
    • It’s too easy to lie with statistics. Every project is a “success,” but isn’t really.
    • Even if a successful prototype is built, it rarely integrates well with the data analysts’ workflow centered around creating spreadsheets and writing reports.
  • The ultimate goal of understanding data is to create better informed design decisions such as product designs, marketing strategies, distribution strategies, labor plans, etc.
  • If design decisions must be made by a single person or tight-knit group, then as the amount of data they try to grasp to guide their decisions grows, their understanding of that data must necessarily get shallower and shallower.
  • Good design decisions account for the specifics of the problem at hand; tools that integrate specialized knowledge and specific data points, not just summaries, are needed.
  • In other words, organizations that don’t let go of document-sharing as their primary means of integrating knowledge are toast.

Organizational Issues in Automating Design

  • True artificial intelligence is a lame goal. What we really need are software systems that can monitor all the world’s data and can build us whatever we want given that context. I call this automated design. Who cares if the program is self-aware when it does it? Let’s just build to the real requirement.
  • You’re going to have to give up on humans understanding all available data. Stop trying to read every report. Stop trying to make better charts and summaries. Some small number of specialists will need to understand each type of data, but no one person can understand everything, so find a better way.
  • In order to bypass humans understanding all data, we must connect data processing and pattern detection directly to the design process, thereby automating the design. As new data comes in, new design decisions must flow out.
  • When new designs flow out based on new data, it’s important to get a number of good designs, each optimal in it’s own way, to account for the unexpected. Imagine you use trip planner software to get driving directions for a trip you’re leaving on today. It suggests either a fast route that keeps you on the interstate or a slower route with many scenic overlooks. You like scenic overlooks, but you look outside and it’s raining, so you take the interstate.

Technical Issues in Data Analysis

  • Data analysis systems are typically organized as a series of methods applied to tables of data. In software lingo, this is a procedural style and doesn’t work very well.
    • This coding style is a problem because what’s really happening is a side effect of procedural steps: the true intent of what is being done to the data and why is not declared in code, but rather implied by the series of mathematical operations being applied to the data.
    • The data’s real-world meaning and the details of the machine learning implementation can’t be hidden from each other. It takes an engineer with deep knowledge of both the data set and the machine learning implementation being used to get decent results.
    • In other words, there is not enough encapsulation at work to develop well-engineered applications, or to effectively have more than one person work on developing a single application.
    • Software systems engineering took a large step forward when they moved from a procedural style (using languages like C and Fortran) to Object-Oriented Programming (Java, C#). A similar advancement is needed for analytics applications.
  • Focusing on distributed algorithm development (Hadoop, Google’s Go programming language) only makes the problem worse by reinforcing the procedural nature of machine learning applications. The result is that the software industry is trying to develop the next generation of practical artificial intelligence applications with tools that look like primitive procedural languages, albeit procedural languages that are very complex so that they can run a program across multiple machines.
  • The better solution is to focus on the data structures used in machine learning applications. I call this a “patternable” form, that satisfies the following:
    • Above all else, a machine learning algorithm that accepts data in this patternable form can perform pattern recognition with no further modifications or meta-data.
    • Can represent any kind of data, just like database tables or xml.
    • Because of it’s strong emphasis on declaring what’s happening in the data format, distributed computing can be handled by transmitting data and program parameters around a network using simple web technologies like HTTP.
  • This new data representation is rebuilt from first-principles of statistical theory, treating data types as sets of possible values defined by combinatorics expressions.
    • Groups of data points observed in the real world are then subsets of the set of possible values. (Ok, it’s not a proper subset since there can be duplicates in your observed data; don’t worry about that right now.)
    • “Treating data types as sets of possible values defined by combinatorics expressions” is just a phrase I use to scare away the squares. It’s actually a pretty simple model to understand; it just doesn’t look quite like other data models and is not obvious why you would view your data in this way until you see how much easier it is to build sophisticated machine learning apps with it.
  • For the technically inclined: The above can be summarized as a hierarchical data model whose primary feature is its ability to encapsulate data to feed into machine learning algorithms.

Technical Issues in Automating Design

  • Designs are actually a type of data. Designs are things people can legitimately alter (a chair is 40 inches tall and white, but you could paint it black), while what is usually called ‘data’ are factual observations (it was 23 degrees farenheit at 12:33 pm in Des Moines, Iowa). By these definitions, any design can be represented in a format that can represent any kind of data.
  • If we represent our designs in our patternable data form, we can apply pattern detection techniques to sets of designs (just like any other data set). It’s then possible to use any patterns discovered in existing designs to guide future design decisions. Imagine designing a fuel efficient car by starting with the designs of 50 existing cars and their fuel efficiency ratings, and then looking for patterns in the design elements of the most efficient cars. This is a form of automated design I call pattern exploitation.
  • Not only are designs a form of data, but computer programs are a type of design. It turns out that the patternable data form is sufficiently strict to write little programming languages because of the strict requirements it imposes on data types for the benefit of pattern detection. Like other examples of automated design, you can then use pattern exploitation to generate new, better computer programs.
  • For the technically inclined: Using pattern exploitation for automated design as described above is achieved by using genetic algorithms to optimize programs written in a constraint-based form of domain-specific-languages (DSL’s).