Thursday, October 21, 2010

How to Analyze Your Data and Take Advantage of Machine Learning in YourApplication #s2gx

Christian Schalk - Google
Google's New Cloud Technologies
  • google storage for developers 
    • api compatible with amazon s3
  • prediction api (machine learning)
  • bigquery
Google Storage
  • store your data in google's cloud 
    • any format, any amount, any time
  • you control access to your data 
    • private, shared, public
  • access via google apis or third party tools/libraries
  • sample use cases 
    • static content hosting, e.g. static html, images, music, video
    • backup and recovery
    • sharing
    • data storage for applications 
      • e.g. used as storage backend for android, appengine, cloud based apps
    • storage for computation 
      • bigquery, prediction api
Google Storage Benefits
  • high performance and scalability 
    • backed by google infrastructure
  • strong security and privacy 
    • control access to your data
  • easy to use 
    • get started fast with google and third party tools
Google Storage Technical Details
  • restful api 
    • get, put, post, head, delete
    • resources identified by uri
    • compatible with s3
  • buckets -- flat containers
  • objects 
    • any type
    • size: 100 gb / object
  • access control for google accounts 
    • for individuals and groups
  • two ways to authenticate requests 
    • sign request using access keys
    • ???
Performance and Scalability
  • objects of any type and 100GB/object
  • unlimited numbers of objects, 1000s of buckets
  • all data replicated to multiple US data centers
  • leveraging google's worldwide network for data delivery
  • only you can use bucket names with your domain names
  • read-your-writes data consistency
  • range get
Security and Privacy Features
  • key-based authentication
  • authenticated downloads from a browser

Getting Started with Google Storage
  • go to http://code.google.com for basic info
  • http://code.google.com/apis/storage (currently in preview mode) 
    • getting started guide, docs, etc.
    • can sign up for an account
  • command line tool available -- gsutil -- low-level access from the command line, scripting
  • google storage manager -- web-based tool for managing google storage

Google Storage Usage Within Google & Early Adopters
  • google bigquery
  • google prediction api
  • google.org -- imagery
  • google patents
  • panoramio
  • picnik
  • vmware
  • US Navy
  • theguardian
  • socialwok
  • xylabs
  • etc.
Pricing
  • storage: 0.17/gb/month
  • also costs for up/downloads
  • similar pricing to amazon s3
  • preview in US 
  • non-US preview available on case-by-case basis

Google Prediction API
  • google's sophisticated machine learning technology
  • available as an on-demand restful http web service
  • provide a bit of text and "train" the algorithm in the service to predict outcomes based on patterns 
  • simple example: language detection 
    • provide series of examples of english, spanish, french, etc. and train the prediction api to recognize the language
  • endless number of applications 
    • customer sentiment
    • transaction risk
    • etc
Prediction API Examples
  • predict and respond to emails in an automated way
Using the Prediction API
  • three step process 
    • upload training data to google storage
    • build a model from your data
    • make new predictions
Training
  • POST prediciton/v1.1/training?data=mybucket...
  • can respond when the prediction engine is ready and gives an estimate of accuracy

Predict
  • apply the trained model to make predictions on new data
  • returns json data
  • includes scores indicating confidence of prediction

Prediction API Capabilities
  • data 
    • input features: numeric or unstructured text
    • output: up to hundreds of discrete categories
  • Training 
    • many machine learning techniques
Prediction Demo
  • cuisine predictor
  • spreadsheet of type of food (e.g. mexican, italian, french) and food description as training data
  • upload spreadsheet to google data storage
  • kick off training process, then can check to see if it's done
  • pretty accurate predictions even on a limited training dataset
Google BigQuery
  • also resides on top of google storage
  • can have large amounts of data that you can quickly analyze using sql-like language
  • fast, simple to use
Use Cases
  • interative tools
  • spam
  • trends detection
  • web dashboards
  • network optimization
Key Capabilities
  • scalable to billions of rows
  • fast--response in seconds
  • simple--queries in sql
  • webservice based--rest, json
Using BigQuery
  • upload to google storage
  • call bigquery service to import raw data into bigquery table
  • perform sql queries on table
Security and Privacy
  • google accounts
  • oauth
  • https

Tools
  • bigquery shell utility available -- just type sql commands and get responses back
  • can tie in a google spreadsheet and point it to a bigquery table

No comments: