Nikhil KarnikJekyll2015-02-23T01:50:42+00:00http://nkarnik.github.io/Nikhil Karnikhttp://nkarnik.github.io/nikhil.pankaj.karnik@gmail.comhttp://nkarnik.github.io/blog/first-post2015-02-16T19:39:55+00:002015-02-16T19:39:55+00:00Nikhil Karnikhttp://nkarnik.github.ionikhil.pankaj.karnik@gmail.com
<p>This was my first time using a static site generator to build a personal website. I’ve had a lot of fun using <a href="http://jekyllrb">Jekyll</a> and the beautiful So Simple Theme. I’ll flesh out this post to explain exactly how easy it is to host a personal site with Github Pages and Jekyll (and set up DNS with Namecheap/GoDaddy)</p>
<p><a href="http://nkarnik.github.io/blog/first-post/">Built with Jekyll</a> was originally published by Nikhil Karnik at <a href="http://nkarnik.github.io">Nikhil Karnik</a> on February 16, 2015.</p>
http://nkarnik.github.io/articles/facebook-comments-12015-01-19T00:00:00-00:002015-01-19T00:00:00+00:00Nikhil Karnikhttp://nkarnik.github.ionikhil.pankaj.karnik@gmail.com
<p>If your website or blog uses Facebook comments as its discussion platform, you may be unknowingly sitting on a goldmine of marketing data. In one interpretation, people who take the time to comment on your website (not spam) are some of your most engaged readers. Or enraged readers, depending on how objectionable they find your content :). Given Facebook’s aggressive “real name” requirements, having your most engaged readers’ valid identities and information can be useful for marketing strategy.</p>
<p>What is one digestible use of this data? A simple exercise is plotting the known locations of your commenters. Not every commenter publicly broadcasts their location, so the input dataset is not perfect, but incomplete is much better than nonexistent. <a href="http://www.fivethirtyeight.com" title="FiveThirtyEight" target="_blank">FiveThirtyEight.com</a> is a leading data journalism website, founded by Nate Silver. In addition to Nate, one of the most prolific authors on the site is the lead writer for FiveThirtyEight’s Datalab, Mona Chalabi. Here is a quick comparison of the known locations of commenters from each of their last 50 articles, with comment density by location. </p>
<iframe width="100%" height="520" frameborder="0" src="http://zillabyte.cartodb.com/viz/e538d1a4-a026-11e4-a8b5-0e0c41326911/embed_map" allowfullscreen="" webkitallowfullscreen="" mozallowfullscreen="" oallowfullscreen="" msallowfullscreen=""></iframe>
<iframe width="100%" height="520" frameborder="0" src="http://zillabyte.cartodb.com/viz/831e466c-a02a-11e4-ad2a-0e9d821ea90d/embed_map" allowfullscreen="" webkitallowfullscreen="" mozallowfullscreen="" oallowfullscreen="" msallowfullscreen=""></iframe>
<p>All credit for the beautiful and user-friendly mapping software goes to the great folks at <a href="http://www.cartodb.com" title="CartoDB" target="_blank">CartoDB</a>.</p>
<p>These maps are visually interesting and possibly fun for assigning bragging rights, but honestly, the full list of commenters (with their corresponding Facebook UIDS) and comments is much more useful. Facebook’s <a href="https://www.facebook.com/ads/audience_manager/" title="Audience Manager" target="_blank">custom audience advertising</a> is very much a walled garden, but it theoretically allows you to do interesting things with highly targeted ads.</p>
<p>If you are a high quality, but niche content producer like <a href="http://www.fivethirtyeight.com" title="FiveThirtyEight" target="_blank">FiveThirtyEight</a>, trying to mass promote your content a la Buzzfeed is probably the equivalent of dumping your advertising budget into the mysterious black holes featured in Interstellar. I haven’t tried this personally, so I admit that I’m just speculating.</p>
<p>A much more fruitful strategy of organically growing readership would be to market popular articles directly (using custom audiences) to your existing commenters, who may or may not be currently sharing your content on social channels. There is possibly no group of people more likely to share your content than the very people who are actively contributing to the discussion on your blog (and potentially interacting with the authors themselves).</p>
<p>Crawling Facebook comments is, surprisingly, not extremely straightforward. Because Facebook comments are often not fully accessible unless javascript interactions (such as clicks) are triggered after page load, I built a Zillabyte component that automates most of this process. Once a browser automation or headless browser tool like Selenium or CasperJS is attached to the Facebook Comment virtual frame, the process is pretty standard between sites.</p>
<p>Here is how to use that component in a Zillabyte app:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">zillabyte</span>
<span class="n">app</span> <span class="o">=</span> <span class="n">zillabyte</span><span class="o">.</span><span class="n">app</span><span class="p">(</span><span class="n">name</span> <span class="o">=</span> <span class="s">"facebook_comments"</span><span class="p">)</span>
<span class="n">stream</span> <span class="o">=</span> <span class="n">app</span><span class="o">.</span><span class="n">source_from_csv</span><span class="p">(</span><span class="s">"urls.csv"</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="p">[</span><span class="s">"url"</span><span class="p">])</span>
<span class="n">stream</span> <span class="o">=</span> <span class="n">stream</span><span class="o">.</span><span class="n">call_component</span><span class="p">(</span><span class="s">"facebook_comment_extractor"</span><span class="p">)</span>
<span class="n">sink</span> <span class="o">=</span> <span class="n">stream</span><span class="o">.</span><span class="n">sink</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">"facebook_comments"</span><span class="p">,</span> <span class="n">columns</span> <span class="o">=</span> <span class="p">[{</span><span class="s">"full_name"</span><span class="p">:</span><span class="s">"string"</span><span class="p">},{</span><span class="s">"facebook_id"</span><span class="p">:</span><span class="s">"string"</span><span class="p">},</span>\
<span class="p">{</span><span class="s">"page_url"</span><span class="p">:</span><span class="s">"string"</span><span class="p">},{</span><span class="s">"location"</span><span class="p">:</span><span class="s">"string"</span><span class="p">},</span> <span class="p">{</span><span class="s">"comment"</span><span class="p">:</span><span class="s">"string"</span><span class="p">}</span>\
<span class="p">{</span><span class="s">"author"</span><span class="p">:</span><span class="s">"string"</span><span class="p">},</span> <span class="p">{</span><span class="s">"page_title"</span><span class="p">:</span><span class="s">"string"</span><span class="p">}]</span></code></pre></div>
<p>Example row of the resulting CSV:
Eric Prange,male,”Silver Spring, Maryland”,erprange,http://fivethirtyeight.com/datalab/which-state-has-the-worst-drivers/,MONA CHALABI,”Dear Mona, Which State Has The Worst Drivers?”,Why are loses per driver only 10-20% of cost of premiums? Is there really that much overhead/profit or am I missing something here?</p>
<p><a href="http://nkarnik.github.io/articles/facebook-comments-1/">Where Are My Facebook Commenters? Part 1</a> was originally published by Nikhil Karnik at <a href="http://nkarnik.github.io">Nikhil Karnik</a> on January 19, 2015.</p>
http://nkarnik.github.io/articles/guerrilla-marketing-local2015-01-19T00:00:00-00:002014-12-12T00:00:00+00:00Nikhil Karnikhttp://nkarnik.github.ionikhil.pankaj.karnik@gmail.com
<p>For e-commerce companies, cutthroat competition is an unavoidable fact of life. What if there were a way to leverage insights from competitors’ behavior? E-commerce companies’ main competitors are brick and mortar retail stores which sell an identical or substitute product. A company like Warby Parker may want to know where all offline eyewear retailers’ storefronts are located.
(<em>CartoDB map of most of the Visionworks and Lenscrafters in CA):</em></p>
<iframe width="100%" height="520" frameborder="0" src="http://zillabyte.cartodb.com/viz/b5363fc8-825d-11e4-9e60-0e9d821ea90d/embed_map" allowfullscreen="" webkitallowfullscreen="" mozallowfullscreen="" oallowfullscreen="" msallowfullscreen=""></iframe>
<p>For brick and mortar stores, nothing is more telling than where they choose to put up their storefronts. In the same way that a surf shop is less likely to open up in Nebraska, large retail brands spend a lot of time and money optimizing their store locations (using factors including population density and target demographics, limited by the cost/availability of real estate). At the least, targeting local ads near the locations of offline competitors will reach some significant percentage of your target demographic.</p>
<p>For your e-commerce business, you probably have a list of brands that compete in your vertical. Using store locators to manually scrape the locations of all of your competitors’ stores is a time consuming and often incomplete process.</p>
<h1>How Zillabyte Solved this Problem</h1>
<p>Zillabyte Components are reusable pieces of data science. For common but difficult data related tasks, we have built an assortment of highly targeted and useful components. Our most popular component to date is the “domain_crawl” component that performs a deep crawl on a targeted domain.</p>
<p>For the common problem of retreiving business locations, I built a component called “business_locator” which takes an input list of businesses (e.g. Target, H&M, Lenscrafter, etc.) and outputs a CSV of all of the global physical locations of each input business. The <a href="https://gist.github.com/nkarnik/a2cd09e2f063d3bbd817" title="Input CSV" target="_blank">input CSV</a> is simply of the form:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">J</span><span class="o">.</span> <span class="n">Crew</span>
<span class="n">Men</span><span class="s">'s Warehouse </span>
<span class="n">Zara</span>
<span class="n">Macy</span><span class="s">'s </span>
<span class="n">Brooks</span> <span class="n">Brothers</span></code></pre></div>
<p>Under the hood, the component processes and cross-references a variety of distributed API calls to 3rd party services such as Google Places and Yelp. </p>
<h1>How to Use the Component: </h1>
<p>To demonstrate the power of this component, I will be making a call to the “business_locator” component within a complete Zillabyte app. </p>
<p>This app sources from the input CSV, which is a list of businesses to find. This is the entire python app:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">zillabyte</span>
<span class="c"># Register the app, source from the input CSV, and call the business_locator component. </span>
<span class="n">app</span> <span class="o">=</span> <span class="n">zillabyte</span><span class="o">.</span><span class="n">app</span><span class="p">(</span><span class="n">name</span> <span class="o">=</span> <span class="s">"zilla_commerce"</span><span class="p">)</span>
<span class="n">stream</span> <span class="o">=</span> <span class="n">app</span><span class="o">.</span><span class="n">source_from_csv</span><span class="p">(</span><span class="s">"businesses.csv"</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="p">[</span><span class="s">"business"</span><span class="p">])</span>
<span class="n">stream</span> <span class="o">=</span> <span class="n">stream</span><span class="o">.</span><span class="n">call_component</span><span class="p">(</span><span class="s">"business_locator"</span><span class="p">)</span>
<span class="c">#Finally, sink the data to your output CSV</span>
<span class="n">sink</span> <span class="o">=</span> <span class="n">stream</span><span class="o">.</span><span class="n">sink</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">"business_locations"</span><span class="p">,</span> <span class="n">columns</span> <span class="o">=</span> <span class="p">[{</span><span class="s">"business"</span><span class="p">:</span><span class="s">"string"</span><span class="p">},{</span><span class="s">"latitude"</span><span class="p">:</span><span class="s">"float"</span><span class="p">},</span>\
<span class="p">{</span><span class="s">"longitude"</span><span class="p">:</span><span class="s">"float"</span><span class="p">},{</span><span class="s">"full_address"</span><span class="p">:</span><span class="s">"string"</span><span class="p">},</span>\
<span class="p">{</span><span class="s">"zip_code"</span><span class="p">:</span><span class="s">"integer"</span><span class="p">}]</span></code></pre></div>
<p>It’s as simple as it looks. Populate your businesses.csv file with the businesses you’re interested in, and call the “business_locator” component. Get your data by running “zillabyte pull:data business_locations output” from the command line. Sample of the output CSV from the map of eyewear stores near the top of the post:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">lenscrafter</span><span class="p">,</span><span class="mf">37.787946</span><span class="p">,</span><span class="o">-</span><span class="mf">122.403076</span><span class="p">,</span><span class="mi">685</span> <span class="n">Market</span> <span class="n">St</span><span class="p">,</span><span class="n">CA</span><span class="p">,</span><span class="mi">94105</span>
<span class="n">lenscrafter</span><span class="p">,</span><span class="mf">33.721162</span><span class="p">,</span><span class="o">-</span><span class="mf">117.792653</span><span class="p">,</span><span class="mi">13662</span> <span class="n">Jamboree</span> <span class="n">Rd</span><span class="p">,</span><span class="n">CA</span><span class="p">,</span><span class="mi">92602</span>
<span class="n">lenscrafter</span><span class="p">,</span><span class="mf">37.292652</span><span class="p">,</span><span class="o">-</span><span class="mf">121.989343</span><span class="p">,</span><span class="mi">1620</span> <span class="n">Saratoga</span> <span class="n">Ave</span><span class="p">,</span><span class="n">CA</span><span class="p">,</span><span class="mi">95129</span>
<span class="n">lenscrafter</span><span class="p">,</span><span class="mf">37.836586</span><span class="p">,</span><span class="o">-</span><span class="mf">122.293834</span><span class="p">,</span><span class="mi">5707</span> <span class="n">Christie</span> <span class="n">Ave</span><span class="p">,</span><span class="n">CA</span><span class="p">,</span><span class="mi">94608</span>
<span class="n">lenscrafter</span><span class="p">,</span><span class="mf">34.076249</span><span class="p">,</span><span class="o">-</span><span class="mf">118.375763</span><span class="p">,</span><span class="mi">8471</span> <span class="n">Beverly</span> <span class="n">Blvd</span><span class="p">,</span><span class="n">CA</span><span class="p">,</span><span class="mi">90048</span></code></pre></div>
<h1>Why Bother? Is this actually useful?</h1>
<p>Yes, absolutely! Even though e-commerce happens in nebulous cyberspace, your customers are always physically somewhere in “meatspace.” Brick and mortar stores are limited by physical location, so they use immense resources to ensure their target demographics are within an accessible range of the store. So what can you do today with competitor locations?</p>
<h1>Targeted Local Advertising</h1>
<p>The most intuitive use of competitor location data is for serving your ideal demographics ads using a local advertising service. Facebook ads encourage using local business advertising <a href="https://www.facebook.com/business/a/local-awareness?campaign_id=1449765931962001&placement=emadv" title="Facebook Local Business" target="_blank">near the physical store</a>, allowing an adjustable radius for serving ads. Flip the script, so that customers in your ideal demographics see your ads while in range of your competitors’ stores. Try making the target radius very small, and potentially reach customers while they are at their local department store. This can be especially useful because this subset of the population already has some level of purchasing intent.</p>
<h1>Offline Community Building</h1>
<p>In the last few years, e-commerce companies have become more proactive about engaging with their customer community offline. For instance, <a href="https://www.combatgent.com" title="Combatant Gentleman" target="_blank">Combatant Gentleman</a> is doing a <a href="https://combatgent.com/nyc" title="Combatant Popup" target="_blank">7 day popup</a> in NYC to help foster the community and culture around its offerings. </p>
<h1>Putting Up a Storefront</h1>
<p>For e-commerce companies exploring the possibility of establishing a brick and mortar presence, using competitor data to determine the most fruitful areas of expansion can be an important consideration. Warby Parker and Nasty Gal have recently established their first physical locations (in NYC and LA, respectively). While doing demographic analysis on over served locales like NYC, SF, and LA may be more straightforward, competitors’ location data may be much more valuable in Cincinnati or Portland.</p>
<h1>What Are You Waiting For?</h1>
<p>Get started using the “business_locator” component by signing up for <a href="http://www.zillabyte.com" title="Zillabyte" target="_blank">Zillabyte</a> right now!</p>
<p><a href="http://nkarnik.github.io/articles/guerrilla-marketing-local/">Guerrilla Marketing - Target Local Ads Near Your Competitors</a> was originally published by Nikhil Karnik at <a href="http://nkarnik.github.io">Nikhil Karnik</a> on December 12, 2014.</p>
http://nkarnik.github.io/articles/twilio-weather2014-12-09T00:00:00-00:002014-12-09T00:00:00+00:00Nikhil Karnikhttp://nkarnik.github.ionikhil.pankaj.karnik@gmail.com
<p>California is in the middle of a historically extreme drought. While this isn’t news to anyone, San Francisco has seen an unexpectedly large amount of rain in the last couple weeks, causing problems for many residents.</p>
<p>Personally, my building has had major leaking and flooding issues each of the last 3 weeks. After getting fed up over the overall lack of preparation and response, I decided to have some fun with weather data and <a href="http://www.twilio.com" title="twilio" target="_blank">Twilio</a> to “gently inform” my landlord when the rains are coming.</p>
<p><img src="https://s3.amazonaws.com/uploads.hipchat.com/50986/345019/VMKpwNzdwrUCfJO/upload.png" alt="Zillaweather Workflow" /></p>
<h1>Sneak Preview: the Crux of the App</h1>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">zillabyte</span>
<span class="kn">from</span> <span class="nn">zillaweather</span> <span class="kn">import</span> <span class="n">TwilioEach</span>
<span class="n">app</span> <span class="o">=</span> <span class="n">zillabyte</span><span class="o">.</span><span class="n">app</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">"zillaweather"</span><span class="p">)</span>
<span class="c">#source from list of world cities</span>
<span class="n">stream</span> <span class="o">=</span> <span class="n">app</span><span class="o">.</span><span class="n">source</span><span class="p">(</span><span class="s">"all_cities"</span><span class="p">)</span>
<span class="c">#call component to find current weather condition at each city</span>
<span class="n">stream</span> <span class="o">=</span> <span class="n">stream</span><span class="o">.</span><span class="n">call_component</span><span class="p">(</span><span class="s">"tomorrow_weather"</span><span class="p">)</span>
<span class="c">#Use current weather and dictionary of cities : numbers to send a SMS with Twilio</span>
<span class="n">stream</span> <span class="o">=</span> <span class="n">stream</span><span class="o">.</span><span class="n">each</span><span class="p">(</span><span class="n">TwilioEach</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">"Prepare_Twilio"</span><span class="p">)</span>
<span class="n">stream</span> <span class="o">=</span> <span class="n">stream</span><span class="o">.</span><span class="n">call_component</span><span class="p">(</span><span class="s">"twilio_text"</span><span class="p">)</span>
<span class="n">stream</span><span class="o">.</span><span class="n">sink</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">"weather"</span><span class="p">,</span> <span class="n">columns</span> <span class="o">=</span> <span class="p">[{</span><span class="s">"to"</span><span class="p">:</span><span class="s">"string"</span><span class="p">}])</span></code></pre></div>
<p>That’s seriously it. In full disclosure, this is what the <a href="https://gist.github.com/nkarnik/1d40b230370c72830ec0" title="Zillaweather" target="_blank">entire app</a> looks like without modularizing the TwilioEach class. All things considered, maybe 30 lines of code total. How? All because of the power of components.</p>
<h1>Sending Twilio messages with Zillabyte</h1>
<p>If the 3 day weather forecast calls for a large amount of rain, I can probably just check an app and let my landlord know in advance. But, hold on… Zillabyte is a service for distributed computing… so what’s the fun of being limited to weather data from one city? I want to help tenants notify their landlord all across America (even the whole world!) about upcoming local weather conditions. I am better served using Zillabyte to query and process data from almost every weather station in America and an automated service like Twilio to serve the notifications. </p>
<p>First, I wrapped all of my Twilio rest API calls inside a Zillabyte component. After registering this component, I can call this component at will. The component simply sends a message from a Twilio number about the current weather conditions in a city to a phone number subscribed to that city.</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">zillabyte</span>
<span class="kn">from</span> <span class="nn">twilio.rest</span> <span class="kn">import</span> <span class="n">TwilioRestClient</span>
<span class="k">def</span> <span class="nf">send_text</span><span class="p">(</span><span class="n">controller</span><span class="p">,</span> <span class="n">tup</span><span class="p">):</span>
<span class="c"># Register twilio client</span>
<span class="n">twilio_client</span> <span class="o">=</span> <span class="n">TwilioRestClient</span><span class="p">(</span><span class="n">tup</span><span class="p">[</span><span class="s">"twilio_sid"</span><span class="p">],</span> <span class="n">tup</span><span class="p">[</span><span class="s">"twilio_auth"</span><span class="p">])</span>
<span class="k">try</span><span class="p">:</span>
<span class="c"># Send message to recipient and wait 1 second (twilio SMS rate limit)</span>
<span class="n">text_body</span> <span class="o">=</span> <span class="s">"It is going to be "</span> <span class="o">+</span> <span class="n">tup</span><span class="p">[</span><span class="s">'condition'</span><span class="p">]</span> <span class="o">+</span> <span class="s">" in "</span> <span class="o">+</span> <span class="n">tup</span><span class="p">[</span><span class="s">'city'</span><span class="p">]</span> <span class="o">+</span> <span class="s">" tomorrow!"</span>
<span class="n">twilio_client</span><span class="o">.</span><span class="n">messages</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="n">to</span><span class="o">=</span><span class="n">tup</span><span class="p">[</span><span class="s">"to"</span><span class="p">],</span> <span class="n">from</span><span class="o">=</span> <span class="n">tup</span><span class="p">[</span><span class="s">"from"</span><span class="p">],</span> <span class="n">body</span><span class="o">=</span> <span class="n">text_body</span><span class="p">)</span>
<span class="c"># Emit the recipient back to the stream...</span>
<span class="n">controller</span><span class="o">.</span><span class="n">emit</span><span class="p">({</span><span class="s">"to"</span> <span class="p">:</span> <span class="n">text_to</span><span class="p">})</span>
<span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="k">except</span><span class="p">:</span>
<span class="k">pass</span>
<span class="n">component</span> <span class="o">=</span> <span class="n">zillabyte</span><span class="o">.</span><span class="n">component</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">"twilio_weather_text"</span><span class="p">)</span>
<span class="n">stream</span> <span class="o">=</span> <span class="n">component</span><span class="o">.</span><span class="n">inputs</span><span class="p">(</span>
<span class="n">name</span> <span class="o">=</span> <span class="s">"input_stream"</span><span class="p">,</span>\
<span class="n">fields</span> <span class="o">=</span> <span class="p">[{</span><span class="s">"twilio_sid"</span> <span class="p">:</span> <span class="s">"string"</span><span class="p">},</span> <span class="p">{</span><span class="s">"twilio_auth"</span> <span class="p">:</span> <span class="s">"string"</span><span class="p">},</span>\
<span class="p">{</span><span class="s">"from"</span> <span class="p">:</span> <span class="s">"string"</span><span class="p">},</span> <span class="p">{</span><span class="s">"to"</span> <span class="p">:</span> <span class="s">"string"</span><span class="p">},</span> <span class="p">{</span><span class="s">"city"</span> <span class="p">:</span> <span class="s">"string"</span><span class="p">},</span>\
<span class="p">{</span><span class="s">"condition"</span> <span class="p">:</span> <span class="s">"string"</span><span class="p">}]</span>\
<span class="p">)</span>
<span class="n">stream</span> <span class="o">=</span> <span class="n">stream</span><span class="o">.</span><span class="n">each</span><span class="p">(</span><span class="n">send_text</span><span class="p">)</span>
<span class="n">stream</span><span class="o">.</span><span class="n">outputs</span><span class="p">(</span><span class="n">name</span> <span class="o">=</span> <span class="s">"output_stream"</span><span class="p">,</span> <span class="n">fields</span><span class="o">=</span><span class="p">[{</span><span class="s">"to"</span> <span class="p">:</span> <span class="s">"string"</span><span class="p">}])</span></code></pre></div>
<p>Fork <a href="https://gist.github.com/nkarnik/311d3a9bd851186ee4b3" title="Twilio Component" target="_blank">this component</a> on github.</p>
<h1>The Entire App
After registering the component, the <a href="https://gist.github.com/nkarnik/1d40b230370c72830ec0" title="Zillaweather" target="_blank">remainder of the Zillabyte app</a> is straightforward to write. The app uses a pre-registered component called "tomorrow_weather", which will determine the expected forecast for tomorrow for a given city. It then sends rain notifications to phone numbers registered by city via the Twilio "send_text" component.
<h1>Why Mix Data and Notifications?</h1>
The primary utility and accessibility of "big data" stems from the ability of data systems to provide meaningful insights to humans. Certainly, many times large-scale data systems are built to feed some dynamic model on the back-end (such as an Ad model on a social network service), but under certain conditions, notifying humans is necessary. For example, if your Ad model is performing particularly poorly (users aren't clicking because they aren't being served relevant ads), your data science team might want to receive a text notification in case they're out to lunch.
</h1>
<p><a href="http://nkarnik.github.io/articles/twilio-weather/">How to Build a Global Weather Alert System in Under an Hour</a> was originally published by Nikhil Karnik at <a href="http://nkarnik.github.io">Nikhil Karnik</a> on December 09, 2014.</p>
http://nkarnik.github.io/articles/mapping-tech-bubbles2014-11-23T00:00:00-00:002014-11-23T00:00:00+00:00Nikhil Karnikhttp://nkarnik.github.ionikhil.pankaj.karnik@gmail.com
<p><em>(Note: I originally published this at the <a href="http://blog.zillabyte.com/2014/11/23/mapping-tech-bubbles/">Zillabyte blog</a>)</em></p>
<p>A time lapse of venture funding rounds (by date announced) for a sample of 1000 companies in SF:</p>
<iframe width="100%" height="520" frameborder="0" src="http://zillabyte.cartodb.com/viz/f4353cee-73a6-11e4-aabd-0e9d821ea90d/embed_map" allowfullscreen="" webkitallowfullscreen="" mozallowfullscreen="" oallowfullscreen="" msallowfullscreen=""></iframe>
<p>…I wonder why rent in SOMA is so expensive.</p>
<h1>Mapping Lots of Bubbles</h1>
<p>I have very few opinions about whether tech companies have been chronically overvalued by institutional venture capitalists. However, it is hard to ignore that quite a bit of venture capital has been poured into a wide array (though eventually concentrated) set of companies that did not exist (1, 5, 10) years ago. Many of these companies have sprouted up in the densest parts America’s urban centers. To that end, I am analyzing how the most recent tech boom has impacted America’s cities. </p>
<p>I collected data about venture-backed companies using the <a href="http://www.crunchbase.com" title="Crunchbase" target="_blank">Crunchbase</a> API, which allows for a limited number of requests. For my purposes, the relevant information for each company included: company name, total funding raised, street address, city/state, and (for later use) API ids for all funding rounds. My goals were three-fold: 1) Map each company within a city with bubble size corresponding to the total funding raised. 2) Map the “acceleration” of venture funding with a time-lapse map. 3) Maybe compare trends between cities.</p>
<p>In these maps, bubble sizes correspond to the total amount of funding raised. Again, this is a semi-random sample of about 1000 companies, so please don’t feel bad if your/your favorite startup isn’t on this map.</p>
<p>All of the maps in this post were rendered with the use of <a href="http://www.cartodb.com" title="CartoDB" target="_blank">CartoDB</a>, an excellent piece of map visualization software. I highly recommend it for your next project.</p>
<p>A static (stylized) map of startup locations by city block in San Francisco:</p>
<iframe width="100%" height="520" frameborder="0" src="http://zillabyte.cartodb.com/viz/82f0f08a-73a8-11e4-ae1c-0e4fddd5de28/embed_map" allowfullscreen="" webkitallowfullscreen="" mozallowfullscreen="" oallowfullscreen="" msallowfullscreen=""></iframe>
<iframe width="100%" height="520" frameborder="0" src="http://zillabyte.cartodb.com/viz/d29ab428-73a7-11e4-8c2e-0e018d66dc29/embed_map" allowfullscreen="" webkitallowfullscreen="" mozallowfullscreen="" oallowfullscreen="" msallowfullscreen=""></iframe>
<h2>Building the App</h2>
<p>Using <a href="http://docs.zillabyte.com" title="Zillabyte Documentation" target="_blank">Zillabyte</a>, I was able to streamline my data processing. I started with a stream of company names from each city, and tried to make valid crunchbase API requests based on how company names are represented in the crunchbase database.</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">zillabyte</span>
<span class="kn">import</span> <span class="nn">crunch</span>
<span class="kn">import</span> <span class="nn">csv</span>
<span class="k">def</span> <span class="nf">prep</span><span class="p">(</span><span class="n">controller</span><span class="p">):</span>
<span class="k">return</span>
<span class="k">def</span> <span class="nf">nt</span><span class="p">(</span><span class="n">controller</span><span class="p">):</span>
<span class="n">infile</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">"sfvalid.csv"</span><span class="p">,</span> <span class="s">"rt"</span><span class="p">)</span>
<span class="n">company_list</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">c</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">reader</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">reader</span><span class="p">(</span><span class="n">infile</span><span class="p">)</span>
<span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">reader</span><span class="p">:</span>
<span class="n">controller</span><span class="o">.</span><span class="n">emit</span><span class="p">({</span><span class="s">"company"</span> <span class="p">:</span> <span class="n">row</span><span class="p">[</span><span class="mi">0</span><span class="p">]})</span>
<span class="n">c</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="n">c</span> <span class="o">></span> <span class="mi">5</span><span class="p">:</span>
<span class="k">print</span> <span class="n">c</span>
<span class="n">controller</span><span class="o">.</span><span class="n">end_cycle</span><span class="p">()</span>
<span class="k">except</span><span class="p">:</span>
<span class="k">print</span> <span class="s">"Error"</span>
<span class="k">if</span> <span class="n">c</span> <span class="o">></span> <span class="mi">20</span><span class="p">:</span>
<span class="k">print</span> <span class="n">c</span>
<span class="n">controller</span><span class="o">.</span><span class="n">end_cycle</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">execute</span><span class="p">(</span><span class="n">controller</span><span class="p">,</span> <span class="n">tup</span><span class="p">):</span>
<span class="n">company</span> <span class="o">=</span> <span class="n">tup</span><span class="p">[</span><span class="s">"company"</span><span class="p">]</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">crunch</span><span class="o">.</span><span class="n">getOrgRow</span><span class="p">(</span><span class="n">company</span><span class="p">)</span>
<span class="c">#tup = (org, funding, address, lat, lon, funds)</span>
<span class="k">print</span> <span class="n">data</span>
<span class="n">controller</span><span class="o">.</span><span class="n">emit</span><span class="p">({</span><span class="s">"company"</span><span class="p">:</span><span class="n">data</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="s">"funding"</span><span class="p">:</span><span class="n">data</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span><span class="s">"address"</span><span class="p">:</span><span class="n">data</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span><span class="s">"lat"</span><span class="p">:</span><span class="n">data</span><span class="p">[</span><span class="mi">3</span><span class="p">],</span><span class="s">"lon"</span><span class="p">:</span><span class="n">data</span><span class="p">[</span><span class="mi">4</span><span class="p">],</span><span class="s">"fund_paths"</span><span class="p">:</span> <span class="n">data</span><span class="p">[</span><span class="mi">5</span><span class="p">]})</span>
<span class="k">return</span>
<span class="n">app</span> <span class="o">=</span> <span class="n">zillabyte</span><span class="o">.</span><span class="n">app</span><span class="p">(</span><span class="n">name</span> <span class="o">=</span> <span class="s">"startups"</span><span class="p">)</span>
<span class="n">app</span><span class="o">.</span><span class="n">source</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">"startuplist"</span><span class="p">,</span> <span class="n">next_tuple</span> <span class="o">=</span> <span class="n">nt</span><span class="p">,</span> <span class="n">end_cycle_policy</span><span class="o">=</span><span class="s">"explicit"</span><span class="p">)</span>\
<span class="o">.</span><span class="n">each</span><span class="p">(</span><span class="n">execute</span> <span class="o">=</span> <span class="n">execute</span><span class="p">)</span>\
<span class="o">.</span><span class="n">sink</span><span class="p">(</span><span class="n">name</span> <span class="o">=</span> <span class="s">"startups"</span><span class="p">,</span> <span class="n">columns</span> <span class="o">=</span> <span class="p">[{</span><span class="s">"company"</span><span class="p">:</span><span class="s">"string"</span><span class="p">},{</span><span class="s">"funding"</span><span class="p">:</span><span class="s">"integer"</span><span class="p">},{</span><span class="s">"address"</span><span class="p">:</span><span class="s">"string"</span><span class="p">},{</span><span class="s">"lat"</span><span class="p">:</span><span class="s">"float"</span><span class="p">},{</span><span class="s">"lon"</span><span class="p">:</span><span class="s">"float"</span><span class="p">},{</span><span class="s">"fund_paths"</span><span class="p">:</span><span class="s">"array"</span><span class="p">}])</span></code></pre></div>
<p>You may have noticed that I have a custom crunch module that makes my API requests. The relevant contents follow:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">requests</span>
<span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">csv</span>
<span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">import</span> <span class="nn">time</span>
<span class="n">key</span> <span class="o">=</span> <span class="s">"YOURKEYHERE"</span>
<span class="k">def</span> <span class="nf">getTotalFunding</span><span class="p">(</span><span class="n">orgjson</span><span class="p">):</span>
<span class="n">funding</span> <span class="o">=</span> <span class="n">orgjson</span><span class="p">[</span><span class="s">"data"</span><span class="p">][</span><span class="s">"properties"</span><span class="p">][</span><span class="s">"total_funding_usd"</span><span class="p">]</span>
<span class="k">return</span> <span class="n">funding</span>
<span class="k">def</span> <span class="nf">getLatLong</span><span class="p">(</span><span class="n">street</span><span class="p">):</span>
<span class="n">street</span> <span class="o">=</span> <span class="n">street</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s">" "</span><span class="p">,</span> <span class="s">"+"</span><span class="p">)</span>
<span class="n">request_url</span> <span class="o">=</span> <span class="s">"http://www.datasciencetoolkit.org/street2coordinates/"</span> <span class="o">+</span> <span class="n">street</span>
<span class="c"># ALSO Need city, state at the end</span>
<span class="n">jreq</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">request_url</span><span class="p">)</span><span class="o">.</span><span class="n">json</span><span class="p">()</span>
<span class="n">lat</span> <span class="o">=</span> <span class="n">jreq</span><span class="p">[</span><span class="n">jreq</span><span class="o">.</span><span class="n">keys</span><span class="p">()[</span><span class="mi">0</span><span class="p">]][</span><span class="s">"latitude"</span><span class="p">]</span>
<span class="n">lon</span> <span class="o">=</span> <span class="n">jreq</span><span class="p">[</span><span class="n">jreq</span><span class="o">.</span><span class="n">keys</span><span class="p">()[</span><span class="mi">0</span><span class="p">]][</span><span class="s">"longitude"</span><span class="p">]</span>
<span class="k">return</span> <span class="p">(</span><span class="n">lat</span><span class="p">,</span> <span class="n">lon</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">getOrganization</span><span class="p">(</span><span class="n">org</span><span class="p">):</span>
<span class="n">request_url</span> <span class="o">=</span> <span class="s">"http://api.crunchbase.com/v/2/organization/"</span> <span class="o">+</span> <span class="n">org</span> <span class="o">+</span> <span class="s">"?user_key="</span> <span class="o">+</span> <span class="n">key</span>
<span class="n">jreq</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">request_url</span><span class="p">)</span><span class="o">.</span><span class="n">json</span><span class="p">()</span>
<span class="k">return</span> <span class="n">jreq</span>
<span class="k">def</span> <span class="nf">validate</span><span class="p">(</span><span class="n">orgjson</span><span class="p">):</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">orgjson</span><span class="o">.</span><span class="n">keys</span><span class="p">())</span> <span class="o"><</span> <span class="mi">2</span><span class="p">:</span>
<span class="k">print</span> <span class="s">"invalid"</span>
<span class="k">return</span> <span class="bp">False</span>
<span class="k">return</span> <span class="bp">True</span>
<span class="k">def</span> <span class="nf">getLocation</span><span class="p">(</span><span class="n">orgjson</span><span class="p">):</span>
<span class="n">location</span> <span class="o">=</span> <span class="n">orgjson</span><span class="p">[</span><span class="s">"data"</span><span class="p">][</span><span class="s">"relationships"</span><span class="p">][</span><span class="s">"headquarters"</span><span class="p">][</span><span class="s">"items"</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
<span class="n">latitude</span> <span class="o">=</span> <span class="n">location</span><span class="p">[</span><span class="s">"latitude"</span><span class="p">]</span>
<span class="n">longitude</span> <span class="o">=</span> <span class="n">location</span><span class="p">[</span><span class="s">"longitude"</span><span class="p">]</span>
<span class="n">city</span> <span class="o">=</span> <span class="n">location</span><span class="p">[</span><span class="s">"city"</span><span class="p">]</span>
<span class="k">if</span> <span class="p">(</span><span class="n">location</span><span class="p">[</span><span class="s">"street_1"</span><span class="p">]</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">)</span> <span class="ow">and</span> <span class="p">(</span><span class="n">location</span><span class="p">[</span><span class="s">"street_2"</span><span class="p">]</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">False</span>
<span class="k">elif</span> <span class="p">(</span><span class="n">location</span><span class="p">[</span><span class="s">"street_1"</span><span class="p">]</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">):</span>
<span class="n">street</span> <span class="o">=</span> <span class="n">location</span><span class="p">[</span><span class="s">"street_2"</span><span class="p">]</span>
<span class="k">elif</span> <span class="p">(</span><span class="n">location</span><span class="p">[</span><span class="s">"street_2"</span><span class="p">]</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">):</span>
<span class="n">street</span> <span class="o">=</span> <span class="n">location</span><span class="p">[</span><span class="s">"street_1"</span><span class="p">]</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">street</span> <span class="o">=</span> <span class="n">location</span><span class="p">[</span><span class="s">"street_1"</span><span class="p">]</span> <span class="o">+</span> <span class="s">" "</span> <span class="o">+</span> <span class="n">location</span><span class="p">[</span><span class="s">"street_2"</span><span class="p">]</span>
<span class="k">return</span> <span class="n">street</span>
<span class="k">def</span> <span class="nf">getFundingRounds</span><span class="p">(</span><span class="n">orgjson</span><span class="p">):</span>
<span class="n">fundingRounds</span> <span class="o">=</span> <span class="n">orgjson</span><span class="p">[</span><span class="s">"data"</span><span class="p">][</span><span class="s">"relationships"</span><span class="p">][</span><span class="s">"funding_rounds"</span><span class="p">][</span><span class="s">"items"</span><span class="p">]</span>
<span class="n">rounds</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">frs</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">fundingRound</span> <span class="ow">in</span> <span class="n">fundingRounds</span><span class="p">:</span>
<span class="n">rounds</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">fundingRound</span><span class="p">[</span><span class="s">"path"</span><span class="p">])</span>
<span class="k">return</span> <span class="n">rounds</span>
<span class="k">def</span> <span class="nf">getOrgRow</span><span class="p">(</span><span class="n">org</span><span class="p">):</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">jorg</span> <span class="o">=</span> <span class="n">getOrganization</span><span class="p">(</span><span class="n">org</span><span class="p">)</span>
<span class="k">if</span> <span class="n">validate</span><span class="p">(</span><span class="n">jorg</span><span class="p">):</span>
<span class="n">funding</span> <span class="o">=</span> <span class="n">getTotalFunding</span><span class="p">(</span><span class="n">jorg</span><span class="p">)</span>
<span class="n">address</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">getLocation</span><span class="p">(</span><span class="n">jorg</span><span class="p">))</span>
<span class="n">lat</span><span class="p">,</span> <span class="n">lon</span> <span class="o">=</span> <span class="n">getLatLong</span><span class="p">(</span><span class="n">address</span><span class="p">)</span>
<span class="n">rounds</span> <span class="o">=</span> <span class="n">getFundingRounds</span><span class="p">(</span><span class="n">jorg</span><span class="p">)</span>
<span class="n">funds</span> <span class="o">=</span> <span class="p">{</span><span class="s">"rounds"</span><span class="p">:</span> <span class="p">[]}</span>
<span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">rounds</span><span class="p">:</span>
<span class="n">funds</span><span class="p">[</span><span class="s">"rounds"</span><span class="p">]</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">r</span><span class="p">))</span>
<span class="n">tup</span> <span class="o">=</span> <span class="p">(</span><span class="n">org</span><span class="p">,</span> <span class="n">funding</span><span class="p">,</span> <span class="n">address</span><span class="p">,</span> <span class="n">lat</span><span class="p">,</span> <span class="n">lon</span><span class="p">,</span> <span class="n">funds</span><span class="p">)</span>
<span class="k">return</span> <span class="n">tup</span>
<span class="k">except</span><span class="p">:</span>
<span class="k">print</span> <span class="s">"error"</span></code></pre></div>
<p>In order to geotag startups latitude and longitude by street name (and implicitly state and country), I made an external API request to datasciencetoolkit’s Georeferencer, which can be called as follows: </p>
<p>request_url = “http://www.datasciencetoolkit.org/street2coordinates/” + street + “%2c+San+Francisco%2c+CA”</p>
<p>Making an http request to this url (with your street embedded) should return a fairly accurate lat, long coordinate for each company. This isn’t a perfect georeference tool; in fact, some of the coordinates are mislabeled (i.e. there is no startup located on Treasure Island).</p>
<h2>Other Cities</h2>
<p>So my stretch goal for this exercise was to analyze other cities startup trajectories. The world doesn’t revolve around SF, of course. Palo Alto is an interesting case, because there are a ton of startups there, but the area is significantly more suburban:</p>
<iframe width="100%" height="520" frameborder="0" src="http://zillabyte.cartodb.com/viz/d0033926-7396-11e4-bcac-0e4fddd5de28/embed_map" allowfullscreen="" webkitallowfullscreen="" mozallowfullscreen="" oallowfullscreen="" msallowfullscreen=""></iframe>
<p>In Palo Alto, most of the startup activity is concentrated around the two mass transit Caltrain stops, the downtown Palo Alto station and the station at California Avenue. Access to transportation is a critical component of how people/businesses are concentrated, so this should not be a huge surprise. </p>
<p><b>NYC</b>:</p>
<p>Time lapse of funding rounds in NYC (limited to Manhattan). Interesting to note that while funding definitely picks up in pace after 2009, the overall growth in startups appears more steady than the relative explosion in San Francisco. Of course, this is all very hand-wavy, and as an engineer/analyst I should really quantify this better:</p>
<iframe width="100%" height="520" frameborder="0" src="http://zillabyte.cartodb.com/viz/cce44e00-73ab-11e4-b546-0e018d66dc29/embed_map" allowfullscreen="" webkitallowfullscreen="" mozallowfullscreen="" oallowfullscreen="" msallowfullscreen=""></iframe>
<p><b>Austin</b>:</p>
<p>Static sample map of Austin’s startup scene. I have never stepped foot into Texas and know nothing about Austin, but it is interesting to see businesses spaced out further in a city.</p>
<iframe width="100%" height="520" frameborder="0" src="http://zillabyte.cartodb.com/viz/ed5f571c-73ae-11e4-8c2e-0e018d66dc29/embed_map" allowfullscreen="" webkitallowfullscreen="" mozallowfullscreen="" oallowfullscreen="" msallowfullscreen=""></iframe>
<h2>Takeaways</h2>
<p>After spending a few hours reading about the history of American cities, I can confirm that I am woefully unqualified to make most sweeping claims about how different cities’ startup communities have evolved over the last 12 years. Cities across America (and the world) have very different histories, and the next step should be to quantify urban movement (population growth) and development relative to the rate of venture funding/startup growth in each city. This way, we can build off this work to analyze the following: How does the acceleration of startup growth compare across cities? How much are changing demographics in large urban areas correlated with the latest tech boom? Which cities are most primed to be the “up and coming” startup hubs?</p>
<p><a href="http://nkarnik.github.io/articles/mapping-tech-bubbles/">Mapping Tech (with) Bubbles</a> was originally published by Nikhil Karnik at <a href="http://nkarnik.github.io">Nikhil Karnik</a> on November 23, 2014.</p>
http://nkarnik.github.io/articles/lil-wayne-pagerank-12013-05-31T00:00:00-00:002014-10-28T00:00:00+00:00Nikhil Karnikhttp://nkarnik.github.ionikhil.pankaj.karnik@gmail.com
<p><em>Note: I originally published this on the <a href="http://blog.zillabyte.com/2014/10/28/lil-waynes-pagerank-part-1/">zillablog:</a></em></p>
<p class="pull-left"><img src="http://nkarnik.github.io/images/raekwon_only.jpg" alt="Raekwon Image" /></p>
<p>Kanye West and Jay-Z, Dr. Dre and Snoop Dogg, Raekwon and Ghostface Killah, Lil Wayne and Drake: some rappers are best known for their featured appearances and partnerships. After countless debates over the greatest rap duo of all time (still Raekwon and Ghostface, for the record), I was curious about how easily I could quantify the strength of rap affiliations. I also wanted to find out how interconnected the network of rap collaborations really was.</p>
<p>Who collaborates the most with other artists? Which pairs of artists have the most collaborations? And, most importantly, if we modeled the entire network of hip hop artists as an undirected graph, how would Lil Wayne’s PageRank stack up to Kanye West or RZA? These are fascinating questions. Somebody, somewhere woke up today wondering all of these things.</p>
<h2 id="modeling-the-data">Modeling the Data</h2>
<p>In the “Hip Hop Graph,” artists (producers are considered artists) are represented as nodes, and edges between nodes represent collaborations between artists. It is straightforward to represent edge weights between two artists as the number of collaborations in the corpus of songs. For ease of use in calculations (i.e. shortest paths and minimum spanning trees), taking the inverse of the number of collaborations (1 / # collaborations) is a common technique. The graphs displayed consist of the connections of the 130 most prolific artists.</p>
<p>The final result (one layout of the minimum spanning tree) using networkx (the python package) and Gephi:</p>
<p><img src="http://nkarnik.github.io/images/rapper_network3-2.png" alt="Network 1" /></p>
<p>Another image of the minimum spanning tree, zoomed in. The size and color of the nodes is correlated with the degree of the node in the MST.</p>
<p><img src="http://nkarnik.github.io/images/mstgraphcenter2.png" alt="Network 2" /></p>
<p>I chose to model the Hip Hop Graph as an undirected graph, in which one collaboration consists of any song in which both artists are featured (either as primary artist, featured artists, or producer). For example, the song
“Winter Warz” by Ghostface Killah, featuring Cappadonna, Masta Killa, Raekwon, and U-God, produced by RZA has {6 C 2} = 12 combinations of collaborations.</p>
<p>For any song, there are three categories of artists. Each song has a “primary artist” to whom the track is attributed. Additionally, every song may have multiple featured artists and one or more producers. To generate the graph, I needed to produce a dataset of the specific format: </p>
<p>{<strong>String</strong> song, <strong>String</strong> artist, <strong>Array</strong> featuredArtists, <strong>Array</strong> producers }</p>
<p>Finding easily accessible music metadata that conforms to this format of attribution is not an easy task. However, certain websites such as <a href="http://rap.genius.com" target="_blank">Rap Genius</a> maintain metadata on each song in this format. So the song corpus being analyzed is the set of songs registered for artists on Rap Genius.</p>
<p>I wrote a quick and dirty scraper in python for Rap Genius’ html using BeautifulSoup that served as my API. The first function returns the URLs for all the songs attributed to a particular primary artist:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
<span class="kn">import</span> <span class="nn">requests</span>
<span class="kn">from</span> <span class="nn">urlparse</span> <span class="kn">import</span> <span class="n">parse_qs</span><span class="p">,</span> <span class="n">urlsplit</span>
<span class="n">RAPGENIUS_URL</span> <span class="o">=</span> <span class="s">'http://rap.genius.com'</span>
<span class="n">RAPGENIUS_SEARCH_URL</span> <span class="o">=</span> <span class="s">'http://genius.com/search'</span>
<span class="n">RAPGENIUS_ARTIST_URL</span> <span class="o">=</span> <span class="s">'http://genius.com/artists'</span>
<span class="k">def</span> <span class="nf">getArtistSongs</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
<span class="n">songs</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">findAll</span><span class="p">(</span><span class="s">'ul'</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">'song_list primary_list '</span><span class="p">):</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">songs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">row</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">'href'</span><span class="p">))</span>
<span class="k">continue</span>
<span class="k">except</span><span class="p">:</span>
<span class="k">print</span> <span class="s">"couldnt find songs"</span>
<span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">soup</span><span class="p">(</span><span class="s">'div'</span><span class="p">,</span> <span class="p">{</span><span class="s">'class'</span><span class="p">:</span><span class="s">'pagination'</span><span class="p">}):</span>
<span class="n">links</span> <span class="o">=</span> <span class="n">r</span><span class="o">.</span><span class="n">findAll</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">last</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">links</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span><span class="o">.</span><span class="n">string</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="n">last</span><span class="p">):</span>
<span class="n">i</span> <span class="o">+=</span> <span class="mi">4</span>
<span class="n">upart</span> <span class="o">=</span> <span class="n">links</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">'href'</span><span class="p">)</span>
<span class="n">nurl</span> <span class="o">=</span> <span class="n">RAPGENIUS_URL</span> <span class="o">+</span> <span class="n">upart</span>
<span class="k">print</span> <span class="n">nurl</span>
<span class="n">scheme</span><span class="p">,</span> <span class="n">netloc</span><span class="p">,</span> <span class="n">path</span><span class="p">,</span> <span class="n">query_string</span><span class="p">,</span> <span class="n">fragment</span> <span class="o">=</span> <span class="n">urlsplit</span><span class="p">(</span><span class="n">nurl</span><span class="p">)</span>
<span class="n">query_params</span> <span class="o">=</span> <span class="n">parse_qs</span><span class="p">(</span><span class="n">query_string</span><span class="p">)</span>
<span class="n">query_params</span><span class="p">[</span><span class="s">'page'</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">)]</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">RAPGENIUS_URL</span> <span class="o">+</span> <span class="n">path</span><span class="p">,</span> <span class="n">params</span><span class="o">=</span><span class="n">query_params</span><span class="p">)</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">r</span><span class="o">.</span><span class="n">text</span>
<span class="n">page</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">r</span><span class="p">)</span>
<span class="k">for</span> <span class="n">pageRow</span> <span class="ow">in</span> <span class="n">page</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'ul'</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">'song_list primary_list '</span><span class="p">):</span>
<span class="k">if</span><span class="p">(</span><span class="nb">type</span><span class="p">(</span><span class="n">pageRow</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'span'</span><span class="p">))</span><span class="o">!=</span><span class="nb">int</span><span class="p">):</span>
<span class="n">songs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">pageRow</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">'href'</span><span class="p">))</span>
<span class="k">except</span><span class="p">:</span>
<span class="k">print</span> <span class="s">"error somewhere"</span>
<span class="k">return</span> <span class="n">songs</span></code></pre></div>
<p>The second function extracts the primary artist, featured artists, and producers for any given song URL:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">setSong</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
<span class="n">featured</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">producers</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">aName</span> <span class="o">=</span> <span class="s">''</span>
<span class="c">#Extract primary artist name</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">art</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'span'</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span> <span class="s">'text_artist'</span><span class="p">)</span>
<span class="n">aName</span> <span class="o">=</span> <span class="n">art</span><span class="o">.</span><span class="n">contents</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">text</span>
<span class="c">#print aName</span>
<span class="k">except</span><span class="p">:</span>
<span class="k">pass</span>
<span class="c">#Extract featured artist names</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">features</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'span'</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">'featured_artists'</span><span class="p">)</span><span class="o">.</span><span class="n">contents</span>
<span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">features</span><span class="p">:</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">featured</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
<span class="k">except</span><span class="p">:</span>
<span class="k">pass</span>
<span class="k">except</span><span class="p">:</span>
<span class="k">pass</span>
<span class="c">#Extract producers' names</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">features</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'span'</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">'producer_artists'</span><span class="p">)</span><span class="o">.</span><span class="n">contents</span>
<span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">features</span><span class="p">:</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">producers</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
<span class="k">except</span><span class="p">:</span>
<span class="k">pass</span>
<span class="k">except</span><span class="p">:</span>
<span class="k">pass</span>
<span class="k">return</span> <span class="p">(</span><span class="n">aName</span><span class="p">,</span> <span class="n">featured</span><span class="p">,</span> <span class="n">producers</span><span class="p">)</span></code></pre></div>
<h2 id="using-zillabyte-to-gather-data">Using Zillabyte to Gather Data</h2>
<p>In order to get all the data I needed, I had to pipe these two functions together to get all the metadata for my entire list of songs. This required a number of http requests to Rap Genius. <a href="http://docs.zillabyte.com" target="_blank">Zillabyte</a> allows easy distributed crawling, using pipe programming to modify streams of tuples. <a href="http://docs.zillabyte.com/index_tech/?ruby#hello-goodbye-world-take-one" target="_blank">Here</a> are some basic Zillabyte apps that should get you familiar with pipe programming and stream processing. Using Zillabyte, I could fetch about 30,000 songs’ data in 45 minutes with a parallelism of 8.</p>
<p>I seeded my app with a CSV of hip hop artists I was interested in; these are the first 10 rows:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">Eminem</span>
<span class="n">Outkast</span>
<span class="n">JayZ</span>
<span class="n">WuTangClan</span>
<span class="n">UGK</span>
<span class="n">Drake</span>
<span class="n">Raekwon</span>
<span class="n">MethodMan</span>
<span class="mi">2</span><span class="n">Pac</span>
<span class="n">KanyeWest</span></code></pre></div>
<p>Here is the code for the Zillabyte app:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">zillabyte</span>
<span class="c">#rapgenius.py is the module that houses the two functions above</span>
<span class="kn">from</span> <span class="nn">rapgenius</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">csv</span>
<span class="k">def</span> <span class="nf">prep</span><span class="p">(</span><span class="n">controller</span><span class="p">):</span>
<span class="k">return</span>
<span class="k">def</span> <span class="nf">getsongs</span><span class="p">(</span><span class="n">controller</span><span class="p">,</span> <span class="n">tup</span><span class="p">):</span>
<span class="n">artist</span> <span class="o">=</span> <span class="n">tup</span><span class="p">[</span><span class="s">"artist"</span><span class="p">]</span>
<span class="n">artistUrl</span> <span class="o">=</span> <span class="s">"http://genius.com/artists/"</span> <span class="o">+</span> <span class="n">artist</span>
<span class="n">songs</span> <span class="o">=</span> <span class="n">getMostArtistSongs</span><span class="p">(</span><span class="n">artistUrl</span><span class="p">)</span>
<span class="k">for</span> <span class="n">song</span> <span class="ow">in</span> <span class="n">songs</span><span class="p">:</span>
<span class="k">print</span> <span class="n">song</span>
<span class="n">controller</span><span class="o">.</span><span class="n">emit</span><span class="p">({</span><span class="s">"song"</span> <span class="p">:</span> <span class="n">song</span><span class="p">,</span> <span class="s">"artist"</span> <span class="p">:</span> <span class="n">artist</span><span class="p">})</span>
<span class="k">return</span>
<span class="k">def</span> <span class="nf">buildGraph</span><span class="p">(</span><span class="n">controller</span><span class="p">,</span> <span class="n">tup</span><span class="p">):</span>
<span class="n">song</span> <span class="o">=</span> <span class="n">tup</span><span class="p">[</span><span class="s">"song"</span><span class="p">]</span>
<span class="n">artist</span> <span class="o">=</span> <span class="n">tup</span><span class="p">[</span><span class="s">"artist"</span><span class="p">]</span>
<span class="n">songData</span> <span class="o">=</span> <span class="n">setSong</span><span class="p">(</span><span class="n">song</span><span class="p">)</span>
<span class="k">print</span> <span class="n">songData</span>
<span class="n">controller</span><span class="o">.</span><span class="n">emit</span><span class="p">({</span><span class="s">"song"</span> <span class="p">:</span> <span class="n">song</span><span class="p">,</span> <span class="s">"artist"</span><span class="p">:</span> <span class="n">songData</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="s">"featuredArtists"</span><span class="p">:</span> <span class="n">songData</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="s">"producers"</span><span class="p">:</span> <span class="n">songData</span><span class="p">[</span><span class="mi">2</span><span class="p">]})</span>
<span class="k">def</span> <span class="nf">nt</span><span class="p">(</span><span class="n">controller</span><span class="p">):</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">"rapperlist.csv"</span><span class="p">)</span> <span class="k">as</span> <span class="n">rl</span><span class="p">:</span>
<span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">rl</span><span class="p">:</span>
<span class="n">controller</span><span class="o">.</span><span class="n">emit</span><span class="p">({</span><span class="s">"artist"</span> <span class="p">:</span> <span class="n">line</span><span class="p">})</span>
<span class="n">controller</span><span class="o">.</span><span class="n">end_cycle</span><span class="p">()</span>
<span class="n">app</span> <span class="o">=</span> <span class="n">zillabyte</span><span class="o">.</span><span class="n">app</span><span class="p">(</span><span class="n">name</span> <span class="o">=</span> <span class="s">"pygenius"</span><span class="p">)</span>
<span class="n">app</span><span class="o">.</span><span class="n">source</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">"raplist"</span><span class="p">,</span> <span class="n">next_tuple</span> <span class="o">=</span> <span class="n">nt</span><span class="p">,</span> <span class="n">end_cycle_policy</span><span class="o">=</span><span class="s">"explicit"</span><span class="p">)</span>\
<span class="o">.</span><span class="n">each</span><span class="p">(</span><span class="n">execute</span> <span class="o">=</span> <span class="n">getsongs</span><span class="p">)</span>\
<span class="o">.</span><span class="n">each</span><span class="p">(</span><span class="n">execute</span> <span class="o">=</span> <span class="n">buildGraph</span><span class="p">)</span>\
<span class="o">.</span><span class="n">sink</span><span class="p">(</span><span class="n">name</span> <span class="o">=</span> <span class="s">"zbrap"</span><span class="p">,</span> <span class="n">columns</span> <span class="o">=</span> <span class="p">[{</span><span class="s">"song"</span><span class="p">:</span><span class="s">"string"</span><span class="p">},</span> <span class="p">{</span><span class="s">"artist"</span><span class="p">:</span><span class="s">"string"</span><span class="p">},</span> <span class="p">{</span><span class="s">"featuredArtists"</span><span class="p">:</span><span class="s">"array"</span><span class="p">},</span> <span class="p">{</span><span class="s">"producers"</span><span class="p">:</span><span class="s">"array"</span><span class="p">}])</span></code></pre></div>
<p>This app uses the python syntax with my seeded list of artists as the custom source. Zillabyte’s pipe programming paradigm applies a series of transformations to a stream of tuples. This app uses the following workflow:</p>
<p>source: list of artists => each: list of urls of all available songs for every artist => each: extract artist, features, and producers metadata for each song’s url => sink: emit to CSV</p>
<p>An example row of the sunk CSV:</p>
<p><strong>“3070”,”http://rap.genius.com/Kanye-west-mercy-lyrics”,”Kanye West”,”["2 Chainz"\,"Big Sean"\,"Pusha T"]”,”["Hudson Mohawke"\,"Kanye West"\,"Lifted"\,"Mike Dean"\,"Mike WiLL Made It"]”,”2014-10-21 19:51:36”,”1”,””</strong></p>
<h2>Visualization</h2>
<p>The fully connected graph has so many strongly connected components that distinguishing between rendered edges becomes extremely difficult. An excellent and simple method to limit the number of rendered edges is to build a minimum spanning tree from the fully connected graph. Keeping my workflow in python, I used an extracting script and networkx on the output CSV to generate the graph. Check out the code on <a href="https://github.com/nkarnik/RapNetwork" target="_blank" class="btn">GITHUB</a> </p>
<p>Kanye West’s quadrant of the MST:</p>
<p><img src="http://nkarnik.github.io/images/mstgraphkanye.png" alt="Network 3" /></p>
<p>Eminem’s quadrant: </p>
<p><img src="http://nkarnik.github.io/images/mstgraphem.png" alt="Network 4" /></p>
<p>I struggled a bit generating user friendly and informative layouts for the graphs, and definitely noticed some strange connections in the graph. </p>
<h2 id="odd-surprises">Odd Surprises</h2>
<p>Anyone familiar with the last 30 years of hip hop should be skeptical of some of the connections in this graph. For example, it should be obvious that 2Pac and Eminem never collaborated while 2Pac was alive, but they are first degree connections in the minimum spanning tree. While bizarre on face, in fact, Eminem is credited as a producer on some of 2Pac’s posthumous material. Inevitably, the quality of the results is highly dependent on the cleanliness of the input data source, in this case Rap Genius. Certainly, there are some instances of duplicated songs, but the most significant source of confusion is that some nontrivial number of songs on rap genius are from mixtape and unofficial remixes, with which many hip hop fans may be unfamiliar. </p>
<h2 id="moving-towards-data-analysis">Moving towards Data Analysis…</h2>
<p>Many of the questions I posed initially, especially the one in the title, remain to be answered. So far, I have completed much of the legwork of generating a structured graph with mostly validated data. In the next post, I will actually determine the PageRank of our favorite rappers (and Lil Wayne) in a distributed fashion using Zillabyte.</p>
<p><a href="http://nkarnik.github.io/articles/lil-wayne-pagerank-1/">The Rap Network</a> was originally published by Nikhil Karnik at <a href="http://nkarnik.github.io">Nikhil Karnik</a> on October 28, 2014.</p>