Friday, September 30, 2022
HomeTechnologyThe Common Solvent for REST APIs – O’Reilly

The Common Solvent for REST APIs – O’Reilly


Knowledge scientists working in Python or R usually purchase knowledge by the use of REST APIs. Each environments present libraries that enable you make HTTP calls to REST endpoints, then remodel JSON responses into dataframes. However that’s by no means so simple as we’d like. Once you’re studying loads of knowledge from a REST API, you have to do it a web page at a time, however pagination works in a different way from one API to the following. So does unpacking the ensuing JSON constructions. HTTP and JSON are low-level requirements, and REST is a loosely-defined framework, however nothing ensures absolute simplicity, by no means thoughts consistency throughout APIs.

What if there have been a method of studying from APIs that abstracted all of the low-level grunt work and labored the identical method in all places? Excellent news! That’s precisely what Steampipe does. It’s a software that interprets REST API calls instantly into SQL tables. Listed here are three examples of questions you can ask and reply utilizing Steampipe.




Be taught sooner. Dig deeper. See farther.

1. Twitter: What are current tweets that point out PySpark?

Right here’s a SQL question to ask that query:

choose
  id,
  textual content
from
  twitter_search_recent
the place
  question = 'pyspark'
order by
  created_at desc
restrict 5;

Right here’s the reply:

+---------------------+------------------------------------------------------------------------------------------------>
| id                  | textual content                                                                                           >
+---------------------+------------------------------------------------------------------------------------------------>
| 1526351943249154050 | @dump Tenho trabalhando bastante com Spark, mas especificamente o PySpark. Vale a pena usar um >
| 1526336147856687105 | RT @MitchellvRijkom: PySpark Tip ⚡                                                            >
|                     |                                                                                                >
|                     | When to make use of what StorageLevel for Cache / Persist?                                             >
|                     |                                                                                                >
|                     | StorageLevel decides how and the place knowledge needs to be s…                                           >
| 1526322757880848385 | Remedy challenges and exceed expectations with a profession as a AWS Pyspark Engineer. https://t.co/>
| 1526318637485010944 | RT @JosMiguelMoya1: #pyspark #spark #BigData curso completo de Python y Spark con PySpark      >
|                     |                                                                                                >
|                     | https://t.co/qf0gIvNmyx                                                                        >
| 1526318107228524545 | RT @money_personal: PySpark & AWS: Grasp Huge Knowledge With PySpark and AWS                    >
|                     | #ApacheSpark #AWSDatabases #BigData #PySpark #100DaysofCode                                    >
|                     | -> http…                                                                                    >
+---------------------+------------------------------------------------------------------------------------------------>

The desk that’s being queried right here, twitter_search_recent, receives the output from Twitter’s /2/tweets/search/current endpoint and formulates it as a desk with these columns. You don’t need to make an HTTP name to that API endpoint or unpack the outcomes, you simply write a SQL question that refers back to the documented columns. A type of columns, question, is particular: it encapsulates Twitter’s question syntax. Right here, we’re simply searching for tweets that match PySpark however we may as simply refine the question by pinning it to particular customers, URLs, varieties (is:retweetis:reply), properties (has:mentionshas_media), and many others. That question syntax is identical irrespective of the way you’re accessing the API: from Python, from R, or from Steampipe. It’s loads to consider, and all you must actually need to know when crafting queries to mine Twitter knowledge.

2. GitHub: What are repositories that point out PySpark?

Right here’s a SQL question to ask that query:

choose 
  title, 
  owner_login, 
  stargazers_count 
from 
  github_search_repository 
the place 
  question = 'pyspark' 
order by stargazers_count desc 
restrict 10;

Right here’s the reply:

+----------------------+-------------------+------------------+
| title                 | owner_login       | stargazers_count |
+----------------------+-------------------+------------------+
| SynapseML            | microsoft         | 3297             |
| spark-nlp            | JohnSnowLabs      | 2725             |
| incubator-linkis     | apache            | 2524             |
| ibis                 | ibis-project      | 1805             |
| spark-py-notebooks   | jadianes          | 1455             |
| petastorm            | uber              | 1423             |
| awesome-spark        | awesome-spark     | 1314             |
| sparkit-learn        | lensacom          | 1124             |
| sparkmagic           | jupyter-incubator | 1121             |
| data-algorithms-book | mahmoudparsian    | 1001             |
+----------------------+-------------------+------------------+

This appears to be like similar to the primary instance! On this case, the desk that’s being queried, github_search_repository, receives the output from GitHub’s /search/repositories endpoint and formulates it as a desk with these columns.

In each instances the Steampipe documentation not solely reveals you the schemas that govern the mapped tables, it additionally offers examples (TwitterGitHub) of SQL queries that use the tables in varied methods.

Be aware that these are simply two of many out there tables. The Twitter API is mapped to 7 tables, and the GitHub API is mapped to 41 tables.

3. Twitter + GitHub: What have homeowners of PySpark-related repositories tweeted currently?

To reply this query we have to seek the advice of two completely different APIs, then be a part of their outcomes. That’s even more durable to do, in a constant method, while you’re reasoning over REST payloads in Python or R. However that is the sort of factor SQL was born to do. Right here’s one technique to ask the query in SQL.

-- discover pyspark repos
with github_repos as (
  choose 
    title, 
    owner_login, 
    stargazers_count 
  from 
    github_search_repository 
  the place 
    question = 'pyspark' and title ~ 'pyspark'
  order by stargazers_count desc 
  restrict 50
),

-- discover twitter handles of repo homeowners
github_users as (
  choose
    u.login,
    u.twitter_username
  from
    github_user u
  be a part of
    github_repos r
  on
    r.owner_login = u.login
  the place
    u.twitter_username just isn't null
),

-- discover corresponding twitter customers
  choose
    id
  from
    twitter_user t
  be a part of
    github_users g
  on
    t.username = g.twitter_username
)

-- discover tweets from these customers
choose
  t.author->>'username' as twitter_user,
  'https://twitter.com/' || (t.author->>'username') || '/standing/' || t.id as url,
  t.textual content
from
  twitter_user_tweet t
be a part of
  twitter_userids u
on
  t.user_id = u.id
the place
  t.created_at > now()::date - interval '1 week'
order by
  t.creator
restrict 5

Right here is the reply:

+----------------+---------------------------------------------------------------+------------------------------------->
| twitter_user   | url                                                           | textual content                                >
+----------------+---------------------------------------------------------------+------------------------------------->
| idealoTech     | https://twitter.com/idealoTech/standing/1524688985649516544     | Can you discover inventive soluti>
|                |                                                               |                                     >
|                |                                                               | Be a part of our @codility Order #API Challe>
|                |                                                               |                                     >
|                |                                                               | #idealolife #codility #php          >
| idealoTech     | https://twitter.com/idealoTech/standing/1526127469706854403     | Our #ProductDiscovery staff at idealo>
|                |                                                               |                                     >
|                |                                                               | Suppose you may remedy it? 😎          >
|                |                                                               | ➡️  https://t.co/ELfUfp94vB https://t>
| ioannides_alex | https://twitter.com/ioannides_alex/standing/1525049398811574272 | RT @scikit_learn: scikit-learn 1.1 i>
|                |                                                               | What's new? You possibly can examine the releas>
|                |                                                               |                                     >
|                |                                                               | pip set up -U…                     >
| andfanilo      | https://twitter.com/andfanilo/standing/1524999923665711104      | @edelynn_belle Thanks! Typically it >
| andfanilo      | https://twitter.com/andfanilo/standing/1523676489081712640      | @juliafmorgado Good luck on the reco>
|                |                                                               |                                     >
|                |                                                               | My recommendation: energy by it + a useless>
|                |                                                               |                                     >
|                |                                                               | I hated my first few brief movies bu>
|                |                                                               |                                     >
|                |                                                               | Wanting ahead to the video 🙂

When APIs frictionlessly turn out to be tables, you may dedicate your full consideration to reasoning over the abstractions represented by these APIs. Larry Wall, the creator of Perl, famously stated: “Straightforward issues needs to be simple, onerous issues needs to be potential.” The primary two examples are issues that needs to be, and are, simple: every is simply 10 traces of easy, straight-ahead SQL that requires no wizardry in any respect.

The third instance is a more durable factor. It will be onerous in any programming language. However SQL makes it potential in a number of good methods. The answer is manufactured from concise stanzas (CTEs, Frequent Desk Expressions) that kind a pipeline. Every section of the pipeline handles one clearly-defined piece of the issue. You possibly can validate the output of every section earlier than continuing to the following. And you are able to do all this with probably the most mature and widely-used grammar for choice, filtering, and recombination of knowledge.

Do I’ve to make use of SQL?

No! Should you like the thought of mapping APIs to tables, however you’ll somewhat purpose over these tables in Python or R dataframes, then Steampipe can oblige. Beneath the covers it’s Postgres, enhanced with international knowledge wrappers that deal with the API-to-table transformation. Something that may connect with Postgres can connect with Steampipe, together with SQL drivers like Python’s psycopg2 and R’s RPostgres in addition to business-intelligence instruments like Metabase, Tableau, and PowerBI. So you need to use Steampipe to frictionlessly eat APIs into dataframes, then purpose over the information in Python or R.

However in case you haven’t used SQL on this method earlier than, it’s value a glance. Take into account this comparability of SQL to Pandas from How one can rewrite your SQL queries in Pandas.

SQLPandas
choose * from airportsairports
choose * from airports restrict 3airports.head(3)
choose id from airports the place ident = ‘KLAX’airports[airports.ident == ‘KLAX’].id
choose distinct kind from airportairports.kind.distinctive()
choose * from airports the place iso_region = ‘US-CA’ and sort = ‘seaplane_base’airports[(airports.iso_region == ‘US-CA’) & (airports.type == ‘seaplane_base’)]
choose ident, title, municipality from airports the place iso_region = ‘US-CA’ and sort = ‘large_airport’airports[(airports.iso_region == ‘US-CA’) & (airports.type == ‘large_airport’)][[‘ident’, ‘name’, ‘municipality’]]

We are able to argue the deserves of 1 fashion versus the opposite, however there’s no query that SQL is probably the most common and widely-implemented technique to specific these operations on knowledge. So no, you don’t have to make use of SQL to its fullest potential with a purpose to profit from Steampipe. However you would possibly discover that you just need to.



RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular