The Alternative to Web Scraping. The “lazy” programmer’s guide to… | by Doug Guthrie

One of the better sites for financial data is Yahoo Finance. This makes it a prime target for web scraping by finance enthusiasts. There are nearly daily questions on StackOverflow that reference some sort of data retrieval (oftentimes through web scraping) from Yahoo Finance.

Web Scraping Problem #1

trying to test a code that scrap from yahoo finance

I’m a python beginner but I like to learn the language by testing it and trying it. so there is a yahoo web scraper…

stackoverflow.com

The OP is trying to find the current price for a specific stock, Facebook. Their code is below:

And that code produced the following output:

the current price: 216.08

It’s a pretty simple problem with an also simple web scraping solution. However, it’s not lazy enough. Let’s look at the next one.

Web Scraping Problem #2

Web Scraping Yahoo Finance Statistics — Code Errors Out on Empty Fields

I found this useful code snippet: Web scraping of Yahoo Finance statistics using BS4 I have simplified the code as per…

stackoverflow.com

The OP is trying to extract data from the statistics tab, the stock’s enterprise value and the number of shares short. His problem actually revolves around retrieving nested dictionary values that may or may not be there, but he seems to have found a better solution as far as retrieving data.

Take a look at line 3: the OP was able to find the data he’s looking for inside a variable in the javascript:

root.App.main = { .... };

From there, the data is retrieved pretty simply by accessing the appropriate nested keys within the dictionary, data. But, as you may have guessed, there is a simpler, lazier solution.

Lazy Solution #1

Look at the URL on line 3

Output:

{
    'quoteSummary': {
        'error': None,
        'result': [{
            'price': {
                'averageDailyVolume10Day': {},
                'averageDailyVolume3Month': {},
                'circulatingSupply': {},
                'currency': 'USD',
                'currencySymbol': '$',
                'exchange': 'NMS',
                'exchangeDataDelayedBy': 0,
                'exchangeName': 'NasdaqGS',
                'fromCurrency': None,
                'lastMarket': None,
                'longName': 'Facebook, Inc.',
                'marketCap': {
                    'fmt': '698.42B',
                    'longFmt': '698,423,836,672.00',
                    'raw': 698423836672
                },
                'marketState': 'REGULAR',
                'maxAge': 1,
                'openInterest': {},
                'postMarketChange': {},
                'postMarketPrice': {},
                'preMarketChange': {
                    'fmt': '-0.90',
                    'raw': -0.899994
                },
                'preMarketChangePercent': {
                    'fmt': '-0.37%',
                    'raw': -0.00368096
                },
                'preMarketPrice': {
                    'fmt': '243.60',
                    'raw': 243.6
                },
                'preMarketSource': 'FREE_REALTIME',
                'preMarketTime': 1594387780,
                'priceHint': {
                    'fmt': '2',
                    'longFmt': '2',
                    'raw': 2
                },
                'quoteSourceName': 'Nasdaq Real Time '
                'Price',
                'quoteType': 'EQUITY',
                'regularMarketChange': {
                    'fmt': '0.30',
                    'raw': 0.30160522
                },
                'regularMarketChangePercent': {
                    'fmt': '0.12%',
                    'raw': 0.0012335592
                },
                'regularMarketDayHigh': {
                    'fmt': '245.49',
                    'raw': 245.49
                },
                'regularMarketDayLow': {
                    'fmt': '239.32',
                    'raw': 239.32
                },
                'regularMarketOpen': {
                    'fmt': '243.68',
                    'raw': 243.685
                },
                'regularMarketPreviousClose': {
                    'fmt': '244.50',
                    'raw': 244.5
                },
                'regularMarketPrice': {
                    'fmt': '244.80',
                    'raw': 244.8016
                },
                'regularMarketSource': 'FREE_REALTIME',
                'regularMarketTime': 1594410026,
                'regularMarketVolume': {
                    'fmt': '19.46M',
                    'longFmt': '19,456,621.00',
                    'raw': 19456621
                },
                'shortName': 'Facebook, Inc.',
                'strikePrice': {},
                'symbol': 'FB',
                'toCurrency': None,
                'underlyingSymbol': None,
                'volume24Hr': {},
                'volumeAllCurrencies': {}
            }
        }]
    }
}the current price: 241.63

Lazy Solution #2

Again, look at the URL on line 3

Output:

{
    'quoteSummary': {
        'result': [{
            'defaultKeyStatistics': {
                'maxAge': 1,
                'priceHint': {
                    'raw': 2,
                    'fmt': '2',
                    'longFmt': '2'
                },
                'enterpriseValue': {
                    'raw': 13677747200,
                    'fmt': '13.68B',
                    'longFmt': '13,677,747,200'
                },
                'forwardPE': {},
                'profitMargins': {
                    'raw': 0.07095,
                    'fmt': '7.10%'
                },
                'floatShares': {
                    'raw': 637754149,
                    'fmt': '637.75M',
                    'longFmt': '637,754,149'
                },
                'sharesOutstanding': {
                    'raw': 639003008,
                    'fmt': '639M',
                    'longFmt': '639,003,008'
                },
                'sharesShort': {},
                'sharesShortPriorMonth': {},
                'sharesShortPreviousMonthDate': {},
                'dateShortInterest': {},
                'sharesPercentSharesOut': {},
                'heldPercentInsiders': {
                    'raw': 0.0025499999,
                    'fmt': '0.25%'
                },
                'heldPercentInstitutions': {
                    'raw': 0.31033,
                    'fmt': '31.03%'
                },
                'shortRatio': {},
                'shortPercentOfFloat': {},
                'beta': {
                    'raw': 0.365116,
                    'fmt': '0.37'
                },
                'morningStarOverallRating': {},
                'morningStarRiskRating': {},
                'category': None,
                'bookValue': {
                    'raw': 12.551,
                    'fmt': '12.55'
                },
                'priceToBook': {
                    'raw': 1.3457094,
                    'fmt': '1.35'
                },
                'annualReportExpenseRatio': {},
                'ytdReturn': {},
                'beta3Year': {},
                'totalAssets': {},
                'yield': {},
                'fundFamily': None,
                'fundInceptionDate': {},
                'legalType': None,
                'threeYearAverageReturn': {},
                'fiveYearAverageReturn': {},
                'priceToSalesTrailing12Months': {},
                'lastFiscalYearEnd': {
                    'raw': 1561852800,
                    'fmt': '2019-06-30'
                },
                'nextFiscalYearEnd': {
                    'raw': 1625011200,
                    'fmt': '2021-06-30'
                },
                'mostRecentQuarter': {
                    'raw': 1577750400,
                    'fmt': '2019-12-31'
                },
                'earningsQuarterlyGrowth': {
                    'raw': 0.114,
                    'fmt': '11.40%'
                },
                'revenueQuarterlyGrowth': {},
                'netIncomeToCommon': {
                    'raw': 938000000,
                    'fmt': '938M',
                    'longFmt': '938,000,000'
                },
                'trailingEps': {
                    'raw': 1.434,
                    'fmt': '1.43'
                },
                'forwardEps': {},
                'pegRatio': {},
                'lastSplitFactor': None,
                'lastSplitDate': {},
                'enterpriseToRevenue': {
                    'raw': 1.035,
                    'fmt': '1.03'
                },
                'enterpriseToEbitda': {
                    'raw': 6.701,
                    'fmt': '6.70'
                },
                '52WeekChange': {
                    'raw': -0.17621362,
                    'fmt': '-17.62%'
                },
                'SandP52WeekChange': {
                    'raw': 0.045882702,
                    'fmt': '4.59%'
                },
                'lastDividendValue': {},
                'lastCapGain': {},
                'annualHoldingsTurnover': {}
            }
        }],
        'error': None
    }
}{'AGL.AX': {'Enterprise Value': '13.73B', 'Shares Short': 'N/A'}}

The lazy alternatives simply altered the request from utilizing the front-end URL to a somewhat unofficial API endpoint, which returns JSON data. It’s simpler and results in more data! What about speed though (pretty sure I promised simpler, more data, and a faster alternative)? Let’s check:

web scraping #1 min time is 0.5678426799999997
lazy #1 min time is 0.11238783999999953
web scraping #2 min time is 0.3731000199999997
lazy #2 min time is 0.0864451399999993

The lazy alternatives are 4x to 5x faster than their web scraping counterparts!

You might be thinking though, “That’s great, but where did you find those URLs?”.

The Lazy Process

Think about the two problems we walked through above: the OP’s we’re trying to retrieve the data after it had been loaded into the page. The lazier solutions went right to the source of the data and didn’t bother with the front-end page at all. This is an important distinction and, I think, a good approach whenever you’re trying to extract data from a website.

Step 1: Examine XHR Requests

An XHR (XMLHttpRequest) object is an API available to web browser scripting languages such as JavaScript. It is used to send HTTP or HTTPs requests to a web server and load the server response data back into the script. Basically, it allows the client to retrieve data from a URL without having to do a full page refresh.

I’ll be using Chrome for the following demonstrations, but other browsers will have similar functionality.

  • If you’d like to follow along, navigate to https://finance.yahoo.com/quote/AAPL?p=AAPL
  • Open Chrome’s developer console. To open the developer console in Google Chrome, open the Chrome Menu in the upper-right-hand corner of the browser window and select More Tools > Developer Tools. You can also use the shortcut Option + ⌘ + J (on macOS), or Shift + CTRL + J (on Windows/Linux).
  • Select the “Network” tab

  • Then filter the results by “XHR”

  • Your results will be similar but not the same. You should notice though that there are a few requests that contain “AAPL”. Let’s start by investigating those. Click on one of the links in the left-most column that contain the characters “AAPL”.

  • After selecting one of the links, you’ll see an additional screen that provides details into the request you selected. The first tab, Headers, provides details into the request made by the browser and the response from the server. Immediately, you should notice the Request URL in the Headers tab is very similar to what was provided in the lazy solutions above. Seems like we’re on the right track.
  • If you select the Preview tab, you’ll see the data returned from the server.

  • Perfect! It looks like we just found the URL to get OHLC data for Apple!

Step 2: Search

Now that we’ve found some of the XHR requests that are made via the browser, let’s search the javascript files to see if we can find any more information. The commonalities I’ve found with the URLs relevant to the XHR requests are “query1” and “query2”. In the top-right corner of the developer’s console, select the three vertical dots and then select “Search” in the dropdown.

Search for “query2” in the search bar:

Select the first option. An additional tab will pop-up containing where “query2” was found. You should notice something similar here as well:

It’s the same variable that web scraping solution #2 targeted to extract their data. The console should give you an option to “pretty-print” the variable. You can either select that option or copy and paste the entire line (line 11 above) into something like https://beautifier.io/ or if you use vscode, download the Beautify extension and it will do the same thing. Once it’s formatted appropriately, paste the entire code into a text editor or something similar and search for “query2” again. You should find one result inside something called “ServicePlugin”. That section contains the URLs that Yahoo Finance utilizes to populate data in their pages. The following is taken right out of that section:

"tachyon.quoteSummary": {"path": "\u002Fv10\u002Ffinance\u002FquoteSummary\u002F{symbol}","timeout": 6000,"query": ["lang", "region", "corsDomain", "crumb", "modules",     "formatted"],"responseField": "quoteSummary","get": {"formatted": true}},

This is the same URL that is utilized in the lazy solutions provided above.

TL;DR

  • While web scraping can be necessary because of how a website is structured, it’s worth the effort investigating to see if you can find the source of the data. The resulting code is simpler and more data is extracted faster.
  • Finding the source of a website’s data is often found by searching through XHR requests or by searching through the site’s javascript files utilizing your browser’s developer console.

More Information

  • What if you can’t find any XHR requests? Check out The Alternative to Web Scraping, Part II: The DRY approach to retrieving web data

The Alternative to Web Scraping, Part II

The DRY approach to retrieving web data

towardsdatascience.com

  • If you’re interested specifically in the Yahoo Finance aspect of this article, I’ve written a python package, yahooquery, that exposes most of those endpoints in a convenient interface. I’ve also written an introductory article that describes how to use the package as well as a comparison to a similar one.

The (Unofficial) Yahoo Finance API

A Python interface to endless amounts of data

towardsdatascience.com

  • Please feel free to reach out if you have any questions or comments

Source: The Alternative to Web Scraping. The “lazy” programmer’s guide to… | by Doug Guthrie | Towards Data Science

Organisational Structures | Technology and Science | Military, IT and Lifestyle consultancy | Social, Broadcast & Cross Media | Flying aircraft