Youtube Keyword Scraper

(YoutubeKWS)

Search for keywords in any given topic

Loading...

Submitting Query...

Box and whisker plot of search query results
Download Image

What your results mean

Graph Axes

From left to right on the bottom line (X axis), plotted are the 50 most frequently occuring words out of the top 100 videos on your provided query. Some words are likely used together in common phrases.

From bottom to top along the left line (Y axis), plotted is the continuum of engagement rate, in decimal. To convert to percentage (%) multiply the value by 100. Engagement rate is calculated as (0.5 * like count + 0.5 * comment count) / views. This means an engagement rate of 1 (100%) = every viewer must like and comment.

Frequency of words (n)

Seen next to every word, n= indicates the number of video titles that word has occured in out of the top 100 videos. A maximum would be (n=100), indicating that the word was used in every video title on the first two pages of youtube for your result.

Generally, the smaller sample size (n, number of occurances), the less accurate the data (engangement rates)

Data distributions

Each column extending vertically corresponds to a single word. The blue box and lines, called Boxes and whiskers, map out the distribution of data points for the word. For example, "api (n=98)" tells us there are 98 videos out of the 100 that include the word "api".

Every video has an engagement rate, so here we can see the engagement rates from all 98 videos that use the word "api" - that gives us 98 data points for "api"! Now visualising 98 dots cluttered in the same space becomes very messy. Instead of visualising each dot itself, we can visualise the distribution of the data, through either a data distribution (bell curve), or a box and whisker plot.

Spreads and Ranges

The whiskers and boxes represent the distribution of values (engagement rates) from videos including a certain word. The longer/taller the whisker or box, the more variety the values have - meaning the larger the ranges.

For example, the distribution belonging to "how" (n=55) shows that 75% of the data has the same spread or range as the top 25% of data. The lowest 75% ranges from values of 0% to 1% engagement rate, while the highest 25% of videos range from 1% to 2% engagement rate.

Conversely, the "key" data distribution has a much smaller range, and videos using this word are likely to have below 1% engagement.

Documentation

How it works

How does my web browser get all this data?
  1. By typing in a query and clicking submit, your browser submits it to a script I have developed.
  2. This data submission occurs through an API gateway - which secures and manages incoming requests to make sure my scripts don't get overrun.
  3. The API Gateway is hosted on AWS, and it routes all incoming requests to a script I have sitting idly, known as a Lambda function.
  4. This Lambda function is not actually running until it receives a request, it can only be "invoked", or started from my API gateway.
  5. This Lambda function contains all the python code and R code necessary to take your search query, request data from the Youtube Data V3 API (just like my API Gateway!), process it and make a nice graph.
  6. But getting you the data is quite difficult! If my Lambda function could just upload code, files, or data onto your computer, every computer in the world would be infested with Malware.
  7. Instead, my Lambda function uploads it to a publicly accessible storage service on AWS - Amazon Simple Storage Service (AWS S3) with a unique identifier so we know that data is yours!
  8. My Lambda function then returns an OK to your browser, along with the unique identifier
  9. The javascript running on this website receives the data from Lambda, and sends ANOTHER request to Amazon S3, using your unique identifier to display the publically-available image and data directly on the website, without touching your computer.

How to interpret results

Statistics, oh boy what have I done?
  1. If statistics and analytics isn't your thing, fret not!
  2. The most important considerations revolve around the completeness of the data: does the data give you the full picture?
  3. The answer is no. Although this tool provides you with an insight to some keywords that are commonly used in popular videos, we cannot assume the relationship extends past that.
  4. This means we cannot assume by using these keywords there is a statistical chance of increasing your views or engagement rate. This certainty would take more advanced statistical analysis, and a lot more data!
  5. One of the reasons this doesn't work is that we don't account for other things influencing the engagement rate. For example:
    • Channels with more subscribers get more views, therefore any keyword they use in titles will seem "better"
    • Some topics/niches have higher engagement, or views than others
    • This data only looks at the highest grossing videos within a topic, we know what might work, but what about low-rating videos, what doesn't work?
    • A graph does not represent big differences! The percentages difference in engagement rates are often very small, and won't make much of a difference.
    • Similarly, although a 2% difference might look big, we have no idea if the value is significant, or "important"
  6. Take these readings with a pinch of salt. If you don't have any idea about what keywords to use in your titles, this might give you some direction! But if you are seeing success with your current strategy, I would moreso use this as a tool for ideas.

Data Architecture

The technical infrastructure underlying the magic
  • Lambda function was used for cost-effectiveness and ease of deployment through API Gateway
  • Initial query passed from API Gateway to initial Python Lambda which recursively makes API calls to Youtube Data:
    • Script uses GET method on search resource to obtain first 50 page of results based on relevancy to query
    • Script stores IDs and titles of videos, using IDs to make a second API GET request on videos resource to obtain statistics about videos
    • This is done recursively, as the Youtube Data API has a maxQuery of 50 videos / items per parameter
    • Currently my script iterates twice, to produce 100 videos worth of data. This can easily be scaled up or down.
  • After extraction, and processing with Pandas, a semi-cleaned output CSV is stored in a private one-zone IA S3 for cost-effective medium term storage. CSV is named by query and timestamp
  • Python Lambda then calls a second Lambda in an R environment, which finalises cleaning, produces the dynamic boxplot, and will in the future, perform ANOVA or multiple linear modelling for association testing.
  • This final graph is then saved to a publically-available S3 bucket, time- and query-stamped for unique, short-term retrieval
  • The R Lambda returns a status 200 along with the unique identifier, to which the website JS retrieves the graph and loads the data.
  • Future steps include: creating S3 object lifecycles to ensure short TTL and minimised costs, and cloudwatch alarms for S3 storage with automated clearance/removal of old data.

Contact Me

Feel free to contact me with any questions or requests you may have. Data can be provided in .csv format upon request, and further explanation and interpretation for graphs can be offered if needed.

omegabytten@gmail.com