The Metaverse: NLP Multi-Classification Modeling

Created By:

Meredith Wang	Brad Gauvin	Mijail Mariano
github.com/m3redithw	github.com/bradgauvin	github.com/mijailmariano

September 2022

Codeup, Kalpana Cohort

Project Description

Like many business buzzwords, "The Metaverse" is no different and is a term many businesses are rapidly working to understand and define before it arrives. The "Metaverse" term has traditionally been closely associated to virtual realities much like a video game where individuals can enter and interact with simulated environments and other players.

Recent studies by McKensey Co. and Wharton School of Business estimate the Metaverse economy to be a roughly $5-13 trillion dollar market by the year 2030. With businesses investing ~$120 billion in the first five months of 2022, and a total of ~$50 billion in 2021 -- The Metaverse is appearing more like a virtual utopia where possibility is only limited by imagination.

Potentially more important than the idea of entering a virtual reality is the potential access to information that a Metaverse environment can help create. Imagine a doctor being able to enter the Metaverse from anywhere in the world and advising or leading highly specialized medical procedures in communities with unequal access to health care. With the changing landscape of working environments, the Metaverse can offer a future for both employer and employee to work from anywhere and still create the special interactions once only possible in the physical space.

As interesting and alluring as a Metaverse future appears to be, we believe there is still much to understand about this topic and the language people use to describe the Metaverse or better yet, the programming language and code that will create it.

Project Goal

This analysis uses GitHub README.md data commonly found in cloudbased repositories to understand the common text patterns in Metaverse related topics and analysis.

We use computational linguistic-based rules to examine and learn from unique text data found in these repositories. From learned patterns, we build a multiclass machine learning classification model capable of predicting future Metaverse repository coding languages.

In modeling we test and deploy several unique non-linear models and chose to deploy an XGBoost model on the final out-of-sample README text data. The XGBoost model returns an overall predictive accuracy score of ~46%.

On average, these results provide us with the potential to accurately predict and study the contents of a "Metaverse" related repo and its primary programming language by ~56% better than a baseline prediction.

The predicted programming language used in this analysis is the primary language in overall repository percentage in the GitHub repository. *

Process: In Brief

We search and acquire the GitHub README text data of "Metaverse" associated repositories.
We perform the following data cleaning processes on the full dataset: filter most non-alphanumeric characters, lemmatize the text, remove stop words and words < 3 letters.
We classify & label repositories that contain README text but no primary programming languages as "text".
We bucket/group affiliated and low-frequency programming languages to higher-domain languages.
We split the larger text dataset into train, validate, and test subset dataset for exploration and modeling.
We set exploratory questions and conduct analysis on programmatic text conducting frequency counts, outlier analysis, word clouds, and visual graphs.
We identify unique single, paired, and grouped words associated to programming languages.
We set a baseline language prediction score and prep the train and validate datasets for modeling.
We identify, and train five unique classes of non-linear models.
We deploy models on train and validate datasets and measure comparative scores.
We determine the best performing model on out-of-sample data and deploy this model on the test dataset.
Finally, we analyze our findings and provide recommendations for future analysis and actionable steps.

Steps to Reproduce

Key Libraries & Modules:

Note that unique to this analysis are several functions which will require a GitHub profile and API token to connect to GitHub’s API address and extract similar repository data. The script and instructions for creating a GitHub token can be found in the “acquire.py” file.
However, if wanting to reproduce this exact analysis, you will first need to download/import the necessary environment(s) for recreating the analysis.
There are several libraries and modules such as Pandas, Matplotlib, Seaborn, NumPy, SKLearn, and “XGBoost” that are used to conduct the analysis. This environment can be referenced under “notebook dependencies” in the final jupyter notebook.
Once the proper environment has been imported, you can read-in the “metaverse.csv” file using Pandas “read_csv” method as shown in the final jupyter notebook and run the report.

Data Dictionary

Metaverse

Refers to a virtual-reality space in which users can interact in a computer-generated environment and potentially other users.

Bigram

Refers to pair and consecutive words found in the text data.

Language

Refers to a programmatic language used to create, manipulate, customize, and/or automate files found in the GitHub repository.

ReadMe Contents

Refers to the text data found in the repository's description file. This file is typically used to provide foundational information on the repository's contents.

Repo

Refers to a unique GitHub web-based folder where project files are stored, edited, and updated over time.

Trigram

Refers to a group of three consecutive words found in the text data.

Word Cloud

Refers to visual representations of the text and where the text's relative size is determined by word frequency found in the text data.

Exploratory Questions

1. What are the most common words in READMEs?

Across all Metaverse related repos and programming languages, we learned that key words such as "href", "detail", "summary", "open", and "project" were prominent words used when describing Metaverse repositories. This inferred to us that the "Metaverse" topic may still be developing.

By identifying words such as "href" and "open" individuals or entities may be leveraging insights from other references or even repositories to better understand the "Metaverse" topic.

Words like "detail" and "summary" can highlight a "zooming-in" or "zooming-out" approach to defining both the practical and theoretical application of the Metaverse or associated topics.

2. Does the length of the README vary by programming language?

We found that the average length of Metaverse repositories differed by programming languages. On average, programming languages such as Rust, Python, TypeScript, C, and regular text tended to contain more words in their README.md files.

On average, programming languages such as HTML, CSS, and Google's "Go" languages contained the least number of texts.

The relatively high text volume from repos coding in "Rust" and "C" languages indicates the potential need for adaptable programming to create and store ideas.

The frequent use of common text may infer the use of exploring, recording, or summarizing ideas about Metaverse related findings.

3. Do different programming languages use a different number of unique words?

Yes, our analysis showed that primary programming languages tended to differ in the number of unique words/texts found in their README.md files.

True programming languages such as Java, Rust, and Python - to include text writing languages such as TypeScript and regular text contained the highest number of unique words when describing repositories.

Since Java is considered a "high-class" programming language, this may be consistent with the high number of unique characters or text found in their respective README.md.

Consistent with previous findings, languages such as C, Rust, Python, TypeScript, and Text represents to us the relatively high-frequency use and thus, unique words found in repos that code in these languages.

We also found that words such as "project" and "app" consistently appeared across most languages.

4. Are there any words that uniquely identify a programming language?

Yes, across the 11 programming classification languages studied we found the following three (3) words/text that uniquely identifies the individual language.

C

"opensource"
"inventory"
"sound"

CSS

"h1"
"landing"
"screenshot"

Go

"hyperspace"
"global"
"argument"

HTML

"metaversity"
"immersive"
"vision"

Java

"chance"
"accidentally"
"committing"

Python

"weronikazak"
"hackathon"
"linkedin"

Rust

"german"
"payment"
"attempt"

Solidity

"metaudio"
"audio"
"clean"

Text

"international"
"snowcrashdao"
"art"

TypeScript

"playground"
"explorer"
"blockbynumber"

Other

"welcome"
"escape"
"play"

Modeling

Algorithms Tested

Decision Tree
SVM (Support vector machine) classifier
KNN (k-nearest neighbors) classifier
Naive Bayes classifier
XGBoost

Top 5 Model Results

Model	Train	Validate
XGBoost, TF-IDF (1,3)	71%	49%
XGBoost, Bag of Words (1, 2)	73%	47%
SVM, TF-IDF (1, 3)	53.6%	45.4%
Extra Tree, TF-IDF (1, 3)	71%	47%
XGBoost TF-IDF (1, 4)	71%	49%
Baseline	31.7%	31.7%

XGBoost Performance Through Test Dataset

XGBoost	Accuracy	Relative % Difference
Baseline	31.5%	-
Train	71%	123%
Validate	49%	55%
Test (final)	49%	56%

Recommendations & Next Steps

Extract more GitHub "Metaverse" repositories over time
Expand the amount of data acquired from Metaverse related repositories
Experiment different ways of categorizing programming languages
Rather than encoding lower frequency program languages to higher associated languages -- explore these program languages and evaluate against model predictions
Research and study additional "Metaverse" related topics or programs being built through "Rust", "Python", "TypeScript", and "C" which on average contained the highest README.md word counts

Project Delivery

Presentation

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
reference_documents		reference_documents
wordclouds		wordclouds
.gitignore		.gitignore
README.md		README.md
acquire.py		acquire.py
final_nlp_notebook.ipynb		final_nlp_notebook.ipynb
metaverse.csv		metaverse.csv
model_performance.xlsx		model_performance.xlsx
modeling.ipynb		modeling.ipynb
prepare.py		prepare.py
repos.csv		repos.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Metaverse: NLP Multi-Classification Modeling

Table of Contents

Project Description

Project Goal

Process: In Brief

Steps to Reproduce

Data Dictionary

Exploratory Questions

1. What are the most common words in READMEs?

2. Does the length of the README vary by programming language?

3. Do different programming languages use a different number of unique words?

4. Are there any words that uniquely identify a programming language?

Modeling

Top 5 Model Results

Recommendations & Next Steps

Project Delivery

About

Releases

Packages

Contributors 3

Languages

MBM-nlp/github_classification_project

Folders and files

Latest commit

History

Repository files navigation

The Metaverse: NLP Multi-Classification Modeling

Table of Contents

Project Description

Project Goal

Process: In Brief

Steps to Reproduce

Data Dictionary

Exploratory Questions

1. What are the most common words in READMEs?

2. Does the length of the README vary by programming language?

3. Do different programming languages use a different number of unique words?

4. Are there any words that uniquely identify a programming language?

Modeling

Top 5 Model Results

Recommendations & Next Steps

Project Delivery

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages