ai code snippets
satya - 1/12/2024, 4:38:09 PM
Similar word set
import gensim.downloader as api
# Load a pre-trained Word2Vec model (or any other KeyedVectors model)
word_vectors = api.load("word2vec-google-news-300")
# Given set of words
given_words = ["king", "queen", "man"]
# Calculate vectors for each word in the given set
word_vectors_set = [word_vectors[word] for word in given_words]
# Calculate the mean vector of the set
mean_vector = sum(word_vectors_set) / len(word_vectors_set)
# Find similar words to the mean vector
similar_words = word_vectors.similar_by_vector(mean_vector, topn=10)
# Print the similar words and their similarity scores
for word, score in similar_words:
    print(f"{word}: {score:.4f}")
satya - 1/12/2024, 7:23:44 PM
is there a word2vec model online that I can run queries against in a browser?
is there a word2vec model online that I can run queries against in a browser?
Search for: is there a word2vec model online that I can run queries against in a browser?
satya - 1/13/2024, 1:18:26 PM
NLTK
NLTK
satya - 1/13/2024, 1:18:50 PM
Getting stem words using NLTK
from nltk.stem.snowball import EnglishStemmer
from nltk.tokenize import word_tokenize
text = "Long sentences"
# list of strings
words = word_tokenize(text)
print(words)
stemmer = EnglishStemmer()
#list of strings
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)
satya - 1/13/2024, 1:20:56 PM
Lemmatizing with NLTK
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
lemmatizer = WordNetLemmatizer()
string_for_lemmatizing = "some sentence"
words = word_tokenize(string_for_lemmatizing)
print(words)
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)
satya - 1/13/2024, 1:22:11 PM
Key classes of NLTK
- EnglishStemmer
 - word_tokenize
 - WordNetLemmatizer
 
satya - 1/13/2024, 1:29:14 PM
NLTK uses
- Text Processing: Tokenization, stemming
 - Part-of-Speech Tagging: Part of speech
 - Parsing: for setnence structure
 - Named Entity Recognition (NER): names and nouns
 - Sentiment Analysis
 - Machine Learning for Text Classification
 - Text Corpora and Lexical Resources: Brown corpus, wordnet, linguistic databases
 - Text Summarization
 - Concordance Analysis: Frequency analysis of words
 - Language Learning and Teaching
 - Research and Experimentation in Linguistics and NLP
 
satya - 1/13/2024, 1:45:37 PM
Useful python segment in libraries
def localTest():
    print ("Starting local test")
    print ("End local test")
if __name__ == '__main__':
    localTest()
satya - 1/13/2024, 1:52:46 PM
For some of nltk to work
import nltk
nltk.download('punkt')
nltk.download('wordnet')
satya - 1/13/2024, 4:09:59 PM
Additional nltk initializations
from nltk import ne_chunk
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
satya - 1/13/2024, 4:20:50 PM
Here is how you navigate a ChunkedTree
import nltk
text = "John and Mary are living in New York City since 2020."
tokens = nltk.word_tokenize(text)
tagged = nltk.pos_tag(tokens)
chunked = nltk.ne_chunk(tagged)
for subtree in chunked:
    if isinstance(subtree, nltk.Tree):
        label = subtree.label()
        entity = " ".join([word for word, pos in subtree.leaves()])
        print(f"Named Entity: {entity}, Label: {label}")
satya - 1/13/2024, 4:21:24 PM
Here is an example of a chunck tree
(S
  (PERSON John/NNP)
  and/CC
  (PERSON Mary/NNP)
  are/VBP
  (GPE living/VBG)
  in/IN
  (GPE New/NNP York/NNP)
  City/NNP
  since/IN
  (DATE 2020/CD)
  ./.)
satya - 1/13/2024, 4:23:21 PM
More on chunck tree
- The structure of the ChunkTree returned by ne_chunk() typically consists of nodes and leaves, where nodes represent named entity chunks, and leaves represent individual words or tokens.
 - Each node in the tree has a label indicating the type of named entity, and it can have children nodes and leaves that form a hierarchical structure.
 - (S ...): Represents the top-level sentence.
 - (PERSON John/NNP): Represents a named entity "John" classified as a person (PERSON).
 - (PERSON Mary/NNP): Represents a named entity "Mary" classified as a person.
 - (GPE New/NNP York/NNP): Represents a named entity "New York" classified as a geopolitical entity (GPE).
 - (DATE 2020/CD): Represents a named entity "2020" classified as a date (DATE).
 
satya - 1/13/2024, 4:29:06 PM
Sample code NLTK name recognition and chunking
# *********************
# Import and download some stuf!!
# You have to do this only once per session I believe
# *********************
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
# Key functions
# *********************
from nltk import ne_chunk
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
def example():    
    ner_text = "Some sentence with peoples and corp names"
    #word list: strings
    tokens = word_tokenize(ner_text)
    print(tokens)
    #A list of key/value tuples
    pos_tagged = pos_tag(tokens)
    print(pos_tagged)
    # Chunked tree object
    result = ne_chunk(pos_tagged)
    print(result)
    #result.draw() 
    #this will open a new window with the tree rendering
satya - 1/13/2024, 4:49:36 PM
Parts of speech
def posTaggingExercise():
    text = """
    We hold these truths to be self-evident, that all men are created equal, 
    that they are endowed by their Creator with certain unalienable Rights, 
    that among these are Life, Liberty and the pursuit of Happiness.
    """
    words = word_tokenize(text)
    taggedWords = pos_tag(words)
    print(taggedWords)
    return taggedWords
# Example
[('We', 'PRP'), ('hold', 'VBP'), ('these', 'DT'), ('truths', 'NNS'),..]
satya - 1/13/2024, 4:50:54 PM
NLTK parts of speech attributes
NLTK parts of speech attributes
satya - 1/13/2024, 7:48:23 PM
What we have done so far with word2vec
- How to distinguish between a script (executable) and a library file in python. Use this mechanism to test functions in library files by executing the library file
 - Use list comprehensions to process each record in a list and make a new list from the result
 - Find vector representations for a word (gensim word2vec)
 - Find most similar words in the corpus fro a given word
 - Find most similar words for a set of words in a list
 - Find the dissimilar words in a list
 - Average words by averaging their vectors
 
satya - 1/13/2024, 7:49:38 PM
What we have done so far with NLTK
- Find the stem words for very many variations of the same word
 - Find the lemmatization of a word
 - Tokenize a sentence into words
 - Categorize or tag words in a sentence as to their grammatical "parts of speech"
 - Identify nouns and their classification in a sentence (Ex: Proper names, organizations, Geopolitical entities, dates etc.). Uses a concept called "chunking"
 
satya - 1/13/2024, 8:04:55 PM
Idea of a list comprehension in python
A language critique:
#
# Conceptual record processing
# in any language, with python as an example
# In python these are called list comprehensions.
#
# Take this procedural idea for an example
for every-record in a list
   do-something with that record
   store that record in a list
#
#This is expressed in python as
#
[do-something for every-record] #put each processed record in a list
#
# Now your target container can be set or a dictionary as well
# in addition to a list
#
{do-something for every-record} #put each record in a set
{do-something-for-akey: do-something-for-value for every-record} #Put it in a dictionary
satya - 1/19/2024, 5:46:16 PM
Hugging face home Page: https://huggingface.co/
satya - 1/19/2024, 6:23:18 PM
Where to get the API keys
- as a link: /settings/account: access tokens
 - or in ui: icon, profile, settings
 - or link: /settings/tokens
 
satya - 1/19/2024, 6:23:31 PM
You have to verify your email first for this to work
You have to verify your email first for this to work
satya - 1/21/2024, 7:28:32 PM
Where is hugging face text inference api request and response documented?
Where is hugging face text inference api request and response documented?
Search for: Where is hugging face text inference api request and response documented?
satya - 1/21/2024, 7:34:26 PM
Here is a list of inputs and outputs to the api
satya - 1/21/2024, 7:42:04 PM
Each type of task has different inputs and outputs to the API
Each type of task has different inputs and outputs to the API
satya - 1/21/2024, 7:43:14 PM
Some task names
- Text Answering
 - Summarization
 - Text Generation
 - Text Classification
 - Named entity recognition
 - Translation
 - ...
 - etc.
 
satya - 1/21/2024, 7:43:51 PM
Inputs and outputs to the text generation task are documented here
Inputs and outputs to the text generation task are documented here
satya - 1/21/2024, 7:44:57 PM
Example input
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/gpt2"
def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()
data = query({"inputs": "The answer to the universe is"})
# There are other parameters other than input
# See the api docs
satya - 1/21/2024, 7:45:49 PM
Return value is either a dict or a list of dicts if you sent a list of inputs
Return value is either a dict or a list of dicts if you sent a list of inputs
satya - 1/21/2024, 7:46:25 PM
Example
data == [ 
  {"generated_text": 'hello'}
]