One example conversation while a Python script ChatGPT had written for me (to identify broad patent claims, in part using least common words [LCW] per patent class [CPC] in patent claim texts) was running; see shared session.
Andrew Schulman
Okay, that’s running. Some things I’d like you to brainstorm with me on, give me pros/cons or alternatives:
1. This is becoming helpful in starting to find broad claims, and that’s good. But if for example there were CPC classifications to go with each patent number (as we had with the PatentSBERTa csvs), then we could add some testing to see if the claim scope/weight as determined with TF-IDF on the one hand aligns with CPC on the other. I may be able to get CPC codes for these; well, of course I can, but it may be time-consuming and/or expensive right now. So let’s assume we don’t have. What else could each patent/claim be compared against? What other predictions might be made from the TF-IDF score, which prediction could be checked?
2. One other potential external attribute is whether the patent was litigated. In other sessions, we’ve worked with the PTO’s litigation dataset. For better or worse, there are only about 50,000 or 60,000 different patents in the particular litigation dataset I have. And the time period covered by it doesn’t necessarily overlap clearly with the time period here — and without that, it would be difficult to see for every given patent if it hasn’t been litigated, only if it has. ChatGPT and I toiled over this in another session. You came up with some reasonable solutions, but basically it’s a small dataset.
3. In addition to the 31 GB patent_claims_fulltext.csv file we’re using here, there’s a related 28 GB pgpub_claims_fulltext.csv file. Perhaps use that too, not merely as a source of yet more patent claims, but exploit the fact that these pgpub ones were not necessarily granted. Might there be a way to use this this [sic; with] the list of granted patents, e.g. use what we know about granted patents to “predict” whether a given patent application would be likely to be granted? Well, sure, but that is probably a job for something more than these TF-IDF scores. More likely a job for SentenceTransformers and a neural network trained to predict likelihood of being granted. But leave that aside, is there something meaningful to be done with the pgpub claims, besides just adding a ton of additional claims?
4. We’re using TF-IDF. Is it time to move to e.g. GloVe? Dumb question in a way, since it all depends on what I’m trying to do with this data. Basically, it’s see if there is some simple, predictable, useful way to determine claim scope, of a given claim, against the universe of patent claims. This might be used for example in patent-portfolio ranking. Given this as one goal, give me pros/cons on TF-IDF vs. GloVe. One difference we’ve discussed in other sessions is interpretability or explainability, with TF-IDF winning out there.
5. Another big goal is, given a patent claim, find similar claims. That is mostly a job for sentence embeddings and probably SentenceTransformer, but I can already see in the partial output that there’s a tendency using sorted TF-IDF scores to group together patent claims that might be considered as having been “separated at birth”. I might want a loop that lets the user input a claim (either by text or by patent/claim), and show the 5 or 10 ones with the closest score. But somehow TF-IDF doesn’t sound suitable for that. GloVe? SentenceTransformer?
6. Finally, is there something that could be done with a [MASK] of selected tokens, and see if it can be predicted? We discussed this a bit in another session. For example, if the least-common token in a claim is [MASK]ed, can we predict it? If so, that might be indicia of it not being particular novel/non-obvious (using those terms informally, not as the legal terms of art). Okay, that’s a lot for you to chew on…
ChatGPT
Thought for a second
Below are some thoughts, pros/cons, and ideas for each of your points. This is a brainstorming exercise, so not all suggestions will be equally feasible or directly implementable, but they’re meant to help you consider your next steps.
1. Without CPC codes, what else can we compare claim scope/weight against?
- Comparable Classifications:
- If CPC codes are unavailable, you might consider other classification systems (e.g., USPC classes, IPC codes) if you have them. They may not be as fine-grained as CPC, but they offer a way to anchor claims in a technology domain.
- Citation Network:
- Patents cite prior patents. If you have citation data, you can measure how “broad” claims relate to the number or diversity of citations. Broader claims might cite fewer and more fundamental patents, or conversely, might be connected to a wide and diverse citation network.
- Temporal Data:
- Look at the era a patent was filed. Broader claims might have been more common in older patents (before examination standards tightened) or vice versa. Comparing claim breadth over time could validate whether your TF-IDF measure correlates with known historical shifts in patent prosecution.
- Prosecution Data:
- If available, consider characteristics like the number of office actions, rejections, or amendments. A broad claim might have undergone many rounds of narrowing. If you can get that data, you might see a correlation.
- Family Size or Geographical Coverage:
- The size of the patent family or how widely it’s filed internationally might correlate with claim breadth. Companies tend to seek broad protection internationally. If you have family and jurisdictional data, you could look for correlations.
2. Using Litigation Data
- Partial Overlap / Sparse Data:
- Even if the litigation dataset is small and covers a limited set of patents, you can still look for patterns. For example, do litigated patents have any correlation with TF-IDF-based broadness measures? Even a small correlation or directionality might be informative.
- Enrichment via Negative Sampling:
- If litigated patents are relatively rare, select a random sample of non-litigated patents from the same era and see if the TF-IDF scores differ in aggregate. While you won’t be sure every non-litigated patent truly had zero litigation risk, a rough comparison might still yield useful insights.
- Focus on Certain Tech Areas:
- If you know which tech fields are well-represented in your litigation set, you can focus on that subset to get a better controlled comparison.
3. Using pgpub_claims_fulltext.csv Data
- Applications vs. Granted Patents:
- One angle: Compare the distribution of TF-IDF scores (or any broadness measure) in granted patents vs. published applications. Are granted patents generally “broader” or “narrower” as measured by TF-IDF?
- If you had a dataset of applications that eventually granted vs. those that didn’t, you might see if breadth correlates with likelihood of grant. Without that info, you could still compare distributions of claim scopes between granted and published-only sets.
- Temporal or Procedural Insights:
- PGPub claims are often amended or narrowed before grant. Comparing an application claim’s initial TF-IDF (broadness) to its final granted form (if you can match them) might show how claims evolve through prosecution.
- Additional Baseline Data:
- At the very least, pgpub claims provide a broader corpus for building a more stable global vocabulary and IDF statistics. More data could mean more stable and representative IDF values.
4. TF-IDF vs. GloVe for Determining Claim Scope
- TF-IDF Pros:
- Very interpretable: you know which terms contribute to breadth vs. narrowness.
- Easy to implement and scales well with large text.
- TF-IDF Cons:
- Ignores context. “Platform” could mean something different in a software vs. mechanical context, but TF-IDF treats it uniformly.
- Frequency-based: Rare but unrelated words still may look “narrow.”
- GloVe (or other static word embeddings) Pros:
- Captures semantic relationships between words: common words in similar contexts have similar vectors.
- Potentially reduces emphasis on mere rarity and focuses more on semantic content.
- GloVe Cons:
- Less interpretable: harder to say why a claim is broad or narrow.
- Still context-agnostic at sentence level (words have only one embedding).
- Requires a large corpus and training time. Pre-trained GloVe might not reflect patent-specific jargon well.
For scope determination, TF-IDF’s clarity can be very beneficial. GloVe could help cluster semantically similar claims, but it might not directly correspond to the legal concept of breadth.
5. Finding Similar Claims
- TF-IDF for Similarity:
- You could represent claims as TF-IDF vectors and compute cosine similarity. This would find claims with similar token distributions. It’s a baseline approach and might group similar technologies reasonably well, but will fail to understand synonyms or paraphrases.
- GloVe for Similarity:
- A bit better than TF-IDF alone because embeddings capture some semantic similarity. Still, averaging word vectors loses sentence structure and nuance.
- SentenceTransformer (Contextual Embeddings) for Similarity:
- Likely the best approach for semantic similarity. Models trained on patent texts (like PatentSBERTa) can handle patent-specific language.
- Produces embeddings at the sentence/claim level, capturing synonyms and context.
- You can easily find top-5 or top-10 similar claims by cosine similarity in embedding space.
- Less interpretable, but much better semantic matching.
If similarity search is a key goal, a SentenceTransformer-based approach seems the most promising.
6. Using a [MASK] for Prediction
- Idea:
- If you mask the least-common token and ask a language model (like a BERT-based model) to predict it, the model’s confidence could indicate how predictable (and thus less unique) the token is.
- Potential Insight:
- If even a rare token is easy for the model to guess from context, maybe it’s not truly “novel” in a meaningful sense. A token that the model finds hard to predict might be genuinely unique or out-of-distribution, suggesting more uniqueness or narrowness.
- Caveats:
- Language models trained on general corpora may not understand patent-specific rare terms well.
- The logic that “predictable = not novel” is heuristic. Many tokens are predictable due to syntactic patterns, not conceptual commonness. Conversely, some genuinely novel terms might be guessable by a model that has seen similar morphology or contexts.
Still, this could be an interesting experiment: measure how often a masked token can be correctly inferred. It might provide a “semantic predictability” measure alongside TF-IDF-based rarity.
Summary of Brainstorm:
- Without CPC, look at other metadata (citations, time, family) or classification systems, or the textual evolution of claims.
- Litigation data is sparse, but partial insights might still emerge; consider careful sampling.
- pgpub claims could be used as a reference set or to compare granted vs. non-granted claim breadth.
- TF-IDF is straightforward and interpretable, while GloVe could add some semantic nuance but at the cost of interpretability. SentenceTransformers are best for similarity tasks.
- Masking tokens and testing predictability by a language model could offer another dimension of “uniqueness” or “broadness,” though it’s a heuristic and requires careful interpretation.
All these considerations depend heavily on your end goals—whether you prioritize interpretability, semantic understanding, or practical application like predictive modeling.
This is exactly the sort of commentary I was hoping for! A few follow-ups:
1. You’re absolutely right to refer to citations. I have NBER (Nat’l Bureau of Econ Research) data listing from-citations and to-citations for patents, roughly in the same time period. This is data that was used in the 2002 book “Patents, Citations, and Innovations: A Window on the Knowledge Economy” by Jaffe and Trajtenberg. I think there’s an update to the data files, and even if not, at least some of it will overlap. There is the usual problem that citations-to are a lagging and noisy indicator. But you’re absolutely right, this would be a very interesting comparison. In both directions: are citations-to good predictors of claim scope? is claim scope related to patents that get cited (it’s the entire specification that is the target of the cite)?
2. The litigation data from the PTO is something ChatGPT and I have enhanced with CPC data, the claims text, and the name of the assignee (which ChatGPT has given me code to normalize). The original data has both the plaintiff and defendant names. At the very least, given we have a subset of patent claims with associated CPC codes etc., we could interleave these with what the current python code is computing TF-IDF scores for. Mark the ones from the litigation dataset with “LITIG:” as a prefix to the patent number perhaps.
2b. Come to think of it, something similar could be done with the smaller set of patent claims/CPC codes that we have from the PatentSBERTa train/test csvs, after filtering out the incomplete claim text. In some cases they’ll overlap with what we have in the big PTO csv, but we could prefix the numbers with “PatBERT:” or something, get the TF-IDF scores, and they would at least appear inter-leaved.
3. pgpub: I’ll come back to this; thinking more about your comments.
4. TF-IDF: yes, no semantics is a problem.
5. Yes, SentenceTransformers with PatentSBERTa is the way to go for this. You’ve already given me (in a different session) working code based on a smaller set of claims (the very train/test data used to build PatentSBERTa). I would simply want to tweak that code to use this big PTO data in addition.
6. Very insightful comments on [MASK], predictability, etc. For now, let me look into what I have for #1 citations. On #2 and #2b, tell me if you see any problem adding in more data using LITIG: and PATBERT: prefixes to the patent numbers.
ChatGPT
Thought for a second
No inherent problem, but a few considerations:
Thought for a second
Take your time to consider how mixing datasets affects the TF-IDF scores. A few additional thoughts that might help: