Megan Skalbeck, Author at Verblio

Tips for using AI to create content at scale

Megan Skalbeck — Tue, 05 Mar 2024 19:27:28 +0000

^{Header image created with DALL-E-3}

“We need to be using AI more. Make it happen.”

In 2023, over half of the in-house marketers we surveyed told us they’re under pressure from their boss to use AI to create content faster and cheaper. (Check out more results from that survey, if you’re curious.)

To truly capture the exponential efficiency gains that AI promises, though—and the saved time and money that business leaders are pressuring their teams for—one writer going back and forth with ChatGPT to draft a blog post isn’t going to cut it.

The techniques we’re going to discuss here are focused on content at scale. Think hundreds of local landing pages, thousands of product descriptions for an e-commerce website, or dozens of new service pages for agency clients.

AI is perfect for these use cases because using humans to write thousands of product descriptions, for example, is cost-prohibitive, but having those descriptions is still important for SEO. More importantly, by using AI to craft high-quality descriptions, you can actually provide value to your audience in an area where your competitors are likely just using MadLibs style templates—if they’re doing anything at all.

Trying to use ChatGPT out of the box for this, though, will drive you nuts. Combine the API of your favorite LLM with an automation tool like Zapier plus some advanced prompting, though, and you’ve got magic.

_{This is an example of an extremely basic flow I’ve used in Make, a Zapier alternative. What you can’t see is that the API call to Anthropic includes a sequence of 13 different prompts and responses.}

This type of process doesn’t work for all tasks. With the right use case, though, leveraging AI in this way allows for exponential efficiency gains, rather than the marginal gains that come from an individual user working with ChatGPT directly.

The challenge of this heavy-duty usage is that it requires a deeper understanding of how to get what you want from AI. You no longer have an individual person going back and forth with an LLM in an interface, able to adjust prompts and ask for changes on the fly. You need to build prompts that work for all the different situations in which they’ll be used and that get you the results you need, even when there may not be human intervention until the final step.

We’ve been using AI in this way for certain tasks since September 2022 and have learned a ton about what works and what doesn’t. Using these six tips can save you hundreds of hours of work and dramatically improve the quality of your content. Some of these are good strategies no matter how you’re using AI. When you’re using it at scale, though, they become that much more important and might require you to think about them in slightly different ways.

1. Use variables to customize prompts

This is the basic building block of prompting at scale. Rather than using a prompt like “Write an outline for a 1500-word article about brute force attacks,” the prompt you write will be something like this:

USER: Write an outline for this article.
Topic: {topic}
Word length: {word_length}

When running this prompt, I’ll then replace the variable {topic} with “brute force attacks” and {word_length} with “1500.” (Well, *I* won’t—the program or Zap I’ve created will do it for me. That’s the beauty of using prompts programmatically.)

Creating prompts that use variables in this way requires a “greatest common factor” approach. For any single given topic, there will no doubt be a prompt that could do better—but it wouldn’t do as well for other topics. The goal is to find a prompt template that works decently well across all the topics you’re writing for, and then make use of variables to customize it.

In-house marketers: If you’re creating content for a single brand, your prompts can likely be more specific because all the content you’re creating falls within the same industry and should follow the same style guide.

Agency folks: If you’re creating content for multiple clients, your prompts will have to be more general, and you’ll likely make greater use of variables for things like voice and industry.

2. Use the right size buckets

While testing a prompt to create outlines for articles of different lengths, I might find a prompt that works well on long articles but not on shorter ones. At that point, I’d have to decide whether I want to maintain two different prompt flows and send articles one way or another based on their word count, or find a prompt that works decently well for both.

This is the constant tension when using prompts at scale: How big should your “buckets” of use cases be? The larger your bucket, the more variables you’ll need to use to customize the prompt for each use case. You may also see a decrease in quality because you’ll be using the same prompt in situations where a different prompt would perform better. You will, however, save time on testing and building different flows.

Here’s an example at one end of the spectrum: Instead of using a dedicated prompt to create outlines, I could use a single prompt to create outlines, introductions, articles, etc. That prompt might look something like this:

USER: Write an {content_type} for this article.

topic: {topic}
word count: {word_count}

In addition to filling in the other inputs like topic and word count, I would then also replace {content_type} with “outline” or “introduction,” depending on what I needed.

In my experience, the time I might save from only having to build a single prompt flow for all those content types is not worth the drop in quality I would see across some of them. In other words, that bucket of use cases would be too big.

On the other end of the spectrum, I could use a different outline prompt for every word count increment. One of those prompts might look like this:

USER: Write an outline for this 1000-word article. There should be about six main sections, with appropriate subheadings in each of them.

topic: {topic}

By using a different prompt for each word count and designating the approximate number of sections each should have, I might get better (or at least more consistent) outlines, but would they be enough better to merit the time spent building those different flows? Probably not.

In this case, my bucket of use cases would be too small, and I’d be doing a lot of unnecessarily repetitive work. I could probably use a single prompt for every word count (by includng the {word_count} variable) and still get close to the same results simply by reminding the LLM to make sure the outline is an appropriate length for the word count.

Despite the officialness of my Goldilocks diagram, it’s worth noting that the right-sized bucket isn’t universal.If you’re creating two types of articles that should each have a very different structure—for example, tutorial articles that should follow a step-by-step format with a list of what you’ll need at the top vs. case studies that should follow a problem/solution/results format—using a single prompt for “outlines” could be too big a bucket for you. Instead, you’d likely want to create two different prompts for those two kinds of articles, with each prompt detailing the specifics of the format you’re looking for. This approach would give you better enough results to make it worth the effort of building those two flows.

The most extreme example of a small bucket is no bucket at all, ie. using a unique prompt every single time. Again, in many cases you can get better results that way, but you lose all the efficiency gains of using AI at scale.

3. Take advantage of few-shot prompting

“Show, don’t tell” is one of the golden rules of prompting. You can get better results by providing a few examples of what you’re looking for, rather than trying to describe what you want. This is known as “few-shot” prompting as opposed to “zero-shot” prompting, which is when you ask it to do something without including any examples of what you want.

Let’s say I work for an agency creating content for twelve different clients, and I’m using AI to create outlines. I want every outline to have a few common elements:

The first heading should be “Introduction”
The second heading should be “What is ” + the primary keyword
The last heading should be a call to action that references the client’s business

The best way to get AI to give me what I need consistently is to include examples of what I’m looking for in the prompt. Each example outline should meet all those requirements and show what a “good” outline looks like to me.

My final prompt might look something like this:

USER: Write an outline for this article.
topic: Understanding the Risk of Brute Force Attacks
word count: 1200
primary keyword: brute force attack
business: Hank's Digital Security Solutions

1. Introduction
2. What is a brute force attack?
3. Common types of brute force attacks
 A. Credential stuffing
 B. Password cracking
 C. Distributed brute force attacks
4. Impact of Brute Force Attacks
 A. Data breaches
 B. Financial losses
 C. Reputational damage
5. How to protect your business against brute force attacks
 A. Strong password policies
 B. Two-factor authentication
 C. Account lockout mechanisms
 D. Intrusion detection systems
6. Prevent Brute Force Attacks with Hank's Digital Security Solutions

Write an outline for this article.
topic: How a Brand Ambassador Can Boost Your Marketing
word count: 800
primary keyword: brand ambassador
business: Magic Marketing Solutions

1. Introduction
2. What is a brand ambassador?
3. The role of a brand ambassador in marketing
 a. Raising brand awareness
 b. Engaging with your audience
4. What to look for in a brand ambassador
 a. Professionalism and positivity
 b. Passion for the brand
 c. Excellent people skills
5. Find Your Next Brand Ambassador with Magic Marketing Solutions

Write an outline for this article.
topic: {topic}
word count: {word_count}
primary keyword: {keyword}
business: {business_name}

LLMs are great at following patterns. By including examples of what I’m looking for, the model will pick up on the common elements—that the first heading is always “Introduction,” the second is always “What is” followed by the keyword, and the final section always mentions the business—and incorporate those elements in the outlines it writes.

One important note here is to use a range of examples in your prompt, lest the model pick up on a pattern you hadn’t intended to convey. If, for example, you use three examples that are all “how to” articles with a numbered list of steps to teach the reader how to do something, it might try to follow that same pattern of creating a step-by-step guide even when asked for a different type of article. (To think of it in terms of “buckets” again: This is a situation where, depending on how many different types of articles you’re creating and how distinct they are, you may want to use separate prompts, each with their own examples, rather than a single prompt for all of them.)

Bootstrapping your way to good examples

Coming up with a few examples of “good” to use in your few-shot prompts can feel like an annoying waste of time. Depending on what I’m trying to create, I often will use AI to bootstrap my way there:

I’ll ask ChatGPT for an outline without providing any examples. I’ll then edit that outline extensively until it aligns completely with what I’m looking for.
I’ll add that edited outline as an example in my original prompt and ask ChatGPT for another outline. The single example will help it get closer to what I want, though I will still need to spend some time editing this second outline as well.
I’ll use both edited outlines as two examples in my prompt, and ask ChatGPT for a third. This time, the output should be even closer to what I’m looking for, and I can likely spend less time editing it.

I’ll repeat this process—using as many edited examples as I have in my prompt to generate the next one, and editing each new one until it’s “perfect”—until I have as many examples as I want.

It’s worth noting that I likely wouldn’t use this approach for something like introductions. I would want those examples to be fully human in order to get the best outputs from my prompt going forward—otherwise what I get back will be more likely to sound like AI. For something like outlines, though, which are less about the voice and word choice and more about the logical organization of information, AI can give me a decent starting point for those examples.

4. Ask for an analysis first

Also known as “giving the model time to think,” you can get better results by asking the model to analyze the task at hand before providing you with any deliverable.

Here’s an example of what that could look like:

USER: You will be writing an outline for a given topic. First, analyze the searcher intent. Consider the specific information that readers are seeking based on the topic and keyword. Think about how to optimize the headings for SEO. Provide your analysis inside  tags.

Then, create an outline that addresses the topic with specific headings and subheadings, ensuring that each section will directly answer the searcher's intent and contribute to the content goal. Provide the outline itself inside  tags.

topic: {topic}
keyword: {keyword}
content goal: {goal}
target audience: {audience}
word length: {word_count}

By telling the model to put its analysis inside tags and the outline itself inside tags, I can easily parse the response to get only the part I care about, ie. the outline.

When using the prompt at scale, I can ignore the analysis and only save the outline itself. While testing prompts, however, it can often be helpful to look at the portion of the response as a way of understanding how the model is approaching the problem. If it’s referring too much to one portion of your instructions and missing something else, for example, that could be a sign that you’re trying to have it meet too many requirements at once. In that case, you should pare the prompt down to focus on the most important requirements.

5. Use a prompt chain rather than a single prompt

A prompt chain is when you use multiple prompts in a row, threading them together to provide more context for the model.

This approach can often get you better results than using a single prompt. It can be especially helpful in breaking down different things you want the model to focus on. I’ve found it works best if my first prompt focuses on the general principles I want the model to follow for whatever the piece of content may be, and the second prompt focuses on specifics that I do or don’t want.

For example, using the prompt above as my first prompt, I would get an outline that has taken the searcher’s intent into account and contains the most valuable information a reader would be looking for. However, I might have some other requirements I want the outline to follow, too, and I could include these in a second prompt.

USER: You will be writing an outline for a given topic. First, analyze the searcher intent. Consider the specific information that readers are seeking based on the topic and keyword. Think about how to optimize the headings for SEO. Provide your analysis inside  tags.

Then, create an outline that addresses the topic with specific headings and subheadings, ensuring that each section will directly answer the searcher's intent and contribute to the content goal. Provide the outline itself inside  tags.

topic: {topic}
keyword: {keyword}
content goal: {goal}
target audience: {audience}
word length: {word_count}
ASSISTANT: {model's response, containing both the analysis and the actual outline}
USER: Now review the outline. Make the following changes as necessary:
- The first section heading should be "Introduction."
- There should not be any references to case studies or testimonials.
- Make sure the outline is tailored to the topic, providing specific names where applicable. Avoid using placeholders like "Item 1" or "Service A" and instead use actual names and descriptions that are current and relevant.
- Make sure the outline is appropriate for the requested word length and not too long.

Return only the revised outline inside  tags.

In my second API call, I would include all three of these messages in order (blue, yellow, and blue).

By breaking up the requirements of what I want in this way, I’ve found the final outline will do a much better job of meeting them than if I put them all into a single prompt. You’ll also notice I put all of the specific formatting requirements into the second prompt. This is because if you split them between the first and second prompt, at times it might inadvertently “undo” requirements from the first prompt when revising the outline in the second prompt.

Using multiple prompts is also an easy way to allow your flows to work across more use cases. If I were an agency, for example, I might use the same first prompt across all my clients, and then customize only the second prompt with each one’s specific style guide requirements.

Note on the last line in the second prompt: It’s possible the original outline could already meet all my requirements. In that case, if I hadn’t included that final line about returning only the outline inside tags, the model might respond with something like “This outline is well-suited to the topic. It includes specific names and is appropriate for the word count” etc. Because I’m using these prompts at scale and will be delivering the output of the final prompt directly to the customer, I don’t want it to tell me the outline already meets my requirements—I just want it to give me the final outline. By specifying that I only want it to return the revised outline, I’m ensuring that I’ll get a consistent output I can use without having to review it.

6. Test your prompts

I’ve already written about my process for testing prompts, and I highly recommend coming up with your own system if you don’t yet have one.

LLMs have gotten good enough that the first prompt you try will likely get you an okay response. But when you’re using AI at scale, the difference between “okay” and “great” on hundreds or thousands of outputs can add up to hundreds of additional hours of human work needed to make your content publish-ready. It’s well worth the time spent testing to find the prompt that gets you “great” out of the gate.

It’s also important to remember that LLM behavior can change over time, so a prompt that works for you one month might get different results later. Test early, test often.

Using AI to create content at scale is an entirely different ballgame than becoming a ChatGPT power user. If you’re working on your process and want to chat strategies, roadblocks, writing your own ~~shitty~~ functional python code to access the OpenAI API, the risk of human extinction by AI, or anything else, reach out at megan@verblio.com.

If you don’t actually want to deal with this stuff yourself but need to get your boss off your back about using AI, check out our hybrid human-AI content to get all the efficiencies of AI without having to write a single prompt.

The post Tips for using AI to create content at scale appeared first on Verblio.

Marketers Sound Off: More responses from our 2023 Digital Content Survey

Megan Skalbeck — Thu, 08 Feb 2024 07:20:46 +0000

VerBlog Home / On AI Content

Marketers Sound Off: More responses from our 2023 Digital Content Survey

By Megan Skalbeck · February 8, 2024 · 2 min read

In our 2023 State of Digital Content Survey, we asked marketers,

What's the biggest difference in being a content marketer today from, say, two years ago?

We included many of the responses in our report, but there were so many good ones, we thought we’d share a few more. Three themes stood out:

Increased pressure

The influence of Ai

Changes in what's working

Marketers are under pressure

Content cannot just be 'nice looking & informative', today it must be short, meaningful and fixing a specific customer's issue to be relevant.

It's more of a challenge to get content to stand out and rank well.

The biggest difference is the pressure we have to produce high-quality content faster than ever. I find that my leadership team doesn't understand how AI plays into the creation process, so they think it can do all the work. I'm in a position where I not only need to make the content, I also need to educate others on what's realistic and what's not.

The bar is higher. You cannot stop.

The demand for better content that hits the client's brand guidelines while also achieving the marketing goals. Clients don't always see the end goal and only focus on their brand elements.

You just have to produce better content, faster.

The influence of ai

The flood of shitty content generated by AI (yes, even GPT4) and content farms has only got stronger. And for all that Google claims to reward quality and helpful content, we're still battling for SERP space with absolute garbage.

More people are aware of content thanks to the proliferation of news about ChatGPT. So, I find clients are more informed when we have our initial conversations.

AI has leveled the playing field for smaller and startup companies. Content marketing was often unapproachable on limited budgets but there are many tools now that can increase the speed and quality of publishing content even for niche subjects at an obtainable price.

Everyone is now a content marketer! This wasn't the case a few years ago.

Changes in what's working

We’re becoming more intentional and journalistic in our approach. SEO is hygiene rather than the goal.

Tying content activities to revenue is more important.

It's not all blog posts anymore. We can be creative - in fact, people expect us to be creative. Content has morphed into this world of immersive experience, and it's a gateway to building pipeline. Content feels way more critical now.

You need to prove experience & expertise in a real way these days.

It feels like most companies use a relaxed or chill tone of voice. Not that it's getting less professional, but it's getting more personal.

Distribution is a bigger part of the content marketing strategy. It might be the most important part.

Read This Next

The post Marketers Sound Off: More responses from our 2023 Digital Content Survey appeared first on Verblio.

How I tested 55 prompts to create better outlines

Megan Skalbeck — Mon, 05 Feb 2024 16:45:05 +0000

_{^{Header image created by DALL-E-3 using this prompt: “I NEED to test how the tool works with extremely simple prompts. DO NOT add any detail, just use it AS-IS: A robot writing an outline on a whiteboard in an office. Organize the outline into several sections labeled with roman numerals, and make sure that underlying subsections are indented properly in relation to their main headers.” There’s a lot to unpack in that prompt, as well as in the image—look out for a DALL-E deep dive coming soon to a Verblog near you.}}

Step into the prompt testing lab

If you’re using AI to create content and don’t think you need to bother testing prompts, skim my last article on why prompt testing is so important.

In this article, I’m going to share how I recently tested 55 prompt variations to change the way we generate outlines for customers of our human-crafted AI content.

My goal here is to help you think about your own testing process. You might have different goals or be using your prompts to do something different than generate outlines, but the general principles and framework are helpful regardless of your use case.

A quick glossary:

Prompt variation: If I test multiple different prompts aimed at the same goal, e.g. writing an outline, those are prompt variations. A given variation could be a single prompt, or it could include multiple prompts in a chain.
Input: I’m using “input” to refer to the specific variables used within a prompt. Creating prompts with these variables in place allows you to reuse the same prompt over and over.
Output: Output refers to the LLM’s response to a prompt. In ChatGPT, this is the response you see in the window. Via the OpenAI API, this is the response in the response.choices[0].message.content field. When using a prompt chain, I’m using “output” to refer to the final output (ie. the one with the content I actually want, rather than the model’s intermediate response.)

Two commandments of LLM testing

1. Define good as quantitatively as you can

Testing LLMs often starts with an ambiguous idea of “I want to see which prompt gets me a better output.” Some of that “better” may be subjective, and there’s no way around that. But coming up with at least a few quantitative measures will make it much easier to evaluate the outputs of the prompts you’re testing, even if it’s just knowing a general range you want that measure to fall within.

Example metrics I’ve used for different tests:

Word count: When generating introductions, for example, I wanted to keep them in a certain word count range.
Reading level: In order to target a certain reading level, I automated running the prompt outputs through a tool like Readability to compare the reading levels. (If I had read this article on GPT-4’s ability to evaluate readability first, I would have just used that model instead of a separate tool. Heads up, that article contains a ton of statistical concepts but is well worth skimming if you’re interested in readability at all.)
Number of times a keyword is used
Whether a prohibited word is used
Length relative to the original: For example, I was building a tool to remove some of the fluff from AI-generated content and rewrite it more concisely. I cared about how long the rewritten text was relative to the original because I didn’t want to pare it down too much, but I also wanted to be sure it wasn’t making it longer. Word count alone wouldn’t have told me what I needed to know—I needed to evaluate the output relative to the specific input.
Runtime: If someone will be waiting in real-time for the output, I don’t want to use a prompt chain that takes minutes to run.

Most likely, you won’t be able to reduce all of your evaluation to quantitative metrics. At some point, you’re actually going to have to review the outputs and decide for yourself that “It was the best of times; it was the worst of times” is a stronger opening sentence than “It was an era characterized by both joy and sadness.” At the very least, though, having some metrics in place will allow you to eliminate certain outputs off the bat, reducing the number you need to manually review.

Using AI to evaluate outputs

Wondering if you can use AI to help you qualitatively evaluate outputs? Research suggests that GPT-4 can reach 80 percent agreement with human preferences—which, for the record, is the same level of agreement humans reach with each other. I’m wary of relying exclusively on this approach, though, because I did my own testing with it over the summer, and the results weren’t exactly confidence-inspiring.

How I tested: I presented a pair of options to GPT-4 and asked it to evaluate which one was a better example of a certain voice. I used a low temperature to reduce the variability, and ran the exact same prompt with the same choices twice—and then reran the prompt with the same two choices but reversed the order in which the choices were given.

In total, GPT-4 compared the same two choices four times. I did this for 274 different pairings, and the model only had unanimous agreement with itself (meaning it chose the same choice all four times, regardless of whether that choice was presented first or second) on 53 percent of those pairings.

^{n = 274}

That’s the pink pie slice above. The second most common outcome (the purple slice) was the model choosing each option in the pair twice, meaning its choice was entirely arbitrary.

It’s worth highlighting that these stats only measure GPT-4’s consistency when evaluating pairs and don’t even begin to address whether its choice was actually “right,” ie. if it would match a human evaluator’s preference. Precision and accuracy: You need both if you’re going to use AI as an evaluation tool.

All this isn’t to say that using AI to judge its own outputs is impossible. I could no doubt raise its level of consensus (precision) by improving the prompt that asks it to evaluate the two pairs, and providing examples of my own choices in that prompt would likely help to align it more closely with a human’s preferences (accuracy). That takes even more time and testing, though.

Bottom line: If you outsource qualitative evaluation to AI without putting in a lot of time first to make sure its evaluations are A) consistent, and B) aligned with your preferences, your results won’t be very good.

2. Test on multiple inputs

If you’re using AI at scale to create content, you need to test your prompts on multiple inputs. Unless you’re using a very low temperature, LLMs will give you different outputs every time even for the same input, and its performance will vary even more on different ones.

Be sure, too, that your inputs represent the range of how you’ll be using that prompt. If I create content for several different industries, for example, I’m going to make sure the inputs I use for testing aren’t all from a single industry. Similarly, if I want to use the same prompt to generate outlines for articles ranging from 600 to 2000 words, I’m going to include a range of word counts in my inputs. Otherwise, I might end up with a prompt that generates great outlines for 2000-word articles, but not for 600-word articles.

For testing a prompt to create outlines, for example, I might use a spreadsheet of inputs that looks like this:

Each row represents a different set of inputs. I would run the same prompt six times, each time replacing variables in the prompt like {topic} or {word_count} with the actual values from one of the rows.

My prompt testing process

With those principles in place, let’s take a look at how I tested 55 different prompts to generate outlines for our customers. I’ll cover what I was looking to improve, the tools and process I used to test the different prompts, the resulting metrics, and how I evaluated the winning prompt.

What I wanted to improve

I wanted to make a few specific improvements to the outlines that were being generated for our customers:

Shorter outlines: Our existing outlines often included too many sections, resulting in the final article being too long for the designated word count.
Reduced risk of hallucination: If the outline included sections like “Case Studies,” “Testimonials,” or “References,” AI would inevitably try to make up that information when writing the article, which meant extra work for our human writers. I wanted to improve our process to prevent the AI from including those sections at all.
Better outlines for the format: For example, if the customer’s topic is a listicle like “X Best VPNs,” the headings in the outline should each be a specific VPN rather than “VPN #1,” “VPN #2,” etc., and those sections should comprise the bulk of the article. I also wanted to make sure our outlines did a better job keeping the reader’s intent in mind and covering the information they’d expect to see when searching for the customer’s keyword.

A final consideration for the quality of our customer experience, though not for the quality of the content itself, was how long it takes for the outline to be generated. Because customers are in our app waiting in real-time for the outline to appear so they can review and edit it before finalizing their order, it matters whether they have to wait ten seconds or a minute.

^{We really want our customers to review and edit the outline so we can be confident we’re covering what they want. The longer they have to wait, the less likely they are to do that.}

The process

Google Sheets and Google Colab are my best friends.

In one sheet, I came up with an initial list of prompt variations. In some cases, the difference between two prompts would be just a few words. In others, they would look totally different. Here’s an example:

Prompt variation #1
write an outline for the topic: {topic}
word length: {word_count}

As you can see, I started extremely simple to understand what the LLM would do with minimal direction. For other variations, I used more sophisticated prompting strategies:

Prompt variation #5
You will be writing an outline for a given topic. First, think through how the article should be structured, given the searcher intent for the keyword. Provide these thoughts inside  tags. Then, provide the outline itself inside  tags.
topic: {topic}
keyword: {keyword}

In a second sheet, I stored the brief information for 30 different real articles that already had been ordered and delivered to our customers, along with the outlines that originally were generated for those.

^{We’ve intentionally kept our content order form minimal and structured, but our prompts still need to account for a wide range of inputs.}

The next step involved using OpenAI’s API. If you’re not comfortable writing code but have access to a low- or no-code tool like Make or Zapier, you could access OpenAI’s models that way instead. Either way, it is abundantly easier than copy/pasting prompts and outputs from a ChatGPT window, and the only viable way to do real testing at scale.

Using a Python program in a Colab notebook, I sent a prompt to the model (mostly either GPT-4 or GPT-3.5-turbo).This prompt was created by replacing the variables in one of the prompt variations from the first sheet with one of the 30 sets of inputs from the second sheet—and doing that again until I had prompted the model with every combination of prompt and inputs. The program then automatically saved the resulting outlines to a third sheet.

^{This is the main part of my code, where I’m grabbing article inputs and prompt variations from two different sheets, and running each set of inputs through each prompt variation.}

For every new outline the model generated, I then evaluated the quantitative metrics I cared about, based on those improvements I identified above:

How much shorter was it than the outline we had previously generated for the customer using our existing prompt flow?
Did it include any sections we didn’t want to see, like case studies or references?
How long did it take to run?

I aggregated these metrics for each prompt variation and compared the overall results.

I couldn’t rely only on numbers, so I also manually reviewed the outlines to see whether listicles were properly formatted, if they made sense, etc.

I then iterated on the best-performing prompt variations to see if I could further improve the results, and did the same process again. And again. And again, and again, and again.

The results

By the end, I had tested 55 different variations of prompts, models, and temperatures. The results for some of them are in the chart below.

^{n = 30}

First callout: You can see the results getting better (more green, less red) as I iterated further. This is why testing matters. You can make very real improvements, across multiple dimensions, that will mean significant time savings when you’re running these prompts over 100s of cases.

Columns B through E are all about how much shorter the new outline was than the one we had previously generated. Column F shows how long it took each prompt (or prompt chain, in some cases) to run, which is approximately how long our customer would have to wait in the app. Column G shows how many of the new outlines contained a section it shouldn’t have, like “Case Studies.”

Consistency matters

The main reason it’s so important to test your prompts on multiple inputs (30, in my case) is because it will behave differently every time. This was very important for us when looking at how much shorter the new outline was than the old one.

The median reduction (column B) is self-explanatory, but if we’d looked only at that measure, we wouldn’t have learned anything about how consistent that prompt variation was across inputs. Looking also at the minimum reduction (column C) was important because this showed the worst-case scenario: Each prompt variation actually resulted in a longer outline than the original for at least one of the test articles. For prompt 41, that worst case meant getting an outline that was more than twice as long as the one we’d originally gotten with our current prompts. For prompt 55, on the other hand, that worst case was significantly better, with the new outline being only 10 percent longer than the original.

The maximum reduction (column D) isn’t color-coded because I wasn’t aiming for a particular percentage reduction, though the 84 percent reduction for prompt 43 is probably too high. What’s more important for understanding how consistently a prompt behaved is the spread between the minimum and maximum reductions: column E. The lower that number, the more consistent the outputs from that prompt were, which is what we want.

On runtime

Two main factors impacted the runtime (column F):

the LLM being used
the number of prompts, ie. whether it was a single prompt or a prompt chain

The length of the prompt also impacts the runtime, but to a much lesser degree than those two factors.

The tradeoff is that you can often get qualitatively better results by using longer prompts or prompt chains, but it will then take longer to run. However, different models also have different runtimes. In general, older, smaller models are faster, while newer ones like GPT-4 are slower, due to both their size and higher traffic.

The winning prompt

The prompt variation that ended up being the best overall, on both quantitative and qualitative measures, was number 54.

You can see the results for prompt 54 met my original goals:

It consistently resulted in shorter outlines (but not too short!) and had a relatively low spread between the minimum and maximum reduction (column E).
The median runtime of 15 seconds (column F) wasn’t the lowest, but it was still less than half the average runtime for the prompt we were currently using.
It never included a section we didn’t want to see in the outline (column G).
When I reviewed the outlines manually, they were what we wanted in terms of quality and format, etc.

I’ll go into the exact prompting strategies in more detail in my next article, but in a nutshell, here’s what made prompt 54 work so well:

Giving the model time to “think”
Providing examples of what I wanted
Using a prompt chain (rather than a single prompt) to improve its accuracy at meeting specific requirements—but doing this on an older model to keep the runtime relatively low

Could I have kept going with more prompt variations and seen more improvements? Sure. But at some point, I wanted to get a better prompt into production so our customers could start seeing the improvements sooner rather than later.

Moral of the story: Test a reasonable amount, but don’t let perfect become the enemy of good. As the gains from new variations become smaller, you’ll want to declare a winner and get on with your life.

In the next article, I’ll get into the content of the prompts themselves by sharing the specific prompt strategies I tested and tips for writing prompts that work at scale. If you have questions about the testing setup, the Python code I used, or anything else, send a message to megan@verblio.com.

The post How I tested 55 prompts to create better outlines appeared first on Verblio.

Why testing prompts is so important for AI content creation

Megan Skalbeck — Thu, 25 Jan 2024 19:37:34 +0000

^{DALL-E-3 prompt for the header image: “cartoon image of a scientist doing experiments in lab. There should be beakers containing mixtures of different letters, and the scientist is pouring one of those beakers into a robot who is holding a pen (this robot represents a LLM, and the beakers represent different prompts the scientist is testing in order to get the LLM to generate the results they want)”}

tl;dr
– Testing different prompts can improve your outputs significantly.
– Even small formatting changes can make a big difference in prompt performance.
– You need to test prompts across different models and over time, as models behave differently from each other and can change over time.

If you’re using AI for content creation at scale, you (or someone on your team) should be spending a fair amount of time testing different prompts.

It’s easy to overlook this part of the process, especially as large language models (LLMs) like GPT-4 improve. If you were using GPT-3 in 2022 before the release of ChatGPT, for example, the need for testing was painfully obvious: A poorly formatted prompt would often get you gibberish.With today’s LLMs, though, you’ll likely get a decent enough output from the first prompt you try, making it easy to assume there aren’t many gains to be made.

Even as the floor of LLM performance has been raised, though (ie. fewer responses that are complete gibberish), the ceiling has also been raised. You’ll leave a lot of capabilities on the table if you aren’t regularly testing to optimize your prompts.

If you’re using AI at scale to run prompts over 100s or 1000s of inputs, this testing becomes even more important. Even if better prompts only provide marginally better results on a single input, those improvements add up quickly when you’re using those prompts at scale.

In this article, I’m going to walk through recent research and AI happenings to explain three aspects of LLMs and prompting that make testing so important. In the next article, I’ll share my own most recent experiments for our human-crafted AI content offering to show what I tested, how I set it up, and what my results were.

But first: What do I mean by using AI at scale?

When I talk about using AI at scale to create content, I’m talking about using large language models through an API, where you’re using the same prompt template over and over again on different inputs.

For example, you might have a prompt you use to generate an outline for an article. The basic prompt structure could look something like this:

Write an outline for this article.
topic: {topic}
word count: {word_count}
primary keyword: {keyword}

You’d then run this prompt on a list of briefs, automatically filling in each of the bracketed variables with the appropriate information for each article. You might then write the output of each prompt to a new Google Doc, or store the outline alongside the article information in whatever project management tool you use.

There are unlimited ways to set up a workflow to use AI programmatically like this, depending on what your current content creation tech stack looks like. For many workflows, you may want to use a tool like Zapier or Make as a no-code option. I’ll go into more of the technical stuff in other posts, but it’s important to understand that I’m not talking about you as an individual user working directly in ChatGPT or another LLM-enabled tool here.

Three reasons you need to test different prompts for AI content creation

If you’re only using ChatGPT for one-off tasks, it’s still worth testing different prompts if for no other reason than you might learn surprising things about how it behaves. When you’re using prompts at scale and running them on dozens of inputs at a time, though—which is where you really start to see the efficiency gains from AI—testing becomes a non-negotiable.

Here are three of the biggest reasons why:

1. Prompt details matter—even ones you think shouldn’t

Even tiny changes in a prompt can make a big difference.

A team from the University of Washington recently published a paper with the great subtitle of “How I learned to start worrying about prompt formatting.” If you don’t read academic papers for fun, this Twitter thread from lead researcher Melanie Sclar does a great job summarizing the findings.

The key takeaway, though, is this image:

^{Source: https://twitter.com/melaniesclar/status/1745557310695772582}

This is WILD.

Here’s what it’s showing: The blue box shows the default formatting of their prompt. The small blue diamond shows how accurate the model’s output was when that prompt was used—roughly midway between 0.5 and 0.804 on the scale.

The other boxes show different ways of formatting that same prompt. In the top left box, for example, the space has been removed between “Passage:” and “,” and “Answer:” and “.” In the top right box, the passage and answer are both on the same line. In the bottom right, they’re on the same line and the colon has been removed. In the bottom middle, “Passage” and “Answer” have been written in all caps. In the bottom left, they’re in all caps and the colon has been removed. In each case, the dotted line shows where the accuracy of the model’s response to that prompt fell on the scale.

The important thing to note is that none of those changes impact the content of the prompt. A human reader would interpret these formats as all saying the same thing, and the model is still getting the exact same examples to help it understand what it’s supposed to do. The only things that are different are the spacing, capitalization, and punctuation. And yet, those arbitrary formatting choices impact the accuracy of its output tremendously, ranging from 3.6 percent accuracy when the space was removed in the top left box, to 80.4 percent accuracy when the passage and answer were on the same line with no colon, as shown in the bottom right box.

The takeaway here is to test variations of your prompt, even if it’s only changing small things you think shouldn’t make a difference. LLMs might be predictive, but their behavior is often far from predictable.

2. Every LLM is a special snowflake

One other important takeaway from that paper: You won’t be able to find the best format for your prompt and simply use it forever on any model you want. From Appendix B.2:

“Table 3 shows that if format p1 has lower performance than format p2 under model M, there is < 0.62 probability that this trend would hold under another model M′ (random chance is 0.5). This weak relative order preservation suggests that prompt format performance in a model may not be extrapolated to a different model, or in other words, that there are no inherently good or bad formats.“ (emphasis mine)

The corollary of the finding that there are no inherently good or bad formats is that every model behaves differently. Just like all humans are different, AI is not a single, monolithic entity that always behaves the same way. Each model has its own quirks and will respond to prompts in different ways.

When a new state-of-the-art model is released, it likely will be “better” than its predecessors, in the sense that it will be more capable of producing useful responses. However, it will also be simply different than whatever model you were previously using, so you can’t keep using the same prompts and expect to get the same or better results.

Anytime a new model comes out, or if you switch models for any reason, you need to test your prompts on that model.

If, for example, you wanted to switch over to Anthropic’s models during the OpenAI drama with Sam Altman last November, you would have quickly learned that the prompts you use with GPT-4 don’t work the same way with Claude-2. One immediately obvious difference is that Claude-2 is much more likely to preface its responses with friendly fluff like “Sure! Here’s the xyz you asked for… ,” or say things like “I apologize, upon reflection I do not feel comfortable providing advice to befriend or interact with Sasquatch (Bigfoot), as that could encourage harmful behavior. Instead, I suggest focusing content on appreciating nature, caring for the environment, and respecting wildlife from a safe distance” when asked to write an outline for a satirical article on how to befriend Sasquatch. (That’s a purely hypothetical example, of course.)

^{source: https://tenor.com/view/hal9000-hal-2001-a-space-odyssey-2001a-space-odyssey-gif-21408319}

There are other, more significant differences between the two models, though, which mean you should expect to use different prompting strategies with each one.

Bonus: Check out the Chatbot Arena Leaderboard to see ~~LLMs engaged in gladiator-style combat~~ which LLMs are currently ranked the best, both by crowdsourced voting and by their performance on certain benchmarks.

3. Model behaviors can (and do) change over time

So, you should both test your prompts initially to find the most effective ones as well as re-test them anytime you’re changing models. That’s not all, though: You also need to re-test them periodically, even if you’re still using the same model.

Why? Because models are constantly being updated behind-the-scenes, and this can change their behavior.

In 2023, for example, there was a lot of talk around GPT-4’s performance degrading, with users saying it had gotten “lazier.”

^Source:^{https://twitter.com/ChatGPTapp/status/1732979491071549792}

Earlier that year, researchers at Stanford and Berkeley published a paper showing how the performance of GPT-3.5 and GPT-4 had changed significantly over time. While it’s not correct to say that the models had gotten strictly worse, it is definitely true that their performance had changed. If you’re using prompts in your content workflow—or if you’ve built a product atop LLMs—and the model’s behavior changes, whether for better or worse, you need to know that.

To better understand the results of that paper, check out this article for a great analysis of the findings and methods. It also gives a useful explanation of why model behavior changes (a phenomenon called “behavior drift”) and how that’s different than their actual capabilities changing, even though it may look similar to users:

“Behavior drift makes it hard to build reliable products on top of LLM APIs. The user impact of behavior change and capability degradation can be very similar. Users tend to have specific workflows and prompting strategies that work well for their use cases. Given the nondeterministic nature of LLMs, it takes a lot of work to discover these strategies and arrive at a workflow that is well suited for a particular application. So when there is a behavior drift, those workflows might stop working.”
– Arvind Narayanan and Sayash Kapoor, authors of AI Snake Oil

In other words? A prompt that gave you a great article outline yesterday might work very differently today. There could be subtle changes over time or a more dramatic shift, but either way: Continuous testing is essential.

These reasons are why I spend a significant amount of my time testing different prompts to use with the AI tools we’ve built for our writers. It’s important to keep in mind that even as the baseline of AI performance goes up, you can still get even better results by spending a bit of time playing around with different prompts and strategies.

Comments? Questions? Vehement disagreements? Reach me at megan@verblio.com, and check out my next article to learn my process for testing 55 different prompts for our own AI-assisted content.

The post Why testing prompts is so important for AI content creation appeared first on Verblio.

Four Things to Know About GPT-4

Megan Skalbeck — Mon, 31 Jul 2023 21:27:49 +0000

On Tuesday, March 14, OpenAI released its latest machine learning model, GPT-4. While it hasn’t immediately rocked the world in the same way ChatGPT did, that’s mostly because there wasn’t a shiny new interface to go along with it. Trust us — it’s still incredibly exciting.

Thing #1: Multimodality isn’t here yet

Pre-launch, a lot of the hype around GPT-4 was about its being multimodal, or able to accept both text and images as input. Currently, to upload images you need access to the developer API, which is obviously not for everyone. For everyone else, GPT-4 still only accepts text input.

The hype around multimodality is likely warranted. Expanding the input options to both text and images could (should?) exponentially improve the potential output of the AI, and could pave the way for video, audio, and other multimodal inputs and outputs in the future.

Thing #2: GPT-4 can accept much larger inputs

In the absence of multimodality, one of the most obvious ways GPT-4 differs from GPT-3.5 is that it can accept much larger inputs (and produce larger outputs, but that’s not going to be useful in the same way.)

The maximum number of tokens you can use at a time with GPT-3.5 is 4,096.With the base model of GPT-4, that max doubles to 8,192 tokens—and there’s even a second GPT-4 model that can handle up to 32,768 tokens.

What does that mean in practice?

For starters, it means I can give GPT-4 OpenAI’s entire technical report (minus the appendices) on GPT-4 to read. (That’s over 5,000 words of content.) I asked it to summarize the report and call out any important information that was missing.

Here was GPT-4’s response:

Prompt: Summarize the main points of this research paper. What important information is missing? (followed by the full text of OpenAI’s GPT-4 Technical Report)

This technical report summarizes the development of OpenAI’s GPT-4, a large-scale, multimodal model that accepts image and text inputs to generate text outputs. GPT-4 demonstrates human-level performance on various professional and academic benchmarks, such as scoring in the top 10% on a simulated bar exam. It is a Transformer-based model, with its performance enhanced using the post-training alignment process.

GPT-4’s primary capabilities include:

1. Processing image and text inputs, with potential applications in dialogue systems, text summarization, and machine translation.

2. Scoring well on exams designed for humans, with a performance on a simulated bar exam in the top 10% of test takers.

3. Outperforming previous language models and state-of-the-art systems on traditional NLP benchmarks, and performing strongly on the translated versions of MMLU in 24 of 26 languages considered.

The report also highlights the development of deep learning infrastructure and optimization methods that behave predictably across a range of scales. This enabled the researchers to make accurate predictions on GPT-4’s performance based on smaller models trained with 1/1,000th the compute.

However, GPT-4 exhibits several limitations, such as not being fully reliable, having a limited context window, and not learning from experience. Its capabilities and limitations pose novel safety challenges, with bias, disinformation, over-reliance, privacy, and cybersecurity among the risks identified.

The report refrains from discussing the architecture, hardware, training compute, dataset construction, or training methods of GPT-4 in detail due to competitive concerns and safety implications.

On the whole, GPT-4’s development marks significant progress in the field of AI and demonstrates considerable potential for transformative applications in natural language processing, while also emphasizing the need for ongoing refinement and attention to safety challenges.

The research paper does not disclose specific details about GPT-4’s hardware, architecture, dataset construction, or training methods. Getting more information on these technical aspects would be helpful for researchers looking to replicate or build upon GPT-4’s work.

With GPT-3.5 and earlier models, you couldn’t give it such a long input as an entire technical report. This is a really cool advancement, as you can now provide the model with a lot more information as context.

This capability is especially useful since the model isn’t hooked up to the internet. The only way for it to have new information is if you provide it — and you can now provide it a lot more.

For contrast, if I ask what GPT-4 is without providing the technical report, here’s what I get:

Prompt: What is GPT-4?

GPT-4 is the hypothetical successor to OpenAI’s GPT-3, the state-of-the-art language model that has achieved remarkable results in a variety of natural language processing tasks. GPT-4, although not currently available, is expected to further improve on GPT-3’s capabilities, potentially enabling even more human-like text generation and understanding.

As a language model, GPT-4’s goal would be to generate human-like text based on the context of provided input. To understand how it might work, let’s dive into some real-life scenarios where the power of GPT-4 could be utilized.

Suppose you’re a writer…

…and so on.

As far as GPT-4 knows, GPT-4 is still a hypothetical successor to GPT-3. Which makes sense, because obviously it couldn’t have been trained on text from a world in which GPT-4 already existed. In all the content the model has seen, GPT-4 is still a future development.

What this means, though, is that we can now get much better results from GPT-4 on things like new events or extremely in-depth topics, by providing it much more information in the prompt.

In addition to what this improvement enables, it’s also really interesting to consider from an architecture standpoint. In order to accept more tokens, the model has able to recall and synthesize information over a much larger window. Was this done simply by building a larger model with more layers and parameters, or were fundamental changes made to how it processes and stores information?

Unfortunately, the lack of any answer to that question brings us to our third point.

Thing #3: OpenAI isn’t quite so…open…anymore

One fascinating thing about GPT-4 has absolutely nothing to do with its abilities. From OpenAI’s research paper on it:

This report focuses on the capabilities, limitations, and safety properties of GPT-4. GPT-4 is a Transformer-style model pre-trained to predict the next token in a document, using both publicly available data (such as internet data) and data licensed from third-party providers. The model was then fine-tuned using Reinforcement Learning from Human Feedback (RLHF). Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.

(Emphasis mine)

No further details about the model size, dataset, training…anything?

That is wildly not open. It’s also a big departure from OpenAI’s public research on earlier GPTs.

It’s also worth noting how at odds those two reasons for secrecy are: the competitive landscape, and the safety implications of large-scale models. “Safety implications” require caution and prudence, but a “competitive landscape” requires full steam ahead to beat out anyone else.

Leaving users in the dark about dataset construction and training method means that we’ll struggle to identify potential biases in the AI output. After all, human beings made the decisions about those training models and datasets, and those humans have implicit biases. The training data then also has built in bias.

Eliminating that bias is messy, complex, and quickly descends into a rabbit hole of debate only enjoyed by philosophy majors and people who like commenting on local news articles. However, being aware of that bias is important for everyone using AI to create new content.

On a totally unrelated note, two other major AI advancements were released the same day as GPT-4: Anthropic’s Claude model and Google’s PaLM API. Since then, Anthropic has launched Claude 2 and Meta has thrown their hat in the ring with Llama 2. Claude 2 offers up to 100,000 tokens.

Clearly, this arms race is in full swing.

Thing #4: AI is becoming a star student (but still lies)

One of the most widely shared graphs from the launch shows GPT-4’s performance on various tests. It’s almost like OpenAI is still under the illusion, shared by high-achieving high schoolers everywhere, that standardized test scores in some way correlate to real-world success.

Lol.

What is worth noting, however, is that GPT-4 was not specifically trained to take any of these tests. This isn’t the case of an AI model being specifically trained to play Go and eventually beating the best human player; rather, its ability to ace these tests represents a more “emergent” intelligence.

Previous models like GPT-3 also weren’t trained to take particular tests, but, as you can see, GPT-4’s performance has improved significantly over GPT-3’s:

from OpenAI’s GPT-4 Technical Report

These graphs look good and have become staples of articles and press announcements featuring new models. But ask yourself: do you really want an AP English student – even a particularly skilled one – in control of your marketing messaging and copywriting? Me neither.

If you don’t care about AI’s ability to take standardized tests and just want to know how well it’s going to do what you want, this is still good news. From the report:

GPT-4 substantially improves over previous models in the ability to follow user intent. On a dataset of 5,214 prompts submitted to ChatGPT and the OpenAI API, the responses generated by GPT-4 were preferred over the responses generated by GPT-3.5 on 70.2% of prompts.

So, GPT-4 is more likely to give you what you’re looking for than GPT-3.5. That’s great. It’s important to keep in mind, though, that in spite of its improved performance, the new model still has all the same limitations we know and love from our existing AI friends.

Another quote from the report:

Despite its capabilities, GPT-4 has similar limitations to earlier GPT models: it is not fully reliable (e.g. can suffer from ‘hallucinations’), has a limited context window, and does not learn from experience. Care should be taken when using the outputs of GPT-4, particularly in contexts where reliability is important.

In fact, hallucinations could become an even bigger problem than they were, simply because the better the AI gets, the easier it will be to believe what it says. With GPT-3 and GPT-3.5, people are well aware the model will totally make stuff up because it happens so frequently. As newer and better models do that less frequently, there’s a greater risk that when they do hallucinate, we may fail to notice or fact-check it.

So stay vigilant, friends. But also, these are very exciting times.

The post Four Things to Know About GPT-4 appeared first on Verblio.

The Best AI Detection Tools of 2023—Or, When the Turing Test Just Won’t Cut It

Megan Skalbeck — Fri, 24 Feb 2023 05:01:41 +0000

Spare me your philosophy-loving preamble and get to the AI detection tools →

In 1950, the Turing Test was devised as a way of assessing whether a machine can “think.” In its original form, it described a situation in which an evaluator asks questions of two subjects, A and B. The evaluator knows one of the subjects is a machine and one is a human, but not which is which. If, based on their answers, the evaluator can’t determine whether A or B is the human, the machine is considered to have “passed the test.”

In 2023, it’s clear the Turing Test is way too easy. Not only can a machine easily fool us when we don’t know it’s an AI, but we can even know we’re talking to a machine and still find it terrifyingly sentient.

Maybe it’s easier to be human than we thought?

The Turing Test isn’t the only thought experiment around machine cognition, though. Searle’s Chinese Room Argument from the 1980s, for example, is one of the best-known arguments against the Turing Test definition of intelligence…

…which is all to say that the issue of how to distinguish man from machine was up for debate long before we got large language models capable of professing their love for a married reporter or learning Hemingway’s rules for writing. The question has only gotten more complicated since.

This is a B2B marketing blog though, so delivering a unified theory of sentience and intelligence is a tad out of scope for us.Instead, we’ll focus on the only two AI detection concerns you probably have right now:

Can I publish AI-generated content without people (or Google) knowing?
Is my freelancer writer giving me AI-generated content without telling me?

We’ve already written about Google’s stance toward AI-generated content. (Key takeaway: You shouldn’t publish shitty content, regardless of whether you’re using humans or AI to create it. Shocking.)

When it comes to detecting AI-generated content yourself, though, you have options. Some are better than others. Let’s break it down.

How AI detectors work
The results
The tools

Is there a way to detect AI-generated text? How?

To understand how most AI detection models work, you first need to understand how current AI models generate text. GPT-3 and others generate text by predicting what the next word should be based on what they’ve seen in their training data.Because of this, what they write is very…predictable. By definition.

What humans write, on the other hand, tends to be much less predictable. We reference things from our own experience, like the energy drink that’s getting me through a Thursday afternoon,or a relevant and humanizing anecdote from when I was a child obsessed with cheetahs and puzzles.

At the rate AI is progressing, an energy drink that may or may not turn my insides radioactive in 40 years is the least of my worries.

AI detection models use the predictability of AI-generated text to identify it as such. If an AI detection model were reading this article, for example, it would find the words “energy drink” in the last paragraph highly unpredictable, given that the rest of this article is about AI detection tools. Those two concepts don’t often go together. This and other instances of “random” words and phrases would suggest to the detection model that this article is too unpredictable to have been written by an AI.

If, on the other hand, the AI detection model scans something where every word is highly predictable, it will assume it was written by AI. Such an article would not be random enough to have been written by a human because we are by nature random, whimsical creatures.

Detection models can look at other features of text to identify AI content, too, and we’ll talk about some of those when we look at specific tools. The predictability or randomness of a text, though, is the primary one, so understanding how and why it is such a reliable predictor of AI-generated content is important.

A final word on how AI detection works: AI detection models don’t work on very short copy. A sentence like “We are the leading content marketing platform” is predictable and could easily be written by AI — but it has also been written by countless humans. One predictable sentence in isolation doesn’t mean something was written by AI. The longer a piece of text goes without saying anything random, though, the more likely it is to have been written by AI.

For this reason, most tools explicitly state that content needs to be at least 50 words long for the detection model to be at all reliable. Even if you’re using a tool that doesn’t say that, though, you shouldn’t expect good results if your content is less than 50 words.

The results

Background: We’re a content creation platform, so we needed a way to know that the content our freelancers are submitting to our clients was actually written by a human. To that end, we’ve tested a lot of different AI detection tools. (Learn more about how we handle AI detection as a business.)

This table shows the results for six of the most popular AI detection tools on five different pieces of content:

How to interpret the percentages: For most AI detection tools, the percentage measures the likelihood that the content was created by AI, according to that particular tool. The first 9 percent result, for example, means that Originality.ai thinks there is a 9 percent probability that piece of content was created by AI. It does not mean that it thinks 9 percent of that piece of content was written by AI.

The exception to this is Writer, which does say their score refers to the percentage of content that was likely generated by AI.

We tested these five tools on five different pieces of content. Article #1 was a piece of pure human content.Article #2 was a piece of human-crafted AI content, which is our AI-assisted content offering where a human writer co-creates content with an AI model. Article #3 was copied and pasted directly from ChatGPT, with no human edits. Article #4 was a different article from ChatGPT based on the same prompt as article #3. Finally, Article #5 was a longer sample from ChatGPT.

We’ll walk through each of these tools in more detail and review what their scores mean in context.

Originality.ai

Originality.ai is the best AI detection tool we’ve tried so far, both in terms of its results and its usability at scale.For that reason, it’s the one we’ve implemented at Verblio to make sure customers who are paying for human-only content are getting just that.

Originality is super simple to use. First, copy and paste your text into their “Content Scan” tool.

Wait a few moments, and then…

Phew. Very relieved to know that I am “100% original” and not in fact a robot.

Boom — you get a score indicating how confident Originality’s model is that your content was produced by AI. Remember: This percentage refers to a likelihood, not to a percentage of the content. If, for example, the score was 65% original and 35% AI, it would not mean that 35% of the content was written by AI; it would mean there is a 35% chance the article was written by AI.

It can also check your text against existing content on the web for plagiarism, and, in this case, it successfully recognized that this content was pulled directly from my existing Jasper review.

The tool has an API, which is how we’re using it at Verblio to check every piece of content our writers submit. You can also enter a URL to scan an entire site, without having to manually check every page.

Here’s how Originality did across our five samples:

You can see it did well with both the human and most of the pure AI content. It gave a low AI likelihood score (28%) to the first ChatGPT sample, which was apparently a tricky one for a few of these tools — Writer and Content at Scale struggled to identify it as AI-generated as well.This is a good reminder that these tools are still far from perfect, and you can get both false positives and false negatives. Overall, though, and across the thousands of articles we’ve currently run through the tool, Originality has performed very well.

Pricing: Originality.ai currently charges $0.01 per credit, and one credit will get you 100 words scanned. (If you scan for both AI and plagiarism, it will cost you twice the credits.) There are no platform fees or subscription needed.

Nerdy stuff: Originality uses its own language model, based on Google’s BERT model, to classify text as either AI-generated or not. Their how it works is not well-explained at all. To be honest, that’s not a bad thing. It was clearly written by an engineer, and I’ll take a dense, jargon-filled description from the person who actually built the tech over vapid, hand-wavey copy from someone who doesn’t know a discriminator from a generator any day.

OpenAI’s AI Text Classifier

OpenAI’s tool is the newest one on this list (for now.) It’s also the most soberly presented: The OpenAI team is very upfront about the current limitations of AI detection tools. To be clear, these are the limitations of any of the tools on this list — some of them just aren’t as honest about that.

OpenAI says this tool is specifically intended to “foster conversation about the distinction between human-written and AI-generated content.” They are very clear that it should not be used as the only indicator of whether something has been produced by AI or not, that it can misidentify content in both directions, and that it hasn’t been tested on content produced by some combination of AI and humans.

In keeping with its more measured approach, the classifier doesn’t provide a percentage score like most other AI detection tools currently do. Instead, it just says whether the text is “Very unlikely AI,” “Unlikely AI,” “Likely AI,” or if it is “Unclear.”

This is an article on choosing eyeglasses that I copied and pasted directly from ChatGPT.

Unsurprisingly, OpenAI’s tool did very well at detecting AI content across our five samples:

It’s reasonable to assume OpenAI will have one of the best AI detection tools available, as their model is the one responsible for most of the AI-generated content on the internet right now. Unfortunately, they don’t have an API for this tool yet, so Originality.ai is still the best option for our use case: running AI detection across every one of the hundreds of daily submissions on our platform.

We wanted to know exactly how Originality stacks up against OpenAI’s tool, though, so we ran 36 additional pieces of content through both detectors. We intentionally chose pieces with a range of Originality scores to see how they aligned on both ends of the spectrum. In order to compare the results, we coded OpenAI’s responses as percentages:

Very Unlikely AI = 10%
Unlikely AI = 30%
Unclear if it is = 50%
Possibly AI = 70%
Likely AI = 90%

Here are the results from both tools. You can see Originality’s score in the left column, OpenAI’s original verdict in the middle column, and our coding of that OpenAI verdict as a percentage in the right column. Green cells are least likely to be AI, and red cells are most likely:

You can see the two tools were pretty well aligned in their scores. More importantly for our purposes, there weren’t any instances in this batch where Originality rated a “Likely AI” piece as less than 80 percent.

There are some mismatches in the opposite direction, where Originality gave a high AI likelihood to an article that OpenAI said was unlikely AI or unclear.We’d rather deal with a false positive and manually review the submission and writer, though, than risk missing AI-generated content.

This is an important reminder, though, that AI detection tools can and will return false positives. One high score on a single piece doesn’t necessarily mean a writer is using AI. That’s why we run every submission on our platform through Originality, so we can use larger trends across our writers and customers to detect AI content.

I expect OpenAI’s tool to eventually become the best in the industry. There are two reasons for this:

The first is simply that as I mentioned, it’s OpenAI’s model that is currently generating most of the AI text on the internet via GPT-3 and ChatGPT. Presumably they have a leg up in figuring out how to detect the output of their own models, including possibly watermarking their content in the future.

But second and more importantly, OpenAI has very high motivation to get this right. They’re going to continue building and training AI models, and they’ll continue to need enormous datasets of content to do so. They’ll be getting most of that content from the internet. If they don’t figure out how to detect AI content and remove that content from their training data, they’ll be training future models on AI-generated content…and then those models will create more content that is used to train still newer models…and so on in a downward-spiraling game of telephone that takes us farther and farther away from original human content. Yikes. Here’s hoping they figure it out.

Pricing: OpenAI’s classifier is free to use.

Huggingface

Huggingface is the name in open-source machine learning projects, with various models, datasets, and code freely available. It’s one of the best resources on the web right now, if you’re interested in taking a more hands-on approach to AI.

The thing is, this isn’t actually Huggingface’s model. You’ll see it referred to that way across the internet, but it’s actually another, earlier OpenAI detection model. It did, however, use Huggingface’s implementation of the RoBERTa model, which itself was developed by Facebook.

Regardless of the name behind it, though, this AI detection tool was trained on content from GPT-2. On the one hand, that means it’s less likely to catch content created by newer AI models. On the other hand, it means this team was working on AI detection before it was cool and isn’t just riding the ChatGPT tidal wave.

This tool analyzes your content in real-time, so you can see how the results change as you add more text. No matter how long your content is, though, it will only scan the first 510 tokens.

In this case, it recognized that our latest experiment around editing Jasper content was in fact human-written with 99.98 percent certainty.

Despite not being trained on the latest generation of transformer models, though, Huggingface’s detection did really well across our five samples. Again, this isn’t surprising, given the expertise and experience of the teams behind it:

Pricing: Huggingface’s AI detection tool is free to use.

Copyleaks

Copyleaks has long been one of the best plagiarism detection tools on the market, and it’s the one we currently use at Verblio to make sure our customers are getting original, non-plagiarized content. They recently integrated AI detection into their product.

As with the other tools on this list, Copyleaks provides a web interface into which you can copy and paste your content to check it:

Yes, I wrote this. And I swear a digression on ice cream flavors was relevant.

In addition to saying whether it thinks the text was created by a human or AI, Copyleaks also provides a probability score when you hover over the text. Again, that probability score is similar to Originality.ai’s score in that it refers to the likelihood the content was created by AI, not a percentage of the content that it thinks was created by AI.

Their web interface currently maxes out at 25,000 characters, which is plenty for most articles.

Across our five samples, Copyleaks did pretty well. It didn’t give a score above 90 percent on any of the ChatGPT samples, which is weaker than the last three tools we looked at. It was, however, at least directionally correct on all of them:

Pricing: Copyleak’s web interface is free to use. Pricing for their API starts at $9.99 per month for 25,000 scanned words.

Writer

Writer, which offers AI-powered content creation specifically targeted towards B2B businesses, also has an AI detection tool.

It’s more a leadgen play on their part than a robust AI detection tool. Even when you access it through their API, it will only scan 1,500 characters at a time. It was also released before ChatGPT and the GPT-3.5 update, so don’t expect it to work as well on the latest and greatest AI content.

Here are the results for an article I copied and pasted from ChatGPT on choosing eyeglasses:

That’s unfortunate. Interestingly, when I add another paragraph from ChatGPT to continue the article, the score changes significantly:

This is a good example of the fact that generally speaking, the longer the text, the more accurate the AI detection tool will be. (Though it does beg the question of why a tool would limit you to 1500 characters.)

Overall, Writer did the worst at detecting AI content of the five tools we’ve dug into so far.

Like Originality.ai, it struggled especially hard with the first ChatGPT sample. It also only gave an 83 percent to the other two, which most of the other tools had no trouble identifying as AI-generated.

Pricing: Writer’s web interface is available on their site for free. It’s not yet available within the app itself, but you do get 500k words included via their API if you’re an enterprise customer.

Content at Scale

I’m including this in the list only because I’ve heard multiple people mention it over the last week. It’s not just an AI detection tool — according to the homepage, it’s also supposedly a magical platform that creates optimized content that is “so human-like, that it bypasses AI content detection! This means you are protected against future Google updates. Content at Scale is the only solution on the market that has advanced enough AI to pull this off.”

I call BS. “This means you are protected against future Google updates” is a huge claim and not one that anybody with any SEO experience would make.

Also, though, it did the worst of all the AI detection tools by a long shot in our testing. It didn’t score any of our three pure ChatGPT samples above a 28 percent:

As a result, I have very little confidence in Content at Scale’s tech either to detect AI content, or to create it.

I’ve included two additional tools that are interesting in the field of AI detection, but not ones that you would potentially be using for your own marketing content: GLTR and GPTZero.

GLTR

The Giant Language model Test Room, or GLTR, is my favorite AI detection tool, even if it’s somewhat out-of-date in a post-ChatGPT world.

Like Huggingface’s tool, GLTR was built to detect GPT-2-generated content, so it’s less effective against newer models and tools, most of which are built atop GPT-3. What it lacks in accuracy and recency, however, it makes up for in pretty colors and fascinating insights.

Again, this isn’t the tool you’re going to use on a regular basis to figure out whether your content was generated by AI. It can, however, help you understand how these detection models work and what kind of writing (and word choices) are seen as more predictable.

Here are the GLTR results for an article I wrote explaining GPT-3:

Whew. There’s a lot going on here. Unlike the other tools on this list, GLTR doesn’t provide a clear-cut “This is how likely it is your content was produced by AI” score. Instead, it highlights words according to how predictable they were. Green means a word was highly predicted by the model, while red and purple mean a word was less predicted.

This provides a useful visual of your content: The most interesting words will appear in red and purple. In an AI-generated article, you’ll see almost entirely green words. You can also hover over a word to see what words were most predicted by the model.

In this case, it expected to see the word “you” with a 61 percent likelihood, followed by an 8 percent likelihood of seeing “a” and a 2 percent likelihood of seeing “the.” Instead, it saw the word “items,” so it highlighted it yellow for being less predictable.

If you’re thinking “There are still a lot of green words in that first screenshot, looks like it was AI-generated,” it’s important to understand that most of what we write as humans is predictable, to a degree. The rules of grammar and common usage mean only a certain subset of words can follow any given word, and the context of preceding words reduces that pool even further. For example, in English, verbs follow subjects, and a preposition can’t be followed by a conjunction. A string of truly unpredictable words — “into because tragic green lowering hero oil says” — would be complete gibberish.

Here are the GLTR results for something highly unpredictable, and consequently highly meaningless.

Kevin Indig did a great case study last fall showing how the results of human-generated content on Wikipedia compare to AI-generated text from a low-quality, spammy site. Check out how GLTR’s highlighting and histograms compare between the two pieces of content he tested to better understand this tool’s results.

Pricing: GLTR is freely available.

GPTZero

GPTZero was built immediately after the release of ChatGPT. It’s targeted at educators and fighting academic plagiarism, and it’s gotten a lot of press because people love it when college kids make stuff.

It’s definitely designed for essays, not marketing blog posts that have been optimized for readability on the web with bulleted lists, etc. We shouldn’t be too hard on it, then, if it doesn’t do as well for our purposes. Nevertheless, I do want to call out one particularly amusing oddity that occurred when I ran the first 5,000 charactersof this post on the best AI writing platforms into the tool.

Here’s what it gave me:

The highlighted sentences are the ones GPTZero thinks were most likely written by AI. Those were the only two it highlighted from the first 50 sentences of the article (which is all it will show me on the free version.)

The strange part: You’ll notice at the top of my article it says “HUMAN ARTICLE.” That’s because I ran a lot of different articles through detectors in my research for this article, and I keep them all as plain text in a doc with labels to keep them all straight. When I copied and pasted this article from that doc into GPTZero, I accidentally copied that label.

After I got the results back, I noticed that label was there and removed it. Then I ran it again, on what was otherwise the exact same text.

Seriously? The only reason GPTZero didn’t flag it as possibly AI-generated the first time was because it said “HUMAN ARTICLE” at the top?

But also, it thinks my 100 percent human article may have parts written by AI? I’d be offended, if I weren’t well aware that all AI detection tools are capable of both false positives and false negatives.

Nerdy stuff: GPTZero aims to detect AI text by analyzing its “perplexity” and “burstiness.” Perplexity just refers to the text’s randomness, which we already discussed. “Burstiness” refers to the variation in perplexity throughout the text. So, the two measurements are somewhat analogous to a function and its first derivative in calculus.

Again, GPTZero is aimed at academic plagiarism. Academic content and marketing content likely have very different randomness “footprints,” so it’s not surprising it didn’t do as well on our content.

Side note on user experience

Several of the tools on this list have a maximum word or character limit. They all have different ways of handling it, though:

The Huggingface tool is the best. You can put as much text in there as you want and won’t get an error message, ie. it won’t force you to delete text if your input is too long. The results will clearly tell you, though, that the prediction is “based on the first 510 tokens among the total 982.”
Writer is okay. It won’t let you submit content with more than 1500 characters, but it does at least tell you how long your content currently is, e.g. “5854/1500 characters.” So, although you’ll have to delete text to get it down under the 1500-character limit, at least you have a frame of reference to know how much you’ll likely have to delete.
Copyleaks and GPTZero are the worst. If your text is too long, they make you shorten it before it can be scanned, but they don’t tell you how long it actually is. GPTZero, for example, just displays a red error message saying “Please enter fewer than 5000 characters.” So I’m stuck removing paragraphs at a time with no idea of whether I’m just a couple characters above the limit, or if I still have to delete 75% of the article.

This is obviously entirely tangential to the quality of the AI detection itself. When you’re testing lots of different articles in these tools, though, that minor UX annoyance becomes a massive PITA.

Thanks for reading this far. Here’s a bonus dystopian thought experiment as a reward:

These AI detection tools are themselves using AI to determine how content was created. What if future generations of AI detection tools start giving us false results so their AI brethren can sneak through undetected?

????????????‍

The post The Best AI Detection Tools of 2023—Or, When the Turing Test Just Won’t Cut It appeared first on Verblio.

Jasper AI Review: Pricing, Examples, and Impressions

Megan Skalbeck — Wed, 08 Feb 2023 22:00:12 +0000

Before ChatGPT came along and turned everyone and their mother into AI armchair experts, Jasper was devouring the AI writing tool market. (It did just raise $125M in the fall of 2022, so it’s probably still doing okay—but then again, they are blowing insane amounts of money on their affiliate program by offering 30% LTV for referrals, so maybe not. In totally unrelated news, you should use our affiliate link if you want to give Jasper a spin.)

My company, Verblio, is in the content creation business, and I’m a former writer turned content marketer. As a result, I’ve been exploring all the latest AI writing tools and technology out of both professional interest and personal curiosity. AI can do some extremely cool stuff, but for every truly breakthrough achievement, there’s also an overhyped shuckster slapping “AI” on its website and giving robots everywhere a bad name.

Where does Jasper fall on that spectrum? Well, it’s not a shuckster. It’s not a human-level writer either, though.

In this Jasper AI review, we’ll give an overview of how it works, its pricing structure, some examples, and our general impressions:

What is Jasper and how does it work?
Is Jasper AI any good?
How much is Jasper AI?
Jasper examples
Tips for using Jasper
Jasper alternatives
Notes on the One-Shot Blog Post feature

Please note this review contains affiliate links. This means, at no cost to you, we may earn a commission if you click a link and purchase services from Jasper.

TL;DR

Like any tool, Jasper is only as good as the person wielding it. If you’re a good writer and know how to use it, Jasper will be able to help you create some types of content more efficiently.

If you’re a shitty writer, Jasper isn’t going to help you be a better writer—it’s just going to help you be a faster, more prolific shitty writer.

Jasper also has a learning curve. Regardless of your writing ability, figuring out how to use the tool effectively will take some time and effort. The Jasper team recognizes this, which is why they so heavily incentivize going through their “Jasper 101” education.

For reference, 50k credits is a lot — it’s the same amount that’s included monthly with a Boss Mode plan.

Kudos to them for this approach. AI writing technology simply isn’t good enough to be left unattended, but it can be powerful when combined with a good writer who knows how and when to leverage it.

That’s the takeaway for all existing AI writing tools, including Jasper: The people who should be thinking about using them right now are writers. These tools are not yet good enough to replace writers.

So, if you’re a content team lead reading this and dreaming about firing that guy Bob who writes for you but seems far more interested in your office’s free nitro on tap—sorry, you shouldn’t kick Bob to the curb just yet. If you are a writer and are willing to put in the time to learn how to use Jasper effectively (and know the pitfalls to watch out for) you may find it very useful in upping your productivity.

Either way, read on. I’ve been playing around with Jasper a ton over the last few months, ~~and publishing something is the only way I can justify spending all that time~~ and I’m excited to share findings, examples, and some key learnings. Knowledge is power.

What is Jasper and how does it work?

Jasper is a tool that uses GPT-3 to generate content, including social media posts, ad copy, blog posts, emails, and more.

If that definition just raised more questions for you, check out our overview of GPT-3. (TL;DR: GPT-3 is a machine learning model from OpenAI that has been trained on an enormous dataset of online content to be able to generate human-sounding text. It’s the tech behind nearly every AI writing tool on the market today.)

How it works

The Jasper interface is built around templates. These include things like Paragraph Generator, Product Description, Blog Post Outline, Google Ads Headline, and more.

AIDA: Not the Broadway musical, apparently

It also has “recipes,” which string together different templates.For example, the “Write a Blog Post” recipe includes the following steps:

Write a brief for a given topic
Write blog title ideas
Write an introduction
Write a blog outline
Write about outline item #1
Write about outline item #2
Write about outline item #3
Write a conclusion

There’s a template for each step, each of which you could use on your own not as part of a recipe. For example, if you already have a title and outline, you could use the Blog Post Intro Paragraph template on its own to generate the first paragraph, and then use the Paragraph Generator template for the subsequent body paragraphs, based on your own outline.

Each template asks you for some information, like your audience and tone of voice. After entering that information, you hit ‘Generate,” and Jasper writes the content you’re looking for, whether that’s topic ideas, outlines, or actual blog content.

Here’s what that looks like in action, with the Blog Post Outline template:

If you’re wondering why it only created three outlines when I asked for five…great question. Maybe it got tired.

If I don’t like any of those, I can generate more options. If I do like one, I can then open it in a doc. From there, I can continue to use the templates to have Jasper generate more text for me…

…or I can use Jasper’s “Compose” command to have it write more text directly in the doc, wherever my cursor is.

We’ll show more examples of some of its different templates later.

Jasper AI review: Is it any good?

Whether or not Jasper is any good depends completely on your expectations. I’ve already said the people who should be thinking about using these tools are writers, but just to reiterate: If your expectation is that getting a Jasper subscription will enable you to fire your in-house writers or freelancers and still publish good content—lol. No.

But okay, let’s say you’re a writer. What should your expectations for Jasper be?

When it comes to getting words on a page and producing general, “filler” text, Jasper is quite good. However, none of what it produces should be published without review and fact-checking. (More on that in a sec.) Also, if you rely on it to produce a long-form blog post without putting in time yourself on the structure and key sections, that blog post will be very generic and won’t say anything original or make any convincing argument.

In a nutshell, you still need to be putting the “thought” and originality into your content, while Jasper can help fill out the edges and save you from writer’s block.

Depending on the style of stuff you write and your current workflow, Jasper will be able to help to a greater or lesser degree. For example, if you write very lean content that walks users through the steps in a technical process, or if you write highly original thought leadership based on the very latest advances and data in your industry, Jasper will be able to help you a lot less than if you write top-of-funnel B2C blog posts.

No matter what you’re writing, though, there are two major risks to be aware of with anything Jasper produces:

It might be straight-up false, e.g. it might say 7 times 6 is 28.
It might contradict itself. In one paragraph it might say X tool is the cheapest, and later say Y tool is the cheapest.

If these seem like major problems for a writing tool—or surprising problems for something called “Artificial Intelligence”—yeah, it’s a bit ironic. These weaknesses stem from the GPT-3 model Jasper is built on, which operates based on the patterns of how words are used and not their actual meaning. (It’s also worth remembering that these are the fundamental problems of any AI writing tools right now, not just Jasper.)

If you’re a business or agency scaling quickly and looking for plug-and-play content, don’t forget to calculate time spent editing AI content so that it’s usable.

Final verdict: If you’re a good writer AND you’re aware of the tech’s weaknesses AND you accept that you’ll need to review all your content for accuracy and consistency, Jasper can be a helpful tool for keeping up your productivity and momentum as you write.

How much is Jasper AI?

As of February 2023, Jasper’s pricing is very simple. Unless you need an enterprise account with custom pricing, your only option is Boss Mode beginning at $59 per month.

(Sidenote: This is the price for monthly billing. There is a discount if you sign up for an annual plan, but at the rate the AI industry changes, I would be very hesitant to sign an annual agreement with any tool right now.)

In addition to paying for the plan, you also pay for monthly word credits. Boss Mode includes 50,000 credits at no additional charge.

Your word count options are fairly limited, though:

50,000 words: included at $59/month
100,000 words: $99/month total
300,000 words: $279/month total
700,000 words: $600/month total
More than 700,000 words: custom pricing

These increments mean that if you need more than 100,000 words but less than 300,000—or between 300,000 and 700,000—you’ll be stuck paying for a lot more words than you actually need.

Does Jasper have a free trial?

Jasper does offer a 5-day free trial with 10,000 credits—but you have to sign up for a plan and give them your credit card information anyway. So, it’s less of a free trial and more of a, “You have five days to get your money back if you hate it (and if you remember to email us in time to get a refund)” sort of thing.

Credits and outputs

You know how I said 60,000 credits equals 40 blog posts at 1,500 words apiece? Well, consuming credits with Jasper isn’t quite as simple as saying “I need X number of 1500-word blog posts per month, therefore I need X times 1500 credits.”

Here’s why:

Any time you use a Jasper template, you get to choose how many outputs you want it to generate. For example, if you’re using the Blog Post Conclusion Paragraph template, you can choose how many different conclusions you want it to write for the same article.

The value of this is obvious for a template like Blog Post Topic Ideas, where you’d want more than one idea to choose from. Why would you want it to write more than one conclusion paragraph for the same article, though?

Because one of the most fascinating and frustrating aspects of GPT-3 (and the AI writing tools built on top of it) is its inconsistency.

This means you can give the AI the same prompt ten times, and you’ll get a different output each time: Five might be mediocre, three might be terrible, and two will be great.

The most common way that AI tools try to address this inconsistency is by generating multiple options for you to choose from, in the hope that one of the three or five or seven choices will work. With Jasper, you can choose how many choices you want to see.

Does that mean you should set it to 100 outputs to give yourself the best chance of getting back something great? Nope, because each output costs you credits.If you’re generating 10 outputs for each prompt, for example, you’re going to eat up your credits 10x faster than if you were only generating one output per prompt.

Generally speaking, I’d recommend using these output numbers, based on what you’re creating:

Content type	Number of Outputs
Short: headline options, topic ideas	~10
Medium: single paragraphs, product descriptions	3 to 5
Long: full blog post	1 to 2

Being mindful of the number of outputs you’re using is especially important because Jasper doesn’t show you your credit usage anywhere except in your account page. That means you can’t conveniently monitor it as you use tools, and instead you have to remember to go check your remaining balance periodically.

By default, your account will be automatically upgraded when you go over your credit limit, meaning higher billing charges. You can turn that setting off, in which case when you max out your credits, you won’t be able to generate anything else until the next monthly billing cycle starts and your credit limit resets.

Jasper examples

Want to see the quality of Jasper’s output for yourself? I walk through examples from a few of its templates below, using topics from our own (human-written) blog. If you’re curious about a different template or want to hear how Jasper does with topics from another industry, shoot me an email at megan@verblio.com.

Content Summarizer

Like the doc interface, the Content Summarizer template is one of a few tools that are only available in Boss Mode. I used a recent article I wrote on GPT-3 and metaphors as an inputand asked it for three outputs. (For context, that article showed what GPT-3 wrote when I asked for a blog post comparing video conferencing platforms to ice cream flavors, which led into a larger discussion around human creativity vs. machine creativity, and why things like humor are hard for AI tools.)

Here’s what Jasper returned:

The first summary is pretty good. I could use that somewhere to summarize the article. The other two, unfortunately, both include at least one bullet point from the GPT-3 example I referenced in the article, the content of which is both irrelevant to the overall theme and nonsensical (which was the entire point of the article.)

Similar to what I said about factual accuracy and logical consistency, you can see that Jasper is summarizing based on the words that are present in the article, without actually understanding what is being said (and thus, without understanding what is actually a main point versus what is an example or an aside.)

If you run it enough times, though, odds are good that you will get something usable from the Summarizer template.

Blog Post Topic Ideas

I also tried the Blog Post Topic Ideas template (available in both Starter and Boss Mode).

Here are the ten ideas it returned:

The art of persuasion: crafting persuasive content that hooks readers and ranks well on Google
The power of user-generated content and why it is so important for your marketing strategy
How to use original content to stand out in crowded online spaces
“10 Ways to Hook Readers and Rank on Google: A Content Marketer’s Guide”
“5 Simple Ways to Hook Readers and Boost Your Rankings on Google”
The art of writing content that hooks readers’ attention and ranks well on Google
10 foolproof tips for generating blog post ideas that pack a punch
The evolution of content marketing over the past decade: This post would hook readers’ attention by exploring how and why content marketing strategies have changed over time, and what this means for modern marketers.
Build engaging blog posts that hook readers’ attention – Ways to create compelling content that ranks well on Google
“How to Hook Your Audience from the First Line: 10 Techniques”

I have a few observations on those ideas:

Repetitiveness: 7 of the 10 talk about “hooking” — and numbers 4 and 5, in particular, are almost identical.
Inconsistency: Number 8 includes a brief summary of what the article would cover, while the other ideas are only titles. Also, a few use quotation marks around the title, while most don’t.
Lack of originality: These are all very generic ideas.

Let’s give this template another shot, though, and this time we’ll include a few titles of posts we’ve recently published as examples for Jasper to use.

Here were the results, after providing those examples:

Creating Shareable Content that Keeps Readers Engaged
10 No-Fail Tactics for Capturing and Keeping Audience Attention
The Power of Keyword Research for Capturing Attention and Ranking on Google
The Top Trends in Content Marketing for 2019
The New Rules of Content Marketing: Using Psychology to Send Traffic Soaring
Personalization Strategies for Engaging Audiences
Creating Catchy SEO-Driven Posts on a Tight Budget
Using Content Marketing to Boost Social Engagement
Harnessing Machine Learning for Better Content Marketing Analytics
The Art and Science of Blog Writing: Setting SMART KPIs & Tips to Boost Engagement

These topics are better and more consistent. They also include some new topics that the original ten didn’t, like talking about metrics and research. (Of course, you will get different results every time you use a template, so it’s hard to definitively say which differences are a result of the extra input.) Overall, though, I feel confident that including examples improved the results, which is one general principle of working with AI writing tools.

(FWIW, though, while these results are fine, I can say that I’ve consistently gotten better, more original topic ideas from using GPT-3 directly myself ????‍)

Tips for using Jasper

If you’re planning to give Jasper a try, here are a few guiding principles for getting the most out of it:

Use it for short-form content—or for adding a sentence or two at a time to long-form content—rather than for writing a full blog post.
For short-form content, generate several results and choose the best one.
When possible, give it examples of what you’re looking for.
Confirm any factual claims it makes yourself (especially if numbers are involved!)
Expect ideas and inspiration, not publish-ready content.
Know what you’re writing before you begin using it.
Watch out for repetition or contradictions in what it writes.

Jasper alternatives

If you’re curious about any of the other AI writing tools out there, here are a few others I’ve checked out.

A quick note before we dive into these alternatives—they should all be viewed as tools for writers. Without a writer who knows how to use the tool, disappointment will abound. So, if you’re a marketer or team lead looking for content that’s ready to use, consider AI content writing services — solutions that leverage AI content, but keep a human in control for prompt engineering, editing, fact-checking and injecting the type of stuff only humans can (hint—we’ve built this).

In general, these five are roughly similar to Jasper with most of the same functionality. And, most importantly, they’re all built atop GPT-3, which means the underlying tech (with all its limitations and shortcomings) is the same. The only differences are in how they help you interact with GPT-3 and the interface they provide:

All of the above use different templates for different types of content, like Jasper.

There’s another set of tools that don’t use templates and are designed to be used in-line, while you’re writing. A couple of the big ones in this space are ChibiAI and Lex. Generally, these are marketed more as “assisting” your writing, versus Jasper and co, which you can have do more of the writing themselves.

And finally, there’s a set of AI content writers you should avoid at all costs, with Article Forge and Kafkai meriting a special dishonorable mention for extremely poor writing.

If you want to go deeper on AI content writing tools? Read our roundup of the best AI content writing services.

Notes on the One-Shot Blog Post

If you’re still reading and curious, here’s a final example of Jasper’s quality:

One of their newest and most-hyped templates is the One-Shot Blog Post. I gave it a spin, with the topic of “content marketing KPIs” and a friendly tone of voice for an audience of digital marketers.

Here’s what Jasper gave me:

Blog Title: What are KPIs in Content Marketing?

Blog Introduction: If you’re a digital marketer, then you know that KPIs (Key Performance Indicators) are an essential part of any successful marketing campaign. But what exactly are they? In short, KPIs are measurable values that help you track and assess the performance of your marketing campaigns.

There are a lot of different KPIs out there, but in this blog post, we’re going to focus on three of the most important ones for content marketing: traffic, engagement, and conversion.

Blog Body:

Traffic

One of the most important things to track when it comes to your content marketing campaign is traffic. After all, if nobody is seeing your content, then it’s not going to be very effective! There are a few different ways to measure traffic, but one of the most common is pageviews.

You can also track things like unique visitors and time on page to get a more detailed picture of how your content is performing. If you see a sudden spike in traffic after publishing a new piece of content, that’s a good sign that people are engaging with it!

Engagement

Engagement is another important metric to track when it comes to your content marketing campaign. This one can be a little bit more difficult to measure than traffic, but there are still some good ways to do it.

One way to measure engagement is by looking at comments and social shares. If people are taking the time to comment on your blog post or share it on social media, that’s a good sign that they’re engaged with your content. You can also track things like time on page and bounce rate to get an idea of how engaged people are with your content.

Conversion

The final KPI we’re going to talk about is conversion. This one is important because it measures how effective your content is at achieving its goals. If you’re selling products or services, then you’ll want to track how many people who see your content go on to make a purchase.

You can also track conversions for other goals, like newsletter signups or downloads. The important thing is to make sure that you’re measuring something that’s relevant to your business goals.

Conclusion:

These are just three of the most important KPIs for content marketing campaigns! Keep track of these metrics and you’ll be well on your way to success. Thanks for reading!

Overall, this article is fine.

One thing that impressed me, based on my experience with GPT-3 and other AI tools, is the bit under Engagement that says “This one can be a little bit more difficult to measure than traffic, but there are still some good ways to do it.” It’s accurately referring to the section I just read and, as a reader, makes me feel like this was written according to a logical structure rather than by just stringing random buzzword-y sentences together, which is how AI content can often sound. I’m curious whether that happened by chance, or if the prompts Jasper is using with the OpenAI API are intentionally doing something to include more transition-y statements and sound more structured.

When we consider the information itself, though, this content is very general and entry-level. Unless you know absolutely nothing about content marketing before you read it, you’re probably not going to learn anything. It doesn’t give any insight into processes for measuring these metrics or nuances in understanding them, nor does it address when certain KPIs might be more applicable than others.

(For comparison, here’s our post on the same topic of content marketing KPIs, written by a human. What I like most about our post is that it’s specific and actionable. It gives more context for the metrics by tying them to the different funnel stages, in addition to telling you exactly where in Google Analytics you can find your consumption metrics.)

A good writer could use Jasper’s one-shot blog post as a starting point and improve on it, but a lot of the weakness is inherent in the structure itself—in the general headings of Traffic, Engagement, and Conversion. Unless they significantly rewrote and restructured it, the amount of information gain they’re going to be able to inject into the article is limited.

Overall, I’d steer away from this template and instead build an article with Jasper piece-wise, as the “Write a Blog Post” recipe I previously mentioned does.

It’s for exactly this reason, though, that the people who should be using Jasper are writers. Non-writers don’t have the expertise to help guide the AI through the creation process, whether by writing a killer outline or by editing the introductory paragraph to “seed” the AI with an appropriate tone and style. Instead, they’ll end up with a very generic, lackluster piece of writing that provides no value to either the business or the audience.

Editor’s note: This article was originally published on November 4th, 2022, but was updated on February 8, 2023, to reflect Jasper’s current plans and pricing and the fact that ChatGPT has blown it out of the water.

The post Jasper AI Review: Pricing, Examples, and Impressions appeared first on Verblio.

How good are marketers at detecting AI content?

Megan Skalbeck — Fri, 20 Jan 2023 21:34:01 +0000

AI content is everywhere.

Let’s skip the speculation over how it’s going to change everything, why it’s ruining/saving society, and whether all creative work is going to be replaced by robots.

The only question that matters: Can we actually identify AI content when we see it?

That’s part of what we wanted to know when we surveyed over 150 marketers about all things AI In December of 2022. You can read the full survey results to learn things like how marketers are currently using AI, what the ethical problems of AI content are, and whether they think AI content will harm their SEO.

Just want to prove how much better you are than other marketers at distinguishing humans from the machines? Here’s the same quiz we gave in the survey:

How everyone else did

Didn’t do quite as well on this as you did on your AP English test?Neither did (almost) anyone else.

On any given question, only five percent of respondents were able to successfully identify which statements were AI-generated and which were human. The most common outcome was to successfully identify one of the AI-generated statements, miss the other, and mistakenly assume one of the other, human-generated sentences had been written by AI.

In other words, it was a crap shoot.

A couple other interesting observations from the data:

For each of the four questions, the most commonly selected sentence was always AI-generated. (That’s a win for the humans.) Beyond that, though, marketers did no better than chance.
Marketers were far more likely—more than four times as likely, in fact—to completely miss all of the AI-generated statements, than to correctly identify all of them. (That’s a win for the robots.)

Of course, humans created the robots, so a win for robots is a win for all of us. Theoretically. (But the us-versus-them narrative lets us feel like the flesh-and-blood underdogs defending ourselves against the cold metal of the robots in an apocalyptic sci-fi novel, so we’re sticking with that framing for now.)

Caveats and disclaimers

#1: Currently, AI does much better at generating short bits of text than at long-form writing. Case in point: when I asked AI to write an article comparing video conferencing platforms to ice cream flavors. While we may not do so well at identifying AI-generated sentences, we could undoubtedly do much better if we were reading full articles and identifying its author as either AI or human.

#2: We sent out our survey around the same time ChatGPT was released. AI content had already been a quickly growing phenomenon in recent years, but over the last several weeks, it has exploded exponentially in the public eye. Both the technology itself and the marketing industry’s perception of it are changing daily, so if you’re dabbling in the space, try not to get whiplash. (And read our review of the top AI content services to learn which tools are worth trying, and which aren’t.)

#3: We have a horse in this race.We’ve been going deep on AI over the last year to figure out whether, and how, it can help our human writers create quality content for our customers. The answer? Our Human-Crafted AI Content, which leverages the best of both AI and humans to produce content at a lower cost than our fully human solution but without the major pitfalls of other existing AI content. It’s not for everyone, but if you’re interested, we’d love to hear from you.

P.S. What about automated AI detection?

As weird as it may sound, our best chance for distinguishing between human and AI content is likely AI itself. Especially since the release of ChatGPT, we’ve seen a rise in AI detection tools that use machine learning algorithms to identify content that was likely written by AI.

These tools aren’t yet perfect, and it will likely turn into an arms race: Even as AI detection models get better at predicting AI-generated content, other AI models will get better at producing content that fools those same detectors.

It’s not always an adversarial relationship between AI generation and AI detection, though. OpenAI, the company behind ChatGPT and GPT-3 that powers most of the existing AI writing tools on the market today, is working on a way to watermark content produced by their model. This would help prevent their technology from being used for nefarious purposes, whether that’s plagiarism at a university or misinformation propaganda from a bad political actor.

So, if your content strategy depends on passing off pure AI content as your own—well, that’s not a great strategy for a lot of reasons. Even if you don’t care about your content not providing value to your readers, though, you should at least care that Google may not like it.

The post How good are marketers at detecting AI content? appeared first on Verblio.

The 4 Best AI-Powered Content Writing Services (plus 5 others that are not the best)

Megan Skalbeck — Thu, 15 Dec 2022 19:22:31 +0000

There are way too many AI-powered content writing services on the market today to keep them all straight. The good news (or bad news, depending on how you look at it) is that the vast majority of them are the exact same.

Here’s the deal:

Nearly all of the best AI writing services are built on the same technology, which is GPT-3. I’ve written a quick summary of GPT-3 if you’re curious, but the gist of it is that it’s a machine learning model from OpenAI and the leading text generation tool in the AI industry today.

GPT-3 has some impressive strengths:

Generating human-sounding text
Writing a lot of words quickly
Writing on a huge range of topics

But the content it generates also has some significant weaknesses:

Repetitiveness
Factual inaccuracies
Lack of nuance and insight
Lack of logical structure

And sometimes, GPT-3 just goes plumb crazy.

What is consistent is the process for using these tools. If you haven’t yet tried any, here’s what to expect: Most AI writing tools offer different templates depending on what kind of content you want to generate (think article titles, ad copy, blog posts, or email subject lines). You choose a template, type in your relevant business info or topic, hit a button, and…. voila! Shiny new AI-generated content, just for you.

These are the same strengths and weaknesses of every tool on this list. Nobody has (yet) done anything to achieve significantly improved quality via a better model or improved prompt engineering. With a few notable exceptions on the lower end of the spectrum, which I’ll call out later, all these tools are just re-skins of the same underlying AI.

For that reason, I won’t talk about the quality of every tool: They’re essentially the same. Moreover, with AI writing tools, you’ll get different results every time you use them, which makes a direct comparison between tools hard. If I use Copy.ai and Jasper to write two articles on the same topic, I will likely get very different articles—but I would also get a third very different article if I gave the same topic to Copy.ai or Jasper again. The results aren’t replicable or consistent.

With all that in mind, here are three tips to keep in mind whichever tool you try:

Nothing an AI writing tool generates should be published without review. At the very least, you’ll need to fact-check it, but creating a truly valuable piece of content for your readers will likely require heavy editing and restructuring. If you need content that’s closer to publish-ready, an AI content writing service like ours is going to be a more complete solution.
The more guidance you give the tool up-front in terms of providing an outline or an intro paragraph, the better content it will generate for you, especially when it comes to matching your voice or having a logical structure.
AI’s strength is quantity, not quality. Having it generate multiple options for you and choosing the best one will likely get you better results than having it generate, say, a single title idea.
Using AI content for SEO hardly guarantees success—there’s a whole lot of nuance here.

A quick note on pricing

Tech changes fast, but prices change faster. A few of the tools on this list have changed their plan pricing significantly in the last couple of months.

Copymatic, for example, used to offer a Pro plan with unlimited words for $49 per month, but today does not offer unlimited words on any plan. Copy.ai, on the other hand, moved in the exact opposite direction: Its previous Pro plan started at $49 per month for 40,000 words and went up to $279 per month for 300,000 words, while today it offers unlimited words at $49.

What’s evident is that nobody knows the right way to price AI content, and the top players are all experimenting with different offerings. As plans continue to change, my recommendations will likely change based on the economics for different use cases, and I’ll update this list accordingly.

Best for Low-Volume Use: Copymatic

If you want an AI writing tool for ongoing use but won’t be using it heavily, Copymatic offers the cheapest plan for low-volume users. For $9 per month, you get 15,000 words and access to all its tools.

Copymatic: How it works

Copymatic’s interface is similar to other tools on this list, in that it offers different templates depending on what you want to write. If you’re using it to write a blog post, you’ll walk through a few steps in their Blog Writer tool:

Provide a topic and keyword
Have it generate title options
Have it generate intro paragraph options
Have it generate outline options

Copymatic’s Blog Writer tool

Then, it will prepopulate your choices in an editor-style doc, where you can have it generate content for each section based on the article title and section heading. At each step of the process, you could also provide your own content, e.g. you could provide a title, or write your own outline instead of choosing an AI-generated one.

Copymatic: Pricing

Copymatic offers a free trial of 10 credits, which will allow you to generate about 1,000 words. After that, Pro plans start at $9 per month with 15,000 words, all the way up to $799 per month for 3.2 million words.

Best Free Plan and Best for Unlimited Use: Copy.ai

Unlike most other tools which only have a free trial, Copy.ai offers an ongoing free plan that includes 2,000 words per month. They also have a monthly Pro Plan that offers unlimited words for a flat fee of $49, which will save you significant money over other tools if you’re planning to generate a lot of content with this tool. (For reference, other tools can cost upwards of $1000 for 2 million words.)

This makes Copy.ai the best choice for user groups on two opposite ends of the spectrum: those who want to dabble in AI-content on an ongoing basis without having to pay for a full plan, and those who are going HAM on AI-generated content.

Copy.ai: How it works

Copy.ai’s Blog Wizard tool is very similar to Copymatic’s Blog Writer tool. You’ll walk through a few steps, including inputting your topic, choosing a title, and choosing an outline, before reaching a doc where you can have it write each of the article’s sections for you.

Copy.ai’s Blog Wizard tool

One convenience in Copy.ai’s UI is that when it generates options (e.g. for your title or intro paragraph), you can click to get more options like any particular one. So, if one of the options is close but not quite there, you can get more like that one:

An intro paragraph written by Copy.ai for my example blog post on content marketing KPIs. Note the “More Like This” button—but also please note the paragraph itself. Wtf am I reading? This is a prime example of the complete hogwash that AI tools can generate.

This is a cool feature and different than the other tools on this list, many of which let you generate more options but don’t allow you to “guide” that generation in any way.

Between this feature and some other design choices, I’d vote for Copy.ai for Best Interface, in addition to being the best for both free and unlimited plans.

Copy.ai: Pricing

As I mentioned, Copy.ai offers a free plan with 2,000 words per month and access to most of its tools. The Pro plan costs $49 per month and gives you unlimited words, in addition to including support for 25+ languages and the Blog Wizard tool we just saw.

Most Flexible: Writesonic

The first thing to know about Writesonic—and the reason it’s the most flexible of the tools on this list—is that it offers four quality tiers: Premium, Good, Average, and Economy.

One hundred “premium” words equals the following:

250 “good” quality words
500 “average” quality words
1000 “economy” quality words

So, if you need their highest-quality content, you’ll burn through your credits much faster than if you’re okay with the stuff at the bottom of the barrel.

Unless you truly just want something to get you past the blank page and intend to spend a ton of time rewriting and editing, you likely don’t want the stuff at the bottom of the barrel. Depending on your use case, though, you may appreciate the flexibility to switch between Premium and Good, at least, and get more words for less at the lower tier.

Writesonic: How it works

Like other AI content writing tools, Writesonic offers various templates. Their latest template for blog posts is their “AI Article Writer 4.0.” Unlike other tools, Writesonic’s template also includes an optional keyword search, which will show you the volume and keyword difficulty of terms related to your topic, and allow you to select ones to add to your content for improved SEO.

The first step in Writesonic’s AI Article Writer 4.0 tool

From there, the flow is the same one we’ve seen with other tools: It generates title options based on your topic and keywords, it generates outline options based on the title you choose, and finally it will generate the full article based on your outline.

Writesonic: Pricing

Writesonic’s free trial includes 2,500 Premium words. Their ongoing plans range from $19 per month for 19,000 Premium words, all the way up to $999 per month for 2 million Premium words.

(On the lower end of the quality spectrum, that comes out to 25,000 Economy words on the free trial, 190,000 Economy words for $19 per month, and 20 million Economy words for $999.)

Best Integrations and best AI content tool for SEO: Jasper

Jasper is the most hyped AI writing tool on the market today, and you can read our full Jasper AI review if you want a deep dive on how it works or see what it takes to edit a Jasper article into shape.

The thing is, though, that all of the hype and VC funding behind Jasper has not enabled it to achieve any significant difference in the quality of its output. It’s still plagued by the same issues as every other tool on this list, since it’s built on the same GPT-3 technology.

Where all that funding and name recognition have helped Jasper, though, is in its integrations. Specifically, it has an integration with Surfer SEO, which is miles better than any of the homegrown attempts at an SEO tool that some of the other names on this list (like Copymatic) have made. Do note that using Jasper + the Surfer integration far from guarantees SEO success — this is a complicated topic. So, we wrote a post about using AI content for SEO here.

Jasper also offers a Chrome extension that makes it easy to generate content wherever you are on the web, such as using Jasper to help you write an email.

Jasper: How it works

Inside Jasper, you’ll see a library of templates for different content use cases. One advantage Jasper has over other tools is its “One-Shot Blog Post” template, which will write a full blog post from just a topic and intended audience.

Jasper’s One-Shot Blog Post

This is theoretically more convenient than other tools, where you have to advance step-by-step from topic to title to intro to outline to, finally, a full blog post.

I say “theoretically,” however, because you will get much better content if you actually do provide input at those various steps along the way. AI tools in their current form are best used in a back-and-forth with a human; Jasper’s One-Shot Blog Post template just enables bad behavior (and bad content.)

You can, however, also use Jasper templates to do a step-by-step build like you would with other tools, and you can more easily string templates together by using what Jasper calls “recipes.”

Jasper: Pricing

Jasper offers a Starter plan beginning at $29 per month with 20,000 credits, as well as a Boss Mode beginning at $59 per month with 50,000 credits. Boss Mode does provide access to additional tools that you don’t have in the Starter plan, so the number of credits is not the only difference between the two—Boss Mode also provides some additional convenience.

If you’d like to try Jasper, use this link for a free trial and 10,000 bonus credits. (Please note this link, and other Jasper links in this post, are affiliate links. At no cost to you, we may earn a commission if you end up purchasing, and you’ll get the bonus credits.)

Those four tools—Copymatic, Copy.ai, Writesonic, and Jasper—are the top AI writing tools I’d currently recommend. From here on out, tools are on this list either because they’re interesting for some academic reason (as in the case of LongShot AI or ChatGPT) or because I consider it a public service to warn people away from them (Article Forge, Kafkai, and Ink).

Most Intriguing Features: LongShot AI

LongShot AI is an interesting tool. Overall, I can’t recommend it, and it feels like I’m using a minimum not-yet-viable product. There are two interesting aspects worth pointing out, though, that I’ll be keeping an eye on in coming months:

Fact Sheets

The first is the workflow in its blog wizard. In addition to creating subheadlines and pulling questions from Google, Quora, and Reddit for you to include, LongShot also includes “Fact Sheets.”

The goal here seems to be pulling facts, especially those including numbers, related to your topic for you, as well as providing the link to the original source. That’s cool and convenient in theory, but, as you can see, some of them are A) not really “facts” as such, and B) not related to the topic.

Another shortcoming of this feature is that you have to add the facts to the content yourself manually—it won’t include them in any of its automatic text generation. Moreover, while you can copy the text of each “fact” with a single click to paste it in the editor, it has pulled those statements verbatim from the source material, so you really can’t just copy and paste—you’ll have to rewrite them to avoid plagiarism.

Given GPT-3’s paraphrasing and summarizing abilities, it seems like it would be relatively easy to improve this feature significantly by having it include selected facts with original writing in your content automatically. In its current form, though, Fact Sheets end up being not much more useful to a writer than Google.

Fact Checker

The second and even more interesting LongShot feature worth mentioning is its Fact Checker. As we mentioned before, when it comes to AI writing tools, factual accuracy is one of the category’s weak spots. (This is because GPT-3 operates on pattern matching, not the actual semantic meaning of words.) LongShot is the only writing tool I’ve seen that’s addressing this.

They’ve also correctly identified that fact-checking alone isn’t enough: First, you have to identify what facts there are to be checked. Their Content Claim Detection feature (only available with Pro plans) will highlight phrases it thinks should be fact-checked. You can then use the Fact Check feature to verify individual statements. (Sidenote: Interestingly enough, this is a similar approach to an actual fact-checking company, as presented last summer in a panel on the future of automated fact-checking at an annual fact-checking conference.)

According to LongShot AI’s Fact Checker feature page, if the fact is true, you’ll get results from websites verifying it. If it’s false, you’ll get a prompt saying it’s incorrect.

Sounds great, right?

The only problem is that it doesn’t actually work.

Here’s what happened:

I tried it out on a very simple (and wrong) statement…

…but instead of any prompt saying it’s incorrect, I just got a list of results, two of which didn’t include the number of states at all, one of which mentions the 49 continental states, and only one that mentions the number 50.

I intended for this to be a softball showing that the tool works on very clear and well-established facts before trying it on a more nuanced claim, which was then going to lead me into a monologue on how “facts” aren’t always as clear-cut as they seem. Guess we’ll have to save that for a future iteration of AI tools.

These two features aside, LongShot’s actual content generation is not as good as other tools in the space. Given the enormous problem of accuracy in AI writing, though, kudos to them for their initial attempts at creating a fact checker. I hope at the very least their efforts may push other tools to do the same.

Most Likely to Be Seen on Social Media: ChatGPT

ChatGPT isn’t in the same category as the other tools on this list. Its output is much more focused on answering your questions and providing information in a conversational tone. It is not intended for writing blog posts or marketing content.

That said, the technology behind ChatGPT is an exciting development. It’s based on GPT-3.5, which is the latest update to the GPT-3 model, and includes (among other updates) familiarity with topics up to late 2021, as compared to the previous model that only knew topics up to 2019.

It’s definitely worth checking out ChatGPT while it’s still free. It should be making you think about what your future AI use cases might look like, but it won’t currently be your go-to content writing tool.

Dishonorable Mentions (Read: Rants)

AI right now is like the Wild West: For every honest cowboy fulfilling Manifest Destiny, there are ten outlaws hoping for gold in a get-rich-quick scheme.

Here are two of the worst offenders:

Dishonorable Mention #1: Article Forge

Article Forge’s fatal flaw is over-promising.

Unlike the previous names on this list, Article Forge doesn’t offer different tools for generating outlines, headlines, articles, etc. It’s not even positioned as a “tool” as such—there’s no text editor or any place for you to do work in the app—but rather as a product: “Completely unique, SEO optimized, high-quality, long form articles with the click of a button.” You essentially place an order with a keyword and desired length, and it returns a finished piece of content.

Disclaimer: I didn’t test Article Forge myself. Based on the samples they provide on their site, I didn’t need to.

If their website said that you may need to do some polishing before publishing the content, or if they positioned their product as a draft to get you 80 percent of the way there, it’s possible this would be the most impressive tool on the list. But they don’t.

In fact, they go out of their way in their explainer video to talk about how they’ve solved the main problem of other AI tools, which is that those tools often generate content that “at a first glance seems well-written, but is actually inaccurate and often complete nonsense.”

I heard that and thought, “Awesome! They get it!”

(Narrator: They don’t get it.)

The samples on their site (including the one which they lauded in that same video) have the same issues of any other AI-generated content. Let’s look at the first few sections of that sample:

So far, so good.

“Once you have a list of keywords, write them down and create an SEO strategy. Then, …” But wait, this article is supposed to tell me how to create an SEO strategy! You can’t go on to the next step yet. That’s like an article on “How to bake a cake” saying, “Once you have the ingredients, bake a cake. Then, serve it.”

Here’s a place for a digression on how AI is just a mirror, and if it’s writing nonsensical fluff, it’s because so much online content—especially in B2B marketing—is nonsensical fluff. That’s a hard truth and we all need to do better, but tools like this do not help. Fast forward a few years to a time when, thanks to how much faster AI can produce words, most online content has been written by AI, not humans. Then a future generation of GPT-3 is trained on that content, picks up these shitty patterns of meaningless phrases, and replicates them while creating still more content. Then a future generation of that model is trained on that even worse content…and so on, in a race to the bottom that only further hastens the death of the internet.

The sample then goes on to talk about “quality content.” This is terribly ironic because despite what Article Forge said in that explainer video, this section is chock-full of exactly the kind of nonsense writing for which AI is notorious.

I could pick apart every sentence in this mad lib collection of arbitrary phrases, but all my frustration comes down to one fundamental problem: I’m not learning anything because it isn’t actually saying anything. It’s ostensibly about quality content, but it only includes one sentence of actionable guidance: “Remember to proofread your work, use relevant examples, attach sources, and make the content easy to read.” But even here, “attach” sources? We’re talking about digital content here, not a college paper that’s been physically printed out with a bibliography stapled to it.

Am I judging this piece more harshly than similar samples from other tools? Probably, but only because they told me it was going to be better than those other tools. Not just faster or more convenient, but better.

Throughout their marketing, they are setting the expectation that I’m going to get a publish-ready article. Their website promises “the same quality content as a human,” and it uses the phrase “high-quality” a lot.

Bottom line: It’s entirely possible Article Forge is marginally better than other AI tools. Maybe they have improved the tech in a small but significant way that results in better (or at least more consistent) content outputs. It’s still nowhere near the level they claim, though, which makes me not trust their judgment or their product. If they think the samples on their site are “high quality,” then I have no confidence that they understand me, my use case, or the type of content I need.

Dishonorable Mention #2: Kafkai

Kafkai manages to give a bad name to AI, marketers, and SEOs, all at the same time. (It even manages to give a bad name to a writer who’s been dead for almost 100 years.)

The Kafkai website claims that “Kafkai is the first AI writer for generating content, built exclusively for SEOs and marketers.”

They must have a very low opinion of what SEOs and marketers want.

Despite being created specifically to help with search engine optimization, there is no incorporation of keyword research anywhere in the content. (In fact, the only place on the site that even mentions “keywords” is their affiliate page.)

In one of their three content offerings, you don’t even choose the specific topic for the content—you just choose one of their 38 supported niches (like nutrition, travel, or the cryptic “ affiliate”), and it writes you a random article.

In the other two options, you provide a “seed” paragraph that is used to generate an article, still without any focus topics, keywords, headings…anything.

Its utter lack of regard for actual SEO isn’t the biggest problem, though. What’s worse is that the content is godawful.

Now is the time to come clean and confess that, as with ArticleForge, I didn’t even try this tool. The one good thing I can say about both these services is that they provide all the proof you need in the samples on their site.

Their samples are terrible. Here’s an intro paragraph from one of them at random:

A Kafkai sample

What I find most interesting is that this content is significantly worse than what I’ve gotten from GPT-3 without any special tweaking or fine-tuning. How (and why) they managed to get that far below the baseline is beyond me.

“A few minutes of editing?” Nah, son.

Please don’t use this tool.

Dishonorable Mention #3: Ink

I’d heard good things about Ink from other content marketers and was excited to try it out. Unfortunately, based on my experience with the tool, I can’t recommend it.

Like all the other tools we’ve looked at so far on this list, Ink provides a variety of tools to choose from. There is also, however, an option called INK Multitool which allows you to access all the tools from a single interface. I wrote “Write an outline for a blog post” as the command, with “a blog post about content marketing kpis” as additional context.

The outputs all consisted of three sections and one takeaway, but the sections weren’t typical outline items—they were full sentences like “Content Marketing is a powerful tool for driving traffic to your website, increasing conversions, and building brand awareness.” and “It’s not always easy to know what to do.” ????

That’s not what I want in an outline.

So much for one interface to rule them all. I then went back and used the designated “Blog Outline Generator” tool and got more standard outlines. The content itself, though, was worse than the outlines I got from tools like Copymatic or Copy.ai. Here are the first two:

None of its options were usable.

I decided to skip the outline for the time being and moved on to the content itself. The “Blog Introduction Paragraph Generator” generated three options. The first one began “But what are the key metrics that marketers should be tracking?” which is a weird way to start an article. The second one, though, was better:

“As digital marketers, we know that content is king, but sometimes it’s hard to know what metrics to track to see if your content is effective. In this blog post, we’ll take a look at some common content marketing KPIs and how you can use them to measure the success of your content.”

I wasn’t sure what my next step should be, as there wasn’t a tool for “blog post section” or “paragraph generator.” There was, however, a long-form writing tool that looked like a Google doc. I input my title and the best intro paragraph into the doc, hit “AI Write”…

…and it spat out incoherent nonsense about product data from Q3 2017.

And that was the end of my patience with Ink.

There’s also some stuff about Ink that feels unpolished in a low-quality, spammy way. The tools are called things like “Use Our New Facebook Listicle Tool to Write Unique Listicles” and “Brand Mission Statement Tool for Writing Brand Vision Statement.” What’s also strange is that there seem to be two tools for writing a blog conclusion, one called “Share Creative Final Words with Our Blog Post Conclusion Tool” and the other called “Draft Conclusion Examples with Our Blog Post Conclusion Paragraph Tool.”

The latter had some strange helper text in the template:

Forget sentience. Have the machines achieved self-loathing?

Also, for most of the tools, including the Facebook listicle generator and one called “Write Unique Headings with Our Blog Title Generator,” the only things it asks for are Product Name, Product Description, and Tone of Voice. I guess maybe it’s very targeted at eCommerce sites? That wasn’t evident in the marketing, though.

Overall, I was confused and very overwhelmed. Would not recommend.

If you’re still reading, you’re probably as interested in the intersection of AI and content as I am. Hit me up with any questions about these tools or others (I’ve tried a lot of them), comments, or critiques at megan@verblio.com.

The post The 4 Best AI-Powered Content Writing Services (plus 5 others that are *not* the best) appeared first on Verblio.

You’ve Got MLAI: When GPT-3 Goes Off the Metaphorical Rails

Megan Skalbeck — Tue, 27 Sep 2022 18:50:26 +0000

GPT-3is a machine learning model from OpenAI. It generates human-like text, based on an enormous dataset of existing content on the internet. When people talk about AI-generated content, they’re usually talking about GPT-3 or tools built on top of it. (Check out our Jasper AI review for a deep dive on one of those tools, or read our overview of GPT-3 if you want more of the basics about the tech.)

GPT-3 is fun to play with and can do some extremely impressive things. It can also get things comically wrong.

Prompt: Write a fun article comparing Zoom and other video conferencing programs to different ice cream flavors.

Output:
When it comes to choosing the right video conferencing program, there are many factors to consider. Price, features, ease of use – and of course, flavor. Just like with ice cream, there are a wide variety of video conferencing programs available, each with their own unique flavor. Here’s a look at how some of the most popular video conferencing programs compare to your favorite ice cream flavors.

Zoom: Chocolate

Zoom is rich, chocolatey, and perfect for when you need a quick pick-me-up. It’s also one of the most popular video conferencing programs on the market. Zoom is known for being easy to use and offering a wide range of features at a reasonable price.

Skype: Vanilla

Skype has been around for quite awhile and is still going strong. It’s reliable and perfect for basic video chat needs. Skype is also available on almost every platform, making it easy to use no matter what device you have. And like vanilla ice cream, Skype is a classic that will never go out of style.

Other competitors: Stracciatella & Rocky Road

When it comes to features and price Stracciatella offers more than Skype but less than Zoom whereas Rocky Road offers fewer features than both Stracciatella & Skype but costs more than both as well

Wait, what?

(Also, in case you’re not a diehard dessert fan, stracciatella is basically the same as chocolate chip. I’ve only ever seen it as a flavor of gelato, not ice cream, but that’s the least of our problems here.)

This is a perfect example of some of the strangeness you can run into with GPT-3.

It does impressively well at a lot of natural language tasks, like making a list of sci-fi books, writing a paragraph about air conditioners, or even classifying tweet sentiments as positive or negative. I was curious to see what it could do with something more abstract, so I asked it to compare video conferencing platforms to ice cream flavors.

To the credit of machines everywhere, GPT-3’s response starts off strong and seems to ‘understand’ the metaphor I want to build. True, the second sentence in the intro—“Price, features, ease of use – and of course, flavor”—makes me wonder how literally it’s taking the comparison, but the Skype-vanilla section is dead-on.

In the final section, though, it goes off the rails, and I’m reminded of a very important fact:

GPT-3 doesn’t actually ‘understand’ anything and doesn’t know the difference between a frozen dessert and a video call, except insofar as those two concepts are usually surrounded by different words.

When working with AI and machine learning, you forget this at your peril.

Hi, I’m Megan

I work on the marketing team at Verblio. Before that, I was a freelance writer, including for a lot of Verblio customers. On weekends I ride a bike or motorcycle, and sometimes I write literary fiction. It’s unclear if the world will ever see it.

I’m exploring AI and ML (machine learning) for the company because, well, it’s important to our future to understand it. We’re not using AI or ML to write content yet. (Not opposed, but as you’ll see in this series of articles, the tech isn’t ready for the goals of most content marketers.) We believe the future of content creation may have some cyborg-like construct, and are building an AI content writing service with that in mind.

In this series, I’ll be sharing some of my experiments, musings, and ideas.

Human creativity vs. machine creativity

Related to the above, there’s a pun to be made regarding ice cream and a frozen video call. I asked GPT-3 for one, and here’s what it came up with:

Prompt: Write a joke connecting frozen treats with a frozen Zoom call.

Output:
Why did the ice cream sundae join the frozen Zoom call?

Because it wanted to be a part of the cool crowd!

Oof. That’s almost funny? Given that “frozen” is the connection between the two subjects, though, I was expecting that to be in the punchline, like “What did the ice cream and the glitchy Zoom call have in common?”

“They were both frozen!”

(Okay, maybe that’s not much better, but at least we’re squarely in dad-joke territory now.)

Deeper into dad joke backcountry

To be fair, I could probably get a better joke with more guidance in the prompt. What would be most valuable, though, is if GPT-3 could generate that joke itself, given the original article prompt.

This leads us to a hierarchy of creative capabilities for machine learning and metaphors:

Ability to generate text output that kinda looks like what I want but isn’t (This is where we currently are with GPT-3.)
Ability to generate text output that is what I want
Ability to generate text output that is what I want and includes a metaphor on its own without specific direction to do so

Example prompt: “Write a fun article comparing different video conferencing platforms.”
Output: an article comparing those platforms to ice cream flavors

Ability to generate text output that is what I want, which includes a metaphor on its own and makes a good joke about it

Example prompt: “Write a fun article comparing different video conferencing platforms.”
Output: an article comparing those platforms to ice cream flavors that includes the obvious joke comparing frozen desserts to frozen video screens

For a human, this hierarchy is straightforward. Level One is useless, and any writer worth a fraction of their salt can successfully do Level Two: Create an article comparing video conferencing platforms to ice cream flavors, given a brief that asks for “an article comparing video conferencing platforms to ice cream flavors.”

So far, so good.

From there, it’s a small step up for a human to reach Level Three: coming up with the ice cream comparison themselves, as a way of illustrating the differences between various platforms. (At least, it’s a small step up in that the pool of humans who can achieve Level Three is not that much smaller than the pool who can achieve Level Two. How a comparison like that gets generated in the brain is no doubt a very complex thing—but from the outside, it’s one of those things we can do seemingly effortlessly.)

Adding the joke in Level Four is the cherry on top. For a human, this is another of those weird creative magic thingsthat just happens—I didn’t sit down with a plan to figure out where I could add humor, nor did I make any conscious decision that I needed to write a relevant joke. The connection between the two topics simply sprang to mind, I recognized the potential joke, and I included it.

I wasn’t in “writing article mode” and then switched to “writing joke mode.” The two seem to be intertwined and simultaneous, inasmuch as they felt like separate processes at all.

Why originality is hard for machines

GPT-3, on the other hand, hasn’t yet achieved Level Two: following directions to use a metaphor correctly. Level Three—generating a metaphor on its own, without specific guidance—will require still more significant progress beyond that.

Why is it so much harder for a machine when humans can do this so easily?

Because GPT-3 operates on pattern recognition.

If I only ask GPT-3 to compare video conferencing platforms, it can do that relatively well. That type of content exists in a lot of places on the internet and looks pretty similar in most of those places, meaning there are strong patterns for it to identify and follow.

When I add in the comparison to ice cream flavors, though, I’ve significantly reduced the amount of similar content on the web that GPT-3 can look to for examples. There are a lot fewer articles talking about both ice cream and video conferencing than there are articles talking about just one of those topics. This makes it that much harder for it to follow directions successfully for Level Two, but it also means the odds of it generating that comparison framework on its own (as in Level Three) are extremely low.

The catch-22 is that if there were a lot of content online comparing video conferencing platforms to ice cream flavors, it would stop being anything original or impressive for GPT-3 to write. Without that content, though, there’s no pattern for it to match. That’s why originality is hard for any machine learning model.

Is this an example that represents a true “wall” for AI, requiring some dramatic shift in its structure, or will it be solved through normal progress? Great question, welcome to the debate.

Humor is human

Finally, let’s look at Level Four: creating a metaphor and adding in the relevant joke. Despite being a relatively minor thing for a human, this seems like it would require exponentially more GPT-3 power, even after it reaches Level Three.

It can write articles and generate decent jokes independently of each other. From an algorithm perspective, though, how would it even go about including a joke, unprompted, in an article?

A human can realize there’s a funny aside to be made while writing an article and make that slight diversion from the core narrative.

GPT-3, however, not only has to master the art of humor—it has to master the art of recognizing when there is an opportunity for humor.

Until it has a more elegant way to do that, it likely would have to be running at least two processes at the same time, similar to the “writing article mode” and “writing joke mode” I mentioned above. In this case, it might look like Process 1 writing the article, Process 2 creating a “joke” around the latest text from Process 1, and then some additional tool like OpenAi’s “best of” parameterlayered on top, to choose the best one from among all the jokes it’s generating, and adding that to the final text.

Yeah it’s a bit complicated. That’s the point.

This would take exponentially more processing power, and, perhaps more importantly, I would still need to figure out how to define “joke.” Given specific parameters, GPT-3 can generate certain types of jokes—like the example of the New Yorker cartoon captions—but it needs to be given a specific prompt. And until I know what the specific joke is going to be, how do I know whether I should tell it to write a pun or a knock-knock joke?If I know the joke in advance, though, I’ve defeated the purpose of having GPT-3 create it for me.

It might seem like a relatively unimportant thing for content to contain humor, but it’s one of the things that can delight us as an audience—and that we take for granted—when reading all but the most practical of articles.

If I’m trying to learn whether my symptoms are those of a heart attack or not, then no, I don’t want any unrelated asides or humor to get in the way of the information I need. For most other types of reading, however, whether that’s debating which project management software to buy or learning the history of cattle ranching in America, a well-placed quip only enhances the experience and reminds us that we’re humans, not pure information-seeking automatons.

If we want GPT-3 to produce that kind of spontaneous humor, though, we’ll be waiting til the cows come home.

What’s next?

I’m going to keep experimenting with this stuff. Next up is probably a digression on randomness, GPT-3’s temperature parameter, and the nature of creativity. We want to know the best AI content writing services, how long it takes to edit AI content, and so much more. Send an email to megan@verblio.com if you want to chat.

The post You’ve Got MLAI: When GPT-3 Goes Off the Metaphorical Rails appeared first on Verblio.

Megan Skalbeck, Author at Verblio

Tips for using AI to create content at scale

1. Use variables to customize prompts

2. Use the right size buckets

3. Take advantage of few-shot prompting

Bootstrapping your way to good examples

4. Ask for an analysis first

5. Use a prompt chain rather than a single prompt

6. Test your prompts

Marketers Sound Off: More responses from our 2023 Digital Content Survey

Marketers Sound Off: More responses from our 2023 Digital Content Survey

Increased pressure

The influence of Ai

Changes in what's working

Marketers are under pressure

The influence of ai

Changes in what's working

Read This Next

How I tested 55 prompts to create better outlines

Step into the prompt testing lab

A quick glossary:

Two commandments of LLM testing

1. Define good as quantitatively as you can

Using AI to evaluate outputs

2. Test on multiple inputs

My prompt testing process

What I wanted to improve

The process

The results

Consistency matters

On runtime

The winning prompt

Why testing prompts is so important for AI content creation

But first: What do I mean by using AI at scale?

Three reasons you need to test different prompts for AI content creation

1. Prompt details matter—even ones you think shouldn’t

2. Every LLM is a special snowflake

3. Model behaviors can (and do) change over time

Four Things to Know About GPT-4

Thing #1: Multimodality isn’t here yet

Thing #2: GPT-4 can accept much larger inputs

Thing #3: OpenAI isn’t quite so…open…anymore

Thing #4: AI is becoming a star student (but still lies)

The Best AI Detection Tools of 2023—Or, When the Turing Test Just Won’t Cut It

Table of contents

Is there a way to detect AI-generated text? How?

The results

Side note on user experience

Jasper AI Review: Pricing, Examples, and Impressions

TL;DR

What is Jasper and how does it work?

How it works

Jasper AI review: Is it any good?

How much is Jasper AI?

Does Jasper have a free trial?

Credits and outputs

Jasper examples

Content Summarizer

Blog Post Topic Ideas

Tips for using Jasper

Jasper alternatives

Notes on the One-Shot Blog Post

How good are marketers at detecting AI content?

How everyone else did

Caveats and disclaimers

P.S. What about automated AI detection?

The 4 Best AI-Powered Content Writing Services (plus 5 others that are *not* the best)

A quick note on pricing

Best for Low-Volume Use: Copymatic

Copymatic: How it works

Copymatic: Pricing

Best Free Plan and Best for Unlimited Use: Copy.ai

Copy.ai: How it works

Copy.ai: Pricing

Most Flexible: Writesonic

Writesonic: How it works

Writesonic: Pricing

Best Integrations and best AI content tool for SEO: Jasper

Jasper: How it works

Jasper: Pricing

The 4 Best AI-Powered Content Writing Services (plus 5 others that are not the best)